Distributed parallel messaging for multiprocessor systems
Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka
2013-06-04
A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.
Support for non-locking parallel reception of packets belonging to a single memory reception FIFO
Chen, Dong [Yorktown Heights, NY; Heidelberger, Philip [Yorktown Heights, NY; Salapura, Valentina [Yorktown Heights, NY; Senger, Robert M [Yorktown Heights, NY; Steinmacher-Burow, Burkhard [Boeblingen, DE; Sugawara, Yutaka [Yorktown Heights, NY
2011-01-27
A method and apparatus for distributed parallel messaging in a parallel computing system. A plurality of DMA engine units are configured in a multiprocessor system to operate in parallel, one DMA engine unit for transferring a current packet received at a network reception queue to a memory location in a memory FIFO (rmFIFO) region of a memory. A control unit implements logic to determine whether any prior received packet destined for that rmFIFO is still in a process of being stored in the associated memory by another DMA engine unit of the plurality, and prevent the one DMA engine unit from indicating completion of storing the current received packet in the reception memory FIFO (rmFIFO) until all prior received packets destined for that rmFIFO are completely stored by the other DMA engine units. Thus, there is provided non-locking support so that multiple packets destined for a single rmFIFO are transferred and stored in parallel to predetermined locations in a memory.
Costa-Font, Joan; Kanavos, Panos
2007-01-01
To examine the effects of parallel simvastatin importation on drug price in three of the main parallel importing countries in the European Union, namely the United Kingdom, Germany, and the Netherlands. To estimate the market share of parallel imported simvastatin and the unit price -both locally produced and parallel imported- adjusted by defined daily dose in the importing country and in the exporting country (Spain). Ordinary least squares regression was used to examine the potential price competition resulting from parallel drug trade between 1997 and 2002. The market share of parallel imported simvastatin progressively expanded (especially in the United Kingdom and Germany) in the period examined, although the price difference between parallel imported and locally sourced simvastatin was not significant. Prices tended to rise in the United Kingdom and Germany and declined in the Netherlands. We found no evidence of pro-competitive effects resulting from the expansion of parallel trade. The development of parallel drug importation in the European Union produced unexpected effects (limited competition) on prices that differ from those expected by the introduction of a new competitor. This is partially the result of drug price regulation scant incentives to competition and of the lack of transparency in the drug reimbursement system, especially due to the effect of informal discounts (not observable to researchers). The case of simvastatin reveals that savings to the health system from parallel trade are trivial. Finally, of the three countries examined, the only country that shows a moderate downward pattern in simvastatin prices is the Netherlands. This effect can be attributed to the existence of a system that claws back informal discounts.
Wijmans, Johannes G [Menlo Park, CA; Merkel, Timothy C [Menlo Park, CA; Baker, Richard W [Palo Alto, CA
2011-10-11
Disclosed herein are combustion systems and power plants that incorporate sweep-based membrane separation units to remove carbon dioxide from combustion gases. In its most basic embodiment, the invention is a combustion system that includes three discrete units: a combustion unit, a carbon dioxide capture unit, and a sweep-based membrane separation unit. In a preferred embodiment, the invention is a power plant including a combustion unit, a power generation system, a carbon dioxide capture unit, and a sweep-based membrane separation unit. In both of these embodiments, the carbon dioxide capture unit and the sweep-based membrane separation unit are configured to be operated in parallel, by which we mean that each unit is adapted to receive exhaust gases from the combustion unit without such gases first passing through the other unit.
Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal
Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that thesemore » schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.« less
Seeing the forest for the trees: Networked workstations as a parallel processing computer
NASA Technical Reports Server (NTRS)
Breen, J. O.; Meleedy, D. M.
1992-01-01
Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.
Partitioning in parallel processing of production systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oflazer, K.
1987-01-01
This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpretermore » with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.« less
NASA Astrophysics Data System (ADS)
Okazaki, Yuji; Uno, Takanori; Asai, Hideki
In this paper, we propose an optimization system with parallel processing for reducing electromagnetic interference (EMI) on electronic control unit (ECU). We adopt simulated annealing (SA), genetic algorithm (GA) and taboo search (TS) to seek optimal solutions, and a Spice-like circuit simulator to analyze common-mode current. Therefore, the proposed system can determine the adequate combinations of the parasitic inductance and capacitance values on printed circuit board (PCB) efficiently and practically, to reduce EMI caused by the common-mode current. Finally, we apply the proposed system to an example circuit to verify the validity and efficiency of the system.
The crew activity planning system bus interface unit
NASA Technical Reports Server (NTRS)
Allen, M. A.
1979-01-01
The hardware and software designs used to implement a high speed parallel communications interface to the MITRE 307.2 kilobit/second serial bus communications system are described. The primary topic is the development of the bus interface unit.
NASA Astrophysics Data System (ADS)
Ramirez, Andres; Rahnemoonfar, Maryam
2017-04-01
A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
Effect of Cooling Units on the Performance of an Automotive Exhaust-Based Thermoelectric Generator
NASA Astrophysics Data System (ADS)
Su, C. Q.; Zhu, D. C.; Deng, Y. D.; Wang, Y. P.; Liu, X.
2017-05-01
Currently, automotive exhaust-based thermoelectric generators (AETEGs) are a hot topic in energy recovery. In order to investigate the influence of coolant flow rate, coolant flow direction and cooling unit arrangement in the AETEG, a thermoelectric generator (TEG) model and a related test bench are constructed. Water cooling is adopted in this study. Due to the non-uniformity of the surface temperature of the heat source, the coolant flow direction would affect the output performance of the TEG. Changing the volumetric flow rate of coolant can increase the output power of multi-modules connected in series or/and parallel as it can improve the temperature uniformity of the cooling unit. Since the temperature uniformity of the cooling unit has a strong influence on the output power, two cooling units are connected in series or parallel to research the effect of cooling unit arrangements on the maximum output power of the TEG. Experimental and theoretical analyses reveal that the net output power is generally higher with cooling units connected in parallel than cooling units connected in series in the cooling system with two cooling units.
Advanced Electric Distribution, Switching, and Conversion Technology for Power Control
NASA Technical Reports Server (NTRS)
Soltis, James V.
1998-01-01
The Electrical Power Control Unit currently under development by Sundstrand Aerospace for use on the Fluids Combustion Facility of the International Space Station is the precursor of modular power distribution and conversion concepts for future spacecraft and aircraft applications. This unit combines modular current-limiting flexible remote power controllers and paralleled power converters into one package. Each unit includes three 1-kW, current-limiting power converter modules designed for a variable-ratio load sharing capability. The flexible remote power controllers can be used in parallel to match load requirements and can be programmed for an initial ON or OFF state on powerup. The unit contains an integral cold plate. The modularity and hybridization of the Electrical Power Control Unit sets the course for future spacecraft electrical power systems, both large and small. In such systems, the basic hybridized converter and flexible remote power controller building blocks could be configured to match power distribution and conversion capabilities to load requirements. In addition, the flexible remote power controllers could be configured in assemblies to feed multiple individual loads and could be used in parallel to meet the specific current requirements of each of those loads. Ultimately, the Electrical Power Control Unit design concept could evolve to a common switch module hybrid, or family of hybrids, for both converter and switchgear applications. By assembling hybrids of a common current rating and voltage class in parallel, researchers could readily adapt these units for multiple applications. The Electrical Power Control Unit concept has the potential to be scaled to larger and smaller ratings for both small and large spacecraft and for aircraft where high-power density, remote power controllers or power converters are required and a common replacement part is desired for multiples of a base current rating.
A hybrid algorithm for parallel molecular dynamics simulations
NASA Astrophysics Data System (ADS)
Mangiardi, Chris M.; Meyer, R.
2017-10-01
This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.« less
NASA Astrophysics Data System (ADS)
Yu, Leiming; Nina-Paravecino, Fanny; Kaeli, David; Fang, Qianqian
2018-01-01
We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs.
MARBLE: A system for executing expert systems in parallel
NASA Technical Reports Server (NTRS)
Myers, Leonard; Johnson, Coe; Johnson, Dean
1990-01-01
This paper details the MARBLE 2.0 system which provides a parallel environment for cooperating expert systems. The work has been done in conjunction with the development of an intelligent computer-aided design system, ICADS, by the CAD Research Unit of the Design Institute at California Polytechnic State University. MARBLE (Multiple Accessed Rete Blackboard Linked Experts) is a system of C Language Production Systems (CLIPS) expert system tool. A copied blackboard is used for communication between the shells to establish an architecture which supports cooperating expert systems that execute in parallel. The design of MARBLE is simple, but it provides support for a rich variety of configurations, while making it relatively easy to demonstrate the correctness of its parallel execution features. In its most elementary configuration, individual CLIPS expert systems execute on their own processors and communicate with each other through a modified blackboard. Control of the system as a whole, and specifically of writing to the blackboard is provided by one of the CLIPS expert systems, an expert control system.
Work stealing for GPU-accelerated parallel programs in a global address space framework
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain« less
Smoldyn on graphics processing units: massively parallel Brownian dynamics simulations.
Dematté, Lorenzo
2012-01-01
Space is a very important aspect in the simulation of biochemical systems; recently, the need for simulation algorithms able to cope with space is becoming more and more compelling. Complex and detailed models of biochemical systems need to deal with the movement of single molecules and particles, taking into consideration localized fluctuations, transportation phenomena, and diffusion. A common drawback of spatial models lies in their complexity: models can become very large, and their simulation could be time consuming, especially if we want to capture the systems behavior in a reliable way using stochastic methods in conjunction with a high spatial resolution. In order to deliver the promise done by systems biology to be able to understand a system as whole, we need to scale up the size of models we are able to simulate, moving from sequential to parallel simulation algorithms. In this paper, we analyze Smoldyn, a widely diffused algorithm for stochastic simulation of chemical reactions with spatial resolution and single molecule detail, and we propose an alternative, innovative implementation that exploits the parallelism of Graphics Processing Units (GPUs). The implementation executes the most computational demanding steps (computation of diffusion, unimolecular, and bimolecular reaction, as well as the most common cases of molecule-surface interaction) on the GPU, computing them in parallel on each molecule of the system. The implementation offers good speed-ups and real time, high quality graphics output
Parallelization of the FLAPW method and comparison with the PPW method
NASA Astrophysics Data System (ADS)
Canning, Andrew; Mannstadt, Wolfgang; Freeman, Arthur
2000-03-01
The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining electronic and magnetic properties of crystals and surfaces. In the past the FLAPW method has been limited to systems of about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell running on up to 512 processors on a Cray T3E parallel supercomputer. Some results will also be presented on a comparison of the plane-wave pseudopotential method and the FLAPW method on large systems.
Method and apparatus for probing relative volume fractions
Jandrasits, Walter G.; Kikta, Thomas J.
1998-01-01
A relative volume fraction probe particularly for use in a multiphase fluid system includes two parallel conductive paths defining therebetween a sample zone within the system. A generating unit generates time varying electrical signals which are inserted into one of the two parallel conductive paths. A time domain reflectometer receives the time varying electrical signals returned by the second of the two parallel conductive paths and, responsive thereto, outputs a curve of impedance versus distance. An analysis unit then calculates the area under the curve, subtracts the calculated area from an area produced when the sample zone consists entirely of material of a first fluid phase, and divides this calculated difference by the difference between an area produced when the sample zone consists entirely of material of the first fluid phase and an area produced when the sample zone consists entirely of material of a second fluid phase. The result is the volume fraction.
Method and apparatus for probing relative volume fractions
Jandrasits, W.G.; Kikta, T.J.
1998-03-17
A relative volume fraction probe particularly for use in a multiphase fluid system includes two parallel conductive paths defining therebetween a sample zone within the system. A generating unit generates time varying electrical signals which are inserted into one of the two parallel conductive paths. A time domain reflectometer receives the time varying electrical signals returned by the second of the two parallel conductive paths and, responsive thereto, outputs a curve of impedance versus distance. An analysis unit then calculates the area under the curve, subtracts the calculated area from an area produced when the sample zone consists entirely of material of a first fluid phase, and divides this calculated difference by the difference between an area produced when the sample zone consists entirely of material of the first fluid phase and an area produced when the sample zone consists entirely of material of a second fluid phase. The result is the volume fraction. 9 figs.
NASA Astrophysics Data System (ADS)
Liu, Jiping; Kang, Xiaochen; Dong, Chun; Xu, Shenghua
2017-12-01
Surface area estimation is a widely used tool for resource evaluation in the physical world. When processing large scale spatial data, the input/output (I/O) can easily become the bottleneck in parallelizing the algorithm due to the limited physical memory resources and the very slow disk transfer rate. In this paper, we proposed a stream tilling approach to surface area estimation that first decomposed a spatial data set into tiles with topological expansions. With these tiles, the one-to-one mapping relationship between the input and the computing process was broken. Then, we realized a streaming framework towards the scheduling of the I/O processes and computing units. Herein, each computing unit encapsulated a same copy of the estimation algorithm, and multiple asynchronous computing units could work individually in parallel. Finally, the performed experiment demonstrated that our stream tilling estimation can efficiently alleviate the heavy pressures from the I/O-bound work, and the measured speedup after being optimized have greatly outperformed the directly parallel versions in shared memory systems with multi-core processors.
STOCHSIMGPU: parallel stochastic simulation for the Systems Biology Toolbox 2 for MATLAB.
Klingbeil, Guido; Erban, Radek; Giles, Mike; Maini, Philip K
2011-04-15
The importance of stochasticity in biological systems is becoming increasingly recognized and the computational cost of biologically realistic stochastic simulations urgently requires development of efficient software. We present a new software tool STOCHSIMGPU that exploits graphics processing units (GPUs) for parallel stochastic simulations of biological/chemical reaction systems and show that significant gains in efficiency can be made. It is integrated into MATLAB and works with the Systems Biology Toolbox 2 (SBTOOLBOX2) for MATLAB. The GPU-based parallel implementation of the Gillespie stochastic simulation algorithm (SSA), the logarithmic direct method (LDM) and the next reaction method (NRM) is approximately 85 times faster than the sequential implementation of the NRM on a central processing unit (CPU). Using our software does not require any changes to the user's models, since it acts as a direct replacement of the stochastic simulation software of the SBTOOLBOX2. The software is open source under the GPL v3 and available at http://www.maths.ox.ac.uk/cmb/STOCHSIMGPU. The web site also contains supplementary information. klingbeil@maths.ox.ac.uk Supplementary data are available at Bioinformatics online.
Redundant binary number representation for an inherently parallel arithmetic on optical computers.
De Biase, G A; Massini, A
1993-02-10
A simple redundant binary number representation suitable for digital-optical computers is presented. By means of this representation it is possible to build an arithmetic with carry-free parallel algebraic sums carried out in constant time and parallel multiplication in log N time. This redundant number representation naturally fits the 2's complement binary number system and permits the construction of inherently parallel arithmetic units that are used in various optical technologies. Some properties of this number representation and several examples of computation are presented.
Parallelization of the FLAPW method
NASA Astrophysics Data System (ADS)
Canning, A.; Mannstadt, W.; Freeman, A. J.
2000-08-01
The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining structural, electronic and magnetic properties of crystals and surfaces. Until the present work, the FLAPW method has been limited to systems of less than about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work, we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell, running on up to 512 processors on a CRAY T3E parallel supercomputer.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chiang, Patrick
2014-01-31
The research goal of this CAREER proposal is to develop energy-efficient, VLSI interconnect circuits and systems that will facilitate future massively-parallel, high-performance computing. Extreme-scale computing will exhibit massive parallelism on multiple vertical levels, from thou sands of computational units on a single processor to thousands of processors in a single data center. Unfortunately, the energy required to communicate between these units at every level (on chip, off-chip, off-rack) will be the critical limitation to energy efficiency. Therefore, the PI's career goal is to become a leading researcher in the design of energy-efficient VLSI interconnect for future computing systems.
NASA Technical Reports Server (NTRS)
Gorospe, George E., Jr.; Daigle, Matthew J.; Sankararaman, Shankar; Kulkarni, Chetan S.; Ng, Eley
2017-01-01
Prognostic methods enable operators and maintainers to predict the future performance for critical systems. However, these methods can be computationally expensive and may need to be performed each time new information about the system becomes available. In light of these computational requirements, we have investigated the application of graphics processing units (GPUs) as a computational platform for real-time prognostics. Recent advances in GPU technology have reduced cost and increased the computational capability of these highly parallel processing units, making them more attractive for the deployment of prognostic software. We present a survey of model-based prognostic algorithms with considerations for leveraging the parallel architecture of the GPU and a case study of GPU-accelerated battery prognostics with computational performance results.
NASA Astrophysics Data System (ADS)
Grzeszczuk, A.; Kowalski, S.
2015-04-01
Compute Unified Device Architecture (CUDA) is a parallel computing platform developed by Nvidia for increase speed of graphics by usage of parallel mode for processes calculation. The success of this solution has opened technology General-Purpose Graphic Processor Units (GPGPUs) for applications not coupled with graphics. The GPGPUs system can be applying as effective tool for reducing huge number of data for pulse shape analysis measures, by on-line recalculation or by very quick system of compression. The simplified structure of CUDA system and model of programming based on example Nvidia GForce GTX580 card are presented by our poster contribution in stand-alone version and as ROOT application.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chaves, Mario Paul
2017-07-01
For my project I have selected to research and design a high current pulse system, which will be externally triggered from a 5V pulse. The research will be conducted in the region of paralleling the solid state switches for a higher current output, as well as to see if there will be any other advantages in doing so. The end use of the paralleled solid state switches will be used on a Capacitive Discharge Unit (CDU). For the first part of my project, I have set my focus on the design of the circuit, selection of components, and simulation ofmore » the circuit.« less
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schmalz, Mark S
2011-07-24
Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more » for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient parallel computation of particle and fluid dynamics simulations. These problems occur throughout DOE, military and commercial sectors: the potential payoff is high. We plan to license or sell the solution to contractors for military and domestic applications such as disaster simulation (aerodynamic and hydrodynamic), Government agencies (hydrological and environmental simulations), and medical applications (e.g., in tomographic image reconstruction). Keywords - High-performance Computing, Graphic Processing Unit, Fluid/Particle Simulation. Summary for Members of Congress - Department of Energy has many simulation codes that must compute faster, to be effective. The Phase I research parallelized particle/fluid simulations for rocket combustion, for high-performance computing systems.« less
Pelletier, Mathew G
2008-02-08
One of the main hurdles standing in the way of optimal cleaning of cotton lint isthe lack of sensing systems that can react fast enough to provide the control system withreal-time information as to the level of trash contamination of the cotton lint. This researchexamines the use of programmable graphic processing units (GPU) as an alternative to thePC's traditional use of the central processing unit (CPU). The use of the GPU, as analternative computation platform, allowed for the machine vision system to gain asignificant improvement in processing time. By improving the processing time, thisresearch seeks to address the lack of availability of rapid trash sensing systems and thusalleviate a situation in which the current systems view the cotton lint either well before, orafter, the cotton is cleaned. This extended lag/lead time that is currently imposed on thecotton trash cleaning control systems, is what is responsible for system operators utilizing avery large dead-band safety buffer in order to ensure that the cotton lint is not undercleaned.Unfortunately, the utilization of a large dead-band buffer results in the majority ofthe cotton lint being over-cleaned which in turn causes lint fiber-damage as well assignificant losses of the valuable lint due to the excessive use of cleaning machinery. Thisresearch estimates that upwards of a 30% reduction in lint loss could be gained through theuse of a tightly coupled trash sensor to the cleaning machinery control systems. Thisresearch seeks to improve processing times through the development of a new algorithm forcotton trash sensing that allows for implementation on a highly parallel architecture.Additionally, by moving the new parallel algorithm onto an alternative computing platform,the graphic processing unit "GPU", for processing of the cotton trash images, a speed up ofover 6.5 times, over optimized code running on the PC's central processing unit "CPU", wasgained. The new parallel algorithm operating on the GPU was able to process a 1024x1024image in less than 17ms. At this improved speed, the image processing system's performance should now be sufficient to provide a system that would be capable of realtimefeed-back control that is in tight cooperation with the cleaning equipment.
Two schemes for rapid generation of digital video holograms using PC cluster
NASA Astrophysics Data System (ADS)
Park, Hanhoon; Song, Joongseok; Kim, Changseob; Park, Jong-Il
2017-12-01
Computer-generated holography (CGH), which is a process of generating digital holograms, is computationally expensive. Recently, several methods/systems of parallelizing the process using graphic processing units (GPUs) have been proposed. Indeed, use of multiple GPUs or a personal computer (PC) cluster (each PC with GPUs) enabled great improvements in the process speed. However, extant literature has less often explored systems involving rapid generation of multiple digital holograms and specialized systems for rapid generation of a digital video hologram. This study proposes a system that uses a PC cluster and is able to more efficiently generate a video hologram. The proposed system is designed to simultaneously generate multiple frames and accelerate the generation by parallelizing the CGH computations across a number of frames, as opposed to separately generating each individual frame while parallelizing the CGH computations within each frame. The proposed system also enables the subprocesses for generating each frame to execute in parallel through multithreading. With these two schemes, the proposed system significantly reduced the data communication time for generating a digital hologram when compared with that of the state-of-the-art system.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Volkov, M V; Garanin, S G; Dolgopolov, Yu V
2014-11-30
A seven-channel fibre laser system operated by the master oscillator – multichannel power amplifier scheme is the phase locked using a stochastic parallel gradient algorithm. The phase modulators on lithium niobate crystals are controlled by a multichannel electronic unit with the microcontroller processing signals in real time. The dynamic phase locking of the laser system with the bandwidth of 14 kHz is demonstrated, the time of phasing is 3 – 4 ms. (fibre and integrated-optical structures)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zuo, Wangda; McNeil, Andrew; Wetter, Michael
2013-05-23
Building designers are increasingly relying on complex fenestration systems to reduce energy consumed for lighting and HVAC in low energy buildings. Radiance, a lighting simulation program, has been used to conduct daylighting simulations for complex fenestration systems. Depending on the configurations, the simulation can take hours or even days using a personal computer. This paper describes how to accelerate the matrix multiplication portion of a Radiance three-phase daylight simulation by conducting parallel computing on heterogeneous hardware of a personal computer. The algorithm was optimized and the computational part was implemented in parallel using OpenCL. The speed of new approach wasmore » evaluated using various daylighting simulation cases on a multicore central processing unit and a graphics processing unit. Based on the measurements and analysis of the time usage for the Radiance daylighting simulation, further speedups can be achieved by using fast I/O devices and storing the data in a binary format.« less
Method and apparatus for probing relative volume fractions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jandrasits, W.G.; Kikta, T.J.
1996-12-31
A relative volume fraction probe particularly for use in a multiphase fluid system includes two parallel conductive paths defining there between a sample zone within the system. A generating unit generates time varying electrical signals which are inserted into one of the two parallel conductive paths. A time domain reflectometer receives the time varying electrical signals returned by the second of the two parallel conductive paths and, responsive thereto, outputs a curve of impedance versus distance. An analysis unit then calculates the area under the curve, subtracts the calculated area from an area produced when the sample zone consists entirelymore » of material of a first fluid phase, and divides this calculated difference by the difference between an area produced when the sample zone consists entirely of material of the first fluid phase and an area produced when the sample zone consists entirely of material of a second fluid phase. The result is the volume fraction.« less
Parallel Execution of Functional Mock-up Units in Buildings Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ozmen, Ozgur; Nutaro, James J.; New, Joshua Ryan
2016-06-30
A Functional Mock-up Interface (FMI) defines a standardized interface to be used in computer simulations to develop complex cyber-physical systems. FMI implementation by a software modeling tool enables the creation of a simulation model that can be interconnected, or the creation of a software library called a Functional Mock-up Unit (FMU). This report describes an FMU wrapper implementation that imports FMUs into a C++ environment and uses an Euler solver that executes FMUs in parallel using Open Multi-Processing (OpenMP). The purpose of this report is to elucidate the runtime performance of the solver when a multi-component system is imported asmore » a single FMU (for the whole system) or as multiple FMUs (for different groups of components as sub-systems). This performance comparison is conducted using two test cases: (1) a simple, multi-tank problem; and (2) a more realistic use case based on the Modelica Buildings Library. In both test cases, the performance gains are promising when each FMU consists of a large number of states and state events that are wrapped in a single FMU. Load balancing is demonstrated to be a critical factor in speeding up parallel execution of multiple FMUs.« less
NASA Technical Reports Server (NTRS)
Monford, L. G., Jr. (Inventor)
1974-01-01
A digital communication system is reported for parallel operation of 16 or more transceiver units with the use of only four interconnecting wires. A remote synchronization circuit produces unit address control words sequentially in data frames of 16 words. Means are provided in each transceiver unit to decode calling signals and to transmit calling and data signals. The transceivers communicate with each other over one data line. The synchronization unit communicates the address control information to the transceiver units over an address line and further provides the timing information over a clock line. A reference voltage level or ground line completes the interconnecting four wire hookup.
NASA Astrophysics Data System (ADS)
Ray, Sibdas; Das, Aniruddha
2015-06-01
Reaction of 2-ethoxymethyleneamino-2-cyanoacetamide with primary alkyl amines in acetonitrile solvent affords 1-substituted-5-aminoimidazole-4-carboxamides. Single crystal X-ray diffraction studies of these imidazole compounds show that there are both anti-parallel and syn-parallel π-π stackings between two imidazole units in parallel-displaced (PD) conformations and the distance between two π-π stacked imidazole units depends mainly on the anti/ syn-parallel nature and to some extent on the alkyl group attached to N-1 of imidazole; molecules with anti-parallel PD-stacking arrangements of the imidazole units have got vertical π-π stacking distance short enough to impart stabilization whereas the imidazole unit having syn-parallel stacking arrangement have got much larger π-π stacking distances. DFT studies on a pair of anti-parallel imidazole units of such an AICA lead to curves for 'π-π stacking stabilization energy vs. π-π stacking distance' which have got similarity with the 'Morse potential energy diagram for a diatomic molecule' and this affords to find out a minimum π-π stacking distance corresponding to the maximum stacking stabilization energy between the pair of imidazole units. On the other hand, a DFT calculation based curve for 'π-π stacking stabilization energy vs. π-π stacking distance' of a pair of syn-parallel imidazole units is shown to have an exponential nature.
Dragas, Jelena; Viswam, Vijay; Shadmani, Amir; Chen, Yihui; Bounik, Raziyeh; Stettler, Alexander; Radivojevic, Milos; Geissler, Sydney; Obien, Marie; Müller, Jan; Hierlemann, Andreas
2017-06-01
Biological cells are characterized by highly complex phenomena and processes that are, to a great extent, interdependent. To gain detailed insights, devices designed to study cellular phenomena need to enable tracking and manipulation of multiple cell parameters in parallel; they have to provide high signal quality and high spatiotemporal resolution. To this end, we have developed a CMOS-based microelectrode array system that integrates six measurement and stimulation functions, the largest number to date. Moreover, the system features the largest active electrode array area to date (4.48×2.43 mm 2 ) to accommodate 59,760 electrodes, while its power consumption, noise characteristics, and spatial resolution (13.5 μm electrode pitch) are comparable to the best state-of-the-art devices. The system includes: 2,048 action-potential (AP, bandwidth: 300 Hz to 10 kHz) recording units, 32 local-field-potential (LFP, bandwidth: 1 Hz to 300 Hz) recording units, 32 current recording units, 32 impedance measurement units, and 28 neurotransmitter detection units, in addition to the 16 dual-mode voltage-only or current/voltage-controlled stimulation units. The electrode array architecture is based on a switch matrix, which allows for connecting any measurement/stimulation unit to any electrode in the array and for performing different measurement/stimulation functions in parallel.
Electro-Optic Computing Architectures: Volume II. Components and System Design and Analysis
1998-02-01
The objective of the Electro - Optic Computing Architecture (EOCA) program was to develop multi-function electro - optic interfaces and optical...interconnect units to enhance the performance of parallel processor systems and form the building blocks for future electro - optic computing architectures...Specifically, three multi-function interface modules were targeted for development - an Electro - Optic Interface (EOI), an Optical Interconnection Unit
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing
NASA Astrophysics Data System (ADS)
Johl, John T.; Baker, Nick C.
1988-10-01
The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
Vascular system modeling in parallel environment - distributed and shared memory approaches
Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne
2011-01-01
The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Classified one-step high-radix signed-digit arithmetic units
NASA Astrophysics Data System (ADS)
Cherri, Abdallah K.
1998-08-01
High-radix number systems enable higher information storage density, less complexity, fewer system components, and fewer cascaded gates and operations. A simple one-step fully parallel high-radix signed-digit arithmetic is proposed for parallel optical computing based on new joint spatial encodings. This reduces hardware requirements and improves throughput by reducing the space-bandwidth produce needed. The high-radix signed-digit arithmetic operations are based on classifying the neighboring input digit pairs into various groups to reduce the computation rules. A new joint spatial encoding technique is developed to present both the operands and the computation rules. This technique increases the spatial bandwidth product of the spatial light modulators of the system. An optical implementation of the proposed high-radix signed-digit arithmetic operations is also presented. It is shown that our one-step trinary signed-digit and quaternary signed-digit arithmetic units are much simpler and better than all previously reported high-radix signed-digit techniques.
Scalable software architecture for on-line multi-camera video processing
NASA Astrophysics Data System (ADS)
Camplani, Massimo; Salgado, Luis
2011-03-01
In this paper we present a scalable software architecture for on-line multi-camera video processing, that guarantees a good trade off between computational power, scalability and flexibility. The software system is modular and its main blocks are the Processing Units (PUs), and the Central Unit. The Central Unit works as a supervisor of the running PUs and each PU manages the acquisition phase and the processing phase. Furthermore, an approach to easily parallelize the desired processing application has been presented. In this paper, as case study, we apply the proposed software architecture to a multi-camera system in order to efficiently manage multiple 2D object detection modules in a real-time scenario. System performance has been evaluated under different load conditions such as number of cameras and image sizes. The results show that the software architecture scales well with the number of camera and can easily works with different image formats respecting the real time constraints. Moreover, the parallelization approach can be used in order to speed up the processing tasks with a low level of overhead.
Vasan, S N Swetadri; Ionita, Ciprian N; Titus, A H; Cartwright, A N; Bednarek, D R; Rudin, S
2012-02-23
We present the image processing upgrades implemented on a Graphics Processing Unit (GPU) in the Control, Acquisition, Processing, and Image Display System (CAPIDS) for the custom Micro-Angiographic Fluoroscope (MAF) detector. Most of the image processing currently implemented in the CAPIDS system is pixel independent; that is, the operation on each pixel is the same and the operation on one does not depend upon the result from the operation on the other, allowing the entire image to be processed in parallel. GPU hardware was developed for this kind of massive parallel processing implementation. Thus for an algorithm which has a high amount of parallelism, a GPU implementation is much faster than a CPU implementation. The image processing algorithm upgrades implemented on the CAPIDS system include flat field correction, temporal filtering, image subtraction, roadmap mask generation and display window and leveling. A comparison between the previous and the upgraded version of CAPIDS has been presented, to demonstrate how the improvement is achieved. By performing the image processing on a GPU, significant improvements (with respect to timing or frame rate) have been achieved, including stable operation of the system at 30 fps during a fluoroscopy run, a DSA run, a roadmap procedure and automatic image windowing and leveling during each frame.
Parallelized CCHE2D flow model with CUDA Fortran on Graphics Process Units
USDA-ARS?s Scientific Manuscript database
This paper presents the CCHE2D implicit flow model parallelized using CUDA Fortran programming technique on Graphics Processing Units (GPUs). A parallelized implicit Alternating Direction Implicit (ADI) solver using Parallel Cyclic Reduction (PCR) algorithm on GPU is developed and tested. This solve...
Method of up-front load balancing for local memory parallel processors
NASA Technical Reports Server (NTRS)
Baffes, Paul Thomas (Inventor)
1990-01-01
In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.
Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito
2016-11-15
A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
On the utility of threads for data parallel programming
NASA Technical Reports Server (NTRS)
Fahringer, Thomas; Haines, Matthew; Mehrotra, Piyush
1995-01-01
Threads provide a useful programming model for asynchronous behavior because of their ability to encapsulate units of work that can then be scheduled for execution at runtime, based on the dynamic state of a system. Recently, the threaded model has been applied to the domain of data parallel scientific codes, and initial reports indicate that the threaded model can produce performance gains over non-threaded approaches, primarily through the use of overlapping useful computation with communication latency. However, overlapping computation with communication is possible without the benefit of threads if the communication system supports asynchronous primitives, and this comparison has not been made in previous papers. This paper provides a critical look at the utility of lightweight threads as applied to data parallel scientific programming.
Pluggable microbial fuel cell stacks for septic wastewater treatment and electricity production.
Yazdi, Hadi; Alzate-Gaviria, Liliana; Ren, Zhiyong Jason
2015-03-01
Septic tanks and other decentralized wastewater treatment systems play an important role in protecting public health and water resource for remote or developing communities. Current septic systems do not have energy production capability, yet such feature can be very valuable for areas lack access to electricity. Here we present an easy-to-operate microbial fuel cell (MFC) stack that consists a common base and multiple pluggable units, which can be connected in either series or parallel for electricity generation during waste treatment in septic tanks. Lab studies showed such easy configuration obtained a power density of 142±6.71mWm(-2) when 3 units are connected in parallel, and preliminary calculation indicates that a system that costs approximately US $25 can power a 6-watt LED light for 4h per day with great improvement potential. Detailed electrochemical characterizations provide insights on system internal loss and technology advancement needed. Copyright © 2015 Elsevier Ltd. All rights reserved.
A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.
Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang
2018-01-01
The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
Electro-Optic Computing Architectures. Volume I
1998-02-01
The objective of the Electro - Optic Computing Architecture (EOCA) program was to develop multi-function electro - optic interfaces and optical...interconnect units to enhance the performance of parallel processor systems and form the building blocks for future electro - optic computing architectures...Specifically, three multi-function interface modules were targeted for development - an Electro - Optic Interface (EOI), an Optical Interconnection Unit (OW
Examinations in the Final Year of Transition to Mathematical Methods Computer Algebra System (CAS)
ERIC Educational Resources Information Center
Leigh-Lancaster, David; Les, Magdalena; Evans, Michael
2010-01-01
2009 was the final year of parallel implementation for Mathematical Methods Units 3 and 4 and Mathematical Methods (CAS) Units 3 and 4. From 2006-2009 there was a common technology-free short answer examination that covered the same function, algebra, calculus and probability content for both studies with corresponding expectations for key…
SURVEY OF FLUE GAS DESULFURIZATION SYSTEMS: WILL COUNTY STATION, COMMONWEALTH EDISON CO
The report gives results of a second survey of the flue gas desulfurization (FGD) system on Unit 1 of Commonwealth Edison Co.'s Will County Station. The FGD system, started up in February 1972, uses a limestone slurry in two parallel scrubbing trains. Each train includes a ventur...
Proposal for massively parallel data storage system
NASA Technical Reports Server (NTRS)
Mansuripur, M.
1992-01-01
An architecture for integrating large numbers of data storage units (drives) to form a distributed mass storage system is proposed. The network of interconnected units consists of nodes and links. At each node there resides a controller board, a data storage unit and, possibly, a local/remote user-terminal. The links (twisted-pair wires, coax cables, or fiber-optic channels) provide the communications backbone of the network. There is no central controller for the system as a whole; all decisions regarding allocation of resources, routing of messages and data-blocks, creation and distribution of redundant data-blocks throughout the system (for protection against possible failures), frequency of backup operations, etc., are made locally at individual nodes. The system can handle as many user-terminals as there are nodes in the network. Various users compete for resources by sending their requests to the local controller-board and receiving allocations of time and storage space. In principle, each user can have access to the entire system, and all drives can be running in parallel to service the requests for one or more users. The system is expandable up to a maximum number of nodes, determined by the number of routing-buffers built into the controller boards. Additional drives, controller-boards, user-terminals, and links can be simply plugged into an existing system in order to expand its capacity.
NASA Astrophysics Data System (ADS)
Cai, Yong; Cui, Xiangyang; Li, Guangyao; Liu, Wenyang
2018-04-01
The edge-smooth finite element method (ES-FEM) can improve the computational accuracy of triangular shell elements and the mesh partition efficiency of complex models. In this paper, an approach is developed to perform explicit finite element simulations of contact-impact problems with a graphical processing unit (GPU) using a special edge-smooth triangular shell element based on ES-FEM. Of critical importance for this problem is achieving finer-grained parallelism to enable efficient data loading and to minimize communication between the device and host. Four kinds of parallel strategies are then developed to efficiently solve these ES-FEM based shell element formulas, and various optimization methods are adopted to ensure aligned memory access. Special focus is dedicated to developing an approach for the parallel construction of edge systems. A parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty function calculation method are embedded in this parallel explicit algorithm. Finally, the program flow is well designed, and a GPU-based simulation system is developed, using Nvidia's CUDA. Several numerical examples are presented to illustrate the high quality of the results obtained with the proposed methods. In addition, the GPU-based parallel computation is shown to significantly reduce the computing time.
Information criteria for quantifying loss of reversibility in parallelized KMC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gourgoulias, Konstantinos, E-mail: gourgoul@math.umass.edu; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Rey-Bellet, Luc, E-mail: luc@math.umass.edu
Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium statistical mechanics, we propose the entropy production per unit time, or entropy production rate, given in terms of an observable and a corresponding estimator, as a metric that quantifies the loss of reversibility. Typically, this is a quantity that cannot bemore » computed explicitly for Parallel KMC, which is why we develop a posteriori estimators that have good scaling properties with respect to the size of the system. Through these estimators, we can connect the different parameters of the scheme, such as the communication time step of the parallelization, the choice of the domain decomposition, and the computational schedule, with its performance in controlling the loss of reversibility. From this point of view, the entropy production rate can be seen both as an information criterion to compare the reversibility of different parallel schemes and as a tool to diagnose reversibility issues with a particular scheme. As a demonstration, we use Sandia Lab's SPPARKS software to compare different parallelization schemes and different domain (lattice) decompositions.« less
Information criteria for quantifying loss of reversibility in parallelized KMC
NASA Astrophysics Data System (ADS)
Gourgoulias, Konstantinos; Katsoulakis, Markos A.; Rey-Bellet, Luc
2017-01-01
Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium statistical mechanics, we propose the entropy production per unit time, or entropy production rate, given in terms of an observable and a corresponding estimator, as a metric that quantifies the loss of reversibility. Typically, this is a quantity that cannot be computed explicitly for Parallel KMC, which is why we develop a posteriori estimators that have good scaling properties with respect to the size of the system. Through these estimators, we can connect the different parameters of the scheme, such as the communication time step of the parallelization, the choice of the domain decomposition, and the computational schedule, with its performance in controlling the loss of reversibility. From this point of view, the entropy production rate can be seen both as an information criterion to compare the reversibility of different parallel schemes and as a tool to diagnose reversibility issues with a particular scheme. As a demonstration, we use Sandia Lab's SPPARKS software to compare different parallelization schemes and different domain (lattice) decompositions.
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
Bayer image parallel decoding based on GPU
NASA Astrophysics Data System (ADS)
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Efficient parallel architecture for highly coupled real-time linear system applications
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Homaifar, Abdollah; Barua, Soumavo
1988-01-01
A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics.
Supercomputing on massively parallel bit-serial architectures
NASA Technical Reports Server (NTRS)
Iobst, Ken
1985-01-01
Research on the Goodyear Massively Parallel Processor (MPP) suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined train broadcast, and conditional branching at the processing element (PE) control unit level. The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. A description is given of the MPP design. Features of the system and its operation are illustrated in the form of charts and diagrams.
Graphics processing unit-assisted lossless decompression
Loughry, Thomas A.
2016-04-12
Systems and methods for decompressing compressed data that has been compressed by way of a lossless compression algorithm are described herein. In a general embodiment, a graphics processing unit (GPU) is programmed to receive compressed data packets and decompress such packets in parallel. The compressed data packets are compressed representations of an image, and the lossless compression algorithm is a Rice compression algorithm.
Gara, Alan; Ohmacht, Martin
2014-09-16
In a multiprocessor system with at least two levels of cache, a speculative thread may run on a core processor in parallel with other threads. When the thread seeks to do a write to main memory, this access is to be written through the first level cache to the second level cache. After the write though, the corresponding line is deleted from the first level cache and/or prefetch unit, so that any further accesses to the same location in main memory have to be retrieved from the second level cache. The second level cache keeps track of multiple versions of data, where more than one speculative thread is running in parallel, while the first level cache does not have any of the versions during speculation. A switch allows choosing between modes of operation of a speculation blind first level cache.
NASA Astrophysics Data System (ADS)
Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.
2012-06-01
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L
2012-06-13
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
119. VIEW OF NORTH SIDE OF LANDLINE INSTRUMENTATION ROOM (206), ...
119. VIEW OF NORTH SIDE OF LANDLINE INSTRUMENTATION ROOM (206), LSB (BLDG. 751). POWER DISTRIBUTION UNITS AND CABLE DISTRIBUTION UNITS ON RIGHT SIDE OF PHOTO; LOGIC CONTROL AND MONITOR UNITS FOR BOOSTER AND FUEL SYSTEMS LEFT OF AND PARALLEL TO EAST ROW OF CABINETS; SIGNAL CONDITIONERS AT NORTH END OF ROOM PERPENDICULAR TO OTHER CABINETS. - Vandenberg Air Force Base, Space Launch Complex 3, Launch Pad 3 East, Napa & Alden Roads, Lompoc, Santa Barbara County, CA
Acceleration of Radiance for Lighting Simulation by Using Parallel Computing with OpenCL
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zuo, Wangda; McNeil, Andrew; Wetter, Michael
2011-09-06
We report on the acceleration of annual daylighting simulations for fenestration systems in the Radiance ray-tracing program. The algorithm was optimized to reduce both the redundant data input/output operations and the floating-point operations. To further accelerate the simulation speed, the calculation for matrix multiplications was implemented using parallel computing on a graphics processing unit. We used OpenCL, which is a cross-platform parallel programming language. Numerical experiments show that the combination of the above measures can speed up the annual daylighting simulations 101.7 times or 28.6 times when the sky vector has 146 or 2306 elements, respectively.
Development, Fabrication, and Testing of Inverter Power System for Metroliner
DOT National Transportation Integrated Search
1979-11-01
This report documents the development and subsequent fabrication of a solid state auxiliary power conditioning unit (APCU) for the upgraded Metroliner. The APCU is an inverter of the pulse width modulated type having multiple parallel transistors in ...
DOE Office of Scientific and Technical Information (OSTI.GOV)
None
This is a fact sheet on the U.S. Department of Energy's (DOE) Advanced Reciprocating Engine Systems program (ARES), which is designed to promote separate, but parallel engine development between the major stationary, gaseous fueled engine manufacturers in the United States.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Brian B; Purba, Victor; Jafarpour, Saber
Given that next-generation infrastructures will contain large numbers of grid-connected inverters and these interfaces will be satisfying a growing fraction of system load, it is imperative to analyze the impacts of power electronics on such systems. However, since each inverter model has a relatively large number of dynamic states, it would be impractical to execute complex system models where the full dynamics of each inverter are retained. To address this challenge, we derive a reduced-order structure-preserving model for parallel-connected grid-tied three-phase inverters. Here, each inverter in the system is assumed to have a full-bridge topology, LCL filter at the pointmore » of common coupling, and the control architecture for each inverter includes a current controller, a power controller, and a phase-locked loop for grid synchronization. We outline a structure-preserving reduced-order inverter model for the setting where the parallel inverters are each designed such that the filter components and controller gains scale linearly with the power rating. By structure preserving, we mean that the reduced-order three-phase inverter model is also composed of an LCL filter, a power controller, current controller, and PLL. That is, we show that the system of parallel inverters can be modeled exactly as one aggregated inverter unit and this equivalent model has the same number of dynamical states as an individual inverter in the paralleled system. Numerical simulations validate the reduced-order models.« less
Store-operate-coherence-on-value
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Dong; Heidelberger, Philip; Kumar, Sameer
A system, method and computer program product for performing various store-operate instructions in a parallel computing environment that includes a plurality of processors and at least one cache memory device. A queue in the system receives, from a processor, a store-operate instruction that specifies under which condition a cache coherence operation is to be invoked. A hardware unit in the system runs the received store-operate instruction. The hardware unit evaluates whether a result of the running the received store-operate instruction satisfies the condition. The hardware unit invokes a cache coherence operation on a cache memory address associated with the receivedmore » store-operate instruction if the result satisfies the condition. Otherwise, the hardware unit does not invoke the cache coherence operation on the cache memory device.« less
ERIC Educational Resources Information Center
Allen, Robert; Layer, Geoff
This book discusses organizational, management and professional dimensions of change as credit-based systems are introduced in higher education institutions in the United Kingdom. Credit-based systems are taken to mean the flexible academic structures based around the parallel but interrelated concepts of credit and modularity. They are being…
Simulation of dispersion in layered coastal aquifer systems
Reilly, T.E.
1990-01-01
A density-dependent solute-transport formulation is used to examine ground-water flow in layered coastal aquifers. The numerical experiments indicate that although the transition zone may be thought of as an impermeable 'sharp' interface with freshwater flow parallel to the transition zone in homogeneous aquifers, this is not the case for layered systems. Freshwater can discharge through the transition zone in the confining units. Further, for the best simulation of layered coastal aquifer systems, either a flow-direction-dependent dispersion formulation is required, or the dispersivities must change spatially to reflect the tight thin confining unit. ?? 1990.
Graphics Processing Units for HEP trigger systems
NASA Astrophysics Data System (ADS)
Ammendola, R.; Bauce, M.; Biagioni, A.; Chiozzi, S.; Cotta Ramusino, A.; Fantechi, R.; Fiorini, M.; Giagu, S.; Gianoli, A.; Lamanna, G.; Lonardo, A.; Messina, A.; Neri, I.; Paolucci, P. S.; Piandani, R.; Pontisso, L.; Rescigno, M.; Simula, F.; Sozzi, M.; Vicini, P.
2016-07-01
General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.
Ultraviolet Communication for Medical Applications
2015-06-01
In the previous Phase I effort, Directed Energy Inc.’s (DEI) parent company Imaging Systems Technology (IST) demonstrated feasibility of several key...accurately model high path loss. Custom photon scatter code was rewritten for parallel execution on a graphics processing unit (GPU). The NVidia CUDA
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vinogradov, A. Yu., E-mail: vinogradov-a@ntcees.ru; Gerasimov, A. S.; Kozlov, A. V.
Consideration is given to different approaches to modeling the control systems of gas turbines as a component of CCPP and GTPP to ensure their reliable parallel operation in the UPS of Russia. The disadvantages of the approaches to the modeling of combined-cycle units in studying long-term electromechanical transients accompanied by power imbalance are pointed out. Examples are presented to support the use of more detailed models of gas turbines in electromechanical transient calculations. It is shown that the modern speed control systems of gas turbines in combination with relatively low equivalent inertia have a considerable effect on electromechanical transients, includingmore » those caused by disturbances not related to power imbalance.« less
Parallel Computer System for 3D Visualization Stereo on GPU
NASA Astrophysics Data System (ADS)
Al-Oraiqat, Anas M.; Zori, Sergii A.
2018-03-01
This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Parallel Distributed Processing Theory in the Age of Deep Networks.
Bowers, Jeffrey S
2017-12-01
Parallel distributed processing (PDP) models in psychology are the precursors of deep networks used in computer science. However, only PDP models are associated with two core psychological claims, namely that all knowledge is coded in a distributed format and cognition is mediated by non-symbolic computations. These claims have long been debated in cognitive science, and recent work with deep networks speaks to this debate. Specifically, single-unit recordings show that deep networks learn units that respond selectively to meaningful categories, and researchers are finding that deep networks need to be supplemented with symbolic systems to perform some tasks. Given the close links between PDP and deep networks, it is surprising that research with deep networks is challenging PDP theory. Copyright © 2017. Published by Elsevier Ltd.
GPU-completeness: theory and implications
NASA Astrophysics Data System (ADS)
Lin, I.-Jong
2011-01-01
This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe that the selection of architecture can be defined in terms of properties of GPU-Completeness. For a welldefined subset of algorithms, GPU-Completeness is intended to connect the parallelism, algorithms and efficient architectures into a unified framework to show that multiple layers of parallel implementation are guided by the same underlying trade-off.
Operational Use of GPS Navigation for Space Shuttle Entry
NASA Technical Reports Server (NTRS)
Goodman, John L.; Propst, Carolyn A.
2008-01-01
The STS-118 flight of the Space Shuttle Endeavour was the first shuttle mission flown with three Global Positioning System (GPS) receivers in place of the three legacy Tactical Air Navigation (TACAN) units. This marked the conclusion of a 15 year effort involving procurement, missionization, integration, and flight testing of a GPS receiver and a parallel effort to formulate and implement shuttle computer software changes to support GPS. The use of GPS data from a single receiver in parallel with TACAN during entry was successfully demonstrated by the orbiters Discovery and Atlantis during four shuttle missions in 2006 and 2007. This provided the confidence needed before flying the first all GPS, no TACAN flight with Endeavour. A significant number of lessons were learned concerning the integration of a software intensive navigation unit into a legacy avionics system. These lessons have been taken into consideration during vehicle design by other flight programs, including the vehicle that will replace the Space Shuttle, Orion.
The benefits of privatization.
Dirnfeld, V
1996-08-15
The promise of a universal, comprehensive, publicly funded system of medical care that was the foundation of the Medical Care Act passed in 1966 is no longer possible. Massive government debt, increasing health care costs, a growing and aging population and advances in technology have challenged the system, which can no longer meet the expectations of the public or of the health care professions. A parallel, private system, funded by a not-for-profit, regulated system of insurance coverage affordable for all wage-earners, would relieve the overstressed public system without decreasing the quality of care in that system. Critics of a parallel, private system, who base their arguments on the politics of fear and envy, charge that such a private system would "Americanize" Canadian health care and that the wealthy would be able to buy better, faster care than the rest of the population. But this has not happened in the parallel public and private health care systems in other Western countries or in the public and private education system in Canada. Wealthy Canadians can already buy medical care in the United States, where they spend $1 billion each year, an amount that represents a loss to Canada of 10,000 health care jobs. Parallel-system schemes in other countries have proven that people are driven to a private system by dissatisfaction with the quality of service, which is already suffering in Canada. Denial of choice is unacceptable to many people, particularly since the terms and conditions under which Canadians originally decided to forgo choice in medical care no longer apply.
Jeong, Kyeong-Min; Kim, Hee-Seung; Hong, Sung-In; Lee, Sung-Keun; Jo, Na-Young; Kim, Yong-Soo; Lim, Hong-Gi; Park, Jae-Hyeung
2012-10-08
Speed enhancement of integral imaging based incoherent Fourier hologram capture using a graphic processing unit is reported. Integral imaging based method enables exact hologram capture of real-existing three-dimensional objects under regular incoherent illumination. In our implementation, we apply parallel computation scheme using the graphic processing unit, accelerating the processing speed. Using enhanced speed of hologram capture, we also implement a pseudo real-time hologram capture and optical reconstruction system. The overall operation speed is measured to be 1 frame per second.
A polymorphic reconfigurable emulator for parallel simulation
NASA Technical Reports Server (NTRS)
Parrish, E. A., Jr.; Mcvey, E. S.; Cook, G.
1980-01-01
Microprocessor and arithmetic support chip technology was applied to the design of a reconfigurable emulator for real time flight simulation. The system developed consists of master control system to perform all man machine interactions and to configure the hardware to emulate a given aircraft, and numerous slave compute modules (SCM) which comprise the parallel computational units. It is shown that all parts of the state equations can be worked on simultaneously but that the algebraic equations cannot (unless they are slowly varying). Attempts to obtain algorithms that will allow parellel updates are reported. The word length and step size to be used in the SCM's is determined and the architecture of the hardware and software is described.
A Design Verification of the Parallel Pipelined Image Processings
NASA Astrophysics Data System (ADS)
Wasaki, Katsumi; Harai, Toshiaki
2008-11-01
This paper presents a case study of the design and verification of a parallel and pipe-lined image processing unit based on an extended Petri net, which is called a Logical Colored Petri net (LCPN). This is suitable for Flexible-Manufacturing System (FMS) modeling and discussion of structural properties. LCPN is another family of colored place/transition-net(CPN) with the addition of the following features: integer value assignment of marks, representation of firing conditions as marks' value based formulae, and coupling of output procedures with transition firing. Therefore, to study the behavior of a system modeled with this net, we provide a means of searching the reachability tree for markings.
Blom, H; Gösch, M
2004-04-01
The past few years we have witnessed a tremendous surge of interest in so-called array-based miniaturised analytical systems due to their value as extremely powerful tools for high-throughput sequence analysis, drug discovery and development, and diagnostic tests in medicine (see articles in Issue 1). Terminologies that have been used to describe these array-based bioscience systems include (but are not limited to): DNA-chip, microarrays, microchip, biochip, DNA-microarrays and genome chip. Potential technological benefits of introducing these miniaturised analytical systems include improved accuracy, multiplexing, lower sample and reagent consumption, disposability, and decreased analysis times, just to mention a few examples. Among the many alternative principles of detection-analysis (e.g.chemiluminescence, electroluminescence and conductivity), fluorescence-based techniques are widely used, examples being fluorescence resonance energy transfer, fluorescence quenching, fluorescence polarisation, time-resolved fluorescence, and fluorescence fluctuation spectroscopy (see articles in Issue 11). Time-dependent fluctuations of fluorescent biomolecules with different molecular properties, like molecular weight, translational and rotational diffusion time, colour and lifetime, potentially provide all the kinetic and thermodynamic information required in analysing complex interactions. In this mini-review article, we present recent extensions aimed to implement parallel laser excitation and parallel fluorescence detection that can lead to even further increase in throughput in miniaturised array-based analytical systems. We also report on developments and characterisations of multiplexing extension that allow multifocal laser excitation together with matched parallel fluorescence detection for parallel confocal dynamical fluorescence fluctuation studies at the single biomolecule level.
NASA Astrophysics Data System (ADS)
Srivastava, D. P.; Sahni, V.; Satsangi, P. S.
2014-08-01
Graph-theoretic quantum system modelling (GTQSM) is facilitated by considering the fundamental unit of quantum computation and information, viz. a quantum bit or qubit as a basic building block. Unit directional vectors "ket 0" and "ket 1" constitute two distinct fundamental quantum across variable orthonormal basis vectors, for the Hilbert space, specifying the direction of propagation of information, or computation data, while complementary fundamental quantum through, or flow rate, variables specify probability parameters, or amplitudes, as surrogates for scalar quantum information measure (von Neumann entropy). This paper applies GTQSM in continuum of protein heterodimer tubulin molecules of self-assembling polymers, viz. microtubules in the brain as a holistic system of interacting components representing hierarchical clustered quantum Hopfield network, hQHN, of networks. The quantum input/output ports of the constituent elemental interaction components, or processes, of tunnelling interactions and Coulombic bidirectional interactions are in cascade and parallel interconnections with each other, while the classical output ports of all elemental components are interconnected in parallel to accumulate micro-energy functions generated in the system as Hamiltonian, or Lyapunov, energy function. The paper presents an insight, otherwise difficult to gain, for the complex system of systems represented by clustered quantum Hopfield network, hQHN, through the application of GTQSM construct.
Reduced-Order Structure-Preserving Model for Parallel-Connected Three-Phase Grid-Tied Inverters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Brian B; Purba, Victor; Jafarpour, Saber
Next-generation power networks will contain large numbers of grid-connected inverters satisfying a significant fraction of system load. Since each inverter model has a relatively large number of dynamic states, it is impractical to analyze complex system models where the full dynamics of each inverter are retained. To address this challenge, we derive a reduced-order structure-preserving model for parallel-connected grid-tied three-phase inverters. Here, each inverter in the system is assumed to have a full-bridge topology, LCL filter at the point of common coupling, and the control architecture for each inverter includes a current controller, a power controller, and a phase-locked loopmore » for grid synchronization. We outline a structure-preserving reduced-order inverter model with lumped parameters for the setting where the parallel inverters are each designed such that the filter components and controller gains scale linearly with the power rating. By structure preserving, we mean that the reduced-order three-phase inverter model is also composed of an LCL filter, a power controller, current controller, and PLL. We show that the system of parallel inverters can be modeled exactly as one aggregated inverter unit and this equivalent model has the same number of dynamical states as any individual inverter in the system. Numerical simulations validate the reduced-order model.« less
Breaking Barriers to Low-Cost Modular Inverter Production & Use
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bogdan Borowy; Leo Casey; Jerry Foshage
2005-05-31
The goal of this cost share contract is to advance key technologies to reduce size, weight and cost while enhancing performance and reliability of Modular Inverter Product for Distributed Energy Resources (DER). Efforts address technology development to meet technical needs of DER market protection, isolation, reliability, and quality. Program activities build on SatCon Technology Corporation inverter experience (e.g., AIPM, Starsine, PowerGate) for Photovoltaic, Fuel Cell, Energy Storage applications. Efforts focused four technical areas, Capacitors, Cooling, Voltage Sensing and Control of Parallel Inverters. Capacitor efforts developed a hybrid capacitor approach for conditioning SatCon's AIPM unit supply voltages by incorporating several typesmore » and sizes to store energy and filter at high, medium and low frequencies while minimizing parasitics (ESR and ESL). Cooling efforts converted the liquid cooled AIPM module to an air-cooled unit using augmented fin, impingement flow cooling. Voltage sensing efforts successfully modified the existing AIPM sensor board to allow several, application dependent configurations and enabling voltage sensor galvanic isolation. Parallel inverter control efforts realized a reliable technique to control individual inverters, connected in a parallel configuration, without a communication link. Individual inverter currents, AC and DC, were balanced in the paralleled modules by introducing a delay to the individual PWM gate pulses. The load current sharing is robust and independent of load types (i.e., linear and nonlinear, resistive and/or inductive). It is a simple yet powerful method for paralleling both individual devices dramatically improves reliability and fault tolerance of parallel inverter power systems. A patent application has been made based on this control technology.« less
Executive impunity and parallel justice? The United Kingdom debate on secret inquests and inquiries.
Bray, Rebecca Scott
2012-03-01
At the beginning of 2008, the United Kingdom Government rolled into the Counter-Terrorism Bill some controversial proposals to reform coronial inquest processes, namely clauses that would provide for "secret inquests". The provisions were heavily criticised both inside and outside Parliament, and took a rocky passage through both the House of Commons and the House of Lords before eventually being abandoned by the government. In 2009 the government again tried to introduce "secret inquests" with the Coroners and Justice Bill, instead ultimately succeeding in establishing what critics have termed a "parallel" system of justice through provisions around "secret inquiries". This move has been seen as subverting the principles of transparency and open justice in the investigation of contentious deaths. This article examines the government's efforts to introduce "secret inquests" and thereafter "secret inquiries" in the context of the United Kingdom's coronial law and purpose, human rights obligations and the ongoing issues around sensitive intelligence, and examines the clash of laws that gave rise to the controversial proposals.
Unobtrusive Software and System Health Management with R2U2 on a Parallel MIMD Coprocessor
NASA Technical Reports Server (NTRS)
Schumann, Johann; Moosbrugger, Patrick
2017-01-01
Dynamic monitoring of software and system health of a complex cyber-physical system requires observers that continuously monitor variables of the embedded software in order to detect anomalies and reason about root causes. There exists a variety of techniques for code instrumentation, but instrumentation might change runtime behavior and could require costly software re-certification. In this paper, we present R2U2E, a novel realization of our real-time, Realizable, Responsive, and Unobtrusive Unit (R2U2). The R2U2E observers are executed in parallel on a dedicated 16-core EPIPHANY co-processor, thereby avoiding additional computational overhead to the system under observation. A DMA-based shared memory access architecture allows R2U2E to operate without any code instrumentation or program interference.
Massively parallel multicanonical simulations
NASA Astrophysics Data System (ADS)
Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard
2018-03-01
Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.
Sittig, D. F.; Orr, J. A.
1991-01-01
Various methods have been proposed in an attempt to solve problems in artifact and/or alarm identification including expert systems, statistical signal processing techniques, and artificial neural networks (ANN). ANNs consist of a large number of simple processing units connected by weighted links. To develop truly robust ANNs, investigators are required to train their networks on huge training data sets, requiring enormous computing power. We implemented a parallel version of the backward error propagation neural network training algorithm in the widely portable parallel programming language C-Linda. A maximum speedup of 4.06 was obtained with six processors. This speedup represents a reduction in total run-time from approximately 6.4 hours to 1.5 hours. We conclude that use of the master-worker model of parallel computation is an excellent method for obtaining speedups in the backward error propagation neural network training algorithm. PMID:1807607
Monte Carlo MP2 on Many Graphical Processing Units.
Doran, Alexander E; Hirata, So
2016-10-11
In the Monte Carlo second-order many-body perturbation (MC-MP2) method, the long sum-of-product matrix expression of the MP2 energy, whose literal evaluation may be poorly scalable, is recast into a single high-dimensional integral of functions of electron pair coordinates, which is evaluated by the scalable method of Monte Carlo integration. The sampling efficiency is further accelerated by the redundant-walker algorithm, which allows a maximal reuse of electron pairs. Here, a multitude of graphical processing units (GPUs) offers a uniquely ideal platform to expose multilevel parallelism: fine-grain data-parallelism for the redundant-walker algorithm in which millions of threads compute and share orbital amplitudes on each GPU; coarse-grain instruction-parallelism for near-independent Monte Carlo integrations on many GPUs with few and infrequent interprocessor communications. While the efficiency boost by the redundant-walker algorithm on central processing units (CPUs) grows linearly with the number of electron pairs and tends to saturate when the latter exceeds the number of orbitals, on a GPU it grows quadratically before it increases linearly and then eventually saturates at a much larger number of pairs. This is because the orbital constructions are nearly perfectly parallelized on a GPU and thus completed in a near-constant time regardless of the number of pairs. In consequence, an MC-MP2/cc-pVDZ calculation of a benzene dimer is 2700 times faster on 256 GPUs (using 2048 electron pairs) than on two CPUs, each with 8 cores (which can use only up to 256 pairs effectively). We also numerically determine that the cost to achieve a given relative statistical uncertainty in an MC-MP2 energy increases as O(n 3 ) or better with system size n, which may be compared with the O(n 5 ) scaling of the conventional implementation of deterministic MP2. We thus establish the scalability of MC-MP2 with both system and computer sizes.
NASA Astrophysics Data System (ADS)
Shcherbakov, Alexandre S.; Chavez Dagostino, Miguel; Arellanes, Adan Omar; Tepichin Rodriguez, Eduardo
2017-08-01
We describe a potential prototype of modern spectrometer based on acousto-optical technique with three parallel optical arms for analysis of radio-wave signals specific to astronomical observations. Each optical arm exhibits original performances to provide parallel multi-band observations with different scales simultaneously. Similar multi-band instrument is able to realize measurements within various scenarios from planetary atmospheres to attractive objects in the distant Universe. The arrangement under development has two novelties. First, each optical arm represents an individual spectrum analyzer with its individual performances. Such an approach is conditioned by exploiting various materials for acousto-optical cells operating within various regimes, frequency ranges, and light wavelengths from independent light sources. Individually produced beam shapers give both the needed incident light polarization and the required apodization for light beam to increase the dynamic range of the system as a whole. After parallel acousto-optical processing, a few data flows from these optical arms are united by the joint CCD matrix on the stage of the combined extremely high-bit rate electronic data processing that provides the system performances as well. The other novelty consists in the usage of various materials for designing wide-aperture acousto-optical cells exhibiting the best performances within each of optical arms. Here, one can mention specifically selected cuts of tellurium dioxide, bastron, and lithium niobate, which overlap selected areas within the frequency range from 40 MHz to 2.0 GHz. Thus one yields the united versatile instrument for comprehensive studies of astronomical objects simultaneously with precise synchronization in various frequency ranges.
Yuan, Jie; Xu, Guan; Yu, Yao; Zhou, Yu; Carson, Paul L; Wang, Xueding; Liu, Xiaojun
2013-08-01
Photoacoustic tomography (PAT) offers structural and functional imaging of living biological tissue with highly sensitive optical absorption contrast and excellent spatial resolution comparable to medical ultrasound (US) imaging. We report the development of a fully integrated PAT and US dual-modality imaging system, which performs signal scanning, image reconstruction, and display for both photoacoustic (PA) and US imaging all in a truly real-time manner. The back-projection (BP) algorithm for PA image reconstruction is optimized to reduce the computational cost and facilitate parallel computation on a state of the art graphics processing unit (GPU) card. For the first time, PAT and US imaging of the same object can be conducted simultaneously and continuously, at a real-time frame rate, presently limited by the laser repetition rate of 10 Hz. Noninvasive PAT and US imaging of human peripheral joints in vivo were achieved, demonstrating the satisfactory image quality realized with this system. Another experiment, simultaneous PAT and US imaging of contrast agent flowing through an artificial vessel, was conducted to verify the performance of this system for imaging fast biological events. The GPU-based image reconstruction software code for this dual-modality system is open source and available for download from http://sourceforge.net/projects/patrealtime.
Image Understanding and Intelligent Parallel Systems
1991-05-09
a common user interface for the interactive , graphical manipulation of those histories, and...Circuits and Systems, August 1987. Yap, S.-K. and M.L. Scott, "PenGuin: A language for reactive graphical user interface programming," to appear, INTERACT , Cambridge, United Kingdom, 1990. 74 ...of up to a factor of 100 over single-workstation implementations. User interfaces to large multiprocessor computers are a difficult issue addressed
An Efficient Fuzzy Controller Design for Parallel Connected Induction Motor Drives
NASA Astrophysics Data System (ADS)
Usha, S.; Subramani, C.
2018-04-01
Generally, an induction motors are highly non-linear and has a complex time varying dynamics. This makes the speed control of an induction motor a challenging issue in the industries. But, due to the recent trends in the power electronic devices and intelligent controllers, the speed control of the induction motor is achieved by including non-linear characteristics also. Conventionally a single inverter is used to run one induction motor in industries. In the traction applications, two or more inductions motors are operated in parallel to reduce the size and cost of induction motors. In this application, the parallel connected induction motors can be driven by a single inverter unit. The stability problems may introduce in the parallel operation under low speed operating conditions. Hence, the speed deviations should be reduce with help of suitable controllers. The speed control of the parallel connected system is performed by PID controller and fuzzy logic controller. In this paper the speed response of the induction motor for the rating of IHP, 1440 rpm, and 50Hz with these controller are compared in time domain specifications. The stability analysis of the system also performed under low speed using matlab platform. The hardware model is developed for speed control using fuzzy logic controller which exhibited superior performances over the other controller.
Parallel processing in the honeybee olfactory pathway: structure, function, and evolution.
Rössler, Wolfgang; Brill, Martin F
2013-11-01
Animals face highly complex and dynamic olfactory stimuli in their natural environments, which require fast and reliable olfactory processing. Parallel processing is a common principle of sensory systems supporting this task, for example in visual and auditory systems, but its role in olfaction remained unclear. Studies in the honeybee focused on a dual olfactory pathway. Two sets of projection neurons connect glomeruli in two antennal-lobe hemilobes via lateral and medial tracts in opposite sequence with the mushroom bodies and lateral horn. Comparative studies suggest that this dual-tract circuit represents a unique adaptation in Hymenoptera. Imaging studies indicate that glomeruli in both hemilobes receive redundant sensory input. Recent simultaneous multi-unit recordings from projection neurons of both tracts revealed widely overlapping response profiles strongly indicating parallel olfactory processing. Whereas lateral-tract neurons respond fast with broad (generalistic) profiles, medial-tract neurons are odorant specific and respond slower. In analogy to "what-" and "where" subsystems in visual pathways, this suggests two parallel olfactory subsystems providing "what-" (quality) and "when" (temporal) information. Temporal response properties may support across-tract coincidence coding in higher centers. Parallel olfactory processing likely enhances perception of complex odorant mixtures to decode the diverse and dynamic olfactory world of a social insect.
PUP: An Architecture to Exploit Parallel Unification in Prolog
1988-03-01
environment stacking mo del similar to the Warren Abstract Machine [23] since it has been shown to be super ior to other known models (see [21]). The storage...execute in groups of independent operations. Unifications belonging to different group s may not overlap. Also unification operations belonging to the...since all parallel operations on the unification units must complete before any of the units can star t executing the next group of parallel
Spudich, Paul A.; Chiou, Brian
2015-01-01
We present a two-dimensional system of generalized coordinates for use with geometrically complex fault ruptures that are neither straight nor continuous. The coordinates are a generalization of the conventional strike-normal and strike-parallel coordinates of a single straight fault. The presented conventions and formulations are applicable to a single curved trace, as well as multiple traces representing the rupture of branching faults or noncontiguous faults. An early application of our generalized system is in the second round of the Next Generation of Ground-Motion Attenuation Model project for the Western United States (NGA-West2), where they were used in the characterization of the hanging-wall effects. We further improve the NGA-West2 strike-parallel formulation for multiple rupture traces with a more intuitive definition of the nominal strike direction. We also derive an analytical expression for the gradient of the generalized strike-normal coordinate. The direction of this gradient may be used as the strike-normal direction in the study of polarization effects on ground motions.
Parallel Work of CO2 Ejectors Installed in a Multi-Ejector Module of Refrigeration System
NASA Astrophysics Data System (ADS)
Bodys, Jakub; Palacz, Michal; Haida, Michal; Smolka, Jacek; Nowak, Andrzej J.; Banasiak, Krzysztof; Hafner, Armin
2016-09-01
A performance analysis on of fixed ejectors installed in a multi-ejector module in a CO2 refrigeration system is presented in this study. The serial and the parallel work of four fixed-geometry units that compose the multi-ejector pack was carried out. The executed numerical simulations were performed with the use of validated Homogeneous Equilibrium Model (HEM). The computational tool ejectorPL for typical transcritical parameters at the motive nozzle were used in all the tests. A wide range of the operating conditions for supermarket applications in three different European climate zones were taken into consideration. The obtained results present the high and stable performance of all the ejectors in the multi-ejector pack.
1985-05-01
unit in the data base, with knowing one generic assembly language. °-’--a 139 The 5-tuple describing single operation execution time of the operations...TSi-- generate , random eventi ( ,.0-15 tieit tmls - ((floa egus ()16 274 r Ispt imet imel I at :EVE’JS- II ktime=0.0; /0 present time 0/ rrs ptime=0.0...computing machinery capable of performing these tasks within a given time constraint. Because the majority of the available computing machinery is general
On-board landmark navigation and attitude reference parallel processor system
NASA Technical Reports Server (NTRS)
Gilbert, L. E.; Mahajan, D. T.
1978-01-01
An approach to autonomous navigation and attitude reference for earth observing spacecraft is described along with the landmark identification technique based on a sequential similarity detection algorithm (SSDA). Laboratory experiments undertaken to determine if better than one pixel accuracy in registration can be achieved consistent with onboard processor timing and capacity constraints are included. The SSDA is implemented using a multi-microprocessor system including synchronization logic and chip library. The data is processed in parallel stages, effectively reducing the time to match the small known image within a larger image as seen by the onboard image system. Shared memory is incorporated in the system to help communicate intermediate results among microprocessors. The functions include finding mean values and summation of absolute differences over the image search area. The hardware is a low power, compact unit suitable to onboard application with the flexibility to provide for different parameters depending upon the environment.
NASA Astrophysics Data System (ADS)
Lasher, Mark E.; Henderson, Thomas B.; Drake, Barry L.; Bocker, Richard P.
1986-09-01
The modified signed-digit (MSD) number representation offers full parallel, carry-free addition. A MSD adder has been described by the authors. This paper describes how the adder can be used in a tree structure to implement an optical multiply algorithm. Three different optical schemes, involving position, polarization, and intensity encoding, are proposed for realizing the trinary logic system. When configured in the generic multiplier architecture, these schemes yield the combinatorial logic necessary to carry out the multiplication algorithm. The optical systems are essentially three dimensional arrangements composed of modular units. Of course, this modularity is important for design considerations, while the parallelism and noninterfering communication channels of optical systems are important from the standpoint of reduced complexity. The authors have also designed electronic hardware to demonstrate and model the combinatorial logic required to carry out the algorithm. The electronic and proposed optical systems will be compared in terms of complexity and speed.
A closed-loop automatic control system for high-intensity acoustic test systems.
NASA Technical Reports Server (NTRS)
Slusser, R. A.
1973-01-01
Sound at sound pressure levels in the range from 130 to 160 dB is used in the investigation. Random noise is passed through a series of parallel filters, generally 1/3-octave wide. A basic automatic system is investigated because of preadjustment inaccuracies and high costs found in a study of a typical manually controlled acoustic testing system. The unit described has been successfully used in automatic acoustic tests in connection with the spacecraft tests for the Mariner 1971 program.
Radiative energy transfer in molecular gases
NASA Technical Reports Server (NTRS)
Tiwari, Surendra N.
1992-01-01
Basic formulations, analyses, and numerical procedures are presented to study radiative interactions in gray as well as nongray gases under different physical and flow conditions. After preliminary fluid-dynamical considerations, essential governing equations for radiative transport are presented that are applicable under local and nonlocal thermodynamic equilibrium conditions. Auxiliary relations for relaxation times and spectral absorption models are also provided. For specific applications, several simple gaseous systems are analyzed. The first system considered consists of a gas bounded by two parallel plates having the same temperature. Within the gas there is a uniform heat source per unit volume. For this system, both vibrational nonequilibrium effects and radiation conduction interactions are studied. The second system consists of fully developed laminar flow and heat transfer in a parallel plate duct under the boundary condition of a uniform surface heat flux. For this system, effects of gray surface emittance are studied. With the single exception of a circular geometry, the third system is considered identical to the second system. Here, the influence of nongray walls is also studied.
Design of a real-time wind turbine simulator using a custom parallel architecture
NASA Technical Reports Server (NTRS)
Hoffman, John A.; Gluck, R.; Sridhar, S.
1995-01-01
The design of a new parallel-processing digital simulator is described. The new simulator has been developed specifically for analysis of wind energy systems in real time. The new processor has been named: the Wind Energy System Time-domain simulator, version 3 (WEST-3). Like previous WEST versions, WEST-3 performs many computations in parallel. The modules in WEST-3 are pure digital processors, however. These digital processors can be programmed individually and operated in concert to achieve real-time simulation of wind turbine systems. Because of this programmability, WEST-3 is very much more flexible and general than its two predecessors. The design features of WEST-3 are described to show how the system produces high-speed solutions of nonlinear time-domain equations. WEST-3 has two very fast Computational Units (CU's) that use minicomputer technology plus special architectural features that make them many times faster than a microcomputer. These CU's are needed to perform the complex computations associated with the wind turbine rotor system in real time. The parallel architecture of the CU causes several tasks to be done in each cycle, including an IO operation and the combination of a multiply, add, and store. The WEST-3 simulator can be expanded at any time for additional computational power. This is possible because the CU's interfaced to each other and to other portions of the simulation using special serial buses. These buses can be 'patched' together in essentially any configuration (in a manner very similar to the programming methods used in analog computation) to balance the input/ output requirements. CU's can be added in any number to share a given computational load. This flexible bus feature is very different from many other parallel processors which usually have a throughput limit because of rigid bus architecture.
NASA Astrophysics Data System (ADS)
Wang, Ting; Plecháč, Petr
2017-12-01
Stochastic reaction networks that exhibit bistable behavior are common in systems biology, materials science, and catalysis. Sampling of stationary distributions is crucial for understanding and characterizing the long-time dynamics of bistable stochastic dynamical systems. However, simulations are often hindered by the insufficient sampling of rare transitions between the two metastable regions. In this paper, we apply the parallel replica method for a continuous time Markov chain in order to improve sampling of the stationary distribution in bistable stochastic reaction networks. The proposed method uses parallel computing to accelerate the sampling of rare transitions. Furthermore, it can be combined with the path-space information bounds for parametric sensitivity analysis. With the proposed methodology, we study three bistable biological networks: the Schlögl model, the genetic switch network, and the enzymatic futile cycle network. We demonstrate the algorithmic speedup achieved in these numerical benchmarks. More significant acceleration is expected when multi-core or graphics processing unit computer architectures and programming tools such as CUDA are employed.
NASA Astrophysics Data System (ADS)
Sewell, Stephen
This thesis introduces a software framework that effectively utilizes low-cost commercially available Graphic Processing Units (GPUs) to simulate complex scientific plasma phenomena that are modeled using the Particle-In-Cell (PIC) paradigm. The software framework that was developed conforms to the Compute Unified Device Architecture (CUDA), a standard for general purpose graphic processing that was introduced by NVIDIA Corporation. This framework has been verified for correctness and applied to advance the state of understanding of the electromagnetic aspects of the development of the Aurora Borealis and Aurora Australis. For each phase of the PIC methodology, this research has identified one or more methods to exploit the problem's natural parallelism and effectively map it for execution on the graphic processing unit and its host processor. The sources of overhead that can reduce the effectiveness of parallelization for each of these methods have also been identified. One of the novel aspects of this research was the utilization of particle sorting during the grid interpolation phase. The final representation resulted in simulations that executed about 38 times faster than simulations that were run on a single-core general-purpose processing system. The scalability of this framework to larger problem sizes and future generation systems has also been investigated.
Data Acquisition System for Multi-Frequency Radar Flight Operations Preparation
NASA Technical Reports Server (NTRS)
Leachman, Jonathan
2010-01-01
A three-channel data acquisition system was developed for the NASA Multi-Frequency Radar (MFR) system. The system is based on a commercial-off-the-shelf (COTS) industrial PC (personal computer) and two dual-channel 14-bit digital receiver cards. The decimated complex envelope representations of the three radar signals are passed to the host PC via the PCI bus, and then processed in parallel by multiple cores of the PC CPU (central processing unit). The innovation is this parallelization of the radar data processing using multiple cores of a standard COTS multi-core CPU. The data processing portion of the data acquisition software was built using autonomous program modules or threads, which can run simultaneously on different cores. A master program module calculates the optimal number of processing threads, launches them, and continually supplies each with data. The benefit of this new parallel software architecture is that COTS PCs can be used to implement increasingly complex processing algorithms on an increasing number of radar range gates and data rates. As new PCs become available with higher numbers of CPU cores, the software will automatically utilize the additional computational capacity.
NASA Technical Reports Server (NTRS)
Kimnach, Greg L.; Lebron, Ramon C.; Fox, David A.
1999-01-01
The John H. Glenn Research Center at Lewis Field (GRC) in Cleveland, OH and the Sundstrand Corporation in Rockford, IL have designed and developed an Engineering Model (EM) Electrical Power Control Unit (EPCU) for the Fluids Combustion Facility, (FCF) experiments to be flown on the International Space Station (ISS). The EPCU will be used as the power interface to the ISS power distribution system for the FCF's space experiments'test and telemetry hardware. Furthermore. it is proposed to be the common power interface for all experiments. The EPCU is a three kilowatt 12OVdc-to-28Vdc converter utilizing three independent Power Converter Units (PCUs), each rated at 1kWe (36Adc @ 28Vdc) which are paralleled and synchronized. Each converter may be fed from one of two ISS power channels. The 28Vdc loads are connected to the EPCU output via 48 solid-state and current-limiting switches, rated at 4Adc each. These switches may be paralleled to supply any given load up to the 108Adc normal operational limit of the paralleled converters. The EPCU was designed in this manner to maximize allocated-power utilization. to shed loads autonomously, to provide fault tolerance. and to provide a flexible power converter and control module to meet various ISS load demands. Tests of the EPCU in the Power Systems Facility testbed at GRC reveal that the overall converted-power efficiency, is approximately 89% with a nominal-input voltage of 12OVdc and a total load in the range of 4O% to 110% rated 28Vdc load. (The PCUs alone have an efficiency of approximately 94.5%). Furthermore, the EM unit passed all flight-qualification level (and beyond) vibration tests, passed ISS EMI (conducted, radiated. and susceptibility) requirements. successfully operated for extended periods in a thermal/vacuum chamber, was integrated with a proto-flight experiment and passed all stability and functional requirements.
System, methods and apparatus for program optimization for multi-threaded processor architectures
Bastoul, Cedric; Lethin, Richard A; Leung, Allen K; Meister, Benoit J; Szilagyi, Peter; Vasilache, Nicolas T; Wohlford, David E
2015-01-06
Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least two multi-stage execution units that allow for parallel execution of tasks. The first custom computing apparatus optimizes the code for parallelism, locality of operations and contiguity of memory accesses on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
History, Principles, and Policies of Observation Medicine.
Ross, Michael A; Granovsky, Michael
2017-08-01
The history of observation medicine has paralleled the rise of emergency medicine over the past 50 years to meet the needs of patients, emergency departments, hospitals, and the US health care system. Just as emergency departments are the safety net of the health system, observation units are the safety net of emergency departments. The growth of observation medicine has been driven by innovations in health care, an ongoing shift of patients from inpatient to outpatient settings, and changes in health policy. These units have been shown to provide better outcomes than traditional care for selected patients. Copyright © 2017 Elsevier Inc. All rights reserved.
Parallel Curriculum Units for Science, Grades 6-12
ERIC Educational Resources Information Center
Leppien, Jann H.; Purcell, Jeanne H.
2011-01-01
Based on the best-selling book "The Parallel Curriculum", this professional development resource gives multifaceted examples of rigorous learning opportunities for science students in Grades 6-12. The four sample units revolve around genetics, the convergence of science and society, the integration of language arts and biology, and the periodic…
NASA Astrophysics Data System (ADS)
Lawry, B. J.; Encarnacao, A.; Hipp, J. R.; Chang, M.; Young, C. J.
2011-12-01
With the rapid growth of multi-core computing hardware, it is now possible for scientific researchers to run complex, computationally intensive software on affordable, in-house commodity hardware. Multi-core CPUs (Central Processing Unit) and GPUs (Graphics Processing Unit) are now commonplace in desktops and servers. Developers today have access to extremely powerful hardware that enables the execution of software that could previously only be run on expensive, massively-parallel systems. It is no longer cost-prohibitive for an institution to build a parallel computing cluster consisting of commodity multi-core servers. In recent years, our research team has developed a distributed, multi-core computing system and used it to construct global 3D earth models using seismic tomography. Traditionally, computational limitations forced certain assumptions and shortcuts in the calculation of tomographic models; however, with the recent rapid growth in computational hardware including faster CPU's, increased RAM, and the development of multi-core computers, we are now able to perform seismic tomography, 3D ray tracing and seismic event location using distributed parallel algorithms running on commodity hardware, thereby eliminating the need for many of these shortcuts. We describe Node Resource Manager (NRM), a system we developed that leverages the capabilities of a parallel computing cluster. NRM is a software-based parallel computing management framework that works in tandem with the Java Parallel Processing Framework (JPPF, http://www.jppf.org/), a third party library that provides a flexible and innovative way to take advantage of modern multi-core hardware. NRM enables multiple applications to use and share a common set of networked computers, regardless of their hardware platform or operating system. Using NRM, algorithms can be parallelized to run on multiple processing cores of a distributed computing cluster of servers and desktops, which results in a dramatic speedup in execution time. NRM is sufficiently generic to support applications in any domain, as long as the application is parallelizable (i.e., can be subdivided into multiple individual processing tasks). At present, NRM has been effective in decreasing the overall runtime of several algorithms: 1) the generation of a global 3D model of the compressional velocity distribution in the Earth using tomographic inversion, 2) the calculation of the model resolution matrix, model covariance matrix, and travel time uncertainty for the aforementioned velocity model, and 3) the correlation of waveforms with archival data on a massive scale for seismic event detection. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
Power saver circuit for audio/visual signal unit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Right, R. W.
1985-02-12
A combined audio and visual signal unit with the audio and visual components actuated alternately and powered over a single cable pair in such a manner that only one of the audio and visual components is drawing power from the power supply at any given instant. Thus, the power supply is never called upon to provide more energy than that drawn by the one of the components having the greater power requirement. This is particularly advantageous when several combined audio and visual signal units are coupled in parallel on one cable pair. Typically, the signal unit may comprise a hornmore » and a strobe light for a fire alarm signalling system.« less
Photovoltaic Powering And Control System For Electrochromic Windows
Schulz, Stephen C.; Michalski, Lech A.; Volltrauer, Hermann N.; Van Dine, John E.
2000-04-25
A sealed insulated glass unit is provided with an electrochromic device for modulating light passing through the unit. The electrochromic device is controlled from outside the unit by a remote control electrically unconnected to the device. Circuitry within the unit may be magnetically controlled from outside. The electrochromic device is powered by a photovoltaic cells. The photovoltaic cells may be positioned so that at least a part of the light incident on the cell passes through the electrochromic device, providing a form of feedback control. A variable resistance placed in parallel with the electrochromic element is used to control the response of the electrochromic element to changes in output of the photovoltaic cell.
NASA Astrophysics Data System (ADS)
Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos
1990-03-01
The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
The Metropolis Monte Carlo method with CUDA enabled Graphic Processing Units
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hall, Clifford; School of Physics, Astronomy, and Computational Sciences, George Mason University, 4400 University Dr., Fairfax, VA 22030; Ji, Weixiao
2014-02-01
We present a CPU–GPU system for runtime acceleration of large molecular simulations using GPU computation and memory swaps. The memory architecture of the GPU can be used both as container for simulation data stored on the graphics card and as floating-point code target, providing an effective means for the manipulation of atomistic or molecular data on the GPU. To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform atomistic and molecular simulations are essential. Our system implements a versatile molecular engine, including inter-molecule interactions and orientational variables for performing the Metropolis Monte Carlo (MMC) algorithm,more » which is one type of Markov chain Monte Carlo. By combining memory objects with floating-point code fragments we have implemented an MMC parallel engine that entirely avoids the communication time of molecular data at runtime. Our runtime acceleration system is a forerunner of a new class of CPU–GPU algorithms exploiting memory concepts combined with threading for avoiding bus bandwidth and communication. The testbed molecular system used here is a condensed phase system of oligopyrrole chains. A benchmark shows a size scaling speedup of 60 for systems with 210,000 pyrrole monomers. Our implementation can easily be combined with MPI to connect in parallel several CPU–GPU duets. -- Highlights: •We parallelize the Metropolis Monte Carlo (MMC) algorithm on one CPU—GPU duet. •The Adaptive Tempering Monte Carlo employs MMC and profits from this CPU—GPU implementation. •Our benchmark shows a size scaling-up speedup of 62 for systems with 225,000 particles. •The testbed involves a polymeric system of oligopyrroles in the condensed phase. •The CPU—GPU parallelization includes dipole—dipole and Mie—Jones classic potentials.« less
ANNarchy: a code generation approach to neural simulations on parallel hardware
Vitay, Julien; Dinkelbach, Helge Ü.; Hamker, Fred H.
2015-01-01
Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions. PMID:26283957
Power conditioning unit for photovoltaic power systems
NASA Astrophysics Data System (ADS)
Beghin, G.; Nguyen Phuoc, V. T.
Operational features and components of a power conditioning unit for interconnecting solar cell module powers with a utility grid are outlined. The two-stage unit first modifies the voltage to desired levels on an internal dc link, then inverts the current in 2 power transformers connected to a vector summation control to neutralize harmonic distortion up to the 11th harmonic. The system operates in parallel with the grid with extra inductors to absorb line-to-line voltage and phase differences, and permits peak power use from the PV array. Reactive power is gained internally, and a power system controller monitors voltages, frequencies, and currents. A booster preregulator adjusts the input voltage from the array to provide voltage regulation for the inverter, and can commutate 450 amps. A total harmonic distortion of less than 5 percent is claimed, with a rating of 5 kVA, 50/60 Hz, 3-phase, and 4-wire.
Wilson, J Adam; Williams, Justin C
2009-01-01
The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
NASA Astrophysics Data System (ADS)
Fehr, M.; Navarro, V.; Martin, L.; Fletcher, E.
2013-08-01
Space Situational Awareness[8] (SSA) is defined as the comprehensive knowledge, understanding and maintained awareness of the population of space objects, the space environment and existing threats and risks. As ESA's SSA Conjunction Prediction Service (CPS) requires the repetitive application of a processing algorithm against a data set of man-made space objects, it is crucial to exploit the highly parallelizable nature of this problem. Currently the CPS system makes use of OpenMP[7] for parallelization purposes using CPU threads, but only a GPU with its hundreds of cores can fully benefit from such high levels of parallelism. This paper presents the adaptation of several core algorithms[5] of the CPS for general-purpose computing on graphics processing units (GPGPU) using NVIDIAs Compute Unified Device Architecture (CUDA).
Varma, Sashank; Karl, Stacy R
2013-05-01
Much of the research on mathematical cognition has focused on the numbers 1, 2, 3, 4, 5, 6, 7, 8, and 9, with considerably less attention paid to more abstract number classes. The current research investigated how people understand decimal proportions--rational numbers between 0 and 1 expressed in the place-value symbol system. The results demonstrate that proportions are represented as discrete structures and processed in parallel. There was a semantic interference effect: When understanding a proportion expression (e.g., "0.29"), both the correct proportion referent (e.g., 0.29) and the incorrect natural number referent (e.g., 29) corresponding to the visually similar natural number expression (e.g., "29") are accessed in parallel, and when these referents lead to conflicting judgments, performance slows. There was also a syntactic interference effect, generalizing the unit-decade compatibility effect for natural numbers: When comparing two proportions, their tenths and hundredths components are processed in parallel, and when the different components lead to conflicting judgments, performance slows. The results also reveal that zero decimals--proportions ending in zero--serve multiple cognitive functions, including eliminating semantic interference and speeding processing. The current research also extends the distance, semantic congruence, and SNARC effects from natural numbers to decimal proportions. These findings inform how people understand the place-value symbol system, and the mental implementation of mathematical symbol systems more generally. Copyright © 2013 Elsevier Inc. All rights reserved.
Blaettler, M; Bruegger, A; Forster, I C; Lehareinger, Y
1988-03-01
The design of an analog interface to a digital audio signal processor (DASP)-video cassette recorder (VCR) system is described. The complete system represents a low-cost alternative to both FM instrumentation tape recorders and multi-channel chart recorders. The interface or DASP input-output unit described in this paper enables the recording and playback of up to 12 analog channels with a maximum of 12 bit resolution and a bandwidth of 2 kHz per channel. Internal control and timing in the recording component of the interface is performed using ROMs which can be reprogrammed to suit different analog-to-digital converter hardware. Improvement in the bandwidth specifications is possible by connecting channels in parallel. A parallel 16 bit data output port is provided for direct transfer of the digitized data to a computer.
Huang, Yongyang; Badar, Mudabbir; Nitkowski, Arthur; Weinroth, Aaron; Tansu, Nelson; Zhou, Chao
2017-01-01
Space-division multiplexing optical coherence tomography (SDM-OCT) is a recently developed parallel OCT imaging method in order to achieve multi-fold speed improvement. However, the assembly of fiber optics components used in the first prototype system was labor-intensive and susceptible to errors. Here, we demonstrate a high-speed SDM-OCT system using an integrated photonic chip that can be reliably manufactured with high precisions and low per-unit cost. A three-layer cascade of 1 × 2 splitters was integrated in the photonic chip to split the incident light into 8 parallel imaging channels with ~3.7 mm optical delay in air between each channel. High-speed imaging (~1s/volume) of porcine eyes ex vivo and wide-field imaging (~18.0 × 14.3 mm2) of human fingers in vivo were demonstrated with the chip-based SDM-OCT system. PMID:28856055
DOE Office of Scientific and Technical Information (OSTI.GOV)
Del Ben, Mauro, E-mail: mauro.delben@chem.uzh.ch; Hutter, Jürg, E-mail: hutter@chem.uzh.ch; VandeVondele, Joost, E-mail: Joost.VandeVondele@mat.ethz.ch
The forces acting on the atoms as well as the stress tensor are crucial ingredients for calculating the structural and dynamical properties of systems in the condensed phase. Here, these derivatives of the total energy are evaluated for the second-order Møller-Plesset perturbation energy (MP2) in the framework of the resolution of identity Gaussian and plane waves method, in a way that is fully consistent with how the total energy is computed. This consistency is non-trivial, given the different ways employed to compute Coulomb, exchange, and canonical four center integrals, and allows, for example, for energy conserving dynamics in various ensembles.more » Based on this formalism, a massively parallel algorithm has been developed for finite and extended system. The designed parallel algorithm displays, with respect to the system size, cubic, quartic, and quintic requirements, respectively, for the memory, communication, and computation. All these requirements are reduced with an increasing number of processes, and the measured performance shows excellent parallel scalability and efficiency up to thousands of nodes. Additionally, the computationally more demanding quintic scaling steps can be accelerated by employing graphics processing units (GPU’s) showing, for large systems, a gain of almost a factor two compared to the standard central processing unit-only case. In this way, the evaluation of the derivatives of the RI-MP2 energy can be performed within a few minutes for systems containing hundreds of atoms and thousands of basis functions. With good time to solution, the implementation thus opens the possibility to perform molecular dynamics (MD) simulations in various ensembles (microcanonical ensemble and isobaric-isothermal ensemble) at the MP2 level of theory. Geometry optimization, full cell relaxation, and energy conserving MD simulations have been performed for a variety of molecular crystals including NH{sub 3}, CO{sub 2}, formic acid, and benzene.« less
Research Priorities for School Nursing in the United Arab Emirates (UAE)
ERIC Educational Resources Information Center
Al-Yateem, Nabeel; Docherty, Charles; Brenner, Maria; Alhosany, Jameela; Altawil, Hanan; Al-Tamimi, Muna
2017-01-01
School nurses are challenged with more children having complex conditions, who are now surviving into school age. This is paralleled by a shift in focus of health systems toward primary care, and national efforts to develop the health-care services, especially those offered to vulnerable populations. Being at the forefront of this change, school…
Wire-Guide Manipulator For Automated Welding
NASA Technical Reports Server (NTRS)
Morris, Tim; White, Kevin; Gordon, Steve; Emerich, Dave; Richardson, Dave; Faulkner, Mike; Stafford, Dave; Mccutcheon, Kim; Neal, Ken; Milly, Pete
1994-01-01
Compact motor drive positions guide for welding filler wire. Drive part of automated wire feeder in partly or fully automated welding system. Drive unit contains three parallel subunits. Rotations of lead screws in three subunits coordinated to obtain desired motions in three degrees of freedom. Suitable for both variable-polarity plasma arc welding and gas/tungsten arc welding.
On Collapse and the Next U.S. Democracy: Elements of Applied Systemic Research
ERIC Educational Resources Information Center
Tanaka, Greg
2015-01-01
While concern has been growing in recent years about the structural precursors to economic collapse in the United States, and a parallel decline in democracy, few have asked: (1) what moral and cultural foundations might be necessary as building blocks to launch a democratic renewal and (2) whether a different and "deep" democracy might…
Solar heating and cooling diode module
Maloney, Timothy J.
1986-01-01
A high efficiency solar heating system comprising a plurality of hollow modular units each for receiving a thermal storage mass, the units being arranged in stacked relation in the exterior frame of a building, each of the units including a port for filling the unit with the mass, a collector region and a storage region, each region having inner and outer walls, the outer wall of the collector region being oriented for exposure to sunlight for heating the thermal storage mass; the storage region having an opening therein and the collector region having a corresponding opening, the openings being joined for communicating the thermal storage mass between the storage and collector regions by thermosiphoning; the collector region being disposed substantially below and in parallel relation to the storage region in the modular unit; and the inner wall of the collector region of each successive modular unit in the stacked relation extending over the outer wall of the storage region of the next lower modular unit in the stacked relation for reducing heat loss from the system. Various modifications and alternatives are disclosed for both heating and cooling applications.
The design of RFID convey or belt gate systems using an antenna control unit.
Park, Chong Ryol; Lee, Seung Joon; Eom, Ki Hwan
2011-01-01
This paper proposes an efficient management system utilizing a Radio Frequency Identification (RFID) antenna control unit which is moving along with the path of boxes of materials on the conveyor belt by manipulating a motor. The proposed antenna control unit, which is driven by a motor and is located on top of the gate, has an array structure of two antennas with parallel connection. The array structure helps improve the directivity of antenna beam pattern and the readable RFID distance due to its configuration. In the experiments, as the control unit follows moving materials, the reading time has been improved by almost three-fold compared to an RFID system employing conventional fixed antennas. The proposed system also has a recognition rate of over 99% without additional antennas for detecting the sides of a box of materials. The recognition rate meets the conditions recommended by the Electronic Product Code glbal network (EPC)global for commercializing the system, with three antennas at a 20 dBm power of reader and a conveyor belt speed of 3.17 m/s. This will enable a host of new RFID conveyor belt gate systems with increased performance.
The Design of RFID Convey or Belt Gate Systems Using an Antenna Control Unit
Park, Chong Ryol; Lee, Seung Joon; Eom, Ki Hwan
2011-01-01
This paper proposes an efficient management system utilizing a Radio Frequency Identification (RFID) antenna control unit which is moving along with the path of boxes of materials on the conveyor belt by manipulating a motor. The proposed antenna control unit, which is driven by a motor and is located on top of the gate, has an array structure of two antennas with parallel connection. The array structure helps improve the directivity of antenna beam pattern and the readable RFID distance due to its configuration. In the experiments, as the control unit follows moving materials, the reading time has been improved by almost three-fold compared to an RFID system employing conventional fixed antennas. The proposed system also has a recognition rate of over 99% without additional antennas for detecting the sides of a box of materials. The recognition rate meets the conditions recommended by the Electronic Product Code glbal network (EPC)global for commercializing the system, with three antennas at a 20 dBm power of reader and a conveyor belt speed of 3.17 m/s. This will enable a host of new RFID conveyor belt gate systems with increased performance. PMID:22164119
GPU real-time processing in NA62 trigger system
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2017-01-01
A commercial Graphics Processing Unit (GPU) is used to build a fast Level 0 (L0) trigger system tested parasitically with the TDAQ (Trigger and Data Acquisition systems) of the NA62 experiment at CERN. In particular, the parallel computing power of the GPU is exploited to perform real-time fitting in the Ring Imaging CHerenkov (RICH) detector. Direct GPU communication using a FPGA-based board has been used to reduce the data transmission latency. The performance of the system for multi-ring reconstrunction obtained during the NA62 physics run will be presented.
Analog system for computing sparse codes
Rozell, Christopher John; Johnson, Don Herrick; Baraniuk, Richard Gordon; Olshausen, Bruno A.; Ortman, Robert Lowell
2010-08-24
A parallel dynamical system for computing sparse representations of data, i.e., where the data can be fully represented in terms of a small number of non-zero code elements, and for reconstructing compressively sensed images. The system is based on the principles of thresholding and local competition that solves a family of sparse approximation problems corresponding to various sparsity metrics. The system utilizes Locally Competitive Algorithms (LCAs), nodes in a population continually compete with neighboring units using (usually one-way) lateral inhibition to calculate coefficients representing an input in an over complete dictionary.
Cazzaniga, Paolo; Nobile, Marco S.; Besozzi, Daniela; Bellini, Matteo; Mauri, Giancarlo
2014-01-01
The introduction of general-purpose Graphics Processing Units (GPUs) is boosting scientific applications in Bioinformatics, Systems Biology, and Computational Biology. In these fields, the use of high-performance computing solutions is motivated by the need of performing large numbers of in silico analysis to study the behavior of biological systems in different conditions, which necessitate a computing power that usually overtakes the capability of standard desktop computers. In this work we present coagSODA, a CUDA-powered computational tool that was purposely developed for the analysis of a large mechanistic model of the blood coagulation cascade (BCC), defined according to both mass-action kinetics and Hill functions. coagSODA allows the execution of parallel simulations of the dynamics of the BCC by automatically deriving the system of ordinary differential equations and then exploiting the numerical integration algorithm LSODA. We present the biological results achieved with a massive exploration of perturbed conditions of the BCC, carried out with one-dimensional and bi-dimensional parameter sweep analysis, and show that GPU-accelerated parallel simulations of this model can increase the computational performances up to a 181× speedup compared to the corresponding sequential simulations. PMID:25025072
Data Acquisition with GPUs: The DAQ for the Muon $g$-$2$ Experiment at Fermilab
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gohn, W.
Graphical Processing Units (GPUs) have recently become a valuable computing tool for the acquisition of data at high rates and for a relatively low cost. The devices work by parallelizing the code into thousands of threads, each executing a simple process, such as identifying pulses from a waveform digitizer. The CUDA programming library can be used to effectively write code to parallelize such tasks on Nvidia GPUs, providing a significant upgrade in performance over CPU based acquisition systems. The muonmore » $g$-$2$ experiment at Fermilab is heavily relying on GPUs to process its data. The data acquisition system for this experiment must have the ability to create deadtime-free records from 700 $$\\mu$$s muon spills at a raw data rate 18 GB per second. Data will be collected using 1296 channels of $$\\mu$$TCA-based 800 MSPS, 12 bit waveform digitizers and processed in a layered array of networked commodity processors with 24 GPUs working in parallel to perform a fast recording of the muon decays during the spill. The described data acquisition system is currently being constructed, and will be fully operational before the start of the experiment in 2017.« less
1989-08-28
Voyager violet, green and ultraviolet images of Triton were map projected into cylindrical coordinates and combines to produce this false-color terrain map. Several compositionally distinct terrain and geologic features are portrayed. At center is a gray-blue unit referred to as 'cantaloupe' terrain because of its unusual topographic texture. The unit appears to predate other units to the left. Immediately adjacent to the cantaloupe terrain, is a smoother unit, represented by a reddish color, that has been dissected by a prominent fault system. This unit apparently overlies a much-higher-albedo material, seen farther left. A prominent angular albedo boundary separates relatively undisturbed smooth terrain from irregular patches which seem to emanate from circular, often bright-centered features. The parallel streaks may represent vented particulate materials blown in the same direction by winds in Triton's thin atmosphere.
Vigelius, Matthias; Meyer, Bernd
2012-01-01
For many biological applications, a macroscopic (deterministic) treatment of reaction-drift-diffusion systems is insufficient. Instead, one has to properly handle the stochastic nature of the problem and generate true sample paths of the underlying probability distribution. Unfortunately, stochastic algorithms are computationally expensive and, in most cases, the large number of participating particles renders the relevant parameter regimes inaccessible. In an attempt to address this problem we present a genuine stochastic, multi-dimensional algorithm that solves the inhomogeneous, non-linear, drift-diffusion problem on a mesoscopic level. Our method improves on existing implementations in being multi-dimensional and handling inhomogeneous drift and diffusion. The algorithm is well suited for an implementation on data-parallel hardware architectures such as general-purpose graphics processing units (GPUs). We integrate the method into an operator-splitting approach that decouples chemical reactions from the spatial evolution. We demonstrate the validity and applicability of our algorithm with a comprehensive suite of standard test problems that also serve to quantify the numerical accuracy of the method. We provide a freely available, fully functional GPU implementation. Integration into Inchman, a user-friendly web service, that allows researchers to perform parallel simulations of reaction-drift-diffusion systems on GPU clusters is underway. PMID:22506001
Logic design for dynamic and interactive recovery.
NASA Technical Reports Server (NTRS)
Carter, W. C.; Jessep, D. C.; Wadia, A. B.; Schneider, P. R.; Bouricius, W. G.
1971-01-01
Recovery in a fault-tolerant computer means the continuation of system operation with data integrity after an error occurs. This paper delineates two parallel concepts embodied in the hardware and software functions required for recovery; detection, diagnosis, and reconfiguration for hardware, data integrity, checkpointing, and restart for the software. The hardware relies on the recovery variable set, checking circuits, and diagnostics, and the software relies on the recovery information set, audit, and reconstruct routines, to characterize the system state and assist in recovery when required. Of particular utility is a handware unit, the recovery control unit, which serves as an interface between error detection and software recovery programs in the supervisor and provides dynamic interactive recovery.
Simulating coupled dynamics of a rigid-flexible multibody system and compressible fluid
NASA Astrophysics Data System (ADS)
Hu, Wei; Tian, Qiang; Hu, HaiYan
2018-04-01
As a subsequent work of previous studies of authors, a new parallel computation approach is proposed to simulate the coupled dynamics of a rigid-flexible multibody system and compressible fluid. In this approach, the smoothed particle hydrodynamics (SPH) method is used to model the compressible fluid, the natural coordinate formulation (NCF) and absolute nodal coordinate formulation (ANCF) are used to model the rigid and flexible bodies, respectively. In order to model the compressible fluid properly and efficiently via SPH method, three measures are taken as follows. The first is to use the Riemann solver to cope with the fluid compressibility, the second is to define virtual particles of SPH to model the dynamic interaction between the fluid and the multibody system, and the third is to impose the boundary conditions of periodical inflow and outflow to reduce the number of SPH particles involved in the computation process. Afterwards, a parallel computation strategy is proposed based on the graphics processing unit (GPU) to detect the neighboring SPH particles and to solve the dynamic equations of SPH particles in order to improve the computation efficiency. Meanwhile, the generalized-alpha algorithm is used to solve the dynamic equations of the multibody system. Finally, four case studies are given to validate the proposed parallel computation approach.
Rapid Parallel Calculation of shell Element Based On GPU
NASA Astrophysics Data System (ADS)
Wanga, Jian Hua; Lia, Guang Yao; Lib, Sheng; Li, Guang Yao
2010-06-01
Long computing time bottlenecked the application of finite element. In this paper, an effective method to speed up the FEM calculation by using the existing modern graphic processing unit and programmable colored rendering tool was put forward, which devised the representation of unit information in accordance with the features of GPU, converted all the unit calculation into film rendering process, solved the simulation work of all the unit calculation of the internal force, and overcame the shortcomings of lowly parallel level appeared ever before when it run in a single computer. Studies shown that this method could improve efficiency and shorten calculating hours greatly. The results of emulation calculation about the elasticity problem of large number cells in the sheet metal proved that using the GPU parallel simulation calculation was faster than using the CPU's. It is useful and efficient to solve the project problems in this way.
A design concept of parallel elasticity extracted from biological muscles for engineered actuators.
Chen, Jie; Jin, Hongzhe; Iida, Fumiya; Zhao, Jie
2016-08-23
Series elastic actuation that takes inspiration from biological muscle-tendon units has been extensively studied and used to address the challenges (e.g. energy efficiency, robustness) existing in purely stiff robots. However, there also exists another form of passive property in biological actuation, parallel elasticity within muscles themselves, and our knowledge of it is limited: for example, there is still no general design strategy for the elasticity profile. When we look at nature, on the other hand, there seems a universal agreement in biological systems: experimental evidence has suggested that a concave-upward elasticity behaviour is exhibited within the muscles of animals. Seeking to draw possible design clues for elasticity in parallel with actuators, we use a simplified joint model to investigate the mechanisms behind this biologically universal preference of muscles. Actuation of the model is identified from general biological joints and further reduced with a specific focus on muscle elasticity aspects, for the sake of easy implementation. By examining various elasticity scenarios, one without elasticity and three with elasticity of different profiles, we find that parallel elasticity generally exerts contradictory influences on energy efficiency and disturbance rejection, due to the mechanical impedance shift thus caused. The trade-off analysis between them also reveals that concave parallel elasticity is able to achieve a more advantageous balance than linear and convex ones. It is expected that the results could contribute to our further understanding of muscle elasticity and provide a theoretical guideline on how to properly design parallel elasticity behaviours for engineering systems such as artificial actuators and robotic joints.
Power/Performance Trade-offs of Small Batched LU Based Solvers on GPUs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Fatica, Massimiliano; Gawande, Nitin A.
In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore » parallel (one matrix, one thread) and can reach sizes up to 128x128, but it does not exploit shared memory and only relies on the high memory bandwidth of the GPU. The first and second solution only support partial pivoting, the third one easily supports partial and full pivoting, making it attractive to problems that require greater numerical stability. We analyze the trade-offs in terms of performance and power consumption as function of the size of the linear systems that are simultaneously solved. We execute the three implementations on a Tesla M2090 (Fermi) and on a Tesla K20 (Kepler).« less
Parallel task processing of very large datasets
NASA Astrophysics Data System (ADS)
Romig, Phillip Richardson, III
This research concerns the use of distributed computer technologies for the analysis and management of very large datasets. Improvements in sensor technology, an emphasis on global change research, and greater access to data warehouses all are increase the number of non-traditional users of remotely sensed data. We present a framework for distributed solutions to the challenges of datasets which exceed the online storage capacity of individual workstations. This framework, called parallel task processing (PTP), incorporates both the task- and data-level parallelism exemplified by many image processing operations. An implementation based on the principles of PTP, called Tricky, is also presented. Additionally, we describe the challenges and practical issues in modeling the performance of parallel task processing with large datasets. We present a mechanism for estimating the running time of each unit of work within a system and an algorithm that uses these estimates to simulate the execution environment and produce estimated runtimes. Finally, we describe and discuss experimental results which validate the design. Specifically, the system (a) is able to perform computation on datasets which exceed the capacity of any one disk, (b) provides reduction of overall computation time as a result of the task distribution even with the additional cost of data transfer and management, and (c) in the simulation mode accurately predicts the performance of the real execution environment.
On program restructuring, scheduling, and communication for parallel processor systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Polychronopoulos, Constantine D.
1986-08-01
This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, thesemore » algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented.« less
NASA Astrophysics Data System (ADS)
Wu, Kaihua; Shao, Zhencheng; Chen, Nian; Wang, Wenjie
2018-01-01
The wearing degree of the wheel set tread is one of the main factors that influence the safety and stability of running train. Geometrical parameters mainly include flange thickness and flange height. Line structure laser light was projected on the wheel tread surface. The geometrical parameters can be deduced from the profile image. An online image acquisition system was designed based on asynchronous reset of CCD and CUDA parallel processing unit. The image acquisition was fulfilled by hardware interrupt mode. A high efficiency parallel segmentation algorithm based on CUDA was proposed. The algorithm firstly divides the image into smaller squares, and extracts the squares of the target by fusion of k_means and STING clustering image segmentation algorithm. Segmentation time is less than 0.97ms. A considerable acceleration ratio compared with the CPU serial calculation was obtained, which greatly improved the real-time image processing capacity. When wheel set was running in a limited speed, the system placed alone railway line can measure the geometrical parameters automatically. The maximum measuring speed is 120km/h.
A Programming Framework for Scientific Applications on CPU-GPU Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Owens, John
2013-03-24
At a high level, my research interests center around designing, programming, and evaluating computer systems that use new approaches to solve interesting problems. The rapid change of technology allows a variety of different architectural approaches to computationally difficult problems, and a constantly shifting set of constraints and trends makes the solutions to these problems both challenging and interesting. One of the most important recent trends in computing has been a move to commodity parallel architectures. This sea change is motivated by the industry’s inability to continue to profitably increase performance on a single processor and instead to move to multiplemore » parallel processors. In the period of review, my most significant work has been leading a research group looking at the use of the graphics processing unit (GPU) as a general-purpose processor. GPUs can potentially deliver superior performance on a broad range of problems than their CPU counterparts, but effectively mapping complex applications to a parallel programming model with an emerging programming environment is a significant and important research problem.« less
NASA Astrophysics Data System (ADS)
Lazcano, R.; Madroñal, D.; Fabelo, H.; Ortega, S.; Salvador, R.; Callicó, G. M.; Juárez, E.; Sanz, C.
2017-10-01
Hyperspectral Imaging (HI) assembles high resolution spectral information from hundreds of narrow bands across the electromagnetic spectrum, thus generating 3D data cubes in which each pixel gathers the spectral information of the reflectance of every spatial pixel. As a result, each image is composed of large volumes of data, which turns its processing into a challenge, as performance requirements have been continuously tightened. For instance, new HI applications demand real-time responses. Hence, parallel processing becomes a necessity to achieve this requirement, so the intrinsic parallelism of the algorithms must be exploited. In this paper, a spatial-spectral classification approach has been implemented using a dataflow language known as RVCCAL. This language represents a system as a set of functional units, and its main advantage is that it simplifies the parallelization process by mapping the different blocks over different processing units. The spatial-spectral classification approach aims at refining the classification results previously obtained by using a K-Nearest Neighbors (KNN) filtering process, in which both the pixel spectral value and the spatial coordinates are considered. To do so, KNN needs two inputs: a one-band representation of the hyperspectral image and the classification results provided by a pixel-wise classifier. Thus, spatial-spectral classification algorithm is divided into three different stages: a Principal Component Analysis (PCA) algorithm for computing the one-band representation of the image, a Support Vector Machine (SVM) classifier, and the KNN-based filtering algorithm. The parallelization of these algorithms shows promising results in terms of computational time, as the mapping of them over different cores presents a speedup of 2.69x when using 3 cores. Consequently, experimental results demonstrate that real-time processing of hyperspectral images is achievable.
NASA Astrophysics Data System (ADS)
Marvel, K.; Milkey, R.; Murdin, P.
2000-11-01
As the 19th century ended, astronomy underwent a period of rapid growth in the United States, a growth that was fueled by both the expansion of the university system and private philanthropy and which also paralleled the growth in astrophysical research. For the first half of the 20th century, the US government took little interest in the funding of astronomical research, concentrating on those a...
The Affordability of University Education: A Perspective from Both Sides of the 49th Parallel
ERIC Educational Resources Information Center
Swail, Watson Scott
2004-01-01
This study was conducted to better understand the relative affordability of public university education in Canada and the United States. The report was written to answer two key questions: (1) How does access to university education in Canada compare to access in the US? and (2) How affordable is the Canadian university system compared to the…
Molecular Monte Carlo Simulations Using Graphics Processing Units: To Waste Recycle or Not?
Kim, Jihan; Rodgers, Jocelyn M; Athènes, Manuel; Smit, Berend
2011-10-11
In the waste recycling Monte Carlo (WRMC) algorithm, (1) multiple trial states may be simultaneously generated and utilized during Monte Carlo moves to improve the statistical accuracy of the simulations, suggesting that such an algorithm may be well posed for implementation in parallel on graphics processing units (GPUs). In this paper, we implement two waste recycling Monte Carlo algorithms in CUDA (Compute Unified Device Architecture) using uniformly distributed random trial states and trial states based on displacement random-walk steps, and we test the methods on a methane-zeolite MFI framework system to evaluate their utility. We discuss the specific implementation details of the waste recycling GPU algorithm and compare the methods to other parallel algorithms optimized for the framework system. We analyze the relationship between the statistical accuracy of our simulations and the CUDA block size to determine the efficient allocation of the GPU hardware resources. We make comparisons between the GPU and the serial CPU Monte Carlo implementations to assess speedup over conventional microprocessors. Finally, we apply our optimized GPU algorithms to the important problem of determining free energy landscapes, in this case for molecular motion through the zeolite LTA.
Massive parallelization of serial inference algorithms for a complex generalized linear model
Suchard, Marc A.; Simpson, Shawn E.; Zorych, Ivan; Ryan, Patrick; Madigan, David
2014-01-01
Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. In this paper we show how high-performance statistical computation, including graphics processing units, relatively inexpensive highly parallel computing devices, can enable complex methods in large databases. We focus on optimization and massive parallelization of cyclic coordinate descent approaches to fit a conditioned generalized linear model involving tens of millions of observations and thousands of predictors in a Bayesian context. We find orders-of-magnitude improvement in overall run-time. Coordinate descent approaches are ubiquitous in high-dimensional statistics and the algorithms we propose open up exciting new methodological possibilities with the potential to significantly improve drug safety. PMID:25328363
Peker, Musa; Şen, Baha; Gürüler, Hüseyin
2015-02-01
The effect of anesthesia on the patient is referred to as depth of anesthesia. Rapid classification of appropriate depth level of anesthesia is a matter of great importance in surgical operations. Similarly, accelerating classification algorithms is important for the rapid solution of problems in the field of biomedical signal processing. However numerous, time-consuming mathematical operations are required when training and testing stages of the classification algorithms, especially in neural networks. In this study, to accelerate the process, parallel programming and computing platform (Nvidia CUDA) facilitates dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU) was utilized. The system was employed to detect anesthetic depth level on related electroencephalogram (EEG) data set. This dataset is rather complex and large. Moreover, the achieving more anesthetic levels with rapid response is critical in anesthesia. The proposed parallelization method yielded high accurate classification results in a faster time.
Detection system of capillary array electrophoresis microchip based on optical fiber
NASA Astrophysics Data System (ADS)
Yang, Xiaobo; Bai, Haiming; Yan, Weiping
2009-11-01
To meet the demands of the post-genomic era study and the large parallel detections of epidemic diseases and drug screening, the high throughput micro-fluidic detection system is needed urgently. A scanning laser induced fluorescence detection system based on optical fiber has been established by using a green laser diode double-pumped solid-state laser as excitation source. It includes laser induced fluorescence detection subsystem, capillary array electrophoresis micro-chip, channel identification unit and fluorescent signal processing subsystem. V-shaped detecting probe composed with two optical fibers for transmitting the excitation light and detecting induced fluorescence were constructed. Parallel four-channel signal analysis of capillary electrophoresis was performed on this system by using Rhodamine B as the sample. The distinction of different samples and separation of samples were achieved with the constructed detection system. The lowest detected concentration is 1×10-5 mol/L for Rhodamine B. The results show that the detection system possesses some advantages, such as compact structure, better stability and higher sensitivity, which are beneficial to the development of microminiaturization and integration of capillary array electrophoresis chip.
Simulation framework for intelligent transportation systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ewing, T.; Doss, E.; Hanebutte, U.
1996-10-01
A simulation framework has been developed for a large-scale, comprehensive, scaleable simulation of an Intelligent Transportation System (ITS). The simulator is designed for running on parallel computers and distributed (networked) computer systems, but can run on standalone workstations for smaller simulations. The simulator currently models instrumented smart vehicles with in-vehicle navigation units capable of optimal route planning and Traffic Management Centers (TMC). The TMC has probe vehicle tracking capabilities (display position and attributes of instrumented vehicles), and can provide two-way interaction with traffic to provide advisories and link times. Both the in-vehicle navigation module and the TMC feature detailed graphicalmore » user interfaces to support human-factors studies. Realistic modeling of variations of the posted driving speed are based on human factors studies that take into consideration weather, road conditions, driver personality and behavior, and vehicle type. The prototype has been developed on a distributed system of networked UNIX computers but is designed to run on parallel computers, such as ANL`s IBM SP-2, for large-scale problems. A novel feature of the approach is that vehicles are represented by autonomous computer processes which exchange messages with other processes. The vehicles have a behavior model which governs route selection and driving behavior, and can react to external traffic events much like real vehicles. With this approach, the simulation is scaleable to take advantage of emerging massively parallel processor (MPP) systems.« less
System design of ELITE power processing unit
NASA Astrophysics Data System (ADS)
Caldwell, David J.
The Electric Propulsion Insertion Transfer Experiment (ELITE) is a space mission planned for the mid 1990s in which technological readiness will be demonstrated for electric orbit transfer vehicles (EOTVs). A system-level design of the power processing unit (PPU), which conditions solar array power for the arcjet thruster, was performed to optimize performance with respect to reliability, power output, efficiency, specific mass, and radiation hardness. The PPU system consists of multiphased parallel switchmode converters, configured as current sources, connected directly from the array to the thruster. The PPU control system includes a solar array peak power tracker (PPT) to maximize the power delivered to the thruster regardless of variations in array characteristics. A stability analysis has been performed to verify that the system is stable despite the nonlinear negative impedance of the PPU input and the arcjet thruster. Performance specifications are given to provide the required spacecraft capability with existing technology.
Chantrapromma, Suchada; Chanawanno, Kullapa; Boonnak, Nawong; Fun, Hoong-Kun
2014-01-01
The asymmetric unit of the title salt, C36H32N2 2+·2C6H4ClO3S−, consists of one anion and one half-cation, the other half being generated by inversion symmetry. The dihedral angle between the pyridinium ring and the napthalene ring system in the asymmetric unit is 42.86 (6)°. In the crystal, cations and anions are linked by weak C—H⋯O interactions into chains along [010]. Adjacent chains are further arranged in an antiparallel manner into sheets parallel to the bc plane. π–π interactions are observed involving the cations, with centroid–centroid distances of 3.7664 (8) and 3.8553 (8) Å. PMID:24860326
Hemani, H; Warrier, M; Sakthivel, N; Chaturvedi, S
2014-05-01
Molecular dynamics (MD) simulations are used in the study of void nucleation and growth in crystals that are subjected to tensile deformation. These simulations are run for typically several hundred thousand time steps depending on the problem. We output the atom positions at a required frequency for post processing to determine the void nucleation, growth and coalescence due to tensile deformation. The simulation volume is broken up into voxels of size equal to the unit cell size of crystal. In this paper, we present the algorithm to identify the empty unit cells (voids), their connections (void size) and dynamic changes (growth and coalescence of voids) for MD simulations of large atomic systems (multi-million atoms). We discuss the parallel algorithms that were implemented and discuss their relative applicability in terms of their speedup and scalability. We also present the results on scalability of our algorithm when it is incorporated into MD software LAMMPS. Copyright © 2014 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Das, Aniruddha
2017-11-01
5-amino-1-(phenyl/p-halophenyl)imidazole-4-carboxamides (N-phenyl AICA) (2a-e) and 5-amino-1-(phenyl/p-halophenyl)imidazole-4-carbonitriles (N-phenyl AICN) (3a-e) had been synthesized. X-ray crystallographic studies of 2a-e and 3a-e had been performed to identify any distinct change in stacking patterns in their crystal lattice. Single crystal X-ray diffraction studies of 2a-e revealed π-π stack formations with both imidazole and phenyl/p-halophenyl units in anti and syn parallel-displaced (PD)-type dispositions. No π-π stacking of imidazole occurred when the halogen substituent is bromo or iodo; π-π stacking in these cases occurred involving phenyl rings only. The presence of an additional T-stacking had been observed in crystal lattices of 3a-e. Vertical π-π stacking distances in anti-parallel PD-type arrangements as well as T-stacking distances had shown stacking distances short enough to impart stabilization whereas syn-parallel stacking arrangements had got much larger π-π stacking distances to belie any syn-parallel stacking stabilization. DFT studies had been pursued for quantifying the π-π stacking and T-stacking stabilization. The plotted curves for anti-parallel and T-stacked moieties had similarities to the 'Morse potential energy curve for diatomic molecule'. The minima of the curves corresponded to the most stable stacking distances and related energy values indicated stacking stabilization. Similar DFT studies on syn-parallel systems of 2b corresponded to no π-π stacking stabilization at all. Halogen-halogen interactions had also been observed to stabilize the compounds 2d, 2e and 3d. Nano-structural behaviour of the series of compounds 2a-e and 3a-e were thoroughly investigated.
ERIC Educational Resources Information Center
Stanford Univ., CA. School Mathematics Study Group.
The first chapter of the seventh unit in this SMSG series discusses perpendiculars and parallels; topics covered include the relationship between parallelism and perpendicularity, rectangles, transversals, parallelograms, general triangles, and measurement of the circumference of the earth. The second chapter, on similarity, discusses scale…
Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units
NASA Astrophysics Data System (ADS)
Corral, Roque; Gisbert, Fernando; Pueblas, Jesus
2017-02-01
The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.
Parallel multiphase microflows: fundamental physics, stabilization methods and applications.
Aota, Arata; Mawatari, Kazuma; Kitamori, Takehiko
2009-09-07
Parallel multiphase microflows, which can integrate unit operations in a microchip under continuous flow conditions, are discussed. Fundamental physics, stabilization methods and some applications are shown.
System life and reliability modeling for helicopter transmissions
NASA Technical Reports Server (NTRS)
Savage, M.; Brikmanis, C. K.
1986-01-01
A computer program which simulates life and reliability of helicopter transmissions is presented. The helicopter transmissions may be composed of spiral bevel gear units and planetary gear units - alone, in series or in parallel. The spiral bevel gear units may have either single or dual input pinions, which are identical. The planetary gear units may be stepped or unstepped and the number of planet gears carried by the planet arm may be varied. The reliability analysis used in the program is based on the Weibull distribution lives of the transmission components. The computer calculates the system lives and dynamic capacities of the transmission components and the transmission. The system life is defined as the life of the component or transmission at an output torque at which the probability of survival is 90 percent. The dynamic capacity of a component or transmission is defined as the output torque which can be applied for one million output shaft cycles for a probability of survival of 90 percent. A complete summary of the life and dynamic capacity results is produced by the program.
Quasi-one-dimensional compressible flow across face seals and narrow slots. 2: Computer program
NASA Technical Reports Server (NTRS)
Zuk, J.; Smith, P. J.
1972-01-01
A computer program is presented for compressible fluid flow with friction across face seals and through narrow slots. The computer program carries out a quasi-one-dimensional flow analysis which is valid for laminar and turbulent flows under both subsonic and choked flow conditions for parallel surfaces. The program is written in FORTRAN IV. The input and output variables are in either the International System of Units (SI) or the U.S. customary system.
Karger, Barry L.; Kotler, Lev; Foret, Frantisek; Minarik, Marek; Kleparnik, Karel
2003-12-09
A modular multiple lane or capillary electrophoresis (chromatography) system that permits automated parallel separation and comprehensive collection of all fractions from samples in all lanes or columns, with the option of further on-line automated sample fraction analysis, is disclosed. Preferably, fractions are collected in a multi-well fraction collection unit, or plate (40). The multi-well collection plate (40) is preferably made of a solvent permeable gel, most preferably a hydrophilic, polymeric gel such as agarose or cross-linked polyacrylamide.
Besozzi, Daniela; Pescini, Dario; Mauri, Giancarlo
2014-01-01
Tau-leaping is a stochastic simulation algorithm that efficiently reconstructs the temporal evolution of biological systems, modeled according to the stochastic formulation of chemical kinetics. The analysis of dynamical properties of these systems in physiological and perturbed conditions usually requires the execution of a large number of simulations, leading to high computational costs. Since each simulation can be executed independently from the others, a massive parallelization of tau-leaping can bring to relevant reductions of the overall running time. The emerging field of General Purpose Graphic Processing Units (GPGPU) provides power-efficient high-performance computing at a relatively low cost. In this work we introduce cuTauLeaping, a stochastic simulator of biological systems that makes use of GPGPU computing to execute multiple parallel tau-leaping simulations, by fully exploiting the Nvidia's Fermi GPU architecture. We show how a considerable computational speedup is achieved on GPU by partitioning the execution of tau-leaping into multiple separated phases, and we describe how to avoid some implementation pitfalls related to the scarcity of memory resources on the GPU streaming multiprocessors. Our results show that cuTauLeaping largely outperforms the CPU-based tau-leaping implementation when the number of parallel simulations increases, with a break-even directly depending on the size of the biological system and on the complexity of its emergent dynamics. In particular, cuTauLeaping is exploited to investigate the probability distribution of bistable states in the Schlögl model, and to carry out a bidimensional parameter sweep analysis to study the oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae. PMID:24663957
Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas
2015-01-01
Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
Mechanical catalysis on the centimetre scale
Miyashita, Shuhei; Audretsch, Christof; Nagy, Zoltán; Füchslin, Rudolf M.; Pfeifer, Rolf
2015-01-01
Enzymes play important roles in catalysing biochemical transaction paths, acting as logical machines through the morphology of the processes. A key challenge in elucidating the nature of these systems, and for engineering manufacturing methods inspired by biochemical reactions, is to attain a comprehensive understanding of the stereochemical ground rules of enzymatic reactions. Here, we present a model of catalysis that can be performed magnetically by centimetre-sized passive floating units. The designed system, which is equipped with permanent magnets only, passively obeys the local causalities imposed by magnetic interactions, albeit it shows a spatial behaviour and an energy profile analogous to those of biochemical enzymes. In this process, the enzyme units trigger physical conformation changes of the target by levelling out the magnetic potential barrier (activation potential) to a funnel type and, thus, induce cascading conformation changes of the targeted substrate units reacting in parallel. The inhibitor units, conversely, suppress such changes by increasing the potential. Because the model is purely mechanical and established on a physics basis in the absence of turbulence, each performance can be explained by the morphology of the unit, extending the definition of catalysis to systems of alternative scales. PMID:25652461
Mechanical catalysis on the centimetre scale.
Miyashita, Shuhei; Audretsch, Christof; Nagy, Zoltán; Füchslin, Rudolf M; Pfeifer, Rolf
2015-03-06
Enzymes play important roles in catalysing biochemical transaction paths, acting as logical machines through the morphology of the processes. A key challenge in elucidating the nature of these systems, and for engineering manufacturing methods inspired by biochemical reactions, is to attain a comprehensive understanding of the stereochemical ground rules of enzymatic reactions. Here, we present a model of catalysis that can be performed magnetically by centimetre-sized passive floating units. The designed system, which is equipped with permanent magnets only, passively obeys the local causalities imposed by magnetic interactions, albeit it shows a spatial behaviour and an energy profile analogous to those of biochemical enzymes. In this process, the enzyme units trigger physical conformation changes of the target by levelling out the magnetic potential barrier (activation potential) to a funnel type and, thus, induce cascading conformation changes of the targeted substrate units reacting in parallel. The inhibitor units, conversely, suppress such changes by increasing the potential. Because the model is purely mechanical and established on a physics basis in the absence of turbulence, each performance can be explained by the morphology of the unit, extending the definition of catalysis to systems of alternative scales.
Study on observation planning of LAMOST focal plane positioning system and its simulation
NASA Astrophysics Data System (ADS)
Zhai, Chao; Jin, Yi; Peng, Xiaobo; Xing, Xiaozheng
2006-06-01
Fiber Positioning System of LAMOST focal plane based on subarea thinking, adopts a parallel controllable positioning plan, the structure is designed as a round area and overlapped each other in order to eliminate the un-observation region. But it also makes the observation efficiency of the system become an important problem. In this paper According to the system, the model of LAMOST focal plane Observation Planning including 4000 fiber positioning units is built, Stars are allocated using netflow algorithm and mechanical collisions are diminished through the retreat algorithm, then the simulation of the system's observation efficiency is carried out. The problem of observation efficiency of LAMOST focal plane is analysed systemic and all-sided from the aspect of overlapped region, fiber positioning units, observation radius, collisions and so on. The observation efficiency of the system in theory is describes and the simulation indicates that the system's observation efficiency is acceptable. The analyses play an indicative role on the design of the LAMOST focal plane structure.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
1980-09-01
A number of investigations, including those conducted by The Aerospace Corporation and other contractors, have led to the recognition of technical, economic, and institutional issues relating to the interface between solar electric technologies and electric utility systems. These issues derive from three attributes of solar electric power concepts, including (1) the variability and unpredictability of the solar resources, (2) the dispersed nature of those resources which suggests the feasible deployment of small dispersed power units, and (3) a high initial capital cost coupled with relatively low operating costs. It is imperative that these integration issues be pursued in parallel withmore » the development of each technology if the nation's electric utility systems are to effectively utilize these technologies in the near to intermediate term. Analyses of three of these issues are presented: utility information requirements, generation mix and production cost impacts, and rate structures in the context of photovoltaic units integrated into the utility system. (WHK)« less
Compact, high energy gas laser
Rockwood, Stephen D.; Stapleton, Robert E.; Stratton, Thomas F.
1976-08-03
An electrically pumped gas laser amplifier unit having a disc-like configuration in which light propagation is radially outward from the axis rather than along the axis. The input optical energy is distributed over a much smaller area than the output optical energy, i.e., the amplified beam, while still preserving the simplicity of parallel electrodes for pumping the laser medium. The system may thus be driven by a comparatively low optical energy input, while at the same time, owing to the large output area, large energies may be extracted while maintaining the energy per unit area below the threshold of gas breakdown.
Massively parallel simulator of optical coherence tomography of inhomogeneous turbid media.
Malektaji, Siavash; Lima, Ivan T; Escobar I, Mauricio R; Sherif, Sherif S
2017-10-01
An accurate and practical simulator for Optical Coherence Tomography (OCT) could be an important tool to study the underlying physical phenomena in OCT such as multiple light scattering. Recently, many researchers have investigated simulation of OCT of turbid media, e.g., tissue, using Monte Carlo methods. The main drawback of these earlier simulators is the long computational time required to produce accurate results. We developed a massively parallel simulator of OCT of inhomogeneous turbid media that obtains both Class I diffusive reflectivity, due to ballistic and quasi-ballistic scattered photons, and Class II diffusive reflectivity due to multiply scattered photons. This Monte Carlo-based simulator is implemented on graphic processing units (GPUs), using the Compute Unified Device Architecture (CUDA) platform and programming model, to exploit the parallel nature of propagation of photons in tissue. It models an arbitrary shaped sample medium as a tetrahedron-based mesh and uses an advanced importance sampling scheme. This new simulator speeds up simulations of OCT of inhomogeneous turbid media by about two orders of magnitude. To demonstrate this result, we have compared the computation times of our new parallel simulator and its serial counterpart using two samples of inhomogeneous turbid media. We have shown that our parallel implementation reduced simulation time of OCT of the first sample medium from 407 min to 92 min by using a single GPU card, to 12 min by using 8 GPU cards and to 7 min by using 16 GPU cards. For the second sample medium, the OCT simulation time was reduced from 209 h to 35.6 h by using a single GPU card, and to 4.65 h by using 8 GPU cards, and to only 2 h by using 16 GPU cards. Therefore our new parallel simulator is considerably more practical to use than its central processing unit (CPU)-based counterpart. Our new parallel OCT simulator could be a practical tool to study the different physical phenomena underlying OCT, or to design OCT systems with improved performance. Copyright © 2017 Elsevier B.V. All rights reserved.
Real-time digital holographic microscopy using the graphic processing unit.
Shimobaba, Tomoyoshi; Sato, Yoshikuni; Miura, Junya; Takenouchi, Mai; Ito, Tomoyoshi
2008-08-04
Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512 x 512 grids in 24 frames per second.
ERIC Educational Resources Information Center
Block, Alan A.
This book explores the construction of the idea of the child as a product of adult needs and school as a place where children may be confined until they are considered socially useful. Drawing parallels with folk singer Bob Dylan's song, "It's All Right, Ma, I'm Only Bleeding," the book argues that the United States' educational system practices a…
Learning in Neural Networks: VLSI Implementation Strategies
NASA Technical Reports Server (NTRS)
Duong, Tuan Anh
1995-01-01
Fully-parallel hardware neural network implementations may be applied to high-speed recognition, classification, and mapping tasks in areas such as vision, or can be used as low-cost self-contained units for tasks such as error detection in mechanical systems (e.g. autos). Learning is required not only to satisfy application requirements, but also to overcome hardware-imposed limitations such as reduced dynamic range of connections.
NASA Astrophysics Data System (ADS)
Mackowski, Daniel; Ramezanpour, Bahareh
2018-07-01
A formulation is developed for numerically solving the frequency domain Maxwell's equations in plane parallel layers of inhomogeneous media. As was done in a recent work [1], the plane parallel layer is modeled as an infinite square lattice of W × W × H unit cells, with W being a sample width of the layer and H the layer thickness. As opposed to the 3D volume integral/discrete dipole formulation, the derivation begins with a Fourier expansion of the electric field amplitude in the lateral plane, and leads to a coupled system of 1D ordinary differential equations in the depth direction of the layer. A 1D dyadic Green's function is derived for this system and used to construct a set of coupled 1D integral equations for the field expansion coefficients. The resulting mathematical formulation is considerably simpler and more compact than that derived, for the same system, using the discrete dipole approximation applied to the periodic plane lattice. Furthermore, the fundamental property variable appearing in the formulation is the Fourier transformed complex permittivity distribution in the unit cell, and the method obviates any need to define or calculate a dipole polarizability. Although designed primarily for random media calculations, the method is also capable of predicting the single scattering properties of individual particles; comparisons are presented to demonstrate that the method can accurately reproduce, at scattering angles not too close to 90°, the polarimetric scattering properties of single and multiple spheres. The derivation of the dyadic Green's function allows for an analytical preconditioning of the equations, and it is shown that this can result in significantly accelerated solution times when applied to densely-packed systems of particles. Calculation results demonstrate that the method, when applied to inhomogeneous media, can predict coherent backscattering and polarization opposition effects.
Pathogenic Policy: Immigrant Policing, Fear, and Parallel Medical Systems in the US South.
Kline, Nolan
2017-01-01
Medical anthropology has a vital role in identifying health-related impacts of policy. In the United States, increasingly harsh immigration policies have formed a multilayered immigrant policing regime comprising state and federal laws and local police practices, the effects of which demand ethnographic attention. In this article, I draw from ethnographic fieldwork in Atlanta, Georgia, to examine the biopolitics of immigrant policing. I underscore how immigrant policing directly impacts undocumented immigrants' health by producing a type of fear based governance that alters immigrants' health behaviors and sites for seeking health services. Ethnographic data further point to how immigrant policing sustains a need for an unequal, parallel medical system, reflecting broader social inequalities impacting vulnerable populations. Moreover, by focusing on immigrant policing, I demonstrate the analytical utility in examining the biopolitics of fear, which can reveal individual experiences and structural influents of health-related vulnerability.
Torak, Lynn J.; Painter, Jaime A.
2006-01-01
The lower Apalachicola-Chattahoochee-Flint (ACF) River Basin contains about 4,600 square miles of karstic and fluvial plains and nearly 100,000 cubic miles of predominantly karst limestone connected hydraulically to the principal rivers and lakes in the Coastal Plain of southwestern Georgia, northwestern Florida, and southwestern Alabama. Sediments of late-middle Eocene to Holocene in hydraulic connection with lakes, streams, and land surface comprise the surficial aquifer system, upper semiconfining unit, Upper Floridan aquifer, and lower semiconfining unit and contribute to the exchange of ground water and surface water in the stream-lake-aquifer flow system. Karst processes, hydraulic properties, and stratigraphic relations limit ground-water and surface-water interaction to the following hydrologic units of the stream-lake-aquifer flow system: the surficial aquifer system, upper semiconfining unit, Upper Floridan aquifer, and lower confining unit. Geologic units corresponding to these hydrologic units are, in ascending order: Lisbon Formation; Clinchfield Sand; Ocala, Marianna, Suwannee, and Tampa Limestones; Hawthorn Group; undifferentiated overburden (residuum); and terrace and undifferentiated (surficial) deposits. Similarities in hydraulic properties and direct or indirect interaction with surface water allow grouping sediments within these geologic units into the aforementioned hydrologic units, which transcend time-stratigraphic classifications and define the geohydrologic framework for the lower ACF River Basin. The low water-transmitting properties of the lower confining unit, principally the Lisbon Formation, allow it to act as a nearly impermeable base to the stream-lake-aquifer flow system. Hydraulic connection of the surficial aquifer system with surface water and the Upper Floridan aquifer is direct where sandy deposits overlie the limestone, or indirect where fluvial deposits overlie clayey limestone residuum. The water level in perched zones within the surficial aquifer system fluctuates independently of water-level changes in the underlying aquifer, adjacent streams, or lakes. Where the surficial aquifer system is connected with surface water and the Upper Floridan aquifer, water-table fluctuations parallel those in adjacent streams or the underlying aquifer. More...
Robertson, Benjamin D; Vadakkeveedu, Siddarth; Sawicki, Gregory S
2017-05-24
We present a novel biorobotic framework comprised of a biological muscle-tendon unit (MTU) mechanically coupled to a feedback controlled robotic environment simulation that mimics in vivo inertial/gravitational loading and mechanical assistance from a parallel elastic exoskeleton. Using this system, we applied select combinations of biological muscle activation (modulated with rate-coded direct neural stimulation) and parallel elastic assistance (applied via closed-loop mechanical environment simulation) hypothesized to mimic human behavior based on previously published modeling studies. These conditions resulted in constant system-level force-length dynamics (i.e., stiffness), reduced biological loads, increased muscle excursion, and constant muscle average positive power output-all consistent with laboratory experiments on intact humans during exoskeleton assisted hopping. Mechanical assistance led to reduced estimated metabolic cost and MTU apparent efficiency, but increased apparent efficiency for the MTU+Exo system as a whole. Findings from this study suggest that the increased natural resonant frequency of the artificially stiffened MTU+Exo system, along with invariant movement frequencies, may underlie observed limits on the benefits of exoskeleton assistance. Our novel approach demonstrates that it is possible to capture the salient features of human locomotion with exoskeleton assistance in an isolated muscle-tendon preparation, and introduces a powerful new tool for detailed, direct examination of how assistive devices affect muscle-level neuromechanics and energetics. Copyright © 2017 Elsevier Ltd. All rights reserved.
German-Korean cooperation for erection and test of industrialized solar technologies
NASA Astrophysics Data System (ADS)
Pfeiffer, H.
1986-01-01
A combined small solar-wind power station and a solar-thermal experimental plant were built. The plants are designed to demonstrate the effective exploitation of solar energy and wind energy and enhanced availability achievable through combination of these two energy sources. A 14 kW wind energy converter and a 2.5 kW solar-cell generator were operated in parallel. The biaxial tracking system used on the solar generator leads to increased and constant generation of electricity throughout the day. A consumer control system switches the energy generators and the consumers in autonomous mode according to changing supply and demand. The solar powered air conditioning unit operates with an absorption type refrigerating unit, high-output flat collectors and an automatic control system. All design values are achieved on start-up of the plant.
Lossless data compression for improving the performance of a GPU-based beamformer.
Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
2015-04-01
The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
Efficient Scalable Median Filtering Using Histogram-Based Operations.
Green, Oded
2018-05-01
Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.
A Verification System for Distributed Objects with Asynchronous Method Calls
NASA Astrophysics Data System (ADS)
Ahrendt, Wolfgang; Dylla, Maximilian
We present a verification system for Creol, an object-oriented modeling language for concurrent distributed applications. The system is an instance of KeY, a framework for object-oriented software verification, which has so far been applied foremost to sequential Java. Building on KeY characteristic concepts, like dynamic logic, sequent calculus, explicit substitutions, and the taclet rule language, the system presented in this paper addresses functional correctness of Creol models featuring local cooperative thread parallelism and global communication via asynchronous method calls. The calculus heavily operates on communication histories which describe the interfaces of Creol units. Two example scenarios demonstrate the usage of the system.
Massively parallel processor computer
NASA Technical Reports Server (NTRS)
Fung, L. W. (Inventor)
1983-01-01
An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.
Ruland, C M; Ravn, I H
2001-01-01
An important strategy for improving resource management and cost containment in health care is to develop information systems that assist hospital managers in financial management, resource allocation, and activity planning. A crucial part of such development is a rigorous evaluation to assess whether the system accomplishes it's intended goals. To evaluate CLASSICA, a Decision Support System (DSS), that assists nurse managers in financial management, resource allocation, staffing, and activity planning. Using a pre-post test design with control units, CLASSICA was evaluated in four test units. Baseline data and simultaneous parallel measures were collected prior to system implementation at test sites and control units. Using expense reports, staffing and financial statistics, surveys, interviews with nurse managers, and logs as data sources, CLASSICA was evaluated on: cost reduction, quality of management information; usefulness as decision support for improved financial management and decision-making; user satisfaction; and ease of use. Evaluation results showed a 41% reduction in expenditures for overtime and extra hours as compared to a 1.8% reduction in control units during the same time period. Users reported a significant improvement in management information; nurse managers stated that they had gained control over costs. The system helped them analyze the relationships between patient activity staffing, and cost of care. Users reported high satisfaction with the system, the information and decision support it provided, and its ease of use. These results suggest that CLASSICA is a DSS that successfully assists nurse managers in cost effective management of their units.
Graphics processing units in bioinformatics, computational biology and systems biology.
Nobile, Marco S; Cazzaniga, Paolo; Tangherloni, Andrea; Besozzi, Daniela
2017-09-01
Several studies in Bioinformatics, Computational Biology and Systems Biology rely on the definition of physico-chemical or mathematical models of biological systems at different scales and levels of complexity, ranging from the interaction of atoms in single molecules up to genome-wide interaction networks. Traditional computational methods and software tools developed in these research fields share a common trait: they can be computationally demanding on Central Processing Units (CPUs), therefore limiting their applicability in many circumstances. To overcome this issue, general-purpose Graphics Processing Units (GPUs) are gaining an increasing attention by the scientific community, as they can considerably reduce the running time required by standard CPU-based software, and allow more intensive investigations of biological systems. In this review, we present a collection of GPU tools recently developed to perform computational analyses in life science disciplines, emphasizing the advantages and the drawbacks in the use of these parallel architectures. The complete list of GPU-powered tools here reviewed is available at http://bit.ly/gputools. © The Author 2016. Published by Oxford University Press.
Fischer-Baum, Simon; Englebretson, Robert
2016-08-01
Reading relies on the recognition of units larger than single letters and smaller than whole words. Previous research has linked sublexical structures in reading to properties of the visual system, specifically on the parallel processing of letters that the visual system enables. But whether the visual system is essential for this to happen, or whether the recognition of sublexical structures may emerge by other means, is an open question. To address this question, we investigate braille, a writing system that relies exclusively on the tactile rather than the visual modality. We provide experimental evidence demonstrating that adult readers of (English) braille are sensitive to sublexical units. Contrary to prior assumptions in the braille research literature, we find strong evidence that braille readers do indeed access sublexical structure, namely the processing of multi-cell contractions as single orthographic units and the recognition of morphemes within morphologically-complex words. Therefore, we conclude that the recognition of sublexical structure is not exclusively tied to the visual system. However, our findings also suggest that there are aspects of morphological processing on which braille and print readers differ, and that these differences may, crucially, be related to reading using the tactile rather than the visual sensory modality. Copyright © 2016 Elsevier B.V. All rights reserved.
OpenMP parallelization of a gridded SWAT (SWATG)
NASA Astrophysics Data System (ADS)
Zhang, Ying; Hou, Jinliang; Cao, Yongpan; Gu, Juan; Huang, Chunlin
2017-12-01
Large-scale, long-term and high spatial resolution simulation is a common issue in environmental modeling. A Gridded Hydrologic Response Unit (HRU)-based Soil and Water Assessment Tool (SWATG) that integrates grid modeling scheme with different spatial representations also presents such problems. The time-consuming problem affects applications of very high resolution large-scale watershed modeling. The OpenMP (Open Multi-Processing) parallel application interface is integrated with SWATG (called SWATGP) to accelerate grid modeling based on the HRU level. Such parallel implementation takes better advantage of the computational power of a shared memory computer system. We conducted two experiments at multiple temporal and spatial scales of hydrological modeling using SWATG and SWATGP on a high-end server. At 500-m resolution, SWATGP was found to be up to nine times faster than SWATG in modeling over a roughly 2000 km2 watershed with 1 CPU and a 15 thread configuration. The study results demonstrate that parallel models save considerable time relative to traditional sequential simulation runs. Parallel computations of environmental models are beneficial for model applications, especially at large spatial and temporal scales and at high resolutions. The proposed SWATGP model is thus a promising tool for large-scale and high-resolution water resources research and management in addition to offering data fusion and model coupling ability.
Liwo, Adam; Ołdziej, Stanisław; Czaplewski, Cezary; Kleinerman, Dana S.; Blood, Philip; Scheraga, Harold A.
2010-01-01
We report the implementation of our united-residue UNRES force field for simulations of protein structure and dynamics with massively parallel architectures. In addition to coarse-grained parallelism already implemented in our previous work, in which each conformation was treated by a different task, we introduce a fine-grained level in which energy and gradient evaluation are split between several tasks. The Message Passing Interface (MPI) libraries have been utilized to construct the parallel code. The parallel performance of the code has been tested on a professional Beowulf cluster (Xeon Quad Core), a Cray XT3 supercomputer, and two IBM BlueGene/P supercomputers with canonical and replica-exchange molecular dynamics. With IBM BlueGene/P, about 50 % efficiency and 120-fold speed-up of the fine-grained part was achieved for a single trajectory of a 767-residue protein with use of 256 processors/trajectory. Because of averaging over the fast degrees of freedom, UNRES provides an effective 1000-fold speed-up compared to the experimental time scale and, therefore, enables us to effectively carry out millisecond-scale simulations of proteins with 500 and more amino-acid residues in days of wall-clock time. PMID:20305729
Advanced autostereoscopic display for G-7 pilot project
NASA Astrophysics Data System (ADS)
Hattori, Tomohiko; Ishigaki, Takeo; Shimamoto, Kazuhiro; Sawaki, Akiko; Ishiguchi, Tsuneo; Kobayashi, Hiromi
1999-05-01
An advanced auto-stereoscopic display is described that permits the observation of a stereo pair by several persons simultaneously without the use of special glasses and any kind of head tracking devices for the viewers. The system is composed of a right eye system, a left eye system and a sophisticated head tracking system. In the each eye system, a transparent type color liquid crystal imaging plate is used with a special back light unit. The back light unit consists of a monochrome 2D display and a large format convex lens. The unit distributes the light of the viewers' correct each eye only. The right eye perspective system is combined with a left eye perspective system is combined with a left eye perspective system by a half mirror in order to function as a time-parallel stereoscopic system. The viewer's IR image is taken through and focused by the large format convex lens and feed back to the back light as a modulated binary half face image. The auto-stereoscopic display employs the TTL method as the accurate head tracking. The system was worked as a stereoscopic TV phone between Duke University Department Tele-medicine and Nagoya University School of Medicine Department Radiology using a high-speed digital line of GIBN. The applications are also described in this paper.
Speech recognition systems on the Cell Broadband Engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Y; Jones, H; Vaidya, S
In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousandsmore » of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.« less
Forward Period Analysis Method of the Periodic Hamiltonian System.
Wang, Pengfei
2016-01-01
Using the forward period analysis (FPA), we obtain the period of a Morse oscillator and mathematical pendulum system, with the accuracy of 100 significant digits. From these results, the long-term [0, 1060] (time unit) solutions, ranging from the Planck time to the age of the universe, are computed reliably and quickly with a parallel multiple-precision Taylor series (PMT) scheme. The application of FPA to periodic systems can greatly reduce the computation time of long-term reliable simulations. This scheme provides an efficient way to generate reference solutions, against which long-term simulations using other schemes can be tested.
High performance, low cost, self-contained, multipurpose PC based ground systems
NASA Technical Reports Server (NTRS)
Forman, Michael; Nickum, William; Troendly, Gregory
1993-01-01
The use of embedded processors greatly enhances the capabilities of personal computers when used for telemetry processing and command control center functions. Parallel architectures based on the use of transputers are shown to be very versatile and reusable, and the synergism between the PC and the embedded processor with transputers results in single unit, low cost workstations of 20 less than MIPS less than or equal to 1000.
Parallel processing and expert systems
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Lau, Sonie
1991-01-01
Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 90's cannot enjoy an increased level of autonomy without the efficient use of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real time demands are met for large expert systems. Speed-up via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial labs in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems was surveyed. The survey is divided into three major sections: (1) multiprocessors for parallel expert systems; (2) parallel languages for symbolic computations; and (3) measurements of parallelism of expert system. Results to date indicate that the parallelism achieved for these systems is small. In order to obtain greater speed-ups, data parallelism and application parallelism must be exploited.
Tankam, Patrice; Santhanam, Anand P.; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P.
2014-01-01
Abstract. Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing. PMID:24695868
Tankam, Patrice; Santhanam, Anand P; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P
2014-07-01
Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing.
Traditional Tracking with Kalman Filter on Parallel Architectures
NASA Astrophysics Data System (ADS)
Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; MacNeill, Ian; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
2015-05-01
Power density constraints are limiting the performance improvements of modern CPUs. To address this, we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The most common track finding techniques in use today are however those based on the Kalman Filter. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. We report the results of our investigations into the potential and limitations of these algorithms on the new parallel hardware.
High performance data transfer
NASA Astrophysics Data System (ADS)
Cottrell, R.; Fang, C.; Hanushevsky, A.; Kreuger, W.; Yang, W.
2017-10-01
The exponentially increasing need for high speed data transfer is driven by big data, and cloud computing together with the needs of data intensive science, High Performance Computing (HPC), defense, the oil and gas industry etc. We report on the Zettar ZX software. This has been developed since 2013 to meet these growing needs by providing high performance data transfer and encryption in a scalable, balanced, easy to deploy and use way while minimizing power and space utilization. In collaboration with several commercial vendors, Proofs of Concept (PoC) consisting of clusters have been put together using off-the- shelf components to test the ZX scalability and ability to balance services using multiple cores, and links. The PoCs are based on SSD flash storage that is managed by a parallel file system. Each cluster occupies 4 rack units. Using the PoCs, between clusters we have achieved almost 200Gbps memory to memory over two 100Gbps links, and 70Gbps parallel file to parallel file with encryption over a 5000 mile 100Gbps link.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thompson, Kelly; Budge, Kent; Lowrie, Rob
2016-03-03
Draco is an object-oriented component library geared towards numerically intensive, radiation (particle) transport applications built for parallel computing hardware. It consists of semi-independent packages and a robust build system. The packages in Draco provide a set of components that can be used by multiple clients to build transport codes. The build system can also be extracted for use in clients. Software includes smart pointers, Design-by-Contract assertions, unit test framework, wrapped MPI functions, a file parser, unstructured mesh data structures, a random number generator, root finders and an angular quadrature component.
Parallel Computation of Ocean-Atmosphere-Wave Coupled Storm Surge Model
NASA Astrophysics Data System (ADS)
Kim, K.; Yamashita, T.
2003-12-01
Ocean-atmosphere interactions are very important in the formation and development of tropical storms. These interactions are dominant in exchanging heat, momentum, and moisture fluxes. Heat flux is usually computed using a bulk equation. In this equation air-sea interface supplies heat energy to the atmosphere and to the storm. Dynamical interaction is most often one way in which it is the atmosphere that drives the ocean. The winds transfer momentum to both ocean surface waves and ocean current. The wind wave makes an important role in the exchange of the quantities of motion, heat and a substance between the atmosphere and the ocean. Storm surges can be considered as the phenomena of mean sea-level changes, which are the result of the frictional stresses of strong winds blowing toward the land and causing the set level and the low atmospheric pressure at the centre of the cyclone can additionally raise the sea level. In addition to the rise in water level itself, another wave factor must be considered. A rise of mean sea level due to white-cap wave dissipation should be considered. In bounded bodies of water, such as small seas, wind driven sea level set up is much serious than inverted barometer effects, in which the effects of wind waves on wind-driven current play an important role. It is necessary to develop the coupled system of the full spectral third-generation wind-wave model (WAM or WAVEWATCH III), the meso-scale atmosphere model (MM5) and the coastal ocean model (POM) for simulating these physical interactions. As the component of coupled system is so heavy for personal usage, the parallel computing system should be developed. In this study, first, we developed the coupling system of the atmosphere model, ocean wave model and the coastal ocean model, in the Beowulf System, for the simulation of the storm surge. It was applied to the storm surge simulation caused by Typhoon Bart (T9918) in the Yatsushiro Sea. The atmosphere model and the ocean model have been made the parallel codes by SPMD methods. The wave-current interface model was developed by defining the wave breaking stresses. And we developed the coupling program to collect and distribute the exchanging data with the parallel system. Every models and coupler are executed at same time, and they calculate own jobs and pass data with organic system. MPMD method programming was performed to couple the models. The coupler and each models united by the separated group, and they calculated by the group unit. Also they passed message when exchanging data by global unit. The data are exchanged every 60-second model time that is the least common multiple time of the atmosphere model, the wave model and the ocean model. The model was applied to the storm surge simulation in the Yatsushiro Sea, in which we could not simulated the observed maximum surge height with the numerical model that did not include the wave breaking stress. It is confirmed that the simulation which includes the wave breaking stress effects can produce the observed maximum height, 450 cm, at Matsuai.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80
NASA Astrophysics Data System (ADS)
Kamat, Manohar P.; Watson, Brian C.
1992-02-01
The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80
NASA Technical Reports Server (NTRS)
Kamat, Manohar P.; Watson, Brian C.
1992-01-01
The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.
Parallel approach in RDF query processing
NASA Astrophysics Data System (ADS)
Vajgl, Marek; Parenica, Jan
2017-07-01
Parallel approach is nowadays a very cheap solution to increase computational power due to possibility of usage of multithreaded computational units. This hardware became typical part of nowadays personal computers or notebooks and is widely spread. This contribution deals with experiments how evaluation of computational complex algorithm of the inference over RDF data can be parallelized over graphical cards to decrease computational time.
NASA Astrophysics Data System (ADS)
Balaji, V.; Benson, Rusty; Wyman, Bruce; Held, Isaac
2016-10-01
Climate models represent a large variety of processes on a variety of timescales and space scales, a canonical example of multi-physics multi-scale modeling. Current hardware trends, such as Graphical Processing Units (GPUs) and Many Integrated Core (MIC) chips, are based on, at best, marginal increases in clock speed, coupled with vast increases in concurrency, particularly at the fine grain. Multi-physics codes face particular challenges in achieving fine-grained concurrency, as different physics and dynamics components have different computational profiles, and universal solutions are hard to come by. We propose here one approach for multi-physics codes. These codes are typically structured as components interacting via software frameworks. The component structure of a typical Earth system model consists of a hierarchical and recursive tree of components, each representing a different climate process or dynamical system. This recursive structure generally encompasses a modest level of concurrency at the highest level (e.g., atmosphere and ocean on different processor sets) with serial organization underneath. We propose to extend concurrency much further by running more and more lower- and higher-level components in parallel with each other. Each component can further be parallelized on the fine grain, potentially offering a major increase in the scalability of Earth system models. We present here first results from this approach, called coarse-grained component concurrency, or CCC. Within the Geophysical Fluid Dynamics Laboratory (GFDL) Flexible Modeling System (FMS), the atmospheric radiative transfer component has been configured to run in parallel with a composite component consisting of every other atmospheric component, including the atmospheric dynamics and all other atmospheric physics components. We will explore the algorithmic challenges involved in such an approach, and present results from such simulations. Plans to achieve even greater levels of coarse-grained concurrency by extending this approach within other components, such as the ocean, will be discussed.
Rosso, Diego; Lothman, Sarah E; Jeung, Matthew K; Pitt, Paul; Gellner, W James; Stone, Alan L; Howard, Don
2011-11-15
Integrated fixed-film activated sludge (IFAS) processes are becoming more popular for both secondary and sidestream treatment in wastewater facilities. These processes are a combination of biofilm reactors and activated sludge processes, achieved by introducing and retaining biofilm carrier media in activated sludge reactors. A full-scale train of three IFAS reactors equipped with AnoxKaldnes media and coarse-bubble aeration was tested using off-gas analysis. This was operated independently in parallel to an existing full-scale activated sludge process. Both processes achieved the same percent removal of COD and ammonia, despite the double oxygen demand on the IFAS reactors. In order to prevent kinetic limitations associated with DO diffusional gradients through the IFAS biofilm, this systems was operated at an elevated dissolved oxygen concentration, in line with the manufacturer's recommendation. Also, to avoid media coalescence on the reactor surface and promote biofilm contact with the substrate, high mixing requirements are specified. Therefore, the air flux in the IFAS reactors was much higher than that of the parallel activated sludge reactors. However, the standardized oxygen transfer efficiency in process water was almost same for both processes. In theory, when the oxygen transfer efficiency is the same, the air used per unit load removed should be the same. However, due to the high DO and mixing requirements, the IFAS reactors were characterized by elevated air flux and air use per unit load treated. This directly reflected in the relative energy footprint for aeration, which in this case was much higher for the IFAS system than activated sludge. Copyright © 2011 Elsevier Ltd. All rights reserved.
Estrada-Arriaga, Edson Baltazar; Hernández-Romano, Jesús; García-Sánchez, Liliana; Guillén Garcés, Rosa Angélica; Bahena-Bahena, Erick Obed; Guadarrama-Pérez, Oscar; Moeller Chavez, Gabriela Eleonora
2018-05-15
In this study, a continuous flow stack consisting of 40 individual air-cathode MFC units was used to determine the performance of stacked MFC during domestic wastewater treatment operated with unconnected individual MFC and in series and parallel configuration. The voltages obtained from individual MFC units were of 0.08-1.1 V at open circuit voltage, while in series connection, the maximum power and current density were 2500 mW/m 2 and 500 mA/m 2 (4.9 V), respectively. In parallel connection, the maximum power and current density was 5.8 mW/m 2 and 24 mA/m 2 , respectively. When the cells were not connected to each other MFC unit, the main bacterial species found in the anode biofilms were Bacillus and Lysinibacillus. After switching from unconnected to series and parallel connections, the most abundant species in the stacked MFC were Pseudomonas aeruginosa, followed by different Bacilli classes. This study demonstrated that when the stacked MFC was switched from unconnected to series and parallel connections, the pollutants removal, performance electricity and microbial community changed significantly. Voltages drops were observed in the stacked MFC, which was mainly limited by the cathodes. These voltages loss indicated high resistances within the stacked MFC, generating a parasitic cross current. Copyright © 2018 Elsevier Ltd. All rights reserved.
The Galley Parallel File System
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David
1996-01-01
As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. The interface conceals the parallelism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. We discuss Galley's file structure and application interface, as well as an application that has been implemented using that interface.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Laseke, B.A. Jr.
The report presents the results of a survey of operational flue gas desulfurization (FGD) systems on coal-fired utility boilers in the United States. The FGD system installed on Unit 1 at the Duck Creek Station of Central Illinois Light Company is described in terms of design and performance. The system consists of four parallel, wet-limestone, rod-deck scrubber modules designed for 25% capacity each, providing a total sulfur dioxide removal efficiency of 85%. The bottom ash, fly ash, and scrubbing wastes are disposed of in a sludge pond lined with a natural impermeable material. The first module of this four modulemore » FGD system was placed in service on July 1, 1976, and operated intermittently throughout the remainder of the year and for approximately one month in early 1977. On July 23, 1978, the three remaining modules were completed and all four modules were placed in the gas path for treatment of high sulfur flue gas.« less
Self-organised fractional quantisation in a hole quantum wire
NASA Astrophysics Data System (ADS)
Gul, Y.; Holmes, S. N.; Myronov, M.; Kumar, S.; Pepper, M.
2018-03-01
We have investigated hole transport in quantum wires formed by electrostatic confinement in strained germanium two-dimensional layers. The ballistic conductance characteristics show the regular staircase of quantum levels with plateaux at n2e 2/h, where n is an integer, e is the fundamental unit of charge and h is Planck’s constant. However as the carrier concentration is reduced, the quantised levels show a behaviour that is indicative of the formation of a zig-zag structure and new quantised plateaux appear at low temperatures. In units of 2e 2/h the new quantised levels correspond to values of n = 1/4 reducing to 1/8 in the presence of a strong parallel magnetic field which lifts the spin degeneracy but does not quantise the wavefunction. A further plateau is observed corresponding to n = 1/32 which does not change in the presence of a parallel magnetic field. These values indicate that the system is behaving as if charge was fractionalised with values e/2 and e/4, possible mechanisms are discussed.
A learnable parallel processing architecture towards unity of memory and computing
NASA Astrophysics Data System (ADS)
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-08-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A learnable parallel processing architecture towards unity of memory and computing.
Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J
2015-08-14
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
Adapting high-level language programs for parallel processing using data flow
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1988-01-01
EASY-FLOW, a very high-level data flow language, is introduced for the purpose of adapting programs written in a conventional high-level language to a parallel environment. The level of parallelism provided is of the large-grained variety in which parallel activities take place between subprograms or processes. A program written in EASY-FLOW is a set of subprogram calls as units, structured by iteration, branching, and distribution constructs. A data flow graph may be deduced from an EASY-FLOW program.
NASA Technical Reports Server (NTRS)
Hsia, T. C.; Lu, G. Z.; Han, W. H.
1987-01-01
In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.
Parallel/Vector Integration Methods for Dynamical Astronomy
NASA Astrophysics Data System (ADS)
Fukushima, Toshio
1999-01-01
This paper reviews three recent works on the numerical methods to integrate ordinary differential equations (ODE), which are specially designed for parallel, vector, and/or multi-processor-unit(PU) computers. The first is the Picard-Chebyshev method (Fukushima, 1997a). It obtains a global solution of ODE in the form of Chebyshev polynomial of large (> 1000) degree by applying the Picard iteration repeatedly. The iteration converges for smooth problems and/or perturbed dynamics. The method runs around 100-1000 times faster in the vector mode than in the scalar mode of a certain computer with vector processors (Fukushima, 1997b). The second is a parallelization of a symplectic integrator (Saha et al., 1997). It regards the implicit midpoint rules covering thousands of timesteps as large-scale nonlinear equations and solves them by the fixed-point iteration. The method is applicable to Hamiltonian systems and is expected to lead an acceleration factor of around 50 in parallel computers with more than 1000 PUs. The last is a parallelization of the extrapolation method (Ito and Fukushima, 1997). It performs trial integrations in parallel. Also the trial integrations are further accelerated by balancing computational load among PUs by the technique of folding. The method is all-purpose and achieves an acceleration factor of around 3.5 by using several PUs. Finally, we give a perspective on the parallelization of some implicit integrators which require multiple corrections in solving implicit formulas like the implicit Hermitian integrators (Makino and Aarseth, 1992), (Hut et al., 1995) or the implicit symmetric multistep methods (Fukushima, 1998), (Fukushima, 1999).
Specialized Computer Systems for Environment Visualization
NASA Astrophysics Data System (ADS)
Al-Oraiqat, Anas M.; Bashkov, Evgeniy A.; Zori, Sergii A.
2018-06-01
The need for real time image generation of landscapes arises in various fields as part of tasks solved by virtual and augmented reality systems, as well as geographic information systems. Such systems provide opportunities for collecting, storing, analyzing and graphically visualizing geographic data. Algorithmic and hardware software tools for increasing the realism and efficiency of the environment visualization in 3D visualization systems are proposed. This paper discusses a modified path tracing algorithm with a two-level hierarchy of bounding volumes and finding intersections with Axis-Aligned Bounding Box. The proposed algorithm eliminates the branching and hence makes the algorithm more suitable to be implemented on the multi-threaded CPU and GPU. A modified ROAM algorithm is used to solve the qualitative visualization of reliefs' problems and landscapes. The algorithm is implemented on parallel systems—cluster and Compute Unified Device Architecture-networks. Results show that the implementation on MPI clusters is more efficient than Graphics Processing Unit/Graphics Processing Clusters and allows real-time synthesis. The organization and algorithms of the parallel GPU system for the 3D pseudo stereo image/video synthesis are proposed. With realizing possibility analysis on a parallel GPU-architecture of each stage, 3D pseudo stereo synthesis is performed. An experimental prototype of a specialized hardware-software system 3D pseudo stereo imaging and video was developed on the CPU/GPU. The experimental results show that the proposed adaptation of 3D pseudo stereo imaging to the architecture of GPU-systems is efficient. Also it accelerates the computational procedures of 3D pseudo-stereo synthesis for the anaglyph and anamorphic formats of the 3D stereo frame without performing optimization procedures. The acceleration is on average 11 and 54 times for test GPUs.
Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan
2017-01-01
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan
2017-08-04
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
Development of a Cross-Flow Fan Powered Quad-Rotor Unmanned Aerial Vehicle
2015-06-01
HVAC Heating ventilation and air conditioning LiPo Lithium - ion polymer PLA Polylactic acid, 3-D printer filament PVA Polyvinyl alcohol PREPREG...control console Figure 79. Rheostat speed control console. 74 c) 6 cell lithium polymer battery Figure 80. 6 Cell LiPo battery . 75 d...Radio control system and versatile unit mounted with zip ties. ......................67 Figure 75. LiPo batteries and parallel battery connector
Large Scale Document Inversion using a Multi-threaded Computing System
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2018-01-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Large Scale Document Inversion using a Multi-threaded Computing System.
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2017-06-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
Evaluation of Urinary Tract Dilation Classification System for Grading Postnatal Hydronephrosis.
Hodhod, Amr; Capolicchio, John-Paul; Jednak, Roman; El-Sherif, Eid; El-Doray, Abd El-Alim; El-Sherbiny, Mohamed
2016-03-01
We assessed the reliability and validity of the Urinary Tract Dilation classification system as a new grading system for postnatal hydronephrosis. We retrospectively reviewed charts of patients who presented with hydronephrosis from 2008 to 2013. We included patients diagnosed prenatally and those with hydronephrosis discovered incidentally during the first year of life. We excluded cases involving urinary tract infection, neurogenic bladder and chromosomal anomalies, those associated with extraurinary congenital malformations and those with followup of less than 24 months without resolution. Hydronephrosis was graded postnatally using the Society for Fetal Urology system, and then the management protocol was chosen. All units were regraded using the Urinary Tract Dilation classification system and compared to the Society for Fetal Urology system to assess reliability. Univariate and multivariate analyses were performed to assess the validity of the Urinary Tract Dilation classification system in predicting hydronephrosis resolution and surgical intervention. A total of 490 patients (730 renal units) were eligible to participate. The Urinary Tract Dilation classification system was reliable in the assessment of hydronephrosis (parallel forms 0.92). Hydronephrosis resolved in 357 units (49%), and 86 units (12%) were managed by surgical intervention. The remainder of renal units demonstrated stable or improved hydronephrosis. Multivariate analysis revealed that the likelihood of surgical intervention was predicted independently by Urinary Tract Dilation classification system risk group, while Society for Fetal Urology grades were predictive of likelihood of resolution. The Urinary Tract Dilation classification system is reliable for evaluation of postnatal hydronephrosis and is valid in predicting surgical intervention. Copyright © 2016 American Urological Association Education and Research, Inc. Published by Elsevier Inc. All rights reserved.
Kalman Filter Tracking on Parallel Architectures
NASA Astrophysics Data System (ADS)
Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
2015-12-01
Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques including Cellular Automata or returning to Hough Transform. The most common track finding techniques in use today are however those based on the Kalman Filter [2]. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust and are exactly those being used today for the design of the tracking system for HL-LHC. Our previous investigations showed that, using optimized data structures, track fitting with Kalman Filter can achieve large speedup both with Intel Xeon and Xeon Phi. We report here our further progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a realistic simulation setup.
NASA Astrophysics Data System (ADS)
Morita, S.; Nakajima, T.; Goto, S.; Yamada, Y.; Kawamura, K.
2012-12-01
A great number of slump (submarine landslide) units have been identified by reflection seismic surveys performed off Shimokita Peninsula, NE Japan (Morita, et al., 2011). A 3-D seismic data revealed typical deformations caused by slumping and related dewatering in the Pliocene and upper formations. The slumping was generated primarily by layer-parallel slip in a very gentle (<1 degree) and flat continental slope. The size of slump units extends over 30 km in both width and slip direction in maximum. The slump units often exhibit an imbrication structure formed by repeated thrusting in the bottom layers, being mostly composed of the thrust blocks and little matrix. The dewatering structure is observed as widespread parallel dikes of which distribution is strongly dependent on the imbrication of the slump units. Slip planes of the slumps are traceable in seismic data because of the layer-parallel slip. The layers which correspond to the slip planes proved to be generally characterized as low-amplitude layers having some thickness, and some of the slip planes exhibit flattened features under the slump units of the imbrication structure accompanied by parallel dikes. This implies that excess fluid in the slip plane caused the lubrication to enhance the slumping and was drained through the parallel dikes during slumping. Some typical structures related to natural gas, e.g. enhanced reflection, gas chimney, have been identified in the seismic data. The shakedown cruise of D/V Chikyu in 2006 reported a recovery of gas hydrate in nearby area (Higuchi et al., 2009). A shallow sulfate-methane interface (SMI) of 3.5-12 mbsf has been reported in the survey area (Kotani et al., 2007). These features indicate that a high methane flux in the area is likely an important ground instability factor to cause the slumping and the dewatering phenomena. We recognize that the set of the slump units in the survey area is one of the most suitable targets to approach mechanism of submarine landslides so that we started exploring the feasibility of a scientific drilling in this survey area.
20 kHz main inverter unit. [for space station power supplies
NASA Technical Reports Server (NTRS)
Hussey, S.
1989-01-01
A proof-of-concept main inverter unit has demonstrated the operation of a pulse-width-modulated parallel resonant power stage topology as a 20-kHz ac power source driver, showing simple output regulation, parallel operation, power sharing and short-circuit operation. The use of a two-stage dc input filter controls the electromagnetic compatibility (EMC) characteristics of the dc power bus, and the use of an ac harmonic trap controls the EMC characteristics of the 20-kHz ac power bus.
Cooperative storage of shared files in a parallel computing system with dynamic block size
Bent, John M.; Faibish, Sorin; Grider, Gary
2015-11-10
Improved techniques are provided for parallel writing of data to a shared object in a parallel computing system. A method is provided for storing data generated by a plurality of parallel processes to a shared object in a parallel computing system. The method is performed by at least one of the processes and comprises: dynamically determining a block size for storing the data; exchanging a determined amount of the data with at least one additional process to achieve a block of the data having the dynamically determined block size; and writing the block of the data having the dynamically determined block size to a file system. The determined block size comprises, e.g., a total amount of the data to be stored divided by the number of parallel processes. The file system comprises, for example, a log structured virtual parallel file system, such as a Parallel Log-Structured File System (PLFS).
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram
2013-04-09
A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithmmore » is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.« less
Long-time atomistic simulations with the Parallel Replica Dynamics method
NASA Astrophysics Data System (ADS)
Perez, Danny
Molecular Dynamics (MD) -- the numerical integration of atomistic equations of motion -- is a workhorse of computational materials science. Indeed, MD can in principle be used to obtain any thermodynamic or kinetic quantity, without introducing any approximation or assumptions beyond the adequacy of the interaction potential. It is therefore an extremely powerful and flexible tool to study materials with atomistic spatio-temporal resolution. These enviable qualities however come at a steep computational price, hence limiting the system sizes and simulation times that can be achieved in practice. While the size limitation can be efficiently addressed with massively parallel implementations of MD based on spatial decomposition strategies, allowing for the simulation of trillions of atoms, the same approach usually cannot extend the timescales much beyond microseconds. In this article, we discuss an alternative parallel-in-time approach, the Parallel Replica Dynamics (ParRep) method, that aims at addressing the timescale limitation of MD for systems that evolve through rare state-to-state transitions. We review the formal underpinnings of the method and demonstrate that it can provide arbitrarily accurate results for any definition of the states. When an adequate definition of the states is available, ParRep can simulate trajectories with a parallel speedup approaching the number of replicas used. We demonstrate the usefulness of ParRep by presenting different examples of materials simulations where access to long timescales was essential to access the physical regime of interest and discuss practical considerations that must be addressed to carry out these simulations. Work supported by the United States Department of Energy (U.S. DOE), Office of Science, Office of Basic Energy Sciences, Materials Sciences and Engineering Division.
Molecular pathways to parallel evolution: I. Gene nexuses and their morphological correlates.
Zuckerkandl, E
1994-12-01
Aspects of the regulatory interactions among genes are probably as old as most genes are themselves. Correspondingly, similar predispositions to changes in such interactions must have existed for long evolutionary periods. Features of the structure and the evolution of the system of gene regulation furnish the background necessary for a molecular understanding of parallel evolution. Patently "unrelated" organs, such as the fat body of a fly and the liver of a mammal, can exhibit fractional homology, a fraction expected to become subject to quantitation. This also seems to hold for different organs in the same organism, such as wings and legs of a fly. In informational macromolecules, on the other hand, homology is indeed all or none. In the quite different case of organs, analogy is expected usually to represent attenuated homology. Many instances of putative convergence are likely to turn out to be predominantly parallel evolution, presumably including the case of the vertebrate and cephalopod eyes. Homology in morphological features reflects a similarity in networks of active genes. Similar nexuses of active genes can be established in cells of different embryological origins. Thus, parallel development can be considered a counterpart to parallel evolution. Specific macromolecular interactions leading to the regulation of the c-fos gene are given as an example of a "controller node" defined as a regulatory unit. Quantitative changes in gene control are distinguished from relational changes, and frequent parallelism in quantitative changes is noted in Drosophila enzymes. Evolutionary reversions in quantitative gene expression are also expected. The evolution of relational patterns is attributed to several distinct mechanisms, notably the shuffling of protein domains. The growth of such patterns may in part be brought about by a particular process of compensation for "controller gene diseases," a process that would spontaneously tend to lead to increased regulatory and organismal complexity. Despite the inferred increase in gene interaction complexity, whose course over evolutionary time is unknown, the number of homology groups for the functional and structural protein units designated as domains has probably remained rather constant, even as, in some of its branches, evolution moved toward "higher" organisms. In connection with this process, the question is raised of parallel evolution within the purview of activating and repressing master switches and in regard to the number of levels into which the hierarchies of genic master switches will eventually be resolved.
Automatic Management of Parallel and Distributed System Resources
NASA Technical Reports Server (NTRS)
Yan, Jerry; Ngai, Tin Fook; Lundstrom, Stephen F.
1990-01-01
Viewgraphs on automatic management of parallel and distributed system resources are presented. Topics covered include: parallel applications; intelligent management of multiprocessing systems; performance evaluation of parallel architecture; dynamic concurrent programs; compiler-directed system approach; lattice gaseous cellular automata; and sparse matrix Cholesky factorization.
Kalman Filter Tracking on Parallel Architectures
NASA Astrophysics Data System (ADS)
Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
2016-11-01
Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC), for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.
An extended algebraic reconstruction technique (E-ART) for dual spectral CT.
Zhao, Yunsong; Zhao, Xing; Zhang, Peng
2015-03-01
Compared with standard computed tomography (CT), dual spectral CT (DSCT) has many advantages for object separation, contrast enhancement, artifact reduction, and material composition assessment. But it is generally difficult to reconstruct images from polychromatic projections acquired by DSCT, because of the nonlinear relation between the polychromatic projections and the images to be reconstructed. This paper first models the DSCT reconstruction problem as a nonlinear system problem; and then extend the classic ART method to solve the nonlinear system. One feature of the proposed method is its flexibility. It fits for any scanning configurations commonly used and does not require consistent rays for different X-ray spectra. Another feature of the proposed method is its high degree of parallelism, which means that the method is suitable for acceleration on GPUs (graphic processing units) or other parallel systems. The method is validated with numerical experiments from simulated noise free and noisy data. High quality images are reconstructed with the proposed method from the polychromatic projections of DSCT. The reconstructed images are still satisfactory even if there are certain errors in the estimated X-ray spectra.
A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gittens, Alex; Kottalam, Jey; Yang, Jiyan
We investigate the performance and scalability of the randomized CX low-rank matrix factorization and demonstrate its applicability through the analysis of a 1TB mass spectrometry imaging (MSI) dataset, using Apache Spark on an Amazon EC2 cluster, a Cray XC40 system, and an experimental Cray cluster. We implemented this factorization both as a parallelized C implementation with hand-tuned optimizations and in Scala using the Apache Spark high-level cluster computing framework. We obtained consistent performance across the three platforms: using Spark we were able to process the 1TB size dataset in under 30 minutes with 960 cores on all systems, with themore » fastest times obtained on the experimental Cray cluster. In comparison, the C implementation was 21X faster on the Amazon EC2 system, due to careful cache optimizations, bandwidth-friendly access of matrices and vector computation using SIMD units. We report these results and their implications on the hardware and software issues arising in supporting data-centric workloads in parallel and distributed environments.« less
Description of inpatient medication management using cognitive work analysis.
Pingenot, Alleene Anne; Shanteau, James; Sengstacke, L T C Daniel N
2009-01-01
The purpose of this article was to describe key elements of an inpatient medication system using the cognitive work analysis method of Rasmussen et al (Cognitive Systems Engineering. Wiley Series in Systems Engineering; 1994). The work of nurses and physicians were observed in routine care of inpatients on a medical-surgical unit and attached ICU. Interaction with pharmacists was included. Preoperative, postoperative, and medical care was observed. Personnel were interviewed to obtain information not easily observable during routine work. Communication between healthcare workers was projected onto an abstraction/decomposition hierarchy. Decision ladders and information flow charts were developed. Results suggest that decision making on an inpatient medical/surgical unit or ICU setting is a parallel, distributed process. Personnel are highly mobile and often are working on multiple issues concurrently. In this setting, communication is key to maintaining organization and synchronization for effective care. Implications for research approaches to system and interface designs and decision support for personnel involved in the process are discussed.
High-performance computing with quantum processing units
Britt, Keith A.; Oak Ridge National Lab.; Humble, Travis S.; ...
2017-03-01
The prospects of quantum computing have driven efforts to realize fully functional quantum processing units (QPUs). Recent success in developing proof-of-principle QPUs has prompted the question of how to integrate these emerging processors into modern high-performance computing (HPC) systems. We examine how QPUs can be integrated into current and future HPC system architectures by accounting for func- tional and physical design requirements. We identify two integration pathways that are differentiated by infrastructure constraints on the QPU and the use cases expected for the HPC system. This includes a tight integration that assumes infrastructure bottlenecks can be overcome as well asmore » a loose integration that as- sumes they cannot. We find that the performance of both approaches is likely to depend on the quantum interconnect that serves to entangle multiple QPUs. As a result, we also identify several challenges in assessing QPU performance for HPC, and we consider new metrics that capture the interplay between system architecture and the quantum parallelism underlying computational performance.« less
High-performance computing with quantum processing units
DOE Office of Scientific and Technical Information (OSTI.GOV)
Britt, Keith A.; Oak Ridge National Lab.; Humble, Travis S.
The prospects of quantum computing have driven efforts to realize fully functional quantum processing units (QPUs). Recent success in developing proof-of-principle QPUs has prompted the question of how to integrate these emerging processors into modern high-performance computing (HPC) systems. We examine how QPUs can be integrated into current and future HPC system architectures by accounting for func- tional and physical design requirements. We identify two integration pathways that are differentiated by infrastructure constraints on the QPU and the use cases expected for the HPC system. This includes a tight integration that assumes infrastructure bottlenecks can be overcome as well asmore » a loose integration that as- sumes they cannot. We find that the performance of both approaches is likely to depend on the quantum interconnect that serves to entangle multiple QPUs. As a result, we also identify several challenges in assessing QPU performance for HPC, and we consider new metrics that capture the interplay between system architecture and the quantum parallelism underlying computational performance.« less
Ridge Orientations of the Ridge-Forming Unit, Sinus Meridiani, Mars-A Fluvial Explanation
NASA Technical Reports Server (NTRS)
Wilkinson, M. Justin; Herridge, A.
2013-01-01
Imagery and MOLA data were used in an analysis of the ridge-forming rock unit (RFU) exposed in Sinus Meridiani (SM). This unit shows parallels at different scales with fluvial sedimentary bodies. We propose the terrestrial megafan as the prime analog for the RFU, and likely for other members of the layered units. Megafans are partial cones of fluvial sediment, with radii up to hundreds of km. Although recent reviews of hypotheses for the RFU units exclude fluvial hypotheses [1], inverted ridges in the deserts of Oman have been suggested as putative analogs for some ridges [2], apparently without appreciating The wider context in which these ridges have formed is a series of megafans [3], a relatively unappreciated geomorphic feature. It has been argued that these units conform to the megafan model at the regional, subregional and local scales [4]. At the regional scale suites of terrestrial megafans are known to cover large areas at the foot of uplands on all continents - a close parallel with the setting of the Meridiani sediments at the foot of the southern uplands of Mars, with its incised fluvial systems leading down the regional NW slope [2, 3] towards the sedimentary units. At the subregional scale the layering and internal discontinuities of the Meridiani rocks are consistent, inter alia, with stacked fluvial units [4]. Although poorly recognized as such, the prime geomorphic environment in which stream channel networks cover large areas, without intervening hillslopes, is the megafan [see e.g. 4]. Single megafans can reach 200,000 km2 [5]. Megafans thus supply an analog for areas where channel-like ridges (as a palimpsest of a prior landscape) cover the intercrater plains of Meridiani [6]. At the local, or river-reach scale, the numerous sinuous features of the RFU are suggestive of fluvial channels. Cross-cutting relationships, a common feature of channels on terrestrial megafans, are ubiquitous. Desert megafans show cemented paleo-channels as inverted topography [4] with all these characteristics.
Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
2015-01-01
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
Quasi-Newton parallel geometry optimization methods
NASA Astrophysics Data System (ADS)
Burger, Steven K.; Ayers, Paul W.
2010-07-01
Algorithms for parallel unconstrained minimization of molecular systems are examined. The overall framework of minimization is the same except for the choice of directions for updating the quasi-Newton Hessian. Ideally these directions are chosen so the updated Hessian gives steps that are same as using the Newton method. Three approaches to determine the directions for updating are presented: the straightforward approach of simply cycling through the Cartesian unit vectors (finite difference), a concurrent set of minimizations, and the Lanczos method. We show the importance of using preconditioning and a multiple secant update in these approaches. For the Lanczos algorithm, an initial set of directions is required to start the method, and a number of possibilities are explored. To test the methods we used the standard 50-dimensional analytic Rosenbrock function. Results are also reported for the histidine dipeptide, the isoleucine tripeptide, and cyclic adenosine monophosphate. All of these systems show a significant speed-up with the number of processors up to about eight processors.
NASA Technical Reports Server (NTRS)
Fraser, A. S.; Wells, A. F.; Tenoso, H. J.; Linnecke, C. B.
1976-01-01
Organon Diagnostics has developed, under NASA sponsorship, a monitoring system to test the capability of a water recovery system to reject the passage of viruses into the recovered water. In this system, a non-pathogenic marker virus, bacteriophage F2, is fed into the process stream before the recovery unit and the reclaimed water is assayed for its presence. An engineering preliminary design has been performed as a parallel effort to the laboratory development of the marker virus test system. Engineering schematics and drawings present a preliminary instrument design of a fully functional laboratory prototype capable of zero-G operation.
A cost-effective methodology for the design of massively-parallel VLSI functional units
NASA Technical Reports Server (NTRS)
Venkateswaran, N.; Sriram, G.; Desouza, J.
1993-01-01
In this paper we propose a generalized methodology for the design of cost-effective massively-parallel VLSI Functional Units. This methodology is based on a technique of generating and reducing a massive bit-array on the mask-programmable PAcube VLSI array. This methodology unifies (maintains identical data flow and control) the execution of complex arithmetic functions on PAcube arrays. It is highly regular, expandable and uniform with respect to problem-size and wordlength, thereby reducing the communication complexity. The memory-functional unit interface is regular and expandable. Using this technique functional units of dedicated processors can be mask-programmed on the naked PAcube arrays, reducing the turn-around time. The production cost of such dedicated processors can be drastically reduced since the naked PAcube arrays can be mass-produced. Analysis of the the performance of functional units designed by our method yields promising results.
Internode data communications in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Miller, Douglas R.; Parker, Jeffrey J.; Ratterman, Joseph D.; Smith, Brian E.
2013-09-03
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
Internode data communications in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E
2014-02-11
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
File concepts for parallel I/O
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1989-01-01
The subject of input/output (I/O) was often neglected in the design of parallel computer systems, although for many problems I/O rates will limit the speedup attainable. The I/O problem is addressed by considering the role of files in parallel systems. The notion of parallel files is introduced. Parallel files provide for concurrent access by multiple processes, and utilize parallelism in the I/O system to improve performance. Parallel files can also be used conventionally by sequential programs. A set of standard parallel file organizations is proposed, organizations are suggested, using multiple storage devices. Problem areas are also identified and discussed.
A Module Experimental Process System Development Unit (MEPSDU)
NASA Technical Reports Server (NTRS)
1982-01-01
Restructuring research objectives from a technical readiness demonstration program to an investigation of high risk, high payoff activities associated with producing photovoltaic modules using non-CZ sheet material is reported. Deletion of the module frame in favor of a frameless design, and modification in cell series parallel electrical interconnect configuration are reviewed. A baseline process sequence was identified for the fabrication of modules using the selected dendritic web sheet material, and economic evaluations of the sequence were completed.
Molecular-dynamics simulations of self-assembled monolayers (SAM) on parallel computers
NASA Astrophysics Data System (ADS)
Vemparala, Satyavani
The purpose of this dissertation is to investigate the properties of self-assembled monolayers, particularly alkanethiols and Poly (ethylene glycol) terminated alkanethiols. These simulations are based on realistic interatomic potentials and require scalable and portable multiresolution algorithms implemented on parallel computers. Large-scale molecular dynamics simulations of self-assembled alkanethiol monolayer systems have been carried out using an all-atom model involving a million atoms to investigate their structural properties as a function of temperature, lattice spacing and molecular chain-length. Results show that the alkanethiol chains tilt from the surface normal by a collective angle of 25° along next-nearest neighbor direction at 300K. At 350K the system transforms to a disordered phase characterized by small tilt angle, flexible tilt direction, and random distribution of backbone planes. With increasing lattice spacing, a, the tilt angle increases rapidly from a nearly zero value at a = 4.7A to as high as 34° at a = 5.3A at 300K. We also studied the effect of end groups on the tilt structure of SAM films. We characterized the system with respect to temperature, the alkane chain length, lattice spacing, and the length of the end group. We found that the gauche defects were predominant only in the tails, and the gauche defects increased with the temperature and number of EG units. Effect of electric field on the structure of poly (ethylene glycol) (PEG) terminated alkanethiol self assembled monolayer (SAM) on gold has been studied using parallel molecular dynamics method. An applied electric field triggers a conformational transition from all-trans to a mostly gauche conformation. The polarity of the electric field has a significant effect on the surface structure of PEG leading to a profound effect on the hydrophilicity of the surface. The electric field applied anti-parallel to the surface normal causes a reversible transition to an ordered state in which the oxygen atoms are exposed. On the other hand, an electric field applied in a direction parallel to the surface normal introduces considerable disorder in the system and the oxygen atoms are buried inside.
Bypass apparatus and method for series connected energy storage devices
Rouillard, Jean; Comte, Christophe; Daigle, Dominik
2000-01-01
A bypass apparatus and method for series connected energy storage devices. Each of the energy storage devices coupled to a common series connection has an associated bypass unit connected thereto in parallel. A current bypass unit includes a sensor which is coupled in parallel with an associated energy storage device or cell and senses an energy parameter indicative of an energy state of the cell, such as cell voltage. A bypass switch is coupled in parallel with the energy storage cell and operable between a non-activated state and an activated state. The bypass switch, when in the non-activated state, is substantially non-conductive with respect to current passing through the energy storage cell and, when in the activated state, provides a bypass current path for passing current to the series connection so as to bypass the associated cell. A controller controls activation of the bypass switch in response to the voltage of the cell deviating from a pre-established voltage setpoint. The controller may be included within the bypass unit or be disposed on a control platform external to the bypass unit. The bypass switch may, when activated, establish a permanent or a temporary bypass current path.
Effective mass and spin susceptibility of dilute two-dimensional holes ion GaAs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chiu, Y-T.; Padmanabhan, M.; Gokmen, T.
2011-10-31
We report effective hole mass (m*) measurements through analyzing the temperature dependence of Shubnikov-de Haas oscillations in dilute (density p {approx} 7 x 10{sup 10} cm{sup -2}, r{sub s} {approx} 6) two-dimensional (2D) hole systems confined to a 20-nm-wide, (311)A GaAs quantum well. The holes in this system occupy two nearly degenerate spin subbands whose m* we measure to be {approx}0.2 (in units of the free electron mass). Despite the relatively large r{sub s} in our 2D system, the measured m* is in reasonably good agreement with the results of our energy band calculations, which do not take interactions intomore » account. We then apply a sufficiently strong parallel magnetic field to fully depopulate one of the spin subbands, and measure m* for the populated subband. We find that this latter m* is close to the m* we measure in the absence of the parallel field. We also deduce the spin susceptibility of the 2D hole system from the depopulation field, and we conclude that the susceptibility is enhanced by about 50% relative to the value expected from the band calculations.« less
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
GAPD: a GPU-accelerated atom-based polychromatic diffraction simulation code.
E, J C; Wang, L; Chen, S; Zhang, Y Y; Luo, S N
2018-03-01
GAPD, a graphics-processing-unit (GPU)-accelerated atom-based polychromatic diffraction simulation code for direct, kinematics-based, simulations of X-ray/electron diffraction of large-scale atomic systems with mono-/polychromatic beams and arbitrary plane detector geometries, is presented. This code implements GPU parallel computation via both real- and reciprocal-space decompositions. With GAPD, direct simulations are performed of the reciprocal lattice node of ultralarge systems (∼5 billion atoms) and diffraction patterns of single-crystal and polycrystalline configurations with mono- and polychromatic X-ray beams (including synchrotron undulator sources), and validation, benchmark and application cases are presented.
REDOX electrochemical energy storage
NASA Technical Reports Server (NTRS)
Thaller, L. H.
1980-01-01
Reservoirs of chemical solutions can store electrical energy with high efficiency. Reactant solutions are stored outside conversion section where charging and discharging reactions take place. Conversion unit consists of stacks of cells connected together in parallel hydraulically, and in series electrically. Stacks resemble fuel cell batteries. System is 99% ampere-hour efficient, 75% watt hour efficient, and has long projected lifetime. Applications include storage buffering for remote solar or wind power systems, and industrial load leveling. Cost estimates are $325/kW of power requirement plus $51/kWh storage capacity. Mass production would reduce cost by about factor of two.
Argonne simulation framework for intelligent transportation systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ewing, T.; Doss, E.; Hanebutte, U.
1996-04-01
A simulation framework has been developed which defines a high-level architecture for a large-scale, comprehensive, scalable simulation of an Intelligent Transportation System (ITS). The simulator is designed to run on parallel computers and distributed (networked) computer systems; however, a version for a stand alone workstation is also available. The ITS simulator includes an Expert Driver Model (EDM) of instrumented ``smart`` vehicles with in-vehicle navigation units. The EDM is capable of performing optimal route planning and communicating with Traffic Management Centers (TMC). A dynamic road map data base is sued for optimum route planning, where the data is updated periodically tomore » reflect any changes in road or weather conditions. The TMC has probe vehicle tracking capabilities (display position and attributes of instrumented vehicles), and can provide 2-way interaction with traffic to provide advisories and link times. Both the in-vehicle navigation module and the TMC feature detailed graphical user interfaces that includes human-factors studies to support safety and operational research. Realistic modeling of variations of the posted driving speed are based on human factor studies that take into consideration weather, road conditions, driver`s personality and behavior and vehicle type. The simulator has been developed on a distributed system of networked UNIX computers, but is designed to run on ANL`s IBM SP-X parallel computer system for large scale problems. A novel feature of the developed simulator is that vehicles will be represented by autonomous computer processes, each with a behavior model which performs independent route selection and reacts to external traffic events much like real vehicles. Vehicle processes interact with each other and with ITS components by exchanging messages. With this approach, one will be able to take advantage of emerging massively parallel processor (MPP) systems.« less
NASA Astrophysics Data System (ADS)
Bellerby, Tim
2015-04-01
PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195
Ho, ThienLuan; Oh, Seung-Rohk
2017-01-01
Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively. PMID:29016700
Electronic scraps--recovering of valuable materials from parallel wire cables.
de Araújo, Mishene Christie Pinheiro Bezerra; Chaves, Arthur Pinto; Espinosa, Denise Crocce Romano; Tenório, Jorge Alberto Soares
2008-11-01
Every year, the number of discarded electro-electronic products is increasing. For this reason recycling is needed, to avoid wasting non-renewable natural resources. The objective of this work is to study the recycling of materials from parallel wire cable through unit operations of mineral processing. Parallel wire cables are basically composed of polymer and copper. The following unit operations were tested: grinding, size classification, dense medium separation, electrostatic separation, scrubbing, panning, and elutriation. It was observed that the operations used obtained copper and PVC concentrates with a low degree of cross contamination. It was concluded that total liberation of the materials was accomplished after grinding to less than 3 mm, using a cage mill. Separation using panning and elutriation presented the best results in terms of recovery and cross contamination.
Parallel Fast Multipole Method For Molecular Dynamics
2007-06-01
Parallel Fast Multipole Method For Molecular Dynamics THESIS Reid G. Ormseth, Captain, USAF AFIT/GAP/ENP/07-J02 DEPARTMENT OF THE AIR FORCE AIR...the United States Government. AFIT/GAP/ENP/07-J02 Parallel Fast Multipole Method For Molecular Dynamics THESIS Presented to the Faculty Department of...has also been provided by ‘The Art of Molecular Dynamics Simulation ’ by Dennis Rapaport. This work is the clearest treatment of the Fast Multipole
NASA Astrophysics Data System (ADS)
Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.
2017-07-01
Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
Komarov, Ivan; D'Souza, Roshan M
2012-01-01
The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
Topical perspective on massive threading and parallelism.
Farber, Robert M
2011-09-01
Unquestionably computer architectures have undergone a recent and noteworthy paradigm shift that now delivers multi- and many-core systems with tens to many thousands of concurrent hardware processing elements per workstation or supercomputer node. GPGPU (General Purpose Graphics Processor Unit) technology in particular has attracted significant attention as new software development capabilities, namely CUDA (Compute Unified Device Architecture) and OpenCL™, have made it possible for students as well as small and large research organizations to achieve excellent speedup for many applications over more conventional computing architectures. The current scientific literature reflects this shift with numerous examples of GPGPU applications that have achieved one, two, and in some special cases, three-orders of magnitude increased computational performance through the use of massive threading to exploit parallelism. Multi-core architectures are also evolving quickly to exploit both massive-threading and massive-parallelism such as the 1.3 million threads Blue Waters supercomputer. The challenge confronting scientists in planning future experimental and theoretical research efforts--be they individual efforts with one computer or collaborative efforts proposing to use the largest supercomputers in the world is how to capitalize on these new massively threaded computational architectures--especially as not all computational problems will scale to massive parallelism. In particular, the costs associated with restructuring software (and potentially redesigning algorithms) to exploit the parallelism of these multi- and many-threaded machines must be considered along with application scalability and lifespan. This perspective is an overview of the current state of threading and parallelize with some insight into the future. Published by Elsevier Inc.
ERIC Educational Resources Information Center
Johnson, Dennis; And Others
This manual, the second of three curriculum guides for an electronics course, is intended for use in a program combining vocational English as a second language (VESL) with bilingual vocational education. Ten units cover the electrical team, Ohm's law, Watt's law, series resistive circuits, parallel resistive circuits, series parallel circuits,…
Full pillar extraction at the Kathleen Mine with mobile roof supports
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grimm, E.S.
1994-12-31
The Voest Alpine Breaker Line Supports (ABLS) resemble self-propelled longwall shields. Each individual unit consists of four hydraulic legs extending from the base of the unit, pressing a solid flat canopy against the mine roof. Each support unit is capable of exerting 606 tons of force against the roof. A chain curtain on the sides and rear protects the interior of the support from falling rock. The internal scissoring lemniscate design allows for parallel movement of the canopy as it is raised or lowered. Each ABLS has 750 feet of 4 AWG trailing cable to supply 480 volts AC tomore » a permissible controller and a 40 hp explosion-proof electrical motor. The hydraulic pump and reservoir are self-contained and protected with an automatic fire suppression system.« less
Computer Drawing Method for Operating Characteristic Curve of PV Power Plant Array Unit
NASA Astrophysics Data System (ADS)
Tan, Jianbin
2018-02-01
According to the engineering design of large-scale grid-connected photovoltaic power stations and the research and development of many simulation and analysis systems, it is necessary to draw a good computer graphics of the operating characteristic curves of photovoltaic array elements and to propose a good segmentation non-linear interpolation algorithm. In the calculation method, Component performance parameters as the main design basis, the computer can get 5 PV module performances. At the same time, combined with the PV array series and parallel connection, the computer drawing of the performance curve of the PV array unit can be realized. At the same time, the specific data onto the module of PV development software can be calculated, and the good operation of PV array unit can be improved on practical application.
Performance of the Galley Parallel File System
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David
1996-01-01
As the input/output (I/O) needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. This interface conceals the parallism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. Initial experiments, reported in this paper, indicate that Galley is capable of providing high-performance 1/O to applications the applications that rely on them. In Section 3 we describe that access data in patterns that have been observed to be common.
GPURFSCREEN: a GPU based virtual screening tool using random forest classifier.
Jayaraj, P B; Ajay, Mathias K; Nufail, M; Gopakumar, G; Jaleel, U C A
2016-01-01
In-silico methods are an integral part of modern drug discovery paradigm. Virtual screening, an in-silico method, is used to refine data models and reduce the chemical space on which wet lab experiments need to be performed. Virtual screening of a ligand data model requires large scale computations, making it a highly time consuming task. This process can be speeded up by implementing parallelized algorithms on a Graphical Processing Unit (GPU). Random Forest is a robust classification algorithm that can be employed in the virtual screening. A ligand based virtual screening tool (GPURFSCREEN) that uses random forests on GPU systems has been proposed and evaluated in this paper. This tool produces optimized results at a lower execution time for large bioassay data sets. The quality of results produced by our tool on GPU is same as that on a regular serial environment. Considering the magnitude of data to be screened, the parallelized virtual screening has a significantly lower running time at high throughput. The proposed parallel tool outperforms its serial counterpart by successfully screening billions of molecules in training and prediction phases.
Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K
2010-01-01
An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
Integration of perception and reasoning in fast neural modules
NASA Technical Reports Server (NTRS)
Fritz, David G.
1989-01-01
Artificial neural systems promise to integrate symbolic and sub-symbolic processing to achieve real time control of physical systems. Two potential alternatives exist. In one, neural nets can be used to front-end expert systems. The expert systems, in turn, are developed with varying degrees of parallelism, including their implementation in neural nets. In the other, rule-based reasoning and sensor data can be integrated within a single hybrid neural system. The hybrid system reacts as a unit to provide decisions (problem solutions) based on the simultaneous evaluation of data and rules. Discussed here is a model hybrid system based on the fuzzy cognitive map (FCM). The operation of the model is illustrated with the control of a hypothetical satellite that intelligently alters its attitude in space in response to an intersecting micrometeorite shower.
The Amazing Labyrinth: An Ancient-Modern Humanities Unit
ERIC Educational Resources Information Center
Ladensack, Carl
1973-01-01
The image of the labyrinth from mythology can find modern day parallelisms in architecture, art, music, and literature--all of which contributes to a humanities unit combining the old with the new. (MM)
NASA Astrophysics Data System (ADS)
Ying, Jia-ju; Chen, Yu-dan; Liu, Jie; Wu, Dong-sheng; Lu, Jun
2016-10-01
The maladjustment of photoelectric instrument binocular optical axis parallelism will affect the observe effect directly. A binocular optical axis parallelism digital calibration system is designed. On the basis of the principle of optical axis binocular photoelectric instrument calibration, the scheme of system is designed, and the binocular optical axis parallelism digital calibration system is realized, which include four modules: multiband parallel light tube, optical axis translation, image acquisition system and software system. According to the different characteristics of thermal infrared imager and low-light-level night viewer, different algorithms is used to localize the center of the cross reticle. And the binocular optical axis parallelism calibration is realized for calibrating low-light-level night viewer and thermal infrared imager.
Software Design for Real-Time Systems on Parallel Computers: Formal Specifications.
1996-04-01
This research investigated the important issues related to the analysis and design of real - time systems targeted to parallel architectures. In...particular, the software specification models for real - time systems on parallel architectures were evaluated. A survey of current formal methods for...uniprocessor real - time systems specifications was conducted to determine their extensibility in specifying real - time systems on parallel architectures. In
Parallel/distributed direct method for solving linear systems
NASA Technical Reports Server (NTRS)
Lin, Avi
1990-01-01
A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.
Are baboons learning "orthographic" representations? Probably not
Bröker, Franziska; Ramscar, Michael; Baayen, Harald
2017-01-01
The ability of Baboons (papio papio) to distinguish between English words and nonwords has been modeled using a deep learning convolutional network model that simulates a ventral pathway in which lexical representations of different granularity develop. However, given that pigeons (columba livia), whose brain morphology is drastically different, can also be trained to distinguish between English words and nonwords, it appears that a less species-specific learning algorithm may be required to explain this behavior. Accordingly, we examined whether the learning model of Rescorla and Wagner, which has proved to be amazingly fruitful in understanding animal and human learning could account for these data. We show that a discrimination learning network using gradient orientation features as input units and word and nonword units as outputs succeeds in predicting baboon lexical decision behavior—including key lexical similarity effects and the ups and downs in accuracy as learning unfolds—with surprising precision. The models performance, in which words are not explicitly represented, is remarkable because it is usually assumed that lexicality decisions, including the decisions made by baboons and pigeons, are mediated by explicit lexical representations. By contrast, our results suggest that in learning to perform lexical decision tasks, baboons and pigeons do not construct a hierarchy of lexical units. Rather, they make optimal use of low-level information obtained through the massively parallel processing of gradient orientation features. Accordingly, we suggest that reading in humans first involves initially learning a high-level system building on letter representations acquired from explicit instruction in literacy, which is then integrated into a conventionalized oral communication system, and that like the latter, fluent reading involves the massively parallel processing of the low-level features encoding semantic contrasts. PMID:28859134
Are baboons learning "orthographic" representations? Probably not.
Linke, Maja; Bröker, Franziska; Ramscar, Michael; Baayen, Harald
2017-01-01
The ability of Baboons (papio papio) to distinguish between English words and nonwords has been modeled using a deep learning convolutional network model that simulates a ventral pathway in which lexical representations of different granularity develop. However, given that pigeons (columba livia), whose brain morphology is drastically different, can also be trained to distinguish between English words and nonwords, it appears that a less species-specific learning algorithm may be required to explain this behavior. Accordingly, we examined whether the learning model of Rescorla and Wagner, which has proved to be amazingly fruitful in understanding animal and human learning could account for these data. We show that a discrimination learning network using gradient orientation features as input units and word and nonword units as outputs succeeds in predicting baboon lexical decision behavior-including key lexical similarity effects and the ups and downs in accuracy as learning unfolds-with surprising precision. The models performance, in which words are not explicitly represented, is remarkable because it is usually assumed that lexicality decisions, including the decisions made by baboons and pigeons, are mediated by explicit lexical representations. By contrast, our results suggest that in learning to perform lexical decision tasks, baboons and pigeons do not construct a hierarchy of lexical units. Rather, they make optimal use of low-level information obtained through the massively parallel processing of gradient orientation features. Accordingly, we suggest that reading in humans first involves initially learning a high-level system building on letter representations acquired from explicit instruction in literacy, which is then integrated into a conventionalized oral communication system, and that like the latter, fluent reading involves the massively parallel processing of the low-level features encoding semantic contrasts.
Integration experiences and performance studies of A COTS parallel archive systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Bary
2010-01-01
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf(COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and lessmore » robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petaflop/s computing system, LANL's Roadrunner, and demonstrated its capability to address requirements of future archival storage systems.« less
Integration experiments and performance studies of a COTS parallel archive system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Gary
2010-06-16
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching andmore » less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, Is, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petafiop/s computing system, LANL's Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems.« less
A real-time GNSS-R system based on software-defined radio and graphics processing units
NASA Astrophysics Data System (ADS)
Hobiger, Thomas; Amagai, Jun; Aida, Masanori; Narita, Hideki
2012-04-01
Reflected signals of the Global Navigation Satellite System (GNSS) from the sea or land surface can be utilized to deduce and monitor physical and geophysical parameters of the reflecting area. Unlike most other remote sensing techniques, GNSS-Reflectometry (GNSS-R) operates as a passive radar that takes advantage from the increasing number of navigation satellites that broadcast their L-band signals. Thereby, most of the GNSS-R receiver architectures are based on dedicated hardware solutions. Software-defined radio (SDR) technology has advanced in the recent years and enabled signal processing in real-time, which makes it an ideal candidate for the realization of a flexible GNSS-R system. Additionally, modern commodity graphic cards, which offer massive parallel computing performances, allow to handle the whole signal processing chain without interfering with the PC's CPU. Thus, this paper describes a GNSS-R system which has been developed on the principles of software-defined radio supported by General Purpose Graphics Processing Units (GPGPUs), and presents results from initial field tests which confirm the anticipated capability of the system.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Peterson, Steven K
The U.S. Department of Energy (DOE) has a significant programmatic interest in the safe and secure routing and transportation of Spent Nuclear Fuel (SNF) and High Level Waste (HLW) in the United States, including shipments entering the country from locations outside U.S borders. In any shipment of SNF/HLW, there are multiple chains; a jurisdictional chain as the material moves between jurisdictions (state, federal, tribal, administrative), a physical supply chain (which mode), as well as a custody chain (which stakeholder is in charge/possession) of the materials being transported. Given these interconnected networks, there lies vulnerabilities, whether in lack of communication betweenmore » interested stakeholders or physical vulnerabilities such as interdiction. By identifying key links and nodes as well as administrative weaknesses, decisions can be made to harden the physical network and improve communication between stakeholders. This paper examines the parallel chains of oversight and custody as well as the chain of stakeholder interests for the shipments of SNF/HLW and the potential impacts on systemic resiliency. Using the Crystal River shutdown location as well as a hypothetical international shipment brought into the United States, this paper illustrates the parallel chains and maps them out visually.« less
Improvement and speed optimization of numerical tsunami modelling program using OpenMP technology
NASA Astrophysics Data System (ADS)
Chernov, A.; Zaytsev, A.; Yalciner, A.; Kurkin, A.
2009-04-01
Currently, the basic problem of tsunami modeling is low speed of calculations which is unacceptable for services of the operative notification. Existing algorithms of numerical modeling of hydrodynamic processes of tsunami waves are developed without taking the opportunities of modern computer facilities. There is an opportunity to have considerable acceleration of process of calculations by using parallel algorithms. We discuss here new approach to parallelization tsunami modeling code using OpenMP Technology (for multiprocessing systems with the general memory). Nowadays, multiprocessing systems are easily accessible for everyone. The cost of the use of such systems becomes much lower comparing to the costs of clusters. This opportunity also benefits all programmers to apply multithreading algorithms on desktop computers of researchers. Other important advantage of the given approach is the mechanism of the general memory - there is no necessity to send data on slow networks (for example Ethernet). All memory is the common for all computing processes; it causes almost linear scalability of the program and processes. In the new version of NAMI DANCE using OpenMP technology and multi-threading algorithm provide 80% gain in speed in comparison with the one-thread version for dual-processor unit. The speed increased and 320% gain was attained for four core processor unit of PCs. Thus, it was possible to reduce considerably time of performance of calculations on the scientific workstations (desktops) without complete change of the program and user interfaces. The further modernization of algorithms of preparation of initial data and processing of results using OpenMP looks reasonable. The final version of NAMI DANCE with the increased computational speed can be used not only for research purposes but also in real time Tsunami Warning Systems.
P43-S Computational Biology Applications Suite for High-Performance Computing (BioHPC.net)
Pillardy, J.
2007-01-01
One of the challenges of high-performance computing (HPC) is user accessibility. At the Cornell University Computational Biology Service Unit, which is also a Microsoft HPC institute, we have developed a computational biology application suite that allows researchers from biological laboratories to submit their jobs to the parallel cluster through an easy-to-use Web interface. Through this system, we are providing users with popular bioinformatics tools including BLAST, HMMER, InterproScan, and MrBayes. The system is flexible and can be easily customized to include other software. It is also scalable; the installation on our servers currently processes approximately 8500 job submissions per year, many of them requiring massively parallel computations. It also has a built-in user management system, which can limit software and/or database access to specified users. TAIR, the major database of the plant model organism Arabidopsis, and SGN, the international tomato genome database, are both using our system for storage and data analysis. The system consists of a Web server running the interface (ASP.NET C#), Microsoft SQL server (ADO.NET), compute cluster running Microsoft Windows, ftp server, and file server. Users can interact with their jobs and data via a Web browser, ftp, or e-mail. The interface is accessible at http://cbsuapps.tc.cornell.edu/.
Research in Parallel Computing: 1987-1990
1994-08-05
emulation, we layered UNIX BSD 4.3 functionality above the kernel primitives, but packaged both as a monolithic unit running in privileged state. This...further, so that only a "pure kernel " or " microkernel " runs in privileged mode, while the other components of the environment execute as one or more client... kernel DTIC TAB 24 2.2.2 Nectar’s communication software Unannounced 0 25 2.2.3 A Nectar programming interface Justification 25 2.3 System evaluation 26
Zhu, Xiang; Zhang, Dianwen
2013-01-01
We present a fast, accurate and robust parallel Levenberg-Marquardt minimization optimizer, GPU-LMFit, which is implemented on graphics processing unit for high performance scalable parallel model fitting processing. GPU-LMFit can provide a dramatic speed-up in massive model fitting analyses to enable real-time automated pixel-wise parametric imaging microscopy. We demonstrate the performance of GPU-LMFit for the applications in superresolution localization microscopy and fluorescence lifetime imaging microscopy. PMID:24130785
A multi-satellite orbit determination problem in a parallel processing environment
NASA Technical Reports Server (NTRS)
Deakyne, M. S.; Anderle, R. J.
1988-01-01
The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.
Parallel computing method for simulating hydrological processesof large rivers under climate change
NASA Astrophysics Data System (ADS)
Wang, H.; Chen, Y.
2016-12-01
Climate change is one of the proverbial global environmental problems in the world.Climate change has altered the watershed hydrological processes in time and space distribution, especially in worldlarge rivers.Watershed hydrological process simulation based on physically based distributed hydrological model can could have better results compared with the lumped models.However, watershed hydrological process simulation includes large amount of calculations, especially in large rivers, thus needing huge computing resources that may not be steadily available for the researchers or at high expense, this seriously restricted the research and application. To solve this problem, the current parallel method are mostly parallel computing in space and time dimensions.They calculate the natural features orderly thatbased on distributed hydrological model by grid (unit, a basin) from upstream to downstream.This articleproposes ahigh-performancecomputing method of hydrological process simulation with high speedratio and parallel efficiency.It combinedthe runoff characteristics of time and space of distributed hydrological model withthe methods adopting distributed data storage, memory database, distributed computing, parallel computing based on computing power unit.The method has strong adaptability and extensibility,which means it canmake full use of the computing and storage resources under the condition of limited computing resources, and the computing efficiency can be improved linearly with the increase of computing resources .This method can satisfy the parallel computing requirements ofhydrological process simulation in small, medium and large rivers.
High Input Voltage, Silicon Carbide Power Processing Unit Performance Demonstration
NASA Technical Reports Server (NTRS)
Bozak, Karin E.; Pinero, Luis R.; Scheidegger, Robert J.; Aulisio, Michael V.; Gonzalez, Marcelo C.; Birchenough, Arthur G.
2015-01-01
A silicon carbide brassboard power processing unit has been developed by the NASA Glenn Research Center in Cleveland, Ohio. The power processing unit operates from two sources: a nominal 300 Volt high voltage input bus and a nominal 28 Volt low voltage input bus. The design of the power processing unit includes four low voltage, low power auxiliary supplies, and two parallel 7.5 kilowatt (kW) discharge power supplies that are capable of providing up to 15 kilowatts of total power at 300 to 500 Volts (V) to the thruster. Additionally, the unit contains a housekeeping supply, high voltage input filter, low voltage input filter, and master control board, such that the complete brassboard unit is capable of operating a 12.5 kilowatt Hall effect thruster. The performance of the unit was characterized under both ambient and thermal vacuum test conditions, and the results demonstrate exceptional performance with full power efficiencies exceeding 97%. The unit was also tested with a 12.5kW Hall effect thruster to verify compatibility and output filter specifications. With space-qualified silicon carbide or similar high voltage, high efficiency power devices, this would provide a design solution to address the need for high power electric propulsion systems.
Use of general purpose graphics processing units with MODFLOW
Hughes, Joseph D.; White, Jeremy T.
2013-01-01
To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
Examining the architecture of cellular computing through a comparative study with a computer
Wang, Degeng; Gribskov, Michael
2005-01-01
The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software–hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's ‘hardware’ equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the ‘bandwidth’ of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179
Examining the architecture of cellular computing through a comparative study with a computer.
Wang, Degeng; Gribskov, Michael
2005-06-22
The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed.
Introduction of Parallel GPGPU Acceleration Algorithms for the Solution of Radiative Transfer
NASA Technical Reports Server (NTRS)
Godoy, William F.; Liu, Xu
2011-01-01
General-purpose computing on graphics processing units (GPGPU) is a recent technique that allows the parallel graphics processing unit (GPU) to accelerate calculations performed sequentially by the central processing unit (CPU). To introduce GPGPU to radiative transfer, the Gauss-Seidel solution of the well-known expressions for 1-D and 3-D homogeneous, isotropic media is selected as a test case. Different algorithms are introduced to balance memory and GPU-CPU communication, critical aspects of GPGPU. Results show that speed-ups of one to two orders of magnitude are obtained when compared to sequential solutions. The underlying value of GPGPU is its potential extension in radiative solvers (e.g., Monte Carlo, discrete ordinates) at a minimal learning curve.
NASA Astrophysics Data System (ADS)
1990-01-01
System 8400 is an advanced system for measurement of gas and liquid pressure, along with a variety of other parameters, including voltage, frequency and digital inputs. System 8400 offers exceptionally high speed data acquisition through parallel processing, and its modular design allows expansion from a relatively inexpensive entry level system by the addition of modular Input Units that can be installed or removed in minutes. Douglas Juanarena was on the team of engineers that developed a new technology known as ESP (electronically scanned pressure). The Langley ESP measurement system was based on miniature integrated circuit pressure-sensing transducers that communicated pressure information to a minicomputer. In 1977, Juanarena formed PSI to exploit the NASA technology. In 1978 he left Langley, obtained a NASA license for the technology, introduced the first commercial product, the 780B pressure measurement system. PSI developed a pressure scanner for automation of industrial processes. Now in its second design generation, the DPT-6400 is capable of making 2,000 measurements a second and has 64 channels by addition of slave units. New system 8400 represents PSI's bid to further exploit the 600 million U.S. industrial pressure measurement market. It is geared to provide a turnkey solution to physical measurement.
Large Scale GW Calculations on the Cori System
NASA Astrophysics Data System (ADS)
Deslippe, Jack; Del Ben, Mauro; da Jornada, Felipe; Canning, Andrew; Louie, Steven
The NERSC Cori system, powered by 9000+ Intel Xeon-Phi processors, represents one of the largest HPC systems for open-science in the United States and the world. We discuss the optimization of the GW methodology for this system, including both node level and system-scale optimizations. We highlight multiple large scale (thousands of atoms) case studies and discuss both absolute application performance and comparison to calculations on more traditional HPC architectures. We find that the GW method is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism across many layers of the system. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division, as part of the Computational Materials Sciences Program.
Removal of gasoline volatile organic compounds via air biofiltration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, R.S.; Saberiyan, A.G.; Esler, C.T.
1995-12-31
Volatile organic compounds (VOCs) generated by vapor extraction and air-stripping systems can be biologically treated in an air biofiltration unit. An air biofilter consists of one or more beds of packing material inoculated with heterotrophic microorganisms capable of degrading the organic contaminant of concern. Waste gases and oxygen are passed through the inoculated packing material, where the microorganisms will degrade the contaminant and release CO{sub 2} + H{sub 2}O. Based on data obtained from a treatability study, a full-scale unit was designed and constructed to be used for treating gasoline vapors generated by a vapor-extraction and groundwater-treatment system at amore » site in California. The unit is composed of two cylindrical reactors with a total packing volume of 3 m{sup 3}. Both reactors are packed with sphagnum moss and inoculated with hydrocarbon-degrading microorganisms of Pseudomonas and Arthrobacter spp. The two reactors are connected in series for air-flow passage. Parallel lines are used for injection of water, nutrients, and buffer to each reactor. Data collected during the startup program have demonstrated an air biofiltration unit with high organic-vapor-removal efficiency.« less
Real-time track-less Cherenkov ring fitting trigger system based on Graphics Processing Units
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Cotta Ramusino, A.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Gianoli, A.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2017-12-01
The parallel computing power of commercial Graphics Processing Units (GPUs) is exploited to perform real-time ring fitting at the lowest trigger level using information coming from the Ring Imaging Cherenkov (RICH) detector of the NA62 experiment at CERN. To this purpose, direct GPU communication with a custom FPGA-based board has been used to reduce the data transmission latency. The GPU-based trigger system is currently integrated in the experimental setup of the RICH detector of the NA62 experiment, in order to reconstruct ring-shaped hit patterns. The ring-fitting algorithm running on GPU is fed with raw RICH data only, with no information coming from other detectors, and is able to provide more complex trigger primitives with respect to the simple photodetector hit multiplicity, resulting in a higher selection efficiency. The performance of the system for multi-ring Cherenkov online reconstruction obtained during the NA62 physics run is presented.
Rubus: A compiler for seamless and extensible parallelism.
Adnan, Muhammad; Aslam, Faisal; Nawaz, Zubair; Sarwar, Syed Mansoor
2017-01-01
Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program.
Rubus: A compiler for seamless and extensible parallelism
Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor
2017-01-01
Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program. PMID:29211758
Peng, Kuan; He, Ling; Zhu, Ziqiang; Tang, Jingtian; Xiao, Jiaying
2013-12-01
Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an accurate and flexible algorithm for photoacoustic tomography. However, the FEM-based algorithm is computationally demanding, especially for three-dimensional cases. To enhance the algorithm's efficiency, in this work a parallel computational strategy is implemented in the framework of the FEM-based reconstruction algorithm using a graphic-processing-unit parallel frame named the "compute unified device architecture." A series of simulation experiments is carried out to test the accuracy and accelerating effect of the improved method. The results obtained indicate that the parallel calculation does not change the accuracy of the reconstruction algorithm, while its computational cost is significantly reduced by a factor of 38.9 with a GTX 580 graphics card using the improved method.
Implementation and performance of parallel Prolog interpreter
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wei, S.; Kale, L.V.; Balkrishna, R.
1988-01-01
In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.
Parallel processing and expert systems
NASA Technical Reports Server (NTRS)
Lau, Sonie; Yan, Jerry C.
1991-01-01
Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited.
Speech motor development: Integrating muscles, movements, and linguistic units.
Smith, Anne
2006-01-01
A fundamental problem for those interested in human communication is to determine how ideas and the various units of language structure are communicated through speaking. The physiological concepts involved in the control of muscle contraction and movement are theoretically distant from the processing levels and units postulated to exist in language production models. A review of the literature on adult speakers suggests that they engage complex, parallel processes involving many units, including sentence, phrase, syllable, and phoneme levels. Infants must develop multilayered interactions among language and motor systems. This discussion describes recent studies of speech motor performance relative to varying linguistic goals during the childhood, teenage, and young adult years. Studies of the developing interactions between speech motor and language systems reveal both qualitative and quantitative differences between the developing and the mature systems. These studies provide an experimental basis for a more comprehensive theoretical account of how mappings between units of language and units of action are formed and how they function. Readers will be able to: (1) understand the theoretical differences between models of speech motor control and models of language processing, as well as the nature of the concepts used in the two different kinds of models, (2) explain the concept of coarticulation and state why this phenomenon has confounded attempts to determine the role of linguistic units, such as syllables and phonemes, in speech production, (3) describe the development of speech motor performance skills and specify quantitative and qualitative differences between speech motor performance in children and adults, and (4) describe experimental methods that allow scientists to study speech and limb motor control, as well as compare units of action used to study non-speech and speech movements.
NASA Astrophysics Data System (ADS)
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
NASA Astrophysics Data System (ADS)
Rostrup, Scott; De Sterck, Hans
2010-12-01
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
Mountain Plains Learning Experience Guide: Radio and T.V. Repair. Course: A.C. Circuits.
ERIC Educational Resources Information Center
Hoggatt, P.; And Others
One of four individualized courses included in a radio and television repair curriculum, this course focuses on alternating current relationships and computations, transformers, power supplies, series and parallel resistive-reactive circuits, and series and parallel resonance. The course is comprised of eight units: (1) Introduction to Alternating…
Genetic Parallel Programming: design and implementation.
Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong
2006-01-01
This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.
NASA Astrophysics Data System (ADS)
Moon, Hongsik
What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the changing computer hardware platforms in order to provide fast, accurate and efficient solutions to large, complex electromagnetic problems. The research in this dissertation proves that the performance of parallel code is intimately related to the configuration of the computer hardware and can be maximized for different hardware platforms. To benchmark and optimize the performance of parallel CEM software, a variety of large, complex projects are created and executed on a variety of computer platforms. The computer platforms used in this research are detailed in this dissertation. The projects run as benchmarks are also described in detail and results are presented. The parameters that affect parallel CEM software on High Performance Computing Clusters (HPCC) are investigated. This research demonstrates methods to maximize the performance of parallel CEM software code.
Playback system designed for X-Band SAR
NASA Astrophysics Data System (ADS)
Yuquan, Liu; Changyong, Dou
2014-03-01
SAR(Synthetic Aperture Radar) has extensive application because it is daylight and weather independent. In particular, X-Band SAR strip map, designed by Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, provides high ground resolution images, at the same time it has a large spatial coverage and a short acquisition time, so it is promising in multi-applications. When sudden disaster comes, the emergency situation acquires radar signal data and image as soon as possible, in order to take action to reduce loss and save lives in the first time. This paper summarizes a type of X-Band SAR playback processing system designed for disaster response and scientific needs. It describes SAR data workflow includes the payload data transmission and reception process. Playback processing system completes signal analysis on the original data, providing SAR level 0 products and quick image. Gigabit network promises radar signal transmission efficiency from recorder to calculation unit. Multi-thread parallel computing and ping pong operation can ensure computation speed. Through gigabit network, multi-thread parallel computing and ping pong operation, high speed data transmission and processing meet the SAR radar data playback real time requirement.
Graphics Processors in HEP Low-Level Trigger Systems
NASA Astrophysics Data System (ADS)
Ammendola, Roberto; Biagioni, Andrea; Chiozzi, Stefano; Cotta Ramusino, Angelo; Cretaro, Paolo; Di Lorenzo, Stefano; Fantechi, Riccardo; Fiorini, Massimiliano; Frezza, Ottorino; Lamanna, Gianluca; Lo Cicero, Francesca; Lonardo, Alessandro; Martinelli, Michele; Neri, Ilaria; Paolucci, Pier Stanislao; Pastorelli, Elena; Piandani, Roberto; Pontisso, Luca; Rossetti, Davide; Simula, Francesco; Sozzi, Marco; Vicini, Piero
2016-11-01
Usage of Graphics Processing Units (GPUs) in the so called general-purpose computing is emerging as an effective approach in several fields of science, although so far applications have been employing GPUs typically for offline computations. Taking into account the steady performance increase of GPU architectures in terms of computing power and I/O capacity, the real-time applications of these devices can thrive in high-energy physics data acquisition and trigger systems. We will examine the use of online parallel computing on GPUs for the synchronous low-level trigger, focusing on tests performed on the trigger system of the CERN NA62 experiment. To successfully integrate GPUs in such an online environment, latencies of all components need analysing, networking being the most critical. To keep it under control, we envisioned NaNet, an FPGA-based PCIe Network Interface Card (NIC) enabling GPUDirect connection. Furthermore, it is assessed how specific trigger algorithms can be parallelized and thus benefit from a GPU implementation, in terms of increased execution speed. Such improvements are particularly relevant for the foreseen Large Hadron Collider (LHC) luminosity upgrade where highly selective algorithms will be essential to maintain sustainable trigger rates with very high pileup.
NASA Astrophysics Data System (ADS)
Mills, R. T.
2014-12-01
As the high performance computing (HPC) community pushes towards the exascale horizon, the importance and prevalence of fine-grained parallelism in new computer architectures is increasing. This is perhaps most apparent in the proliferation of so-called "accelerators" such as the Intel Xeon Phi or NVIDIA GPGPUs, but the trend also holds for CPUs, where serial performance has grown slowly and effective use of hardware threads and vector units are becoming increasingly important to realizing high performance. This has significant implications for weather, climate, and Earth system modeling codes, many of which display impressive scalability across MPI ranks but take relatively little advantage of threading and vector processing. In addition to increasing parallelism, next generation codes will also need to address increasingly deep hierarchies for data movement: NUMA/cache levels, on node vs. off node, local vs. wide neighborhoods on the interconnect, and even in the I/O system. We will discuss some approaches (grounded in experiences with the Intel Xeon Phi architecture) for restructuring Earth science codes to maximize concurrency across multiple levels (vectors, threads, MPI ranks), and also discuss some novel approaches for minimizing expensive data movement/communication.
Li, Kangkang; Yu, Hai; Feron, Paul; Tade, Moses; Wardhaugh, Leigh
2015-08-18
Using a rate-based model, we assessed the technical feasibility and energy performance of an advanced aqueous-ammonia-based postcombustion capture process integrated with a coal-fired power station. The capture process consists of three identical process trains in parallel, each containing a CO2 capture unit, an NH3 recycling unit, a water separation unit, and a CO2 compressor. A sensitivity study of important parameters, such as NH3 concentration, lean CO2 loading, and stripper pressure, was performed to minimize the energy consumption involved in the CO2 capture process. Process modifications of the rich-split process and the interheating process were investigated to further reduce the solvent regeneration energy. The integrated capture system was then evaluated in terms of the mass balance and the energy consumption of each unit. The results show that our advanced ammonia process is technically feasible and energy-competitive, with a low net power-plant efficiency penalty of 7.7%.
Kruger, Tina M; Gilland, Sarah; Frank, Jacquelyn B; Murphy, Bridget C; English, Courtney; Meade, Jana; Morrow, Kaylee; Rush, Evan
2017-01-01
In May 2014, a short-term study-abroad experience was conducted in Finland through a course offered at Indiana State University (ISU). Students and faculty from ISU and Eastern Illinois University participated in the experience, which was created to facilitate a cross-cultural comparison of long-term-care settings in the United States and Finland. With its outstanding system of caring for the health and social needs of its aging populace, Finland is a logical model to examine when considering ways to improve the quality of life for older adults who require care in the United States . Those participating in the course visited a series of long-term-care facilities in the region surrounding Terre Haute, Indiana, then travelled to Lappeenranta, Finland to visit parallel sites. Through limited-participation observation and semistructured interviews, similarities and differences in experiences, educations, and policies affecting long-term care workers in the United States and Finland were identified and are described here.
Iterative algorithms for large sparse linear systems on parallel computers
NASA Technical Reports Server (NTRS)
Adams, L. M.
1982-01-01
Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
NASA Technical Reports Server (NTRS)
Batcher, K. E.; Eddey, E. E.; Faiss, R. O.; Gilmore, P. A.
1981-01-01
The processing of synthetic aperture radar (SAR) signals using the massively parallel processor (MPP) is discussed. The fast Fourier transform convolution procedures employed in the algorithms are described. The MPP architecture comprises an array unit (ARU) which processes arrays of data; an array control unit which controls the operation of the ARU and performs scalar arithmetic; a program and data management unit which controls the flow of data; and a unique staging memory (SM) which buffers and permutes data. The ARU contains a 128 by 128 array of bit-serial processing elements (PE). Two-by-four surarrays of PE's are packaged in a custom VLSI HCMOS chip. The staging memory is a large multidimensional-access memory which buffers and permutes data flowing with the system. Efficient SAR processing is achieved via ARU communication paths and SM data manipulation. Real time processing capability can be realized via a multiple ARU, multiple SM configuration.
Accelerating large-scale protein structure alignments with graphics processing units
2012-01-01
Background Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons. Findings We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH. Conclusions ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU. PMID:22357132
Characterizing parallel file-access patterns on a large-scale multiprocessor
NASA Technical Reports Server (NTRS)
Purakayastha, A.; Ellis, Carla; Kotz, David; Nieuwejaar, Nils; Best, Michael L.
1995-01-01
High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such high-performance parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. This paper is part of the CHARISMA project, which intends to fill this void by measuring real file-system workloads on various production parallel machines. In particular, we present results from the CM-5 at the National Center for Supercomputing Applications. Our results are unique because we collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for parallel file-system design.
High Input Voltage, Silicon Carbide Power Processing Unit Performance Demonstration
NASA Technical Reports Server (NTRS)
Bozak, Karin E.; Pinero, Luis R.; Scheidegger, Robert J.; Aulisio, Michael V.; Gonzalez, Marcelo C.; Birchenough, Arthur G.
2015-01-01
A silicon carbide brassboard power processing unit has been developed by the NASA Glenn Research Center in Cleveland, Ohio. The power processing unit operates from two sources - a nominal 300-Volt high voltage input bus and a nominal 28-Volt low voltage input bus. The design of the power processing unit includes four low voltage, low power supplies that provide power to the thruster auxiliary supplies, and two parallel 7.5 kilowatt power supplies that are capable of providing up to 15 kilowatts of total power at 300-Volts to 500-Volts to the thruster discharge supply. Additionally, the unit contains a housekeeping supply, high voltage input filter, low voltage input filter, and master control board, such that the complete brassboard unit is capable of operating a 12.5 kilowatt Hall Effect Thruster. The performance of unit was characterized under both ambient and thermal vacuum test conditions, and the results demonstrate the exceptional performance with full power efficiencies exceeding 97. With a space-qualified silicon carbide or similar high voltage, high efficiency power device, this design could evolve into a flight design for future missions that require high power electric propulsion systems.
GASPRNG: GPU accelerated scalable parallel random number generator library
NASA Astrophysics Data System (ADS)
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.
Alaska Native Land Claims. Workbook to Accompany Textbook.
ERIC Educational Resources Information Center
Hays, Lydia L.
Written as a companion to the secondary textbook, "Alaska Native Land Claims", this student workbook is organized via 9 units and 39 chapters which parallel the text's organizational format. Each unit presents unit goals and has anywhere from three to five subsections or chapters. Each titled chapter (e.g., Alaska's First Settlers)…
Introducing parallelism to histogramming functions for GEM systems
NASA Astrophysics Data System (ADS)
Krawczyk, Rafał D.; Czarski, Tomasz; Kolasinski, Piotr; Pozniak, Krzysztof T.; Linczuk, Maciej; Byszuk, Adrian; Chernyshova, Maryna; Juszczyk, Bartlomiej; Kasprowicz, Grzegorz; Wojenski, Andrzej; Zabolotny, Wojciech
2015-09-01
This article is an assessment of potential parallelization of histogramming algorithms in GEM detector system. Histogramming and preprocessing algorithms in MATLAB were analyzed with regard to adding parallelism. Preliminary implementation of parallel strip histogramming resulted in speedup. Analysis of algorithms parallelizability is presented. Overview of potential hardware and software support to implement parallel algorithm is discussed.
Field Trials of the Multi-Source Approach for Resistivity and Induced Polarization Data Acquisition
NASA Astrophysics Data System (ADS)
LaBrecque, D. J.; Morelli, G.; Fischanger, F.; Lamoureux, P.; Brigham, R.
2013-12-01
Implementing systems of distributed receivers and transmitters for resistivity and induced polarization data is an almost inevitable result of the availability of wireless data communication modules and GPS modules offering precise timing and instrument locations. Such systems have a number of advantages; for example, they can be deployed around obstacles such as rivers, canyons, or mountains which would be difficult with traditional 'hard-wired' systems. However, deploying a system of identical, small, battery powered, transceivers, each capable of injecting a known current and measuring the induced potential has an additional and less obvious advantage in that multiple units can inject current simultaneously. The original purpose for using multiple simultaneous current sources (multi-source) was to increase signal levels. In traditional systems, to double the received signal you inject twice the current which requires you to apply twice the voltage and thus four times the power. Alternatively, one approach to increasing signal levels for large-scale surveys collected using small, battery powered transceivers is it to allow multiple units to transmit in parallel. In theory, using four 400 watt transmitters on separate, parallel dipoles yields roughly the same signal as a single 6400 watt transmitter. Furthermore, implementing the multi-source approach creates the opportunity to apply more complex current flow patterns than simple, parallel dipoles. For a perfect, noise-free system, multi-sources adds no new information to a data set that contains a comprehensive set of data collected using single sources. However, for realistic, noisy systems, it appears that multi-source data can substantially impact survey results. In preliminary model studies, the multi-source data produced such startling improvements in subsurface images that even the authors questioned their veracity. Between December of 2012 and July of 2013, we completed multi-source surveys at five sites with depths of exploration ranging from 150 to 450 m. The sites included shallow geothermal sites near Reno Nevada, Pomarance Italy, and Volterra Italy; a mineral exploration site near Timmins Quebec; and a landslide investigation near Vajont Dam in northern Italy. These sites provided a series of challenges in survey design and deployment including some extremely difficult terrain and a broad range of background resistivity and induced values. Despite these challenges, comparison of multi-source results to resistivity and induced polarization data collection with more traditional methods support the thesis that the multi-source approach is capable of providing substantial improvements in both depth of penetration and resolution over conventional approaches.
NASA Astrophysics Data System (ADS)
Niwase, Hiroaki; Takada, Naoki; Araki, Hiromitsu; Maeda, Yuki; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi
2016-09-01
Parallel calculations of large-pixel-count computer-generated holograms (CGHs) are suitable for multiple-graphics processing unit (multi-GPU) cluster systems. However, it is not easy for a multi-GPU cluster system to accomplish fast CGH calculations when CGH transfers between PCs are required. In these cases, the CGH transfer between the PCs becomes a bottleneck. Usually, this problem occurs only in multi-GPU cluster systems with a single spatial light modulator. To overcome this problem, we propose a simple method using the InfiniBand network. The computational speed of the proposed method using 13 GPUs (NVIDIA GeForce GTX TITAN X) was more than 3000 times faster than that of a CPU (Intel Core i7 4770) when the number of three-dimensional (3-D) object points exceeded 20,480. In practice, we achieved ˜40 tera floating point operations per second (TFLOPS) when the number of 3-D object points exceeded 40,960. Our proposed method was able to reconstruct a real-time movie of a 3-D object comprising 95,949 points.
Control and protection system for paralleled modular static inverter-converter systems
NASA Technical Reports Server (NTRS)
Birchenough, A. G.; Gourash, F.
1973-01-01
A control and protection system was developed for use with a paralleled 2.5-kWe-per-module static inverter-converter system. The control and protection system senses internal and external fault parameters such as voltage, frequency, current, and paralleling current unbalance. A logic system controls contactors to isolate defective power conditioners or loads. The system sequences contactor operation to automatically control parallel operation, startup, and fault isolation. Transient overload protection and fault checking sequences are included. The operation and performance of a control and protection system, with detailed circuit descriptions, are presented.
Accelerating atomistic calculations of quantum energy eigenstates on graphic cards
NASA Astrophysics Data System (ADS)
Rodrigues, Walter; Pecchia, A.; Lopez, M.; Auf der Maur, M.; Di Carlo, A.
2014-10-01
Electronic properties of nanoscale materials require the calculation of eigenvalues and eigenvectors of large matrices. This bottleneck can be overcome by parallel computing techniques or the introduction of faster algorithms. In this paper we report a custom implementation of the Lanczos algorithm with simple restart, optimized for graphical processing units (GPUs). The whole algorithm has been developed using CUDA and runs entirely on the GPU, with a specialized implementation that spares memory and reduces at most machine-to-device data transfers. Furthermore parallel distribution over several GPUs has been attained using the standard message passing interface (MPI). Benchmark calculations performed on a GaN/AlGaN wurtzite quantum dot with up to 600,000 atoms are presented. The empirical tight-binding (ETB) model with an sp3d5s∗+spin-orbit parametrization has been used to build the system Hamiltonian (H).
GPU Based Software Correlators - Perspectives for VLBI2010
NASA Technical Reports Server (NTRS)
Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
2010-01-01
Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kahlenberg, Volker; Konzett, Juergen; Kaindl, Reinhard
High-pressure synthesis experiments in the system Na{sub 2}O-Y{sub 2}O{sub 3}-SiO{sub 2} revealed the existence of a previously unknown polymorph of NaYSi{sub 2}O{sub 6} or Na{sub 3}Y{sub 3}[Si{sub 3}O{sub 9}]{sub 2} which was quenched from 3.0 GPa and 1000 deg. C. Structural investigations on this modification have been performed using single-crystal X-ray diffraction data collected at ambient conditions. Furthermore, unpolarized micro-Raman spectra have been obtained from single-crystal material. The high-P modification of NaYSi{sub 2}O{sub 6} crystallizes in the centrosymmetric space group C2/c with 12 formula units per cell (a=8.2131(9) A, b=10.3983(14) A, c=17.6542(21) A, {beta}=100.804(9){sup o}, V=1481.0(3) A{sup 3}, R(|F|)=0.033 formore » 1142 independent observed reflections) and belongs to the group of cyclo-silicates. Basic building units are isolated three-membered [Si{sub 3}O{sub 9}] rings located in layers parallel to (010). Within a single layer the rings are concentrated in strings parallel to [100]. The sequence of directedness of up (U) or down (D) pointing tetrahedra of a single ring is UUU or DDD, respectively. Stacking of the layers parallel to b results in the formation of a three-dimensional structure in which yttrium and sodium cations are incorporated for charge compensation. In more detail, four non-tetrahedral cation positions can be differentiated which are coordinated by 6 and 8 oxygen ligands. Refinements of the site occupancies did not reveal any indication for mixed Na-Y populations on these positions. Finally, several geometrical parameters of rings occurring in cyclo-trisilicate structures have been compiled and are discussed. - Graphical abstract: Projection of the whole structure of high-P NaYSi{sub 2}O{sub 6} parallel to [100].« less
A method for real-time implementation of HOG feature extraction
NASA Astrophysics Data System (ADS)
Luo, Hai-bo; Yu, Xin-rong; Liu, Hong-mei; Ding, Qing-hai
2011-08-01
Histogram of oriented gradient (HOG) is an efficient feature extraction scheme, and HOG descriptors are feature descriptors which is widely used in computer vision and image processing for the purpose of biometrics, target tracking, automatic target detection(ATD) and automatic target recognition(ATR) etc. However, computation of HOG feature extraction is unsuitable for hardware implementation since it includes complicated operations. In this paper, the optimal design method and theory frame for real-time HOG feature extraction based on FPGA were proposed. The main principle is as follows: firstly, the parallel gradient computing unit circuit based on parallel pipeline structure was designed. Secondly, the calculation of arctangent and square root operation was simplified. Finally, a histogram generator based on parallel pipeline structure was designed to calculate the histogram of each sub-region. Experimental results showed that the HOG extraction can be implemented in a pixel period by these computing units.
Potential Application of a Graphical Processing Unit to Parallel Computations in the NUBEAM Code
NASA Astrophysics Data System (ADS)
Payne, J.; McCune, D.; Prater, R.
2010-11-01
NUBEAM is a comprehensive computational Monte Carlo based model for neutral beam injection (NBI) in tokamaks. NUBEAM computes NBI-relevant profiles in tokamak plasmas by tracking the deposition and the slowing of fast ions. At the core of NUBEAM are vector calculations used to track fast ions. These calculations have recently been parallelized to run on MPI clusters. However, cost and interlink bandwidth limit the ability to fully parallelize NUBEAM on an MPI cluster. Recent implementation of double precision capabilities for Graphical Processing Units (GPUs) presents a cost effective and high performance alternative or complement to MPI computation. Commercially available graphics cards can achieve up to 672 GFLOPS double precision and can handle hundreds of thousands of threads. The ability to execute at least one thread per particle simultaneously could significantly reduce the execution time and the statistical noise of NUBEAM. Progress on implementation on a GPU will be presented.
NASA Technical Reports Server (NTRS)
Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)
1990-01-01
Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
NASA Astrophysics Data System (ADS)
Shi, Sheng-bing; Chen, Zhen-xing; Qin, Shao-gang; Song, Chun-yan; Jiang, Yun-hong
2014-09-01
With the development of science and technology, photoelectric equipment comprises visible system, infrared system, laser system and so on, integration, information and complication are higher than past. Parallelism and jumpiness of optical axis are important performance of photoelectric equipment,directly affect aim, ranging, orientation and so on. Jumpiness of optical axis directly affect hit precision of accurate point damage weapon, but we lack the facility which is used for testing this performance. In this paper, test system which is used fo testing parallelism and jumpiness of optical axis is devised, accurate aim isn't necessary and data processing are digital in the course of testing parallelism, it can finish directly testing parallelism of multi-axes, aim axis and laser emission axis, parallelism of laser emission axis and laser receiving axis and first acuualizes jumpiness of optical axis of optical sighting device, it's a universal test system.
The Galley Parallel File System
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David
1996-01-01
Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/0 requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.
JSD: Parallel Job Accounting on the IBM SP2
NASA Technical Reports Server (NTRS)
Saphir, William; Jones, James Patton; Walter, Howard (Technical Monitor)
1995-01-01
The IBM SP2 is one of the most promising parallel computers for scientific supercomputing - it is fast and usually reliable. One of its biggest problems is a lack of robust and comprehensive system software. Among other things, this software allows a collection of Unix processes to be treated as a single parallel application. It does not, however, provide accounting for parallel jobs other than what is provided by AIX for the individual process components. Without parallel job accounting, it is not possible to monitor system use, measure the effectiveness of system administration strategies, or identify system bottlenecks. To address this problem, we have written jsd, a daemon that collects accounting data for parallel jobs. jsd records information in a format that is easily machine- and human-readable, allowing us to extract the most important accounting information with very little effort. jsd also notifies system administrators in certain cases of system failure.
Photonic content-addressable memory system that uses a parallel-readout optical disk
NASA Astrophysics Data System (ADS)
Krishnamoorthy, Ashok V.; Marchand, Philippe J.; Yayla, Gökçe; Esener, Sadik C.
1995-11-01
We describe a high-performance associative-memory system that can be implemented by means of an optical disk modified for parallel readout and a custom-designed silicon integrated circuit with parallel optical input. The system can achieve associative recall on 128 \\times 128 bit images and also on variable-size subimages. The system's behavior and performance are evaluated on the basis of experimental results on a motionless-head parallel-readout optical-disk system, logic simulations of the very-large-scale integrated chip, and a software emulation of the overall system.
RAMA: A file system for massively parallel computers
NASA Technical Reports Server (NTRS)
Miller, Ethan L.; Katz, Randy H.
1993-01-01
This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.
Collectively loading an application in a parallel computer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.
Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.
Accelerating Wright–Fisher Forward Simulations on the Graphics Processing Unit
Lawrie, David S.
2017-01-01
Forward Wright–Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the Central Processor Unit (CPU), thus limiting their usefulness. However, the single-locus Wright–Fisher forward algorithm is exceedingly parallelizable, with many steps that are so-called “embarrassingly parallel,” consisting of a vast number of individual computations that are all independent of each other and thus capable of being performed concurrently. The rise of modern Graphics Processing Units (GPUs) and programming languages designed to leverage the inherent parallel nature of these processors have allowed researchers to dramatically speed up many programs that have such high arithmetic intensity and intrinsic concurrency. The presented GPU Optimized Wright–Fisher simulation, or “GO Fish” for short, can be used to simulate arbitrary selection and demographic scenarios while running over 250-fold faster than its serial counterpart on the CPU. Even modest GPU hardware can achieve an impressive speedup of over two orders of magnitude. With simulations so accelerated, one can not only do quick parametric bootstrapping of previously estimated parameters, but also use simulated results to calculate the likelihoods and summary statistics of demographic and selection models against real polymorphism data, all without restricting the demographic and selection scenarios that can be modeled or requiring approximations to the single-locus forward algorithm for efficiency. Further, as many of the parallel programming techniques used in this simulation can be applied to other computationally intensive algorithms important in population genetics, GO Fish serves as an exciting template for future research into accelerating computation in evolution. GO Fish is part of the Parallel PopGen Package available at: http://dl42.github.io/ParallelPopGen/. PMID:28768689
Triton - False Color of Cantaloupe Terrain
1996-09-26
Voyager violet, green, and ultraviolet images of Triton were map projected into cylindrical coordinates and combined to produce this false color terrain map. Several compositionally distinct terrain and geologic features are portrayed. At center is a gray blue unit referred to as 'cantaloupe' terrain because of its unusual topographic texture. The unit appears to predate other units to the left. Immediately adjacent to the cantaloupe terrain, is a smoother unit, represented by a reddish color, that has been dissected by a prominent fault system. This unit apparently overlies a much higher albedo material, seen farther left. A prominent angular albedo boundary separates relatively undisturbed smooth terrain from irregular patches which have been derived from breakup of the same material. Also visible at the far left are diffuse, elongated streaks, which seem to emanate from circular, often bright centered features. The parallel streaks may represent vented particulate materials blown in the same direction by winds in Triton's thin atmosphere. The Voyager Mission was conducted by JPL for NASA's Office of Space Science and Applications. http://photojournal.jpl.nasa.gov/catalog/PIA00060
The neural circuits of innate fear: detection, integration, action, and memorization
Silva, Bianca A.; Gross, Cornelius T.
2016-01-01
How fear is represented in the brain has generated a lot of research attention, not only because fear increases the chances for survival when appropriately expressed but also because it can lead to anxiety and stress-related disorders when inadequately processed. In this review, we summarize recent progress in the understanding of the neural circuits processing innate fear in rodents. We propose that these circuits are contained within three main functional units in the brain: a detection unit, responsible for gathering sensory information signaling the presence of a threat; an integration unit, responsible for incorporating the various sensory information and recruiting downstream effectors; and an output unit, in charge of initiating appropriate bodily and behavioral responses to the threatful stimulus. In parallel, the experience of innate fear also instructs a learning process leading to the memorization of the fearful event. Interestingly, while the detection, integration, and output units processing acute fear responses to different threats tend to be harbored in distinct brain circuits, memory encoding of these threats seems to rely on a shared learning system. PMID:27634145
A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
Song, Fengguang; Dongarra, Jack
2014-10-01
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Song, Fengguang; Dongarra, Jack
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
Numerical simulation of disperse particle flows on a graphics processing unit
NASA Astrophysics Data System (ADS)
Sierakowski, Adam J.
In both nature and technology, we commonly encounter solid particles being carried within fluid flows, from dust storms to sediment erosion and from food processing to energy generation. The motion of uncountably many particles in highly dynamic flow environments characterizes the tremendous complexity of such phenomena. While methods exist for the full-scale numerical simulation of such systems, current computational capabilities require the simplification of the numerical task with significant approximation using closure models widely recognized as insufficient. There is therefore a fundamental need for the investigation of the underlying physical processes governing these disperse particle flows. In the present work, we develop a new tool based on the Physalis method for the first-principles numerical simulation of thousands of particles (a small fraction of an entire disperse particle flow system) in order to assist in the search for new reduced-order closure models. We discuss numerous enhancements to the efficiency and stability of the Physalis method, which introduces the influence of spherical particles to a fixed-grid incompressible Navier-Stokes flow solver using a local analytic solution to the flow equations. Our first-principles investigation demands the modeling of unresolved length and time scales associated with particle collisions. We introduce a collision model alongside Physalis, incorporating lubrication effects and proposing a new nonlinearly damped Hertzian contact model. By reproducing experimental studies from the literature, we document extensive validation of the methods. We discuss the implementation of our methods for massively parallel computation using a graphics processing unit (GPU). We combine Eulerian grid-based algorithms with Lagrangian particle-based algorithms to achieve computational throughput up to 90 times faster than the legacy implementation of Physalis for a single central processing unit. By avoiding all data communication between the GPU and the host system during the simulation, we utilize with great efficacy the GPU hardware with which many high performance computing systems are currently equipped. We conclude by looking forward to the future of Physalis with multi-GPU parallelization in order to perform resolved disperse flow simulations of more than 100,000 particles and further advance the development of reduced-order closure models.
NASA Astrophysics Data System (ADS)
Shi, Wei; Hu, Xiaosong; Jin, Chao; Jiang, Jiuchun; Zhang, Yanru; Yip, Tony
2016-05-01
With the development and popularization of electric vehicles, it is urgent and necessary to develop effective management and diagnosis technology for battery systems. In this work, we design a parallel battery model, according to equivalent circuits of parallel voltage and branch current, to study effects of imbalanced currents on parallel large-format LiFePO4/graphite battery systems. Taking a 60 Ah LiFePO4/graphite battery system manufactured by ATL (Amperex Technology Limited, China) as an example, causes of imbalanced currents in the parallel connection are analyzed using our model, and the associated effect mechanisms on long-term stability of each single battery are examined. Theoretical and experimental results show that continuously increasing imbalanced currents during cycling are mainly responsible for the capacity fade of LiFePO4/graphite parallel batteries. It is thus a good way to avoid fast performance fade of parallel battery systems by suppressing variations of branch currents.
Interdisciplinary Science through the Parallel Curriculum Model: Lessons from the Sea
ERIC Educational Resources Information Center
Hathcock, Stephanie J.
2018-01-01
The Parallel Curriculum Model (PCM) lends itself to considering curriculum development from different angles. It begins with a solid Core Curriculum and can then be extended through the Curriculum of Connections, Practice, and Identity. This article showcases a way of thinking about the creation of a PCM unit by providing examples from an…
NASA Astrophysics Data System (ADS)
Shcherbakov, Alexandre S.; Chavez Dagostino, Miguel; Arellanes, Adan O.; Aguirre Lopez, Arturo
2016-09-01
We develop a multi-band spectrometer with a few spatially parallel optical arms for the combined processing of their data flow. Such multi-band capability has various applications in astrophysical scenarios at different scales: from objects in the distant universe to planetary atmospheres in the Solar system. Each optical arm exhibits original performances to provide parallel multi-band observations with different scales simultaneously. Similar possibility is based on designing each optical arm individually via exploiting different materials for acousto-optical cells operating within various regimes, frequency ranges and light wavelengths from independent light sources. Individual beam shapers provide both the needed incident light polarization and the required apodization to increase the dynamic range of a system. After parallel acousto-optical processing, data flows are united by the joint CCD matrix on the stage of the combined electronic data processing. At the moment, the prototype combines still three bands, i.e. includes three spatial optical arms. The first low-frequency arm operates at the central frequencies 60-80 MHz with frequency bandwidth 40 MHz. The second arm is oriented to middle-frequencies 350-500 MHz with frequency bandwidth 200-300 MHz. The third arm is intended for ultra-high-frequency radio-wave signals about 1.0-1.5 GHz with frequency bandwidth <300 MHz. To-day, this spectrometer has the following preliminary performances. The first arm exhibits frequency resolution 20 KHz; while the second and third arms give the resolution 150-200 KHz. The numbers of resolvable spots are 1500- 2000 depending on the regime of operation. The fourth optical arm at the frequency range 3.5 GHz is currently under construction.
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
Kuroda, Noritaka; Hird, Nick; Cork, David G
2006-01-01
During further improvement of a high-throughput, solution-phase synthesis system, new workup tools and apparatus for parallel liquid-liquid extraction and evaporation have been developed. A combination of in-house design and collaboration with external manufacturers has been used to address (1) environmental issues concerning solvent emissions and (2) sample tracking errors arising from manual intervention. A parallel liquid-liquid extraction unit, containing miniature high-speed magnetic stirrers for efficient mixing of organic and aqueous phases, has been developed for use on a multichannel liquid handler. Separation of the phases is achieved by dispensing them into a newly patented filter tube containing a vertical hydrophobic porous membrane, which allows only the organic phase to pass into collection vials positioned below. The vertical positioning of the membrane overcomes the hitherto dependence on the use of heavier-than-water, bottom-phase, organic solvents such as dichloromethane, which are restricted due to environmental concerns. Both small (6-mL) and large (60-mL) filter tubes were developed for parallel phase separation in library and template synthesis, respectively. In addition, an apparatus for parallel solvent evaporation was developed to (1) remove solvent from the above samples with highly efficient recovery and (2) avoid the movement of individual samples between their collection on a liquid handler and registration to prevent sample identification errors. The apparatus uses a diaphragm pump to achieve a dynamic circulating closed system with a heating block for the rack of 96 sample vials and an efficient condenser to trap the solvents. Solvent recovery is typically >98%, and convenient operation and monitoring has made the apparatus the first choice for removal of volatile solvents.
Mathematics Classrooms in Japan, Taiwan, and the United States.
ERIC Educational Resources Information Center
Stigler, James W.; And Others
1987-01-01
Studies were conducted in Chinese, Japanese, and American classrooms during mathematics classes. Large cross-cultural differences were found in variables related to classroom structure and management. These paralleled differences in mathematics achievement among China, Japan, and the United States. (PCB)
De Lazzari, Claudio; Genuini, Igino; Pisanelli, Domenico M; D'Ambrosi, Alessandra; Fedele, Francesco
2014-12-18
There is an established tradition of cardiovascular simulation tools, but the application of this kind of technology in the e-Learning arena is a novel approach. This paper presents an e-Learning environment aimed at teaching the interaction of cardiovascular and lung systems to health-care professionals. Heart-lung interaction must be analyzed while assisting patients with severe respiratory problems or with heart failure in intensive care unit. Such patients can be assisted by mechanical ventilatory assistance or by thoracic artificial lung."In silico" cardiovascular simulator was experimented during a training course given to graduate students of the School of Specialization in Cardiology at 'Sapienza' University in Rome.The training course employed CARDIOSIM©: a numerical simulator of the cardiovascular system. Such simulator is able to reproduce pathophysiological conditions of patients affected by cardiovascular and/or lung disease. In order to study the interactions among the cardiovascular system, the natural lung and the thoracic artificial lung (TAL), the numerical model of this device has been implemented. After having reproduced a patient's pathological condition, TAL model was applied in parallel and hybrid model during the training course.Results obtained during the training course show that TAL parallel assistance reduces right ventricular end systolic (diastolic) volume, but increases left ventricular end systolic (diastolic) volume. The percentage changes induced by hybrid TAL assistance on haemodynamic variables are lower than those produced by parallel assistance. Only in the case of the mean pulmonary arterial pressure, there is a percentage reduction which, in case of hybrid assistance, is greater (about 40%) than in case of parallel assistance (20-30%).At the end of the course, a short questionnaire was submitted to students in order to assess the quality of the course. The feedback obtained was positive, showing good results with respect to the degree of students' learning and the ease of use of the software simulator.
Tutorial: Performance and reliability in redundant disk arrays
NASA Technical Reports Server (NTRS)
Gibson, Garth A.
1993-01-01
A disk array is a collection of physically small magnetic disks that is packaged as a single unit but operates in parallel. Disk arrays capitalize on the availability of small-diameter disks from a price-competitive market to provide the cost, volume, and capacity of current disk systems but many times their performance. Unfortunately, relative to current disk systems, the larger number of components in disk arrays leads to higher rates of failure. To tolerate failures, redundant disk arrays devote a fraction of their capacity to an encoding of their information. This redundant information enables the contents of a failed disk to be recovered from the contents of non-failed disks. The simplest and least expensive encoding for this redundancy, known as N+1 parity is highlighted. In addition to compensating for the higher failure rates of disk arrays, redundancy allows highly reliable secondary storage systems to be built much more cost-effectively than is now achieved in conventional duplicated disks. Disk arrays that combine redundancy with the parallelism of many small-diameter disks are often called Redundant Arrays of Inexpensive Disks (RAID). This combination promises improvements to both the performance and the reliability of secondary storage. For example, IBM's premier disk product, the IBM 3390, is compared to a redundant disk array constructed of 84 IBM 0661 3 1/2-inch disks. The redundant disk array has comparable or superior values for each of the metrics given and appears likely to cost less. In the first section of this tutorial, I explain how disk arrays exploit the emergence of high performance, small magnetic disks to provide cost-effective disk parallelism that combats the access and transfer gap problems. The flexibility of disk-array configurations benefits manufacturer and consumer alike. In contrast, I describe in this tutorial's second half how parallelism, achieved through increasing numbers of components, causes overall failure rates to rise. Redundant disk arrays overcome this threat to data reliability by ensuring that data remains available during and after component failures.
NASA Astrophysics Data System (ADS)
Shi, X.
2015-12-01
As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced infrastructure remains unexplored in this domain. Within this presentation, our prior and on-going initiatives will be summarized to exemplify how we exploit multicore CPUs, GPUs, and MICs, and clusters of CPUs, GPUs and MICs, to accelerate geocomputation in different applications.
ERIC Educational Resources Information Center
Schaap, Eileen, Ed.; Fresen, Sue, Ed.
This teacher's guide and student guide unit contains supplemental readings, activities, and methods adapted for secondary students who have disabilities and other students with diverse learning needs. The unit focuses on world history and correlates to Florida's Sunshine State Standards. It is divided into the following 21 units of study that…
pcircle - A Suite of Scalable Parallel File System Tools
DOE Office of Scientific and Technical Information (OSTI.GOV)
WANG, FEIYI
2015-10-01
Most of the software related to file system are written for conventional local file system, they are serialized and can't take advantage of the benefit of a large scale parallel file system. "pcircle" software builds on top of ubiquitous MPI in cluster computing environment and "work-stealing" pattern to provide a scalable, high-performance suite of file system tools. In particular - it implemented parallel data copy and parallel data checksumming, with advanced features such as async progress report, checkpoint and restart, as well as integrity checking.
Bio-inspired multi-mode optic flow sensors for micro air vehicles
NASA Astrophysics Data System (ADS)
Park, Seokjun; Choi, Jaehyuk; Cho, Jihyun; Yoon, Euisik
2013-06-01
Monitoring wide-field surrounding information is essential for vision-based autonomous navigation in micro-air-vehicles (MAV). Our image-cube (iCube) module, which consists of multiple sensors that are facing different angles in 3-D space, can be applied to the wide-field of view optic flows estimation (μ-Compound eyes) and to attitude control (μ- Ocelli) in the Micro Autonomous Systems and Technology (MAST) platforms. In this paper, we report an analog/digital (A/D) mixed-mode optic-flow sensor, which generates both optic flows and normal images in different modes for μ- Compound eyes and μ-Ocelli applications. The sensor employs a time-stamp based optic flow algorithm which is modified from the conventional EMD (Elementary Motion Detector) algorithm to give an optimum partitioning of hardware blocks in analog and digital domains as well as adequate allocation of pixel-level, column-parallel, and chip-level signal processing. Temporal filtering, which may require huge hardware resources if implemented in digital domain, is remained in a pixel-level analog processing unit. The rest of the blocks, including feature detection and timestamp latching, are implemented using digital circuits in a column-parallel processing unit. Finally, time-stamp information is decoded into velocity from look-up tables, multiplications, and simple subtraction circuits in a chip-level processing unit, thus significantly reducing core digital processing power consumption. In the normal image mode, the sensor generates 8-b digital images using single slope ADCs in the column unit. In the optic flow mode, the sensor estimates 8-b 1-D optic flows from the integrated mixed-mode algorithm core and 2-D optic flows with an external timestamp processing, respectively.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dritz, K.W.; Boyle, J.M.
This paper addresses the problem of measuring and analyzing the performance of fine-grained parallel programs running on shared-memory multiprocessors. Such processors use locking (either directly in the application program, or indirectly in a subroutine library or the operating system) to serialize accesses to global variables. Given sufficiently high rates of locking, the chief factor preventing linear speedup (besides lack of adequate inherent parallelism in the application) is lock contention - the blocking of processes that are trying to acquire a lock currently held by another process. We show how a high-resolution, low-overhead clock may be used to measure both lockmore » contention and lack of parallel work. Several ways of presenting the results are covered, culminating in a method for calculating, in a single multiprocessing run, both the speedup actually achieved and the speedup lost to contention for each lock and to lack of parallel work. The speedup losses are reported in the same units, ''processor-equivalents,'' as the speedup achieved. Both are obtained without having to perform the usual one-process comparison run. We chronicle also a variety of experiments motivated by actual results obtained with our measurement method. The insights into program performance that we gained from these experiments helped us to refine the parts of our programs concerned with communication and synchronization. Ultimately these improvements reduced lock contention to a negligible amount and yielded nearly linear speedup in applications not limited by lack of parallel work. We describe two generally applicable strategies (''code motion out of critical regions'' and ''critical-region fissioning'') for reducing lock contention and one (''lock/variable fusion'') applicable only on certain architectures.« less
NASA Astrophysics Data System (ADS)
Lü, Hua-Ping; Wang, Shi-Hong; Li, Xiao-Wen; Tang, Guo-Ning; Kuang, Jin-Yu; Ye, Wei-Ping; Hu, Gang
2004-06-01
Two-dimensional one-way coupled map lattices are used for cryptography where multiple space units produce chaotic outputs in parallel. One of the outputs plays the role of driving for synchronization of the decryption system while the others perform the function of information encoding. With this separation of functions the receiver can establish a self-checking and self-correction mechanism, and enjoys the advantages of both synchronous and self-synchronizing schemes. A comparison between the present system with the system of advanced encryption standard (AES) is presented in the aspect of channel noise influence. Numerical investigations show that our system is much stronger than AES against channel noise perturbations, and thus can be better used for secure communications with large channel noise.
NASA Technical Reports Server (NTRS)
Wang, P.; Li, P.
1998-01-01
A high-resolution numerical study on parallel systems is reported on three-dimensional, time-dependent, thermal convective flows. A parallel implentation on the finite volume method with a multigrid scheme is discussed, and a parallel visualization systemm is developed on distributed systems for visualizing the flow.
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gosink, Luke; Wu, Kesheng; Bethel, E. Wes
2009-06-02
The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
AdiosStMan: Parallelizing Casacore Table Data System using Adaptive IO System
NASA Astrophysics Data System (ADS)
Wang, R.; Harris, C.; Wicenec, A.
2016-07-01
In this paper, we investigate the Casacore Table Data System (CTDS) used in the casacore and CASA libraries, and methods to parallelize it. CTDS provides a storage manager plugin mechanism for third-party developers to design and implement their own CTDS storage managers. Having this in mind, we looked into various storage backend techniques that can possibly enable parallel I/O for CTDS by implementing new storage managers. After carrying on benchmarks showing the excellent parallel I/O throughput of the Adaptive IO System (ADIOS), we implemented an ADIOS based parallel CTDS storage manager. We then applied the CASA MSTransform frequency split task to verify the ADIOS Storage Manager. We also ran a series of performance tests to examine the I/O throughput in a massively parallel scenario.
NASA Astrophysics Data System (ADS)
Ford, Eric B.
2009-05-01
We present the results of a highly parallel Kepler equation solver using the Graphics Processing Unit (GPU) on a commercial nVidia GeForce 280GTX and the "Compute Unified Device Architecture" (CUDA) programming environment. We apply this to evaluate a goodness-of-fit statistic (e.g., χ2) for Doppler observations of stars potentially harboring multiple planetary companions (assuming negligible planet-planet interactions). Given the high-dimensionality of the model parameter space (at least five dimensions per planet), a global search is extremely computationally demanding. We expect that the underlying Kepler solver and model evaluator will be combined with a wide variety of more sophisticated algorithms to provide efficient global search, parameter estimation, model comparison, and adaptive experimental design for radial velocity and/or astrometric planet searches. We tested multiple implementations using single precision, double precision, pairs of single precision, and mixed precision arithmetic. We find that the vast majority of computations can be performed using single precision arithmetic, with selective use of compensated summation for increased precision. However, standard single precision is not adequate for calculating the mean anomaly from the time of observation and orbital period when evaluating the goodness-of-fit for real planetary systems and observational data sets. Using all double precision, our GPU code outperforms a similar code using a modern CPU by a factor of over 60. Using mixed precision, our GPU code provides a speed-up factor of over 600, when evaluating nsys > 1024 models planetary systems each containing npl = 4 planets and assuming nobs = 256 observations of each system. We conclude that modern GPUs also offer a powerful tool for repeatedly evaluating Kepler's equation and a goodness-of-fit statistic for orbital models when presented with a large parameter space.
NAS Requirements Checklist for Job Queuing/Scheduling Software
NASA Technical Reports Server (NTRS)
Jones, James Patton
1996-01-01
The increasing reliability of parallel systems and clusters of computers has resulted in these systems becoming more attractive for true production workloads. Today, the primary obstacle to production use of clusters of computers is the lack of a functional and robust Job Management System for parallel applications. This document provides a checklist of NAS requirements for job queuing and scheduling in order to make most efficient use of parallel systems and clusters for parallel applications. Future requirements are also identified to assist software vendors with design planning.
Graphics processing unit based computation for NDE applications
NASA Astrophysics Data System (ADS)
Nahas, C. A.; Rajagopal, Prabhu; Balasubramaniam, Krishnan; Krishnamurthy, C. V.
2012-05-01
Advances in parallel processing in recent years are helping to improve the cost of numerical simulation. Breakthroughs in Graphical Processing Unit (GPU) based computation now offer the prospect of further drastic improvements. The introduction of 'compute unified device architecture' (CUDA) by NVIDIA (the global technology company based in Santa Clara, California, USA) has made programming GPUs for general purpose computing accessible to the average programmer. Here we use CUDA to develop parallel finite difference schemes as applicable to two problems of interest to NDE community, namely heat diffusion and elastic wave propagation. The implementations are for two-dimensions. Performance improvement of the GPU implementation against serial CPU implementation is then discussed.
Template based parallel checkpointing in a massively parallel computer system
Archer, Charles Jens [Rochester, MN; Inglett, Todd Alan [Rochester, MN
2009-01-13
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
Customizing FP-growth algorithm to parallel mining with Charm++ library
NASA Astrophysics Data System (ADS)
Puścian, Marek
2017-08-01
This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.
Using Parallel Processing for Problem Solving.
1979-12-01
are the basic parallel proces- sing primitive . Different goals of the system can be pursued in parallel by placing them in separate activities...Language primitives are provided for manipulating running activities. Viewpoints are a generalization of context FOM -(over "*’ DD I FON 1473 ’EDITION OF I...arc the basic parallel processing primitive . Different goals of the system can be pursued in parallel by placing them in separate activities. Language
A data distributed parallel algorithm for ray-traced volume rendering
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu; Painter, James S.; Hansen, Charles D.; Krogh, Michael F.
1993-01-01
This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local ray tracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on both the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.
Method of synchronizing independent functional unit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Changhoan
A system for synchronizing parallel processing of a plurality of functional processing units (FPU), a first FPU and a first program counter to control timing of a first stream of program instructions issued to the first FPU by advancement of the first program counter; a second FPU and a second program counter to control timing of a second stream of program instructions issued to the second FPU by advancement of the second program counter, the first FPU is in communication with a second FPU to synchronize the issuance of a first stream of program instructions to the second stream ofmore » program instructions and the second FPU is in communication with the first FPU to synchronize the issuance of the second stream program instructions to the first stream of program instructions.« less
Chantrapromma, Suchada; Chanawanno, Kullapa; Boonnak, Nawong; Fun, Hoong-Kun
2012-01-01
The asymmetric unit of the title compound, C(36)H(32)N(2) (2+)·2I(-), consists of one half-mol-ecule of the cation and one I(-) anion. The cation is located on an inversion centre. The dihedral angle between the pyridinium ring and the naphthalene ring system in the asymmetric unit is 19.01 (14)°. In the crystal, the cations and the anions are linked by C-H⋯I inter-actions into a layer parallel to the bc plane. Intra- and inter-molecular π-π inter-actions with centroid-centroid distances of 3.533 (2)-3.807 (2) Å are also observed.
Watanabe, Yuuki; Takahashi, Yuhei; Numazawa, Hiroshi
2014-02-01
We demonstrate intensity-based optical coherence tomography (OCT) angiography using the squared difference of two sequential frames with bulk-tissue-motion (BTM) correction. This motion correction was performed by minimization of the sum of the pixel values using axial- and lateral-pixel-shifted structural OCT images. We extract the BTM-corrected image from a total of 25 calculated OCT angiographic images. Image processing was accelerated by a graphics processing unit (GPU) with many stream processors to optimize the parallel processing procedure. The GPU processing rate was faster than that of a line scan camera (46.9 kHz). Our OCT system provides the means of displaying structural OCT images and BTM-corrected OCT angiographic images in real time.
Method of synchronizing independent functional unit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Changhoan
2017-05-16
A system for synchronizing parallel processing of a plurality of functional processing units (FPU), a first FPU and a first program counter to control timing of a first stream of program instructions issued to the first FPU by advancement of the first program counter; a second FPU and a second program counter to control timing of a second stream of program instructions issued to the second FPU by advancement of the second program counter, the first FPU is in communication with a second FPU to synchronize the issuance of a first stream of program instructions to the second stream ofmore » program instructions and the second FPU is in communication with the first FPU to synchronize the issuance of the second stream program instructions to the first stream of program instructions.« less
Method of synchronizing independent functional unit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Changhoan
2017-02-14
A system for synchronizing parallel processing of a plurality of functional processing units (FPU), a first FPU and a first program counter to control timing of a first stream of program instructions issued to the first FPU by advancement of the first program counter; a second FPU and a second program counter to control timing of a second stream of program instructions issued to the second FPU by advancement of the second program counter, the first FPU is in communication with a second FPU to synchronize the issuance of a first stream of program instructions to the second stream ofmore » program instructions and the second FPU is in communication with the first FPU to synchronize the issuance of the second stream program instructions to the first stream of program instructions.« less
Fiber Optic Communication System For Medical Images
NASA Astrophysics Data System (ADS)
Arenson, Ronald L.; Morton, Dan E.; London, Jack W.
1982-01-01
This paper discusses a fiber optic communication system linking ultrasound devices, Computerized tomography scanners, Nuclear Medicine computer system, and a digital fluoro-graphic system to a central radiology research computer. These centrally archived images are available for near instantaneous recall at various display consoles. When a suitable laser optical disk is available for mass storage, more extensive image archiving will be added to the network including digitized images of standard radiographs for comparison purposes and for remote display in such areas as the intensive care units, the operating room, and selected outpatient departments. This fiber optic system allows for a transfer of high resolution images in less than a second over distances exceeding 2,000 feet. The advantages of using fiber optic cables instead of typical parallel or serial communication techniques will be described. The switching methodology and communication protocols will also be discussed.
Parallel design of JPEG-LS encoder on graphics processing units
NASA Astrophysics Data System (ADS)
Duan, Hao; Fang, Yong; Huang, Bormin
2012-01-01
With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.
A collimator-converter system for IEC propulsion
NASA Astrophysics Data System (ADS)
Momota, Hiromu; Miley, George H.
2002-01-01
The collimator-converter system extracts fusion power from D-3He fueled IEC devices and provides electricity needed to operate ionic thrusters and other-power components. The whole system is linear and consists of a series of collimator units at the center, magnetic expander units at both sides of the fusion units, followed by direct energy converters at both ends. This system is enclosed in a vacuum chamber with a magnetic channel provided by magnetic solenoids out of respective chambers. The fusion unit consists of an IEC fusion core, a pair of coils anti-parallel to the solenoid coils, and a stabilization coil that stabilizes the position of coil pair coils. The IEC fusion core is installed at the center of the pair coils. After the magnetic expander, velocities of fusion particles from D-3He fueled IEC units are directed to the magnetic channel, which guides energetic fusion particles as well as leaking unburned fuel components to a high-efficiency traveling wave direct energy converter (TWDEC). Leaking unburned fuel components are separated with a magnetic separator at the entrance of a direct energy converter and pumped out for further refueling. A TWDEC is made of an array of metallic meshed grids, each of which is connected to every terminal with an external transmission circuit. The transmission line couples to the direct energy converter. Substations for electricity, a cryogenic plant, and various power control systems are outside of the vacuum chamber. The length of the cylindrical system is essentially determined by the proton energy of 14.8 MeV and the radius should be large so as to reduce power flow density. The present system provides 250 MWf fusion power and converting it to 150 MWc electricity. Its size is 150 m(length)×6.6 m(diameter) in size and 185 tons in weight. .
Parallel Signal Processing and System Simulation using aCe
NASA Technical Reports Server (NTRS)
Dorband, John E.; Aburdene, Maurice F.
2003-01-01
Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).
An Expert System for the Development of Efficient Parallel Code
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit
2004-01-01
We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.
Casimir force in O(n) systems with a diffuse interface.
Dantchev, Daniel; Grüneberg, Daniel
2009-04-01
We study the behavior of the Casimir force in O(n) systems with a diffuse interface and slab geometry infinity;{d-1}xL , where 2
Comparison between four dissimilar solar panel configurations
NASA Astrophysics Data System (ADS)
Suleiman, K.; Ali, U. A.; Yusuf, Ibrahim; Koko, A. D.; Bala, S. I.
2017-12-01
Several studies on photovoltaic systems focused on how it operates and energy required in operating it. Little attention is paid on its configurations, modeling of mean time to system failure, availability, cost benefit and comparisons of parallel and series-parallel designs. In this research work, four system configurations were studied. Configuration I consists of two sub-components arranged in parallel with 24 V each, configuration II consists of four sub-components arranged logically in parallel with 12 V each, configuration III consists of four sub-components arranged in series-parallel with 8 V each, and configuration IV has six sub-components with 6 V each arranged in series-parallel. Comparative analysis was made using Chapman Kolmogorov's method. The derivation for explicit expression of mean time to system failure, steady state availability and cost benefit analysis were performed, based on the comparison. Ranking method was used to determine the optimal configuration of the systems. The results of analytical and numerical solutions of system availability and mean time to system failure were determined and it was found that configuration I is the optimal configuration.
Support for Debugging Automatically Parallelized Programs
NASA Technical Reports Server (NTRS)
Hood, Robert; Jost, Gabriele; Biegel, Bryan (Technical Monitor)
2001-01-01
This viewgraph presentation provides information on the technical aspects of debugging computer code that has been automatically converted for use in a parallel computing system. Shared memory parallelization and distributed memory parallelization entail separate and distinct challenges for a debugging program. A prototype system has been developed which integrates various tools for the debugging of automatically parallelized programs including the CAPTools Database which provides variable definition information across subroutines as well as array distribution information.
CCA 3101/4101 Environmental Humanities: The History of a Unit through an Ecopedagogical Lens
ERIC Educational Resources Information Center
Ryan, John Charles
2012-01-01
In 2011 the author taught, for the first time, the well-established unit CCA3101/4101 Environmental Humanities in the School of Communications and Arts at ECU (Edith Cowan University) in Western Australia. The unit has a 20-year history through associate professor Rod Giblett and parallels the development of the environmental humanities as a field…
Gas diffusion as a new fluidic unit operation for centrifugal microfluidic platforms.
Ymbern, Oriol; Sández, Natàlia; Calvo-López, Antonio; Puyol, Mar; Alonso-Chamarro, Julian
2014-03-07
A centrifugal microfluidic platform prototype with an integrated membrane for gas diffusion is presented for the first time. The centrifugal platform allows multiple and parallel analysis on a single disk and integrates at least ten independent microfluidic subunits, which allow both calibration and sample determination. It is constructed with a polymeric substrate material and it is designed to perform colorimetric determinations by the use of a simple miniaturized optical detection system. The determination of three different analytes, sulfur dioxide, nitrite and carbon dioxide, is carried out as a proof of concept of a versatile microfluidic system for the determination of analytes which involve a gas diffusion separation step during the analytical procedure.
Investigation of Combined Motor/Magnetic Bearings for Flywheel Energy Storage Systems
NASA Technical Reports Server (NTRS)
Hofmann, Heath
2003-01-01
Dr. Hofmann's work in the summer of 2003 consisted of two separate projects. In the first part of the summer, Dr. Hofmann prepared and collected information regarding rotor losses in synchronous machines; in particular, machines with low rotor losses operating in vacuum and supported by magnetic bearings, such as the motor/generator for flywheel energy storage systems. This work culminated in a presentation at NASA Glenn Research Center on this topic. In the second part, Dr. Hofmann investigated an approach to flywheel energy storage where the phases of the flywheel motor/generator are connected in parallel with the phases of an induction machine driving a mechanical actuator. With this approach, additional power electronics for driving the flywheel unit are not required. Simulations of the connection of a flywheel energy storage system to a model of an electromechanical actuator testbed at NASA Glenn were performed that validated the proposed approach. A proof-of-concept experiment using the D1 flywheel unit at NASA Glenn and a Sundstrand induction machine connected to a dynamometer was successfully conducted.
Lin, Jyh-Miin; Patterson, Andrew J; Chang, Hing-Chiu; Gillard, Jonathan H; Graves, Martin J
2015-10-01
To propose a new reduced field-of-view (rFOV) strategy for iterative reconstructions in a clinical environment. Iterative reconstructions can incorporate regularization terms to improve the image quality of periodically rotated overlapping parallel lines with enhanced reconstruction (PROPELLER) MRI. However, the large amount of calculations required for full FOV iterative reconstructions has posed a huge computational challenge for clinical usage. By subdividing the entire problem into smaller rFOVs, the iterative reconstruction can be accelerated on a desktop with a single graphic processing unit (GPU). This rFOV strategy divides the iterative reconstruction into blocks, based on the block-diagonal dominant structure. A near real-time reconstruction system was developed for the clinical MR unit, and parallel computing was implemented using the object-oriented model. In addition, the Toeplitz method was implemented on the GPU to reduce the time required for full interpolation. Using the data acquired from the PROPELLER MRI, the reconstructed images were then saved in the digital imaging and communications in medicine format. The proposed rFOV reconstruction reduced the gridding time by 97%, as the total iteration time was 3 s even with multiple processes running. A phantom study showed that the structure similarity index for rFOV reconstruction was statistically superior to conventional density compensation (p < 0.001). In vivo study validated the increased signal-to-noise ratio, which is over four times higher than with density compensation. Image sharpness index was improved using the regularized reconstruction implemented. The rFOV strategy permits near real-time iterative reconstruction to improve the image quality of PROPELLER images. Substantial improvements in image quality metrics were validated in the experiments. The concept of rFOV reconstruction may potentially be applied to other kinds of iterative reconstructions for shortened reconstruction duration.
NASA Astrophysics Data System (ADS)
Wang, Linlin; Wang, Zhenqi; Yu, Shui; Ngia, Ngong Roger
2016-08-01
The Miocene deepwater gravity-flow sedimentary system in Block A of the southwestern part of the Lower Congo Basin was identified and interpreted using high-resolution 3-D seismic, drilling and logging data to reveal development characteristics and main controlling factors. Five types of deepwater gravity-flow sedimentary units have been identified in the Miocene section of Block A, including mass transport, deepwater channel, levee, abandoned channel and sedimentary lobe deposits. Each type of sedimentary unit has distinct external features, internal structures and lateral characteristics in seismic profiles. Mass transport deposits (MTDs) in particular correspond to chaotic low-amplitude reflections in contact with mutants on both sides. The cross section of deepwater channel deposits in the seismic profile is in U- or V-shape. The channel deposits change in ascending order from low-amplitude, poor-continuity, chaotic filling reflections at the bottom, to high-amplitude, moderate to poor continuity, chaotic or sub-parallel reflections in the middle section and to moderate-weak amplitude, good continuity, parallel or sub-parallel reflections in the upper section. The sedimentary lobes are laterally lobate, which corresponds to high-amplitude, good-continuity, moundy reflection signatures in the seismic profile. Due to sediment flux, faults, and inherited terrain, few mass transport deposits occur in the northeastern part of the study area. The front of MTDs is mainly composed of channel-levee complex deposits, while abandoned-channel and lobe-deposits are usually developed in high-curvature channel sections and the channel terminals, respectively. The distribution of deepwater channel, levee, abandoned channel and sedimentary lobe deposits is predominantly controlled by relative sea level fluctuations and to a lesser extent by tectonism and inherited terrain.
Structural studies on a high-pressure polymorph of NaYSi 2O 6
NASA Astrophysics Data System (ADS)
Kahlenberg, Volker; Konzett, Jürgen; Kaindl, Reinhard
2007-06-01
High-pressure synthesis experiments in the system Na 2O-Y 2O 3-SiO 2 revealed the existence of a previously unknown polymorph of NaYSi 2O 6 or Na 3Y 3[Si 3O 9] 2 which was quenched from 3.0 GPa and 1000 °C. Structural investigations on this modification have been performed using single-crystal X-ray diffraction data collected at ambient conditions. Furthermore, unpolarized micro-Raman spectra have been obtained from single-crystal material. The high-P modification of NaYSi 2O 6 crystallizes in the centrosymmetric space group C2/ c with 12 formula units per cell ( a=8.2131(9) Å, b=10.3983(14) Å, c=17.6542(21) Å, β=100.804(9)°, V=1481.0(3) Å 3, R(| F|)=0.033 for 1142 independent observed reflections) and belongs to the group of cyclo-silicates. Basic building units are isolated three-membered [Si 3O 9] rings located in layers parallel to (010). Within a single layer the rings are concentrated in strings parallel to [100]. The sequence of directedness of up ( U) or down ( D) pointing tetrahedra of a single ring is UUU or DDD, respectively. Stacking of the layers parallel to b results in the formation of a three-dimensional structure in which yttrium and sodium cations are incorporated for charge compensation. In more detail, four non-tetrahedral cation positions can be differentiated which are coordinated by 6 and 8 oxygen ligands. Refinements of the site occupancies did not reveal any indication for mixed Na-Y populations on these positions. Finally, several geometrical parameters of rings occurring in cyclo-trisilicate structures have been compiled and are discussed.
Using parallel computing for the display and simulation of the space debris environment
NASA Astrophysics Data System (ADS)
Möckel, M.; Wiedemann, C.; Flegel, S.; Gelhaus, J.; Vörsmann, P.; Klinkrad, H.; Krag, H.
2011-07-01
Parallelism is becoming the leading paradigm in today's computer architectures. In order to take full advantage of this development, new algorithms have to be specifically designed for parallel execution while many old ones have to be upgraded accordingly. One field in which parallel computing has been firmly established for many years is computer graphics. Calculating and displaying three-dimensional computer generated imagery in real time requires complex numerical operations to be performed at high speed on a large number of objects. Since most of these objects can be processed independently, parallel computing is applicable in this field. Modern graphics processing units (GPUs) have become capable of performing millions of matrix and vector operations per second on multiple objects simultaneously. As a side project, a software tool is currently being developed at the Institute of Aerospace Systems that provides an animated, three-dimensional visualization of both actual and simulated space debris objects. Due to the nature of these objects it is possible to process them individually and independently from each other. Therefore, an analytical orbit propagation algorithm has been implemented to run on a GPU. By taking advantage of all its processing power a huge performance increase, compared to its CPU-based counterpart, could be achieved. For several years efforts have been made to harness this computing power for applications other than computer graphics. Software tools for the simulation of space debris are among those that could profit from embracing parallelism. With recently emerged software development tools such as OpenCL it is possible to transfer the new algorithms used in the visualization outside the field of computer graphics and implement them, for example, into the space debris simulation environment. This way they can make use of parallel hardware such as GPUs and Multi-Core-CPUs for faster computation. In this paper the visualization software will be introduced, including a comparison between the serial and the parallel method of orbit propagation. Ways of how to use the benefits of the latter method for space debris simulation will be discussed. An introduction to OpenCL will be given as well as an exemplary algorithm from the field of space debris simulation.
Using parallel computing for the display and simulation of the space debris environment
NASA Astrophysics Data System (ADS)
Moeckel, Marek; Wiedemann, Carsten; Flegel, Sven Kevin; Gelhaus, Johannes; Klinkrad, Heiner; Krag, Holger; Voersmann, Peter
Parallelism is becoming the leading paradigm in today's computer architectures. In order to take full advantage of this development, new algorithms have to be specifically designed for parallel execution while many old ones have to be upgraded accordingly. One field in which parallel computing has been firmly established for many years is computer graphics. Calculating and displaying three-dimensional computer generated imagery in real time requires complex numerical operations to be performed at high speed on a large number of objects. Since most of these objects can be processed independently, parallel computing is applicable in this field. Modern graphics processing units (GPUs) have become capable of performing millions of matrix and vector operations per second on multiple objects simultaneously. As a side project, a software tool is currently being developed at the Institute of Aerospace Systems that provides an animated, three-dimensional visualization of both actual and simulated space debris objects. Due to the nature of these objects it is possible to process them individually and independently from each other. Therefore, an analytical orbit propagation algorithm has been implemented to run on a GPU. By taking advantage of all its processing power a huge performance increase, compared to its CPU-based counterpart, could be achieved. For several years efforts have been made to harness this computing power for applications other than computer graphics. Software tools for the simulation of space debris are among those that could profit from embracing parallelism. With recently emerged software development tools such as OpenCL it is possible to transfer the new algorithms used in the visualization outside the field of computer graphics and implement them, for example, into the space debris simulation environment. This way they can make use of parallel hardware such as GPUs and Multi-Core-CPUs for faster computation. In this paper the visualization software will be introduced, including a comparison between the serial and the parallel method of orbit propagation. Ways of how to use the benefits of the latter method for space debris simulation will be discussed. An introduction of OpenCL will be given as well as an exemplary algorithm from the field of space debris simulation.
Parallel-Processing Test Bed For Simulation Software
NASA Technical Reports Server (NTRS)
Blech, Richard; Cole, Gary; Townsend, Scott
1996-01-01
Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).
Experimental design, operation, and results of a 4 kW high temperature steam electrolysis experiment
Zhang, Xiaoyu; O'Brien, James E.; Tao, Greg; ...
2015-08-06
High temperature steam electrolysis (HTSE) is a promising technology for large-scale hydrogen production. However, research on HTSE performance above the kW level is limited. This paper presents the results of 4 kW HTSE long-term test completed in a multi-kW test facility recently developed at the Idaho National Laboratory (INL). The 4 kW HTSE unit included two solid oxide electrolysis stacks operating in parallel, each of which included 40 electrode-supported planar cells. A current density of 0.41 A/cm2 was used for the long-term operation, resulting in a hydrogen production rate about 25 slpm. A demonstration of 920 hours stable operation wasmore » achieved. The paper also includes detailed descriptions of the piping layout, steam generation and delivery system, test fixture, heat recuperation system, hot zone, instrumentation, and operating conditions. As a result, this successful demonstration of multi-kW scale HTSE unit will help to advance the technology toward near-term commercialization.« less
Zinc-chlorine battery plant system and method
Whittlesey, Curtis C.; Mashikian, Matthew S.
1981-01-01
A zinc-chlorine battery plant system and method of redirecting the electrical current around a failed battery module. The battery plant includes a power conditioning unit, a plurality of battery modules connected electrically in series to form battery strings, a plurality of battery strings electrically connected in parallel to the power conditioning unit, and a bypass switch for each battery module in the battery plant. The bypass switch includes a normally open main contact across the power terminals of the battery module, and a set of normally closed auxiliary contacts for controlling the supply of reactants electrochemically transformed in the cells of the battery module. Upon the determination of a failure condition, the bypass switch for the failed battery module is energized to close the main contact and open the auxiliary contacts. Within a short time, the electrical current through the battery module will substantially decrease due to the cutoff of the supply of reactants, and the electrical current flow through the battery string will be redirected through the main contact of the bypass switch.
How do strategic decisions and operative practices affect operating room productivity?
Peltokorpi, Antti
2011-12-01
Surgical operating rooms are cost-intensive parts of health service production. Managing operating units efficiently is essential when hospitals and healthcare systems aim to maximize health outcomes with limited resources. Previous research about operating room management has focused on studying the effect of management practices and decisions on efficiency by utilizing mainly modeling approach or before-after analysis in single hospital case. The purpose of this research is to analyze the synergic effect of strategic decisions and operative management practices on operating room productivity and to use a multiple case study method enabling statistical hypothesis testing with empirical data. 11 hypotheses that propose connections between the use of strategic and operative practices and productivity were tested in a multi-hospital study that included 26 units. The results indicate that operative practices, such as personnel management, case scheduling and performance measurement, affect productivity more remarkably than do strategic decisions that relate to, e.g., units' size, scope or academic status. Units with different strategic positions should apply different operative practices: Focused hospital units benefit most from sophisticated case scheduling and parallel processing whereas central and ambulatory units should apply flexible working hours, incentives and multi-skilled personnel. Operating units should be more active in applying management practices which are adequate for their strategic orientation.
Design and Development of Multi-Purpose CCD Camera System with Thermoelectric Cooling: Hardware
NASA Astrophysics Data System (ADS)
Kang, Y.-W.; Byun, Y. I.; Rhee, J. H.; Oh, S. H.; Kim, D. K.
2007-12-01
We designed and developed a multi-purpose CCD camera system for three kinds of CCDs; KAF-0401E(768×512), KAF-1602E(1536×1024), KAF-3200E(2184×1472) made by KODAK Co.. The system supports fast USB port as well as parallel port for data I/O and control signal. The packing is based on two stage circuit boards for size reduction and contains built-in filter wheel. Basic hardware components include clock pattern circuit, A/D conversion circuit, CCD data flow control circuit, and CCD temperature control unit. The CCD temperature can be controlled with accuracy of approximately 0.4° C in the max. range of temperature, Δ 33° C. This CCD camera system has with readout noise 6 e^{-}, and system gain 5 e^{-}/ADU. A total of 10 CCD camera systems were produced and our tests show that all of them show passable performance.
Design of a nuclear isotope heat source assembly for a spaceborne mini-Brayton power module.
NASA Technical Reports Server (NTRS)
Wein, D.; Gorland, S. H.
1973-01-01
Results of a study to develop a feasible design definition of a heat source assembly (HSA) for use in nominal 500-, 1200-, or 2000-W(e) mini-Brayton spacecraft power systems. The HSA is a modular design which is used either as a single unit to provide thermal energy to the 500-W(e) mini-Brayton power module or in parallel with one or two additional HSAs for the 1200- or 2000-W(e) power module systems. Principal components consist of a multihundred watt RTG isotope heat source, a heat source heat exchanger which transfers the thermal energy from the heat source to the mini-Brayton power conversion system, an auxiliary cooling system which provides requisite cooling during nonoperation of the power conversion module and an emergency cooling system which precludes accidental release of isotope fuel in the event of system failure.
The application of autostereoscopic display in smart home system based on mobile devices
NASA Astrophysics Data System (ADS)
Zhang, Yongjun; Ling, Zhi
2015-03-01
Smart home is a system to control home devices which are more and more popular in our daily life. Mobile intelligent terminals based on smart homes have been developed, make remote controlling and monitoring possible with smartphones or tablets. On the other hand, 3D stereo display technology developed rapidly in recent years. Therefore, a iPad-based smart home system adopts autostereoscopic display as the control interface is proposed to improve the userfriendliness of using experiences. In consideration of iPad's limited hardware capabilities, we introduced a 3D image synthesizing method based on parallel processing with Graphic Processing Unit (GPU) implemented it with OpenGL ES Application Programming Interface (API) library on IOS platforms for real-time autostereoscopic displaying. Compared to the traditional smart home system, the proposed system applied autostereoscopic display into smart home system's control interface enhanced the reality, user-friendliness and visual comfort of interface.
A parallel time integrator for noisy nonlinear oscillatory systems
NASA Astrophysics Data System (ADS)
Subber, Waad; Sarkar, Abhijit
2018-06-01
In this paper, we adapt a parallel time integration scheme to track the trajectories of noisy non-linear dynamical systems. Specifically, we formulate a parallel algorithm to generate the sample path of nonlinear oscillator defined by stochastic differential equations (SDEs) using the so-called parareal method for ordinary differential equations (ODEs). The presence of Wiener process in SDEs causes difficulties in the direct application of any numerical integration techniques of ODEs including the parareal algorithm. The parallel implementation of the algorithm involves two SDEs solvers, namely a fine-level scheme to integrate the system in parallel and a coarse-level scheme to generate and correct the required initial conditions to start the fine-level integrators. For the numerical illustration, a randomly excited Duffing oscillator is investigated in order to study the performance of the stochastic parallel algorithm with respect to a range of system parameters. The distributed implementation of the algorithm exploits Massage Passing Interface (MPI).
Distributed and parallel Ada and the Ada 9X recommendations
NASA Technical Reports Server (NTRS)
Volz, Richard A.; Goldsack, Stephen J.; Theriault, R.; Waldrop, Raymond S.; Holzbacher-Valero, A. A.
1992-01-01
Recently, the DoD has sponsored work towards a new version of Ada, intended to support the construction of distributed systems. The revised version, often called Ada 9X, will become the new standard sometimes in the 1990s. It is intended that Ada 9X should provide language features giving limited support for distributed system construction. The requirements for such features are given. Many of the most advanced computer applications involve embedded systems that are comprised of parallel processors or networks of distributed computers. If Ada is to become the widely adopted language envisioned by many, it is essential that suitable compilers and tools be available to facilitate the creation of distributed and parallel Ada programs for these applications. The major languages issues impacting distributed and parallel programming are reviewed, and some principles upon which distributed/parallel language systems should be built are suggested. Based upon these, alternative language concepts for distributed/parallel programming are analyzed.
Partitioning problems in parallel, pipelined and distributed computing
NASA Technical Reports Server (NTRS)
Bokhari, S.
1985-01-01
The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry
1998-01-01
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG
NASA Astrophysics Data System (ADS)
Griffiths, M. K.; Fedun, V.; Erdélyi, R.
2015-03-01
Parallelization techniques have been exploited most successfully by the gaming/graphics industry with the adoption of graphical processing units (GPUs), possessing hundreds of processor cores. The opportunity has been recognized by the computational sciences and engineering communities, who have recently harnessed successfully the numerical performance of GPUs. For example, parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar, astrophysical and geophysical plasmas. Here, we describe the implementation of SMAUG, the Sheffield Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetized and gravitationally stratified plasma. The objective of this paper is to present the numerical methods and techniques used for porting the code to this novel and highly parallel compute architecture. The methods employed are justified by the performance benchmarks and validation results demonstrating that the code successfully simulates the physics for a range of test scenarios including a full 3D realistic model of wave propagation in the solar atmosphere.
Graphics Processing Unit Assisted Thermographic Compositing
NASA Technical Reports Server (NTRS)
Ragasa, Scott; McDougal, Matthew; Russell, Sam
2012-01-01
Objective: To develop a software application utilizing general purpose graphics processing units (GPUs) for the analysis of large sets of thermographic data. Background: Over the past few years, an increasing effort among scientists and engineers to utilize the GPU in a more general purpose fashion is allowing for supercomputer level results at individual workstations. As data sets grow, the methods to work them grow at an equal, and often great, pace. Certain common computations can take advantage of the massively parallel and optimized hardware constructs of the GPU to allow for throughput that was previously reserved for compute clusters. These common computations have high degrees of data parallelism, that is, they are the same computation applied to a large set of data where the result does not depend on other data elements. Signal (image) processing is one area were GPUs are being used to greatly increase the performance of certain algorithms and analysis techniques. Technical Methodology/Approach: Apply massively parallel algorithms and data structures to the specific analysis requirements presented when working with thermographic data sets.
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units
NASA Astrophysics Data System (ADS)
Ren, Chuang; Kong, Xianglong; Huang, Michael; Decyk, Viktor; Mori, Warren
2011-10-01
Recently using CUDA, we have developed an electromagnetic Particle-in-Cell (PIC) code with charge-conserving current deposition for Nvidia graphic processing units (GPU's) (Kong et al., Journal of Computational Physics 230, 1676 (2011). On a Tesla M2050 (Fermi) card, the GPU PIC code can achieve a one-particle-step process time of 1.2 - 3.2 ns in 2D and 2.3 - 7.2 ns in 3D, depending on plasma temperatures. In this talk we will discuss novel algorithms for GPU-PIC including charge-conserving current deposition scheme with few branching and parallel particle sorting. These algorithms have made efficient use of the GPU shared memory. We will also discuss how to replace the computation kernels of existing parallel CPU codes while keeping their parallel structures. This work was supported by U.S. Department of Energy under Grant Nos. DE-FG02-06ER54879 and DE-FC02-04ER54789 and by NSF under Grant Nos. PHY-0903797 and CCF-0747324.
Quantum supercharger library: hyper-parallelism of the Hartree-Fock method.
Fernandes, Kyle D; Renison, C Alicia; Naidoo, Kevin J
2015-07-05
We present here a set of algorithms that completely rewrites the Hartree-Fock (HF) computations common to many legacy electronic structure packages (such as GAMESS-US, GAMESS-UK, and NWChem) into a massively parallel compute scheme that takes advantage of hardware accelerators such as Graphical Processing Units (GPUs). The HF compute algorithm is core to a library of routines that we name the Quantum Supercharger Library (QSL). We briefly evaluate the QSL's performance and report that it accelerates a HF 6-31G Self-Consistent Field (SCF) computation by up to 20 times for medium sized molecules (such as a buckyball) when compared with mature Central Processing Unit algorithms available in the legacy codes in regular use by researchers. It achieves this acceleration by massive parallelization of the one- and two-electron integrals and optimization of the SCF and Direct Inversion in the Iterative Subspace routines through the use of GPU linear algebra libraries. © 2015 Wiley Periodicals, Inc. © 2015 Wiley Periodicals, Inc.
Godlewska, P; Jańczak, J; Kucharska, E; Hanuza, J; Lorenc, J; Michalski, J; Dymińska, L; Węgliński, Z
2014-01-01
Fourier transform IR and Raman spectra, XRD studies and DFT quantum chemical calculations have been used to characterize the structural and vibrational properties of 2-hydroxy-5-methylpyridine-3-carboxylic acid. In the unit-cell of this compound two molecules related by the inversion center interact via OH⋯N hydrogen bonds. The double hydrogen bridge system is spaced parallel to the (102) crystallographic plane forming eight-membered arrangement characteristic for pyridine derivatives. The six-membered ring is the second characteristic unit formed via the intramolecular OH⋯O hydrogen bond. The geometry optimization of the monomer and dimer have been performed applying the Gaussian03 program package. All calculations were performed in the B3LYP/6-31G(d,p) basis set using the XRD data as input parameters. The relation between the molecular and crystal structures has been discussed in terms of the hydrogen bonds formed in the unit cell. The vibrations of the dimer have been discussed in terms of the resonance inside the system built of five rings coupled via hydrogen bonds. Copyright © 2013 Elsevier B.V. All rights reserved.
Reliability modelling and analysis of a multi-state element based on a dynamic Bayesian network
NASA Astrophysics Data System (ADS)
Li, Zhiqiang; Xu, Tingxue; Gu, Junyuan; Dong, Qi; Fu, Linyu
2018-04-01
This paper presents a quantitative reliability modelling and analysis method for multi-state elements based on a combination of the Markov process and a dynamic Bayesian network (DBN), taking perfect repair, imperfect repair and condition-based maintenance (CBM) into consideration. The Markov models of elements without repair and under CBM are established, and an absorbing set is introduced to determine the reliability of the repairable element. According to the state-transition relations between the states determined by the Markov process, a DBN model is built. In addition, its parameters for series and parallel systems, namely, conditional probability tables, can be calculated by referring to the conditional degradation probabilities. Finally, the power of a control unit in a failure model is used as an example. A dynamic fault tree (DFT) is translated into a Bayesian network model, and subsequently extended to a DBN. The results show the state probabilities of an element and the system without repair, with perfect and imperfect repair, and under CBM, with an absorbing set plotted by differential equations and verified. Through referring forward, the reliability value of the control unit is determined in different kinds of modes. Finally, weak nodes are noted in the control unit.
Parallelized direct execution simulation of message-passing parallel programs
NASA Technical Reports Server (NTRS)
Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.
1994-01-01
As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.
Graphics Processing Unit Assisted Thermographic Compositing
NASA Technical Reports Server (NTRS)
Ragasa, Scott; Russell, Samuel S.
2012-01-01
Objective Develop a software application utilizing high performance computing techniques, including general purpose graphics processing units (GPGPUs), for the analysis and visualization of large thermographic data sets. Over the past several years, an increasing effort among scientists and engineers to utilize graphics processing units (GPUs) in a more general purpose fashion is allowing for previously unobtainable levels of computation by individual workstations. As data sets grow, the methods to work them grow at an equal, and often greater, pace. Certain common computations can take advantage of the massively parallel and optimized hardware constructs of the GPU which yield significant increases in performance. These common computations have high degrees of data parallelism, that is, they are the same computation applied to a large set of data where the result does not depend on other data elements. Image processing is one area were GPUs are being used to greatly increase the performance of certain analysis and visualization techniques.
GPU COMPUTING FOR PARTICLE TRACKING
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nishimura, Hiroshi; Song, Kai; Muriki, Krishna
2011-03-25
This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculationmore » of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.« less
Ehsan, Shoaib; Clark, Adrian F.; ur Rehman, Naveed; McDonald-Maier, Klaus D.
2015-01-01
The integral image, an intermediate image representation, has found extensive use in multi-scale local feature detection algorithms, such as Speeded-Up Robust Features (SURF), allowing fast computation of rectangular features at constant speed, independent of filter size. For resource-constrained real-time embedded vision systems, computation and storage of integral image presents several design challenges due to strict timing and hardware limitations. Although calculation of the integral image only consists of simple addition operations, the total number of operations is large owing to the generally large size of image data. Recursive equations allow substantial decrease in the number of operations but require calculation in a serial fashion. This paper presents two new hardware algorithms that are based on the decomposition of these recursive equations, allowing calculation of up to four integral image values in a row-parallel way without significantly increasing the number of operations. An efficient design strategy is also proposed for a parallel integral image computation unit to reduce the size of the required internal memory (nearly 35% for common HD video). Addressing the storage problem of integral image in embedded vision systems, the paper presents two algorithms which allow substantial decrease (at least 44.44%) in the memory requirements. Finally, the paper provides a case study that highlights the utility of the proposed architectures in embedded vision systems. PMID:26184211
Ehsan, Shoaib; Clark, Adrian F; Naveed ur Rehman; McDonald-Maier, Klaus D
2015-07-10
The integral image, an intermediate image representation, has found extensive use in multi-scale local feature detection algorithms, such as Speeded-Up Robust Features (SURF), allowing fast computation of rectangular features at constant speed, independent of filter size. For resource-constrained real-time embedded vision systems, computation and storage of integral image presents several design challenges due to strict timing and hardware limitations. Although calculation of the integral image only consists of simple addition operations, the total number of operations is large owing to the generally large size of image data. Recursive equations allow substantial decrease in the number of operations but require calculation in a serial fashion. This paper presents two new hardware algorithms that are based on the decomposition of these recursive equations, allowing calculation of up to four integral image values in a row-parallel way without significantly increasing the number of operations. An efficient design strategy is also proposed for a parallel integral image computation unit to reduce the size of the required internal memory (nearly 35% for common HD video). Addressing the storage problem of integral image in embedded vision systems, the paper presents two algorithms which allow substantial decrease (at least 44.44%) in the memory requirements. Finally, the paper provides a case study that highlights the utility of the proposed architectures in embedded vision systems.
Photoacoustic Imaging with a Commercial Ultrasound System and a Custom Probe
Wang, Xueding; Fowlkes, J. Brian; Cannata, Jonathan M.; Hu, Changhong; Carson, Paul L.
2010-01-01
Building photoacoustic imaging (PAI) systems by using stand-alone ultrasound (US) units makes it convenient to take advantage of the state-of-the-art ultrasonic technologies. However, the sometimes limited receiving sensitivity and the comparatively narrow bandwidth of commercial US probes may not be sufficient to acquire high quality photoacoustic images. In this work, a high-speed PAI system has been developed using a commercial US unit and a custom built 128-element piezoelectric-polymer array (PPA) probe using a P(VDF-TrFE) film and flexible circuit to define the elements. Since the US unit supports simultaneous signal acquisition from 64 parallel receive channels, PAI data for synthetic image formation from a 64 or 128 element array aperture can be acquired after a single or dual laser firing, respectively. Therefore, 2D B-scan imaging can be achieved with a maximum frame rate up to 10 Hz, limited only by the laser repetition rate. The uniquely properties of P(VDF-TrFE) facilitated a wide -6 dB receiving bandwidth of over 120 % for the array. A specially designed 128-channel preamplifier board made the connection between the array and the system cable which not only enabled element electrical impedance matching but also further elevated the signal-to-noise ratio (SNR) to further enhance the detection of weak photoacoustic signals. Through the experiments on phantoms and rabbit ears, the good performance of this PAI system was demonstrated. PMID:21276653
Optimization of vehicle-trailer connection systems
NASA Astrophysics Data System (ADS)
Sorge, F.
2016-09-01
The three main requirements of a vehicle-trailer connection system are: en route stability, over- or under-steering restraint, minimum off-tracking along curved path. Linking the two units by four-bar trapeziums, wider stability margins may be attained in comparison with the conventional pintle-hitch for both instability types, divergent or oscillating. The stability maps are traced applying the Hurwitz method or the direct analysis of the characteristic equation at the instability threshold. Several types of four-bar linkages may be quickly tested, with the drawbars converging towards the trailer or the towing unit. The latter configuration appears preferable in terms of self-stability and may yield high critical speeds by optimising the geometrical and physical properties. Nevertheless, the system stability may be improved in general by additional vibration dampers in parallel with the connection linkage. Moreover, the four-bar connection may produce significant corrections of the under-steering or over-steering behaviour of the vehicle-train after a steering command from the driver. The off- tracking along the curved paths may be also optimized or kept inside prefixed margins of acceptableness. Activating electronic stability systems if necessary, fair results are obtainable for both the steering conduct and the off-tracking.
A Study of Parallels Between Antarctica South Pole Traverse Equipment and Lunar/Mars Surface Systems
NASA Technical Reports Server (NTRS)
Mueller, Robert P.; Hoffman, Stephen, J.; Thur, Paul
2010-01-01
The parallels between an actual Antarctica South Pole re-supply traverse conducted by the National Science Foundation (NSF) Office of Polar Programs in 2009 have been studied with respect to the latest mission architecture concepts being generated by the United States National Aeronautics and Space Administration (NASA) for lunar and Mars surface systems scenarios. The challenges faced by both endeavors are similar since they must both deliver equipment and supplies to support operations in an extreme environment with little margin for error in order to be successful. By carefully and closely monitoring the manifesting and operational support equipment lists which will enable this South Pole traverse, functional areas have been identified. The equipment required to support these functions will be listed with relevant properties such as mass, volume, spare parts and maintenance schedules. This equipment will be compared to space systems currently in use and projected to be required to support equivalent and parallel functions in Lunar and Mars missions in order to provide a level of realistic benchmarking. Space operations have historically required significant amounts of support equipment and tools to operate and maintain the space systems that are the primary focus of the mission. By gaining insight and expertise in Antarctic South Pole traverses, space missions can use the experience gained over the last half century of Antarctic operations in order to design for operations, maintenance, dual use, robustness and safety which will result in a more cost effective, user friendly, and lower risk surface system on the Moon and Mars. It is anticipated that the U.S Antarctic Program (USAP) will also realize benefits for this interaction with NASA in at least two areas: an understanding of how NASA plans and carries out its missions and possible improved efficiency through factors such as weight savings, alternative technologies, or modifications in training and operations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Seal, Sudip K; Perumalla, Kalyan S; Hirshman, Steven Paul
2013-01-01
Simulations that require solutions of block tridiagonal systems of equations rely on fast parallel solvers for runtime efficiency. Leading parallel solvers that are highly effective for general systems of equations, dense or sparse, are limited in scalability when applied to block tridiagonal systems. This paper presents scalability results as well as detailed analyses of two parallel solvers that exploit the special structure of block tridiagonal matrices to deliver superior performance, often by orders of magnitude. A rigorous analysis of their relative parallel runtimes is shown to reveal the existence of a critical block size that separates the parameter space spannedmore » by the number of block rows, the block size and the processor count, into distinct regions that favor one or the other of the two solvers. Dependence of this critical block size on the above parameters as well as on machine-specific constants is established. These formal insights are supported by empirical results on up to 2,048 cores of a Cray XT4 system. To the best of our knowledge, this is the highest reported scalability for parallel block tridiagonal solvers to date.« less
Digitally controlled twelve-pulse firing generator
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berde, D.; Ferrara, A.A.
1981-01-01
Control System Studies for the Tokamak Fusion Test Reactor (TFTR) indicate that accurate thyristor firing in the AC-to-DC conversion system is required in order to achieve good regulation of the various field currents. Rapid update and exact firing angle control are required to avoid instabilities, large eddy currents, or parasitic oscillations. The Prototype Firing Generator was designed to satisfy these requirements. To achieve the required /plus or minus/0.77/degree/firing accuracy, a three-phase-locked loop reference was designed; otherwise, the Firing Generator employs digital circuitry. The unit, housed in a standard CAMAC crate, operates under microcomputer control. Functions are performed under program control,more » which resides in nonvolatile read-only memory. Communication with CICADA control system is provided via an 11-bit parallel interface.« less
Application of high-performance computing to numerical simulation of human movement
NASA Technical Reports Server (NTRS)
Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.
1995-01-01
We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
Mirbozorgi, S Abdollah; Bahrami, Hadi; Sawan, Mohamad; Gosselin, Benoit
2016-04-01
This paper presents a novel experimental chamber with uniform wireless power distribution in 3D for enabling long-term biomedical experiments with small freely moving animal subjects. The implemented power transmission chamber prototype is based on arrays of parallel resonators and multicoil inductive links, to form a novel and highly efficient wireless power transmission system. The power transmitter unit includes several identical resonators enclosed in a scalable array of overlapping square coils which are connected in parallel to provide uniform power distribution along x and y. Moreover, the proposed chamber uses two arrays of primary resonators, facing each other, and connected in parallel to achieve uniform power distribution along the z axis. Each surface includes 9 overlapped coils connected in parallel and implemented into two layers of FR4 printed circuit board. The chamber features a natural power localization mechanism, which simplifies its implementation and ease its operation by avoiding the need for active detection and control mechanisms. A single power surface based on the proposed approach can provide a power transfer efficiency (PTE) of 69% and a power delivered to the load (PDL) of 120 mW, for a separation distance of 4 cm, whereas the complete chamber prototype provides a uniform PTE of 59% and a PDL of 100 mW in 3D, everywhere inside the chamber with a size of 27×27×16 cm(3).
Line-by-line spectroscopic simulations on graphics processing units
NASA Astrophysics Data System (ADS)
Collange, Sylvain; Daumas, Marc; Defour, David
2008-01-01
We report here on software that performs line-by-line spectroscopic simulations on gases. Elaborate models (such as narrow band and correlated-K) are accurate and efficient for bands where various components are not simultaneously and significantly active. Line-by-line is probably the most accurate model in the infrared for blends of gases that contain high proportions of H 2O and CO 2 as this was the case for our prototype simulation. Our implementation on graphics processing units sustains a speedup close to 330 on computation-intensive tasks and 12 on memory intensive tasks compared to implementations on one core of high-end processors. This speedup is due to data parallelism, efficient memory access for specific patterns and some dedicated hardware operators only available in graphics processing units. It is obtained leaving most of processor resources available and it would scale linearly with the number of graphics processing units in parallel machines. Line-by-line simulation coupled with simulation of fluid dynamics was long believed to be economically intractable but our work shows that it could be done with some affordable additional resources compared to what is necessary to perform simulations on fluid dynamics alone. Program summaryProgram title: GPU4RE Catalogue identifier: ADZY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADZY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 62 776 No. of bytes in distributed program, including test data, etc.: 1 513 247 Distribution format: tar.gz Programming language: C++ Computer: x86 PC Operating system: Linux, Microsoft Windows. Compilation requires either gcc/g++ under Linux or Visual C++ 2003/2005 and Cygwin under Windows. It has been tested using gcc 4.1.2 under Ubuntu Linux 7.04 and using Visual C++ 2005 with Cygwin 1.5.24 under Windows XP. RAM: 1 gigabyte Classification: 21.2 External routines: OpenGL ( http://www.opengl.org) Nature of problem: Simulating radiative transfer on high-temperature high-pressure gases. Solution method: Line-by-line Monte-Carlo ray-tracing. Unusual features: Parallel computations are moved to the GPU. Additional comments: nVidia GeForce 7000 or ATI Radeon X1000 series graphics processing unit is required. Running time: A few minutes.
Parallelized Stochastic Cutoff Method for Long-Range Interacting Systems
NASA Astrophysics Data System (ADS)
Endo, Eishin; Toga, Yuta; Sasaki, Munetaka
2015-07-01
We present a method of parallelizing the stochastic cutoff (SCO) method, which is a Monte-Carlo method for long-range interacting systems. After interactions are eliminated by the SCO method, we subdivide a lattice into noninteracting interpenetrating sublattices. This subdivision enables us to parallelize the Monte-Carlo calculation in the SCO method. Such subdivision is found by numerically solving the vertex coloring of a graph created by the SCO method. We use an algorithm proposed by Kuhn and Wattenhofer to solve the vertex coloring by parallel computation. This method was applied to a two-dimensional magnetic dipolar system on an L × L square lattice to examine its parallelization efficiency. The result showed that, in the case of L = 2304, the speed of computation increased about 102 times by parallel computation with 288 processors.
Design of on-board parallel computer on nano-satellite
NASA Astrophysics Data System (ADS)
You, Zheng; Tian, Hexiang; Yu, Shijie; Meng, Li
2007-11-01
This paper provides one scheme of the on-board parallel computer system designed for the Nano-satellite. Based on the development request that the Nano-satellite should have a small volume, low weight, low power cost, and intelligence, this scheme gets rid of the traditional one-computer system and dual-computer system with endeavor to improve the dependability, capability and intelligence simultaneously. According to the method of integration design, it employs the parallel computer system with shared memory as the main structure, connects the telemetric system, attitude control system, and the payload system by the intelligent bus, designs the management which can deal with the static tasks and dynamic task-scheduling, protect and recover the on-site status and so forth in light of the parallel algorithms, and establishes the fault diagnosis, restoration and system restructure mechanism. It accomplishes an on-board parallel computer system with high dependability, capability and intelligence, a flexible management on hardware resources, an excellent software system, and a high ability in extension, which satisfies with the conception and the tendency of the integration electronic design sufficiently.
Um, Keehong; Yoo, Sooyeup
2013-10-01
Protocol for digital multiplex with 512 pieces of information is increasingly adopted in the design of illumination systems. In conventional light-emitting diode systems, the receivers are connected in parallel and each of the receiving units receives all the data from the master dimmer console, but each receiving unit operates by recognizing as its own data that which corresponds to the assigned number of the receiver. Because the serial numbers of illumination devices are transmitted in binary code, synchronization is too complicated to be used properly. In order to improve the protocol of illumination control systems, we propose an algorithm of protocol reception to install and manage the system in a simpler and more convenient way. We propose the systems for controlling the light-emitting diode illumination of simplified receiver slaves adopting the digital multiplex-512 protocol where master console and multiple receiver slaves are connected in a daisy chain fashion. The digital multiplex-512 data packet is received according to the sequence order of their locations from the console, without assigning the sequence number of each channel at the receiving device. The purpose of this paper is to design a simple and small-sized controller for the control systems of lamps and lighting adopting the digital multiplex-512 network.
A parallel graded-mesh FDTD algorithm for human-antenna interaction problems.
Catarinucci, Luca; Tarricone, Luciano
2009-01-01
The finite difference time domain method (FDTD) is frequently used for the numerical solution of a wide variety of electromagnetic (EM) problems and, among them, those concerning human exposure to EM fields. In many practical cases related to the assessment of occupational EM exposure, large simulation domains are modeled and high space resolution adopted, so that strong memory and central processing unit power requirements have to be satisfied. To better afford the computational effort, the use of parallel computing is a winning approach; alternatively, subgridding techniques are often implemented. However, the simultaneous use of subgridding schemes and parallel algorithms is very new. In this paper, an easy-to-implement and highly-efficient parallel graded-mesh (GM) FDTD scheme is proposed and applied to human-antenna interaction problems, demonstrating its appropriateness in dealing with complex occupational tasks and showing its capability to guarantee the advantages of a traditional subgridding technique without affecting the parallel FDTD performance.
Development of the fibre positioning unit of MOONS
NASA Astrophysics Data System (ADS)
Montgomery, David; Atkinson, David; Beard, Stephen; Cochrane, William; Drass, Holger; Guinouard, Isabelle; Lee, David; Taylor, William; Rees, Phil; Watson, Steve
2016-08-01
The Multi-Object Optical and Near-Infrared Spectrograph (MOONS) will exploit the full 500 square arcmin field of view offered by the Nasmyth focus of the Very Large Telescope and will be equipped with two identical triple arm cryogenic spectrographs covering the wavelength range 0.64μm-1.8μm, with a multiplex capability of over 1000 fibres. This can be configured to produce spectra for chosen targets and have close proximity sky subtraction if required. The system will have both a medium resolution (R 4000-6000) mode and a high resolution (R 20000) mode. The fibre positioning units are used to position each fibre independently in order to pick off each sub field of 1.0" within a circular patrol area of 85" on sky (50mm physical diameter). The nominal physical separation between FPUs is 25mm allowing a 100% overlap in coverage between adjacent units. The design of the fibre positioning units allows parallel and rapid reconfiguration between observations. The kinematic geometry is such that pupil alignment is maintained over the patrol area. This paper presents the design of the Fibre Positioning Units at the preliminary design review and the results of verification testing of the advanced prototypes.
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arampatzis, Giorgos, E-mail: garab@math.uoc.gr; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Plechac, Petr, E-mail: plechac@math.udel.edu
2012-10-01
We present a mathematical framework for constructing and analyzing parallel algorithms for lattice kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. Rather than focusing on constructing exactly the stochastic trajectories, our approach relies on approximating the evolution of observables, such as density, coverage, correlations and so on. More specifically, we develop a spatial domain decomposition of the Markov operator (generator) that describes the evolution of all observables according to the kinetic Monte Carlo algorithm. This domain decompositionmore » corresponds to a decomposition of the Markov generator into a hierarchy of operators and can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). Based on this operator decomposition, we formulate parallel Fractional step kinetic Monte Carlo algorithms by employing the Trotter Theorem and its randomized variants; these schemes, (a) are partially asynchronous on each fractional step time-window, and (b) are characterized by their communication schedule between processors. The proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules. We carry out a detailed benchmarking of the parallel KMC schemes using available exact solutions, for example, in Ising-type systems and we demonstrate the capabilities of the method to simulate complex spatially distributed reactions at very large scales on GPUs. Finally, we discuss work load balancing between processors and propose a re-balancing scheme based on probabilistic mass transport methods.« less
Accelerated Adaptive MGS Phase Retrieval
NASA Technical Reports Server (NTRS)
Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang
2011-01-01
The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.
Parallel dispatch: a new paradigm of electrical power system dispatch
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Jun Jason; Wang, Fei-Yue; Wang, Qiang
Modern power systems are evolving into sociotechnical systems with massive complexity, whose real-time operation and dispatch go beyond human capability. Thus, the need for developing and applying new intelligent power system dispatch tools are of great practical significance. In this paper, we introduce the overall business model of power system dispatch, the top level design approach of an intelligent dispatch system, and the parallel intelligent technology with its dispatch applications. We expect that a new dispatch paradigm, namely the parallel dispatch, can be established by incorporating various intelligent technologies, especially the parallel intelligent technology, to enable secure operation of complexmore » power grids, extend system operators U+02BC capabilities, suggest optimal dispatch strategies, and to provide decision-making recommendations according to power system operational goals.« less
Self-assembly in densely grafted macromolecules with amphiphilic monomer units: diagram of states.
Lazutin, A A; Vasilevskaya, V V; Khokhlov, A R
2017-11-22
By means of computer modelling, the self-organization of dense planar brushes of macromolecules with amphiphilic monomer units was addressed and their state diagram was constructed. The diagram of states includes the following regions: disordered position of monomer units with respect to each other, strands composed of a few polymer chains and lamellae with different domain spacing. The transformation of lamellae structures with different domain spacing occurred within the intermediate region and could proceed through the formation of so-called parking garage structures. The parking garage structure joins the lamellae with large (on the top of the brushes) and small (close to the grafted surface) domain spacing, which appears like a system of inclined locally parallel layers connected with each other by bridges. The parking garage structures were observed for incompatible A and B groups in selective solvents, which result in aggregation of the side B groups and dense packing of amphiphilic macromolecules in the restricted volume of the planar brushes.
A cascaded neuro-computational model for spoken word recognition
NASA Astrophysics Data System (ADS)
Hoya, Tetsuya; van Leeuwen, Cees
2010-03-01
In human speech recognition, words are analysed at both pre-lexical (i.e., sub-word) and lexical (word) levels. The aim of this paper is to propose a constructive neuro-computational model that incorporates both these levels as cascaded layers of pre-lexical and lexical units. The layered structure enables the system to handle the variability of real speech input. Within the model, receptive fields of the pre-lexical layer consist of radial basis functions; the lexical layer is composed of units that perform pattern matching between their internal template and a series of labels, corresponding to the winning receptive fields in the pre-lexical layer. The model adapts through self-tuning of all units, in combination with the formation of a connectivity structure through unsupervised (first layer) and supervised (higher layers) network growth. Simulation studies show that the model can achieve a level of performance in spoken word recognition similar to that of a benchmark approach using hidden Markov models, while enabling parallel access to word candidates in lexical decision making.
NASA Astrophysics Data System (ADS)
Wang, Tai-Han; Huang, Da-Nian; Ma, Guo-Qing; Meng, Zhao-Hai; Li, Ye
2017-06-01
With the continuous development of full tensor gradiometer (FTG) measurement techniques, three-dimensional (3D) inversion of FTG data is becoming increasingly used in oil and gas exploration. In the fast processing and interpretation of large-scale high-precision data, the use of the graphics processing unit process unit (GPU) and preconditioning methods are very important in the data inversion. In this paper, an improved preconditioned conjugate gradient algorithm is proposed by combining the symmetric successive over-relaxation (SSOR) technique and the incomplete Choleksy decomposition conjugate gradient algorithm (ICCG). Since preparing the preconditioner requires extra time, a parallel implement based on GPU is proposed. The improved method is then applied in the inversion of noisecontaminated synthetic data to prove its adaptability in the inversion of 3D FTG data. Results show that the parallel SSOR-ICCG algorithm based on NVIDIA Tesla C2050 GPU achieves a speedup of approximately 25 times that of a serial program using a 2.0 GHz Central Processing Unit (CPU). Real airborne gravity-gradiometry data from Vinton salt dome (southwest Louisiana, USA) are also considered. Good results are obtained, which verifies the efficiency and feasibility of the proposed parallel method in fast inversion of 3D FTG data.
Li, Bowei; Jiang, Lei; Xie, Hua; Gao, Yan; Qin, Jianhua; Lin, Bingcheng
2009-09-01
A micropump-actuated negative pressure pinched injection method is developed for parallel electrophoresis on a multi-channel LIF detection system. The system has a home-made device that could individually control 16-port solenoid valves and a high-voltage power supply. The laser beam is excitated and distributes to the array separation channels for detection. The hybrid Glass-PDMS microfluidic chip comprises two common reservoirs, four separation channels coupled to their respective pneumatic micropumps and two reference channels. Due to use of pressure as a driving force, the proposed method has no sample bias effect for separation. There is only one high-voltage supply needed for separation without relying on the number of channels, which is significant for high-throughput analysis, and the time for sample loading is shortened to 1 s. In addition, the integrated micropumps can provide the versatile interface for coupling with other function units to satisfy the complicated demands. The performance is verified by separation of DNA marker and Hepatitis B virus DNA samples. And this method is also expected to show the potential throughput for the DNA analysis in the field of disease diagnosis.
Mniszewski, S M; Cawkwell, M J; Wall, M E; Mohd-Yusof, J; Bock, N; Germann, T C; Niklasson, A M N
2015-10-13
We present an algorithm for the calculation of the density matrix that for insulators scales linearly with system size and parallelizes efficiently on multicore, shared memory platforms with small and controllable numerical errors. The algorithm is based on an implementation of the second-order spectral projection (SP2) algorithm [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] in sparse matrix algebra with the ELLPACK-R data format. We illustrate the performance of the algorithm within self-consistent tight binding theory by total energy calculations of gas phase poly(ethylene) molecules and periodic liquid water systems containing up to 15,000 atoms on up to 16 CPU cores. We consider algorithm-specific performance aspects, such as local vs nonlocal memory access and the degree of matrix sparsity. Comparisons to sparse matrix algebra implementations using off-the-shelf libraries on multicore CPUs, graphics processing units (GPUs), and the Intel many integrated core (MIC) architecture are also presented. The accuracy and stability of the algorithm are illustrated with long duration Born-Oppenheimer molecular dynamics simulations of 1000 water molecules and a 303 atom Trp cage protein solvated by 2682 water molecules.
Mitterauer, Bernhard J.; Kofler-Westergren, Birgitta
2011-01-01
A model of glial–neuronal interactions is proposed that could be explanatory for the demyelination identified in brains with schizophrenia. It is based on two hypotheses: (1) that glia–neuron systems are functionally viable and important for normal brain function, and (2) that disruption of this postulated function disturbs the glial categorization function, as shown by formal analysis. According to this model, in schizophrenia receptors on astrocytes in glial–neuronal synaptic units are not functional, loosing their modulatory influence on synaptic neurotransmission. Hence, an unconstrained neurotransmission flux occurs that hyperactivates the axon and floods the cognate receptors of neurotransmitters on oligodendrocytes. The excess of neurotransmitters may have a toxic effect on oligodendrocytes and myelin, causing demyelination. In parallel, an increasing impairment of axons may disconnect neuronal networks. It is formally shown how oligodendrocytes normally categorize axonic information processing via their processes. Demyelination decomposes the oligodendrocyte–axonic system making it incapable to generate categories of information. This incoherence may be responsible for symptoms of disorganization in schizophrenia, such as thought disorder, inappropriate affect and incommunicable motor behavior. In parallel, the loss of oligodendrocytes affects gap junctions in the panglial syncytium, presumably responsible for memory impairment in schizophrenia. PMID:21647404
Combined algorithmic and GPU acceleration for ultra-fast circular conebeam backprojection
NASA Astrophysics Data System (ADS)
Brokish, Jeffrey; Sack, Paul; Bresler, Yoram
2010-04-01
In this paper, we describe the first implementation and performance of a fast O(N3logN) hierarchical backprojection algorithm for cone beam CT with a circular trajectory1,developed on a modern Graphics Processing Unit (GPU). The resulting tomographic backprojection system for 3D cone beam geometry combines speedup through algorithmic improvements provided by the hierarchical backprojection algorithm with speedup from a massively parallel hardware accelerator. For data parameters typical in diagnostic CT and using a mid-range GPU card, we report reconstruction speeds of up to 360 frames per second, and relative speedup of almost 6x compared to conventional backprojection on the same hardware. The significance of these results is twofold. First, they demonstrate that the reduction in operation counts demonstrated previously for the FHBP algorithm can be translated to a comparable run-time improvement in a massively parallel hardware implementation, while preserving stringent diagnostic image quality. Second, the dramatic speedup and throughput numbers achieved indicate the feasibility of systems based on this technology, which achieve real-time 3D reconstruction for state-of-the art diagnostic CT scanners with small footprint, high-reliability, and affordable cost.
Ogawa, Yasushi; Fawaz, Farah; Reyes, Candice; Lai, Julie; Pungor, Erno
2007-01-01
Parameter settings of a parallel line analysis procedure were defined by applying statistical analysis procedures to the absorbance data from a cell-based potency bioassay for a recombinant adenovirus, Adenovirus 5 Fibroblast Growth Factor-4 (Ad5FGF-4). The parallel line analysis was performed with a commercially available software, PLA 1.2. The software performs Dixon outlier test on replicates of the absorbance data, performs linear regression analysis to define linear region of the absorbance data, and tests parallelism between the linear regions of standard and sample. Width of Fiducial limit, expressed as a percent of the measured potency, was developed as a criterion for rejection of the assay data and to significantly improve the reliability of the assay results. With the linear range-finding criteria of the software set to a minimum of 5 consecutive dilutions and best statistical outcome, and in combination with the Fiducial limit width acceptance criterion of <135%, 13% of the assay results were rejected. With these criteria applied, the assay was found to be linear over the range of 0.25 to 4 relative potency units, defined as the potency of the sample normalized to the potency of Ad5FGF-4 standard containing 6 x 10(6) adenovirus particles/mL. The overall precision of the assay was estimated to be 52%. Without the application of Fiducial limit width criterion, the assay results were not linear over the range, and an overall precision of 76% was calculated from the data. An absolute unit of potency for the assay was defined by using the parallel line analysis procedure as the amount of Ad5FGF-4 that results in an absorbance value that is 121% of the average absorbance readings of the wells containing cells not infected with the adenovirus.
A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU
NASA Astrophysics Data System (ADS)
Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha
2018-03-01
Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
NASA Astrophysics Data System (ADS)
Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa
2017-08-01
The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing
(KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
NASA Astrophysics Data System (ADS)
Ferrando, N.; Gosálvez, M. A.; Cerdá, J.; Gadea, R.; Sato, K.
2011-03-01
Presently, dynamic surface-based models are required to contain increasingly larger numbers of points and to propagate them over longer time periods. For large numbers of surface points, the octree data structure can be used as a balance between low memory occupation and relatively rapid access to the stored data. For evolution rules that depend on neighborhood states, extended simulation periods can be obtained by using simplified atomistic propagation models, such as the Cellular Automata (CA). This method, however, has an intrinsic parallel updating nature and the corresponding simulations are highly inefficient when performed on classical Central Processing Units (CPUs), which are designed for the sequential execution of tasks. In this paper, a series of guidelines is presented for the efficient adaptation of octree-based, CA simulations of complex, evolving surfaces into massively parallel computing hardware. A Graphics Processing Unit (GPU) is used as a cost-efficient example of the parallel architectures. For the actual simulations, we consider the surface propagation during anisotropic wet chemical etching of silicon as a computationally challenging process with a wide-spread use in microengineering applications. A continuous CA model that is intrinsically parallel in nature is used for the time evolution. Our study strongly indicates that parallel computations of dynamically evolving surfaces simulated using CA methods are significantly benefited by the incorporation of octrees as support data structures, substantially decreasing the overall computational time and memory usage.
FPGA implementation of sparse matrix algorithm for information retrieval
NASA Astrophysics Data System (ADS)
Bojanic, Slobodan; Jevtic, Ruzica; Nieto-Taladriz, Octavio
2005-06-01
Information text data retrieval requires a tremendous amount of processing time because of the size of the data and the complexity of information retrieval algorithms. In this paper the solution to this problem is proposed via hardware supported information retrieval algorithms. Reconfigurable computing may adopt frequent hardware modifications through its tailorable hardware and exploits parallelism for a given application through reconfigurable and flexible hardware units. The degree of the parallelism can be tuned for data. In this work we implemented standard BLAS (basic linear algebra subprogram) sparse matrix algorithm named Compressed Sparse Row (CSR) that is showed to be more efficient in terms of storage space requirement and query-processing timing over the other sparse matrix algorithms for information retrieval application. Although inverted index algorithm is treated as the de facto standard for information retrieval for years, an alternative approach to store the index of text collection in a sparse matrix structure gains more attention. This approach performs query processing using sparse matrix-vector multiplication and due to parallelization achieves a substantial efficiency over the sequential inverted index. The parallel implementations of information retrieval kernel are presented in this work targeting the Virtex II Field Programmable Gate Arrays (FPGAs) board from Xilinx. A recent development in scientific applications is the use of FPGA to achieve high performance results. Computational results are compared to implementations on other platforms. The design achieves a high level of parallelism for the overall function while retaining highly optimised hardware within processing unit.
Design of high-performance parallelized gene predictors in MATLAB.
Rivard, Sylvain Robert; Mailloux, Jean-Gabriel; Beguenane, Rachid; Bui, Hung Tien
2012-04-10
This paper proposes a method of implementing parallel gene prediction algorithms in MATLAB. The proposed designs are based on either Goertzel's algorithm or on FFTs and have been implemented using varying amounts of parallelism on a central processing unit (CPU) and on a graphics processing unit (GPU). Results show that an implementation using a straightforward approach can require over 4.5 h to process 15 million base pairs (bps) whereas a properly designed one could perform the same task in less than five minutes. In the best case, a GPU implementation can yield these results in 57 s. The present work shows how parallelism can be used in MATLAB for gene prediction in very large DNA sequences to produce results that are over 270 times faster than a conventional approach. This is significant as MATLAB is typically overlooked due to its apparent slow processing time even though it offers a convenient environment for bioinformatics. From a practical standpoint, this work proposes two strategies for accelerating genome data processing which rely on different parallelization mechanisms. Using a CPU, the work shows that direct access to the MEX function increases execution speed and that the PARFOR construct should be used in order to take full advantage of the parallelizable Goertzel implementation. When the target is a GPU, the work shows that data needs to be segmented into manageable sizes within the GFOR construct before processing in order to minimize execution time.
Real-time SHVC software decoding with multi-threaded parallel processing
NASA Astrophysics Data System (ADS)
Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu
2014-09-01
This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.
System-wide power management control via clock distribution network
Coteus, Paul W.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Reed, Don D.
2015-05-19
An apparatus, method and computer program product for automatically controlling power dissipation of a parallel computing system that includes a plurality of processors. A computing device issues a command to the parallel computing system. A clock pulse-width modulator encodes the command in a system clock signal to be distributed to the plurality of processors. The plurality of processors in the parallel computing system receive the system clock signal including the encoded command, and adjusts power dissipation according to the encoded command.
NASA Technical Reports Server (NTRS)
Reinsch, K. G. (Editor); Schmidt, W. (Editor); Ecer, A. (Editor); Haeuser, Jochem (Editor); Periaux, J. (Editor)
1992-01-01
A conference was held on parallel computational fluid dynamics and produced related papers. Topics discussed in these papers include: parallel implicit and explicit solvers for compressible flow, parallel computational techniques for Euler and Navier-Stokes equations, grid generation techniques for parallel computers, and aerodynamic simulation om massively parallel systems.
Andrade, Xavier; Aspuru-Guzik, Alán
2013-10-08
We discuss the application of graphical processing units (GPUs) to accelerate real-space density functional theory (DFT) calculations. To make our implementation efficient, we have developed a scheme to expose the data parallelism available in the DFT approach; this is applied to the different procedures required for a real-space DFT calculation. We present results for current-generation GPUs from AMD and Nvidia, which show that our scheme, implemented in the free code Octopus, can reach a sustained performance of up to 90 GFlops for a single GPU, representing a significant speed-up when compared to the CPU version of the code. Moreover, for some systems, our implementation can outperform a GPU Gaussian basis set code, showing that the real-space approach is a competitive alternative for DFT simulations on GPUs.
Accelerating a three-dimensional eco-hydrological cellular automaton on GPGPU with OpenCL
NASA Astrophysics Data System (ADS)
Senatore, Alfonso; D'Ambrosio, Donato; De Rango, Alessio; Rongo, Rocco; Spataro, William; Straface, Salvatore; Mendicino, Giuseppe
2016-10-01
This work presents an effective implementation of a numerical model for complete eco-hydrological Cellular Automata modeling on Graphical Processing Units (GPU) with OpenCL (Open Computing Language) for heterogeneous computation (i.e., on CPUs and/or GPUs). Different types of parallel implementations were carried out (e.g., use of fast local memory, loop unrolling, etc), showing increasing performance improvements in terms of speedup, adopting also some original optimizations strategies. Moreover, numerical analysis of results (i.e., comparison of CPU and GPU outcomes in terms of rounding errors) have proven to be satisfactory. Experiments were carried out on a workstation with two CPUs (Intel Xeon E5440 at 2.83GHz), one GPU AMD R9 280X and one GPU nVIDIA Tesla K20c. Results have been extremely positive, but further testing should be performed to assess the functionality of the adopted strategies on other complete models and their ability to fruitfully exploit parallel systems resources.
Data Partitioning and Load Balancing in Parallel Disk Systems
NASA Technical Reports Server (NTRS)
Scheuermann, Peter; Weikum, Gerhard; Zabback, Peter
1997-01-01
Parallel disk systems provide opportunities for exploiting I/O parallelism in two possible waves, namely via inter-request and intra-request parallelism. In this paper we discuss the main issues in performance tuning of such systems, namely striping and load balancing, and show their relationship to response time and throughput. We outline the main components of an intelligent, self-reliant file system that aims to optimize striping by taking into account the requirements of the applications and performs load balancing by judicious file allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but effective heuristics that incur only little overhead. We present performance experiments based on synthetic workloads and real-life traces.
Conceptual design of a hybrid parallel mechanism for mask exchanging of TMT
NASA Astrophysics Data System (ADS)
Wang, Jianping; Zhou, Hongfei; Li, Kexuan; Zhou, Zengxiang; Zhai, Chao
2015-10-01
Mask exchange system is an important part of the Multi-Object Broadband Imaging Echellette (MOBIE) on the Thirty Meter Telescope (TMT). To solve the problem of stiffness changing with the gravity vector of the mask exchange system in the MOBIE, the hybrid parallel mechanism design method was introduced into the whole research. By using the characteristics of high stiffness and precision of parallel structure, combined with large moving range of serial structure, a conceptual design of a hybrid parallel mask exchange system based on 3-RPS parallel mechanism was presented. According to the position requirements of the MOBIE, the SolidWorks structure model of the hybrid parallel mask exchange robot was established and the appropriate installation position without interfering with the related components and light path in the MOBIE of TMT was analyzed. Simulation results in SolidWorks suggested that 3-RPS parallel platform had good stiffness property in different gravity vector directions. Furthermore, through the research of the mechanism theory, the inverse kinematics solution of the 3-RPS parallel platform was calculated and the mathematical relationship between the attitude angle of moving platform and the angle of ball-hinges on the moving platform was established, in order to analyze the attitude adjustment ability of the hybrid parallel mask exchange robot. The proposed conceptual design has some guiding significance for the design of mask exchange system of the MOBIE on TMT.
Orthorectification by Using Gpgpu Method
NASA Astrophysics Data System (ADS)
Sahin, H.; Kulur, S.
2012-07-01
Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
Support for Debugging Automatically Parallelized Programs
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)
2001-01-01
We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify the program execution without changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.
Relative Debugging of Automatically Parallelized Programs
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)
2002-01-01
We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular, the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify, the program execution with out changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.
The 20 kW battery study program
NASA Technical Reports Server (NTRS)
1971-01-01
Six battery configurations were selected for detailed study and these are described. A computer program was modified for use in estimation of the weights, costs, and reliabilities of each of the configurations, as a function of several important independent variables, such as system voltage, battery voltage ratio (battery voltage/bus voltage), and the number of parallel units into which each of the components of the power subsystem was divided. The computer program was used to develop the relationship between the independent variables alone and in combination, and the dependent variables: weight, cost, and availability. Parametric data, including power loss curves, are given.
DOE Office of Scientific and Technical Information (OSTI.GOV)
McCormick, B.H.; Narasimhan, R.
1963-01-01
The overall computer system contains three main parts: an input device, a pattern recognition unit (PRU), and a control computer. The bubble chamber picture is divided into a grid of st run. Concent 1-mm squares on the film. It is then processed in parallel in a two-dimensional array of 1024 identical processing modules (stalactites) of the PRU. The array can function as a two- dimensional shift register in which results of successive shifting operations can be accumulated. The pattern recognition process is generally controlled by a conventional arithmetic computer. (A.G.W.)
Implementation of a production Ada project: The GRODY study
NASA Technical Reports Server (NTRS)
Godfrey, Sara; Brophy, Carolyn Elizabeth
1989-01-01
The use of the Ada language and design methodologies that encourage full use of its capabilities have a strong impact on all phases of the software development project life cycle. At the National Aeronautics and Space Administration/Goddard Space Flight Center (NASA/GSFC), the Software Engineering Laboratory (SEL) conducted an experiment in parallel development of two flight dynamics systems in FORTRAN and Ada. The differences observed during the implementation, unit testing, and integration phases of the two projects are described and the lessons learned during the implementation phase of the Ada development are outlined. Included are recommendations for future Ada development projects.
Malleable architecture generator for FPGA computing
NASA Astrophysics Data System (ADS)
Gokhale, Maya; Kaba, James; Marks, Aaron; Kim, Jang
1996-10-01
The malleable architecture generator (MARGE) is a tool set that translates high-level parallel C to configuration bit streams for field-programmable logic based computing systems. MARGE creates an application-specific instruction set and generates the custom hardware components required to perform exactly those computations specified by the C program. In contrast to traditional fixed-instruction processors, MARGE's dynamic instruction set creation provides for efficient use of hardware resources. MARGE processes intermediate code in which each operation is annotated by the bit lengths of the operands. Each basic block (sequence of straight line code) is mapped into a single custom instruction which contains all the operations and logic inherent in the block. A synthesis phase maps the operations comprising the instructions into register transfer level structural components and control logic which have been optimized to exploit functional parallelism and function unit reuse. As a final stage, commercial technology-specific tools are used to generate configuration bit streams for the desired target hardware. Technology- specific pre-placed, pre-routed macro blocks are utilized to implement as much of the hardware as possible. MARGE currently supports the Xilinx-based Splash-2 reconfigurable accelerator and National Semiconductor's CLAy-based parallel accelerator, MAPA. The MARGE approach has been demonstrated on systolic applications such as DNA sequence comparison.
LSPR chip for parallel, rapid, and sensitive detection of cancer markers in serum.
Aćimović, Srdjan S; Ortega, Maria A; Sanz, Vanesa; Berthelot, Johann; Garcia-Cordero, Jose L; Renger, Jan; Maerkl, Sebastian J; Kreuzer, Mark P; Quidant, Romain
2014-05-14
Label-free biosensing based on metallic nanoparticles supporting localized surface plasmon resonances (LSPR) has recently received growing interest (Anker, J. N., et al. Nat. Mater. 2008, 7, 442-453). Besides its competitive sensitivity (Yonzon, C. R., et al. J. Am. Chem. Soc. 2004, 126, 12669-12676; Svendendahl, M., et al. Nano Lett. 2009, 9, 4428-4433) when compared to the surface plasmon resonance (SPR) approach based on extended metal films, LSPR biosensing features a high-end miniaturization potential and a significant reduction of the interrogation device bulkiness, positioning itself as a promising candidate for point-of-care diagnostic and field applications. Here, we present the first, paralleled LSPR lab-on-a-chip realization that goes well beyond the state-of-the-art, by uniting the latest advances in plasmonics, nanofabrication, microfluidics, and surface chemistry. Our system offers parallel, real-time inspection of 32 sensing sites distributed across 8 independent microfluidic channels with very high reproducibility/repeatability. This enables us to test various sensing strategies for the detection of biomolecules. In particular we demonstrate the fast detection of relevant cancer biomarkers (human alpha-feto-protein and prostate specific antigen) down to concentrations of 500 pg/mL in a complex matrix consisting of 50% human serum.
NASA Technical Reports Server (NTRS)
Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan
1994-01-01
A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.
Modelling parallel programs and multiprocessor architectures with AXE
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Fineman, Charles E.
1991-01-01
AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.
AZTEC. Parallel Iterative method Software for Solving Linear Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hutchinson, S.; Shadid, J.; Tuminaro, R.
1995-07-01
AZTEC is an interactive library that greatly simplifies the parrallelization process when solving the linear systems of equations Ax=b where A is a user supplied n X n sparse matrix, b is a user supplied vector of length n and x is a vector of length n to be computed. AZTEC is intended as a software tool for users who want to avoid cumbersome parallel programming details but who have large sparse linear systems which require an efficiently utilized parallel processing system. A collection of data transformation tools are provided that allow for easy creation of distributed sparse unstructured matricesmore » for parallel solutions.« less
Bit-parallel arithmetic in a massively-parallel associative processor
NASA Technical Reports Server (NTRS)
Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.
1992-01-01
A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.
Equalizer: a scalable parallel rendering framework.
Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato
2009-01-01
Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.
ERIC Educational Resources Information Center
Jansen, Jonathan D.
2006-01-01
The parallels between South Africa and the United States run deep. For the United States, that moment of transition, at least as far as education is concerned, was the landmark ruling of 1954, described in the shorthand, "Brown v. Board of Education"; for South Africa, that moment came 40 years later when every citizen could, for the…
On-ground calibration of the ART-XC/SRG mirror system and detector unit at IKI. Part I
NASA Astrophysics Data System (ADS)
Pavlinsky, M.; Tkachenko, A.; Levin, V.; Krivchenko, A.; Rotin, A.; Kuznetsova, M.; Lapshov, I.; Krivonos, R.; Semena, A.; Semena, N.; Serbinov, D.; Shtykovsky, A.; Yaskovich, A.; Oleinikov, V.; Glushenko, A.; Mereminskiy, I.; Molkov, S.; Sazonov, S.; Arefiev, V.
2018-05-01
From October 2016 to September 2017, we performed tests of the ART-XC /SRG spare mirror system and detector unit at the 60-m-long IKI X-ray test facility. We describe some technical features of this test facility. We also present a brief description of the ART-XC mirror system and focal detectors. The nominal focal length of the ART-XC optics is 2700 mm. The field of view is determined by the combination of the mirror system and the detector unit and is equal to ˜0.31 square degrees. The declared operating energy range is 5-30 keV. During the tests, we illuminated the detector with a 55Fe+241 Am calibration source and also with a quasi-parallel X-ray beam. The calibration source is integrated into the detector's collimator. The X-ray beam was generated by a set of Oxford Instruments X-ray tubes with Cr, Cu and Mo targets and an Amptek miniature X-ray tube (Mini-X) with Ag transmission target. The detector was exposed to the X-ray beam either directly or through the mirror system. We present the obtained results on the detector's energy resolution, the muon on-ground background level and the energy dependence of the W90 value. The accuracy of a mathematical model of the ART-XC mirror system, based on ray-tracing simulations, proves to be within 3.5% in the main energy range of 4-20 keV and 5.4% in the "hard" energy range of 20-40 keV.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism
DOE Office of Scientific and Technical Information (OSTI.GOV)
Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less
Zhang, Zhen; Yan, Peng; Jiang, Huan; Ye, Peiqing
2014-09-01
In this paper, we consider the discrete time-varying internal model-based control design for high precision tracking of complicated reference trajectories generated by time-varying systems. Based on a novel parallel time-varying internal model structure, asymptotic tracking conditions for the design of internal model units are developed, and a low order robust time-varying stabilizer is further synthesized. In a discrete time setting, the high precision tracking control architecture is deployed on a Voice Coil Motor (VCM) actuated servo gantry system, where numerical simulations and real time experimental results are provided, achieving the tracking errors around 3.5‰ for frequency-varying signals. Copyright © 2014 ISA. Published by Elsevier Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
1979-01-01
The preliminary design for a prototype small (20 kWe) solar thermal electric generating unit was completed, consisting of several subsystems. The concentrator and the receiver collect solar energy and a thermal buffer storage with a transport system is used to provide a partially smoothed heat input to the Stirling engine. A fossil-fuel combustor is included in the receiver designs to permit operation with partial or no solar insolation (hybrid). The engine converts the heat input into mechanical action that powers a generator. To obtain electric power on a large scale, multiple solar modules will be required to operate in parallel. The small solar electric power plant used as a baseline design will provide electricity at remote sites and small communities.
Causes of High-temperature Superconductivity in the Hydrogen Sulfide Electron-phonon System
NASA Astrophysics Data System (ADS)
Degtyarenko, N. N.; Mazur, E. A.
The electron and phonon spectra, as well as the density of electron and phonon states of the stable orthorhombic structure of hydrogen sulfide (SH2) at pressures 100-180 GPa have been calculated. It is found that the set of parallel planes of hydrogen atoms is formed at pressure ∼175 GPa as a result of structural changes in the unit cell of the crystal under pressure. There should be complete concentration of hydrogen atoms in these planes. As a result the electron properties of the system acquire a quasi-two-dimensional character. The features of in phase and antiphase oscillations of hydrogen atoms in these planes leading to two narrow high-energy peaks in the phonon density of states are investigated.
Reasons for high-temperature superconductivity in the electron-phonon system of hydrogen sulfide
NASA Astrophysics Data System (ADS)
Degtyarenko, N. N.; Mazur, E. A.
2015-08-01
We have calculated the electron and phonon spectra, as well as the densities of the electron and phonon states, of the stable orthorhombic structure of hydrogen sulfide SH2 in the pressure interval 100-180 GPa. It is found that at a pressure of 175 GPa, a set of parallel planes of hydrogen atoms is formed due to a structural modification of the unit cell under pressure with complete accumulation of all hydrogen atoms in these planes. As a result, the electronic properties of the system become quasi-two-dimensional. We have also analyzed the collective synphase and antiphase vibrations of hydrogen atoms in these planes, leading to the occurrence of two high-energy peaks in the phonon density of states.
The illegal and the dead: are Mexicans renewable energy?
Smith-Nonini, Sandy
2011-01-01
This article reflects on the production of injury and death among Latino workers in the agro-industrial food complex, with attention to systemic relationships between the United States and Mexico in the post-North American Free Trade Agreement period, which has been characterized by waves of new labor migration that directly enhanced US agricultural profitability. The article draws parallels between literatures on labor productivity and new writings on energy and sustainable agriculture. It examines the usefulness of embodiment as a dialectical approach to eco-social theory, and the concept of "body politic," or a politics of moral ecology, as a means of reasserting the human shape of production systems that have become deformed by the impersonal calculus of neoliberal capitalism.
[Signal reception and processing by the retina].
Eysel, U
2007-01-01
Phototransduction occurs in the retina, which, as an outsourced part of the brain, fulfills important tasks in neuronal processing for image analysis relevant to perception. Interlinked biochemical cycles with immense amplification factors transform the electromagnetic waves of light into neuronal activity, and photochemical adaptation allows adjustment to light intensities of over more than 10 logarithmic units. Beginning with its dual system of photoreceptors with highly sensible rods and a color sensitive cone system, the retina, with between 50 and 100 main cell types, is characterized by complex neuronal circuits. The resulting center-surround antagonism of the receptive fields serves, amongst other things, to amplify intensity and color contrasts. Specialized ganglion cell types give rise to parallel signaling pathways into the higher visual centers of the brain.
Economics. Teacher's Guide [and Student Guide]. Parallel Alternative Strategies for Students (PASS).
ERIC Educational Resources Information Center
Chambliss, Robert, Ed.; Fresen, Sue, Ed.
This teacher's guide and student guide unit contains supplemental readings, activities, and methods adapted for secondary students who have disabilities and other students with diverse learning needs. The curriculum correlates to Florida's Sunshine State Standards and is divided into the following six units of study: (1) introduction to economics,…
ERIC Educational Resources Information Center
Schaap, Eileen, Ed.; Fresen, Sue, Ed.
This teacher's guide and student guide unit contains supplemental readings, activities, and methods adapted for secondary students who have disabilities and other students with diverse learning needs. The materials differ from standard textbooks and workbooks in several ways: simplified text; smaller units of study; reduced vocabulary level;…
ERIC Educational Resources Information Center
Danner, Greg, Ed.; Fresen, Sue, Ed.
This teacher's guide and student guide unit contains supplemental readings, activities, and methods adapted for secondary students who have disabilities and other students with diverse learning needs. The materials are designed to help these students succeed in regular education content courses and include simplified text and smaller units of…
A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S.; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116
A real-time capable software-defined receiver using GPU for adaptive anti-jam GPS sensors.
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities.
Factors Affecting the Length of Stay in the Intensive Care Unit: Our Clinical Experience
Sengul Samanci, Nilay; Akkoc, İbrahim; Yucetas, Esma; Cebeci, Egemen; Ozturk, Savas
2018-01-01
Background and Aim Long hospital days in intensive care unit (ICU) due to life-threatening diseases are increasing in the world. The primary goal in ICU is to decrease length of stay in order to improve the quality of medical care and reduce cost. The aim of our study is to identify and categorize the factors associated with prolonged stays in ICU. Materials and Method We retrospectively analyzed 3925 patients. We obtained the patients' demographic, clinical, diagnostic, and physiologic variables; mortality; lengths of stay by examining the intensive care unit database records. Results The mean age of the study was 61.6 ± 18.9 years. The average length of stay in intensive care unit was 10.2 ± 25.2 days. The most common cause of hospitalization was because of multiple diseases (19.5%). The length of stay was positively correlated with urea, creatinine, and sodium. It was negatively correlated with uric acid and hematocrit levels. Length of stay was significantly higher in patients not operated on than in patients operated on (p < 0.001). Conclusion Our study showed a significantly increased length of stay in patients with cardiovascular system diseases, multiple diseases, nervous system diseases, and cerebrovascular diseases. Moreover we showed that when urea, creatinine, and sodium values increase, in parallel the length of stay increases. PMID:29750174
NASA Astrophysics Data System (ADS)
Weigel, Martin
2011-09-01
Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dr. Dale M. Snider
2011-02-28
This report gives the result from the Phase-1 work on demonstrating greater than 10x speedup of the Barracuda computer program using parallel methods and GPU processors (General-Purpose Graphics Processing Unit or Graphics Processing Unit). Phase-1 demonstrated a 12x speedup on a typical Barracuda function using the GPU processor. The problem test case used about 5 million particles and 250,000 Eulerian grid cells. The relative speedup, compared to a single CPU, increases with increased number of particles giving greater than 12x speedup. Phase-1 work provided a path for reformatting data structure modifications to give good parallel performance while keeping a friendlymore » environment for new physics development and code maintenance. The implementation of data structure changes will be in Phase-2. Phase-1 laid the ground work for the complete parallelization of Barracuda in Phase-2, with the caveat that implemented computer practices for parallel programming done in Phase-1 gives immediate speedup in the current Barracuda serial running code. The Phase-1 tasks were completed successfully laying the frame work for Phase-2. The detailed results of Phase-1 are within this document. In general, the speedup of one function would be expected to be higher than the speedup of the entire code because of I/O functions and communication between the algorithms. However, because one of the most difficult Barracuda algorithms was parallelized in Phase-1 and because advanced parallelization methods and proposed parallelization optimization techniques identified in Phase-1 will be used in Phase-2, an overall Barracuda code speedup (relative to a single CPU) is expected to be greater than 10x. This means that a job which takes 30 days to complete will be done in 3 days. Tasks completed in Phase-1 are: Task 1: Profile the entire Barracuda code and select which subroutines are to be parallelized (See Section Choosing a Function to Accelerate) Task 2: Select a GPU consultant company and jointly parallelize subroutines (CPFD chose the small business EMPhotonics for the Phase-1 the technical partner. See Section Technical Objective and Approach) Task 3: Integrate parallel subroutines into Barracuda (See Section Results from Phase-1 and its subsections) Task 4: Testing, refinement, and optimization of parallel methodology (See Section Results from Phase-1 and Section Result Comparison Program) Task 5: Integrate Phase-1 parallel subroutines into Barracuda and release (See Section Results from Phase-1 and its subsections) Task 6: Roadmap of Phase-2 (See Section Plan for Phase-2) With the completion of Phase 1 we have the base understanding to completely parallelize Barracuda. An overview of the work to move Barracuda to a parallelized code is given in Plan for Phase-2.« less
A Framework for Load Balancing of Tensor Contraction Expressions via Dynamic Task Partitioning
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam
In this paper, we introduce the Dynamic Load-balanced Tensor Contractions (DLTC), a domain-specific library for efficient task parallel execution of tensor contraction expressions, a class of computation encountered in quantum chemistry and physics. Our framework decomposes each contraction into smaller unit of tasks, represented by an abstraction referred to as iterators. We exploit an extra level of parallelism by having tasks across independent contractions executed concurrently through a dynamic load balancing run- time. We demonstrate the improved performance, scalability, and flexibility for the computation of tensor contraction expressions on parallel computers using examples from coupled cluster methods.
Hu, Peter F; Yang, Shiming; Li, Hsiao-Chi; Stansbury, Lynn G; Yang, Fan; Hagegeorge, George; Miller, Catriona; Rock, Peter; Stein, Deborah M; Mackenzie, Colin F
2017-01-01
Research and practice based on automated electronic patient monitoring and data collection systems is significantly limited by system down time. We asked whether a triple-redundant Monitor of Monitors System (MoMs) to collect and summarize key information from system-wide data sources could achieve high fault tolerance, early diagnosis of system failure, and improve data collection rates. In our Level I trauma center, patient vital signs(VS) monitors were networked to collect real time patient physiologic data streams from 94 bed units in our various resuscitation, operating, and critical care units. To minimize the impact of server collection failure, three BedMaster® VS servers were used in parallel to collect data from all bed units. To locate and diagnose system failures, we summarized critical information from high throughput datastreams in real-time in a dashboard viewer and compared the before and post MoMs phases to evaluate data collection performance as availability time, active collection rates, and gap duration, occurrence, and categories. Single-server collection rates in the 3-month period before MoMs deployment ranged from 27.8 % to 40.5 % with combined 79.1 % collection rate. Reasons for gaps included collection server failure, software instability, individual bed setting inconsistency, and monitor servicing. In the 6-month post MoMs deployment period, average collection rates were 99.9 %. A triple redundant patient data collection system with real-time diagnostic information summarization and representation improved the reliability of massive clinical data collection to nearly 100 % in a Level I trauma center. Such data collection framework may also increase the automation level of hospital-wise information aggregation for optimal allocation of health care resources.
PRAIS: Distributed, real-time knowledge-based systems made easy
NASA Technical Reports Server (NTRS)
Goldstein, David G.
1990-01-01
This paper discusses an architecture for real-time, distributed (parallel) knowledge-based systems called the Parallel Real-time Artificial Intelligence System (PRAIS). PRAIS strives for transparently parallelizing production (rule-based) systems, even when under real-time constraints. PRAIS accomplishes these goals by incorporating a dynamic task scheduler, operating system extensions for fact handling, and message-passing among multiple copies of CLIPS executing on a virtual blackboard. This distributed knowledge-based system tool uses the portability of CLIPS and common message-passing protocols to operate over a heterogeneous network of processors.
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness
NASA Astrophysics Data System (ADS)
Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.
2014-09-01
Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.
2003-01-01
Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
Methods for design and evaluation of parallel computating systems (The PISCES project)
NASA Technical Reports Server (NTRS)
Pratt, Terrence W.; Wise, Robert; Haught, Mary JO
1989-01-01
The PISCES project started in 1984 under the sponsorship of the NASA Computational Structural Mechanics (CSM) program. A PISCES 1 programming environment and parallel FORTRAN were implemented in 1984 for the DEC VAX (using UNIX processes to simulate parallel processes). This system was used for experimentation with parallel programs for scientific applications and AI (dynamic scene analysis) applications. PISCES 1 was ported to a network of Apollo workstations by N. Fitzgerald.
Karasick, Michael S.; Strip, David R.
1996-01-01
A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.
A mission-based productivity compensation model for an academic anesthesiology department.
Reich, David L; Galati, Maria; Krol, Marina; Bodian, Carol A; Kahn, Ronald A
2008-12-01
We replaced a nearly fixed-salary academic physician compensation model with a mission-based productivity model with the goal of improving attending anesthesiologist productivity. The base salary system was stratified according to rank and clinical experience. The supplemental pay structure was linked to electronic patient records and a scheduling database to award points for clinical activity; educational, research, and administrative points systems were constructed in parallel. We analyzed monthly American Society of Anesthesiologist (ASA) unit data for operating room activity and physician compensation from 2000 through mid-2007, excluding the 1-yr implementation period (July 2004-June 2005) for the new model. Comparing 2005-2006 with 2000-2004, quarterly ASA units increased by 14% (P = 0.0001) and quarterly ASA units per full-time equivalent increased by 31% (P < 0.0001), while quarterly ASA units per anesthetizing location decreased by 10% (P = 0.046). Compared with a baseline year (2001), Instructor and Assistant Professor faculty compensation increased more than Associate Professor and Professor faculty (P < 0.001) in both pre- and postimplementation periods. There were larger compensation increases for the postimplementation period compared with preimplementation across faculty rank groupings (P < 0.0001). Academic and educational output was stable. Implementing a productivity-based faculty compensation model in an academic department was associated with increased mean supplemental pay with relatively fewer faculty. ASA units per month and ASA units per operating room full-time equivalent increased, and these metrics are the most likely drivers of the increased compensation. This occurred despite a slight decrease in clinical productivity as measured by ASA units per anesthetizing location. Academic and educational output was stable.
Design considerations for parallel graphics libraries
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1994-01-01
Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.
Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++
NASA Technical Reports Server (NTRS)
Bodin, Francois; Priol, Thierry; Mehrotra, Piyush; Gannon, Dennis
1994-01-01
Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model.
Acceleration of discrete stochastic biochemical simulation using GPGPU.
Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira
2015-01-01
For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130.
Acceleration of discrete stochastic biochemical simulation using GPGPU
Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira
2015-01-01
For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130. PMID:25762936
Use of geographic information system to display water-quality data from San Juan basin
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thorn, C.R.; Dam, W.L.
1989-09-01
The ARC/INFO geographic information system is creating thematic maps of the San Juan basin as part of the USGS Regional Aquifer-System Analysis program. (Use of trade names is for descriptive purposes only and does not constitute endorsement by the US Geological Survey.) Maps created by a Prime version of ARC/INFO, to be published in a series of Hydrologic Investigations Atlas reports for selected geologic units, will include outcrop patters, water-well locations, and water-quality data. The San Juan basin study area, encompassing about 19,400 mi{sup 2}, can be displayed with ARC/INFO at various scales; on the same scale, generated water-quality mapsmore » can be compared and overlain with other maps such as potentiometric surface and depth to top of a geologic or hydrologic unit. Selected water-quality and well data (including latitude and longitude) are retrieved from the USGS National Water Information System data base for a specified geologic unit. Data are formatted by Fortran programs and read into an INFO data base. Two parallel files - an INFO file containing water-quality data and well data and an ARC file containing the site coordinates - are joined to form the ARC/INFO data base. A file containing a series of commands using Prime's Command Procedure language is used to select coverage, display, and position data on the map. Data interpretation is enhanced by displaying water-quality data throughout the basin in combination with other hydrologic and geologic data.« less
Jackin, Boaz Jessie; Watanabe, Shinpei; Ootsu, Kanemitsu; Ohkawa, Takeshi; Yokota, Takashi; Hayasaki, Yoshio; Yatagai, Toyohiko; Baba, Takanobu
2018-04-20
A parallel computation method for large-size Fresnel computer-generated hologram (CGH) is reported. The method was introduced by us in an earlier report as a technique for calculating Fourier CGH from 2D object data. In this paper we extend the method to compute Fresnel CGH from 3D object data. The scale of the computation problem is also expanded to 2 gigapixels, making it closer to real application requirements. The significant feature of the reported method is its ability to avoid communication overhead and thereby fully utilize the computing power of parallel devices. The method exhibits three layers of parallelism that favor small to large scale parallel computing machines. Simulation and optical experiments were conducted to demonstrate the workability and to evaluate the efficiency of the proposed technique. A two-times improvement in computation speed has been achieved compared to the conventional method, on a 16-node cluster (one GPU per node) utilizing only one layer of parallelism. A 20-times improvement in computation speed has been estimated utilizing two layers of parallelism on a very large-scale parallel machine with 16 nodes, where each node has 16 GPUs.
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms.
Chen, Chunlei; He, Li; Zhang, Huixiang; Zheng, Hao; Wang, Lei
2017-01-01
Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions.
Architecture studies and system demonstrations for optical parallel processor for AI and NI
NASA Astrophysics Data System (ADS)
Lee, Sing H.
1988-03-01
In solving deterministic AI problems the data search for matching the arguments of a PROLOG expression causes serious bottleneck when implemented sequentially by electronic systems. To overcome this bottleneck we have developed the concepts for an optical expert system based on matrix-algebraic formulation, which will be suitable for parallel optical implementation. The optical AI system based on matrix-algebraic formation will offer distinct advantages for parallel search, adult learning, etc.
The Utility of SAR to Monitor Ocean Processes.
1981-11-01
echo received from ocean waves include the motion of the a horizontally polarized wave will have its E vector parallel to scattering surfaces, the so...radiation is defined by the direction of the electric field intensity, E, vector . For example, a horizontally polarized wave will have its E vector ...Oil Spill Off the East Coast of the United States ................ .... 55 19. L-band Parallel and Cross Polarized SAR Imagery of Ice in the Beaufort
Massively parallel information processing systems for space applications
NASA Technical Reports Server (NTRS)
Schaefer, D. H.
1979-01-01
NASA is developing massively parallel systems for ultra high speed processing of digital image data collected by satellite borne instrumentation. Such systems contain thousands of processing elements. Work is underway on the design and fabrication of the 'Massively Parallel Processor', a ground computer containing 16,384 processing elements arranged in a 128 x 128 array. This computer uses existing technology. Advanced work includes the development of semiconductor chips containing thousands of feedthrough paths. Massively parallel image analog to digital conversion technology is also being developed. The goal is to provide compact computers suitable for real-time onboard processing of images.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krishnamurthy, Dheepak
This paper is an overview of Power System Simulation Toolbox (psst). psst is an open-source Python application for the simulation and analysis of power system models. psst simulates the wholesale market operation by solving a DC Optimal Power Flow (DCOPF), Security Constrained Unit Commitment (SCUC) and a Security Constrained Economic Dispatch (SCED). psst also includes models for the various entities in a power system such as Generator Companies (GenCos), Load Serving Entities (LSEs) and an Independent System Operator (ISO). psst features an open modular object oriented architecture that will make it useful for researchers to customize, expand, experiment beyond solvingmore » traditional problems. psst also includes a web based Graphical User Interface (GUI) that allows for user friendly interaction and for implementation on remote High Performance Computing (HPCs) clusters for parallelized operations. This paper also provides an illustrative application of psst and benchmarks with standard IEEE test cases to show the advanced features and the performance of toolbox.« less
Traditional and modern medicine working in tandem.
Pretorius, E
1991-12-01
Because of the many problems relating to health care delivery in Africa, it is becoming apparent that neither the exclusive/monopolistic nor the tolerant legislative systems should be tolerated any longer. Especially since the Alma Ata Conference held by the WHO/UNICEF there has been growing impetus towards either inclusive/parallel (the beneficial co-existence of traditional and modern medical systems), or integrated systems. Although the idea of making traditional and modern medicine work in tandem in a united treatment context has its merits, it is also plagued by issues such as the nature of the products of an integrated training, resistance by stubborn protagonists of either of the two systems, or that only lip-service is paid to the idea of co-operation. Nevertheless, it is believed that all interest groups--the authorities responsible for health care delivery, the Western-trained health care workers, the traditional healers and the users of these services--stand to gain from such liaison.
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)
1998-01-01
This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.
NASA Astrophysics Data System (ADS)
Mosby, Matthew; Matouš, Karel
2015-12-01
Three-dimensional simulations capable of resolving the large range of spatial scales, from the failure-zone thickness up to the size of the representative unit cell, in damage mechanics problems of particle reinforced adhesives are presented. We show that resolving this wide range of scales in complex three-dimensional heterogeneous morphologies is essential in order to apprehend fracture characteristics, such as strength, fracture toughness and shape of the softening profile. Moreover, we show that computations that resolve essential physical length scales capture the particle size-effect in fracture toughness, for example. In the vein of image-based computational materials science, we construct statistically optimal unit cells containing hundreds to thousands of particles. We show that these statistically representative unit cells are capable of capturing the first- and second-order probability functions of a given data-source with better accuracy than traditional inclusion packing techniques. In order to accomplish these large computations, we use a parallel multiscale cohesive formulation and extend it to finite strains including damage mechanics. The high-performance parallel computational framework is executed on up to 1024 processing cores. A mesh convergence and a representative unit cell study are performed. Quantifying the complex damage patterns in simulations consisting of tens of millions of computational cells and millions of highly nonlinear equations requires data-mining the parallel simulations, and we propose two damage metrics to quantify the damage patterns. A detailed study of volume fraction and filler size on the macroscopic traction-separation response of heterogeneous adhesives is presented.
Comparison of multihardware parallel implementations for a phase unwrapping algorithm
NASA Astrophysics Data System (ADS)
Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo
2018-04-01
Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
NASA Astrophysics Data System (ADS)
Xavier, M. P.; do Nascimento, T. M.; dos Santos, R. W.; Lobosco, M.
2014-03-01
The development of computational systems that mimics the physiological response of organs or even the entire body is a complex task. One of the issues that makes this task extremely complex is the huge computational resources needed to execute the simulations. For this reason, the use of parallel computing is mandatory. In this work, we focus on the simulation of temporal and spatial behaviour of some human innate immune system cells and molecules in a small three-dimensional section of a tissue. To perform this simulation, we use multiple Graphics Processing Units (GPUs) in a shared-memory environment. Despite of high initialization and communication costs imposed by the use of GPUs, the techniques used to implement the HIS simulator have shown to be very effective to achieve this purpose.
Hybrid sodium heat pipe receivers for dish/Stirling systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Laing, D.; Reusch, M.
1997-12-31
The design of a hybrid solar/gas heat pipe receiver for the SBP 9 kW dish/Stirling system using a United Stirling AB V160 Stirling engine and the results of on-sun testing in alternative and parallel mode will be reported. The receiver is designed to transfer a thermal power of 35 kW. The heat pipe operates at around 800 C, working fluid is sodium. Operational options are solar-only, gas augmented and gas-only mode. Also the design of a second generation hybrid heat pipe receiver currently developed under a EU-funded project, based on the experience gained with the first hybrid receiver, will bemore » reported. This receiver is designed for the improved SPB/L. and C.-10 kW dish/Stirling system with the reworked SOLO V161 Stirling engine.« less
NASA Technical Reports Server (NTRS)
2009-01-01
Topics covered include: Direct-Solve Image-Based Wavefront Sensing; Use of UV Sources for Detection and Identification of Explosives; Using Fluorescent Viruses for Detecting Bacteria in Water; Gradiometer Using Middle Loops as Sensing Elements in a Low-Field SQUID MRI System; Volcano Monitor: Autonomous Triggering of In-Situ Sensors; Wireless Fluid-Level Sensors for Harsh Environments; Interference-Detection Module in a Digital Radar Receiver; Modal Vibration Analysis of Large Castings; Structural/Radiation-Shielding Epoxies; Integrated Multilayer Insulation; Apparatus for Screening Multiple Oxygen-Reduction Catalysts; Determining Aliasing in Isolated Signal Conditioning Modules; Composite Bipolar Plate for Unitized Fuel Cell/Electrolyzer Systems; Spectrum Analyzers Incorporating Tunable WGM Resonators; Quantum-Well Thermophotovoltaic Cells; Bounded-Angle Iterative Decoding of LDPC Codes; Conversion from Tree to Graph Representation of Requirements; Parallel Hybrid Vehicle Optimal Storage System; and Anaerobic Digestion in a Flooded Densified Leachbed.
Adaptive-optics optical coherence tomography processing using a graphics processing unit.
Shafer, Brandon A; Kriske, Jeffery E; Kocaoglu, Omer P; Turner, Timothy L; Liu, Zhuolin; Lee, John Jaehwan; Miller, Donald T
2014-01-01
Graphics processing units are increasingly being used for scientific computing for their powerful parallel processing abilities, and moderate price compared to super computers and computing grids. In this paper we have used a general purpose graphics processing unit to process adaptive-optics optical coherence tomography (AOOCT) images in real time. Increasing the processing speed of AOOCT is an essential step in moving the super high resolution technology closer to clinical viability.