Science.gov

Sample records for efficient parallel processing

  1. Efficient multitasking: parallel versus serial processing of multiple tasks

    PubMed Central

    Fischer, Rico; Plessow, Franziska

    2015-01-01

    In the context of performance optimizations in multitasking, a central debate has unfolded in multitasking research around whether cognitive processes related to different tasks proceed only sequentially (one at a time), or can operate in parallel (simultaneously). This review features a discussion of theoretical considerations and empirical evidence regarding parallel versus serial task processing in multitasking. In addition, we highlight how methodological differences and theoretical conceptions determine the extent to which parallel processing in multitasking can be detected, to guide their employment in future research. Parallel and serial processing of multiple tasks are not mutually exclusive. Therefore, questions focusing exclusively on either task-processing mode are too simplified. We review empirical evidence and demonstrate that shifting between more parallel and more serial task processing critically depends on the conditions under which multiple tasks are performed. We conclude that efficient multitasking is reflected by the ability of individuals to adjust multitasking performance to environmental demands by flexibly shifting between different processing strategies of multiple task-component scheduling. PMID:26441742

  2. Efficient biased random bit generation for parallel processing

    SciTech Connect

    Slone, D.M.

    1994-09-28

    A lattice gas automaton was implemented on a massively parallel machine (the BBN TC2000) and a vector supercomputer (the CRAY C90). The automaton models Burgers equation {rho}t + {rho}{rho}{sub x} = {nu}{rho}{sub xx} in 1 dimension. The lattice gas evolves by advecting and colliding pseudo-particles on a 1-dimensional, periodic grid. The specific rules for colliding particles are stochastic in nature and require the generation of many billions of random numbers to create the random bits necessary for the lattice gas. The goal of the thesis was to speed up the process of generating the random bits and thereby lessen the computational bottleneck of the automaton.

  3. The power and efficiency of advanced software and parallel processing

    NASA Technical Reports Server (NTRS)

    Singh, Ramen P.; Taylor, Lawrence W., Jr.

    1989-01-01

    Real-time simulation of flexible and articulating systems is difficult because of the computational burden of the time varying calculations. The mobile servicing system of the NASA Space Station Freedom will handle heavy payloads by local arm manipulations and by translating along the spline of the Station, it is crucial to have real-time simulation available. To enable such a simulation to be of high fidelity and to be able to be hosted on a modest computer, special care must be made in formulating the structural dynamics. Frontal solution algorithms save considerable time in performing these calculations. In addition, it is necessary to take advantage of parallel processing be compatible to take full advantage of both. An approach is offered which will result in high fidelity, real-time simulation for flexible, articulating systems such as the space Station remote servicing system.

  4. Parallel processing for efficient 3D slope stability modelling

    NASA Astrophysics Data System (ADS)

    Marchesini, Ivan; Mergili, Martin; Alvioli, Massimiliano; Metz, Markus; Schneider-Muntau, Barbara; Rossi, Mauro; Guzzetti, Fausto

    2014-05-01

    We test the performance of the GIS-based, three-dimensional slope stability model r.slope.stability. The model was developed as a C- and python-based raster module of the GRASS GIS software. It considers the three-dimensional geometry of the sliding surface, adopting a modification of the model proposed by Hovland (1977), and revised and extended by Xie and co-workers (2006). Given a terrain elevation map and a set of relevant thematic layers, the model evaluates the stability of slopes for a large number of randomly selected potential slip surfaces, ellipsoidal or truncated in shape. Any single raster cell may be intersected by multiple sliding surfaces, each associated with a value of the factor of safety, FS. For each pixel, the minimum value of FS and the depth of the associated slip surface are stored. This information is used to obtain a spatial overview of the potentially unstable slopes in the study area. We test the model in the Collazzone area, Umbria, central Italy, an area known to be susceptible to landslides of different type and size. Availability of a comprehensive and detailed landslide inventory map allowed for a critical evaluation of the model results. The r.slope.stability code automatically splits the study area into a defined number of tiles, with proper overlap in order to provide the same statistical significance for the entire study area. The tiles are then processed in parallel by a given number of processors, exploiting a multi-purpose computing environment at CNR IRPI, Perugia. The map of the FS is obtained collecting the individual results, taking the minimum values on the overlapping cells. This procedure significantly reduces the processing time. We show how the gain in terms of processing time depends on the tile dimensions and on the number of cores.

  5. A high resolution finite volume method for efficient parallel simulation of casting processes on unstructured meshes

    SciTech Connect

    Kothe, D.B.; Turner, J.A.; Mosso, S.J.; Ferrell, R.C.

    1997-03-01

    We discuss selected aspects of a new parallel three-dimensional (3-D) computational tool for the unstructured mesh simulation of Los Alamos National Laboratory (LANL) casting processes. This tool, known as {bold Telluride}, draws upon on robust, high resolution finite volume solutions of metal alloy mass, momentum, and enthalpy conservation equations to model the filling, cooling, and solidification of LANL castings. We briefly describe the current {bold Telluride} physical models and solution methods, then detail our parallelization strategy as implemented with Fortran 90 (F90). This strategy has yielded straightforward and efficient parallelization on distributed and shared memory architectures, aided in large part by new parallel libraries {bold JTpack9O} for Krylov-subspace iterative solution methods and {bold PGSLib} for efficient gather/scatter operations. We illustrate our methodology and current capabilities with source code examples and parallel efficiency results for a LANL casting simulation.

  6. Reliable and Efficient Parallel Processing Algorithms and Architectures for Modern Signal Processing. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Liu, Kuojuey Ray

    1990-01-01

    Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.

  7. Parallel Information Processing.

    ERIC Educational Resources Information Center

    Rasmussen, Edie M.

    1992-01-01

    Examines parallel computer architecture and the use of parallel processors for text. Topics discussed include parallel algorithms; performance evaluation; parallel information processing; parallel access methods for text; parallel and distributed information retrieval systems; parallel hardware for text; and network models for information…

  8. Special parallel processing workshop

    SciTech Connect

    1994-12-01

    This report contains viewgraphs from the Special Parallel Processing Workshop. These viewgraphs deal with topics such as parallel processing performance, message passing, queue structure, and other basic concept detailing with parallel processing.

  9. A new strategy for efficient solar energy conversion: Parallel-processing with surface plasmons

    NASA Technical Reports Server (NTRS)

    Anderson, L. M.

    1982-01-01

    This paper introduces an advanced concept for direct conversion of sunlight to electricity, which aims at high efficiency by tailoring the conversion process to separate energy bands within the broad solar spectrum. The objective is to obtain a high level of spectrum-splitting without sequential losses or unique materials for each frequency band. In this concept, sunlight excites a spectrum of surface plasma waves which are processed in parallel on the same metal film. The surface plasmons transport energy to an array of metal-barrier-semiconductor diodes, where energy is extracted by inelastic tunneling. Diodes are tuned to different frequency bands by selecting the operating voltage and geometry, but all diodes share the same materials.

  10. Automatic analysis (aa): efficient neuroimaging workflows and parallel processing using Matlab and XML.

    PubMed

    Cusack, Rhodri; Vicente-Grabovetsky, Alejandro; Mitchell, Daniel J; Wild, Conor J; Auer, Tibor; Linke, Annika C; Peelle, Jonathan E

    2014-01-01

    Recent years have seen neuroimaging data sets becoming richer, with larger cohorts of participants, a greater variety of acquisition techniques, and increasingly complex analyses. These advances have made data analysis pipelines complicated to set up and run (increasing the risk of human error) and time consuming to execute (restricting what analyses are attempted). Here we present an open-source framework, automatic analysis (aa), to address these concerns. Human efficiency is increased by making code modular and reusable, and managing its execution with a processing engine that tracks what has been completed and what needs to be (re)done. Analysis is accelerated by optional parallel processing of independent tasks on cluster or cloud computing resources. A pipeline comprises a series of modules that each perform a specific task. The processing engine keeps track of the data, calculating a map of upstream and downstream dependencies for each module. Existing modules are available for many analysis tasks, such as SPM-based fMRI preprocessing, individual and group level statistics, voxel-based morphometry, tractography, and multi-voxel pattern analyses (MVPA). However, aa also allows for full customization, and encourages efficient management of code: new modules may be written with only a small code overhead. aa has been used by more than 50 researchers in hundreds of neuroimaging studies comprising thousands of subjects. It has been found to be robust, fast, and efficient, for simple-single subject studies up to multimodal pipelines on hundreds of subjects. It is attractive to both novice and experienced users. aa can reduce the amount of time neuroimaging laboratories spend performing analyses and reduce errors, expanding the range of scientific questions it is practical to address. PMID:25642185

  11. Automatic analysis (aa): efficient neuroimaging workflows and parallel processing using Matlab and XML

    PubMed Central

    Cusack, Rhodri; Vicente-Grabovetsky, Alejandro; Mitchell, Daniel J.; Wild, Conor J.; Auer, Tibor; Linke, Annika C.; Peelle, Jonathan E.

    2015-01-01

    Recent years have seen neuroimaging data sets becoming richer, with larger cohorts of participants, a greater variety of acquisition techniques, and increasingly complex analyses. These advances have made data analysis pipelines complicated to set up and run (increasing the risk of human error) and time consuming to execute (restricting what analyses are attempted). Here we present an open-source framework, automatic analysis (aa), to address these concerns. Human efficiency is increased by making code modular and reusable, and managing its execution with a processing engine that tracks what has been completed and what needs to be (re)done. Analysis is accelerated by optional parallel processing of independent tasks on cluster or cloud computing resources. A pipeline comprises a series of modules that each perform a specific task. The processing engine keeps track of the data, calculating a map of upstream and downstream dependencies for each module. Existing modules are available for many analysis tasks, such as SPM-based fMRI preprocessing, individual and group level statistics, voxel-based morphometry, tractography, and multi-voxel pattern analyses (MVPA). However, aa also allows for full customization, and encourages efficient management of code: new modules may be written with only a small code overhead. aa has been used by more than 50 researchers in hundreds of neuroimaging studies comprising thousands of subjects. It has been found to be robust, fast, and efficient, for simple-single subject studies up to multimodal pipelines on hundreds of subjects. It is attractive to both novice and experienced users. aa can reduce the amount of time neuroimaging laboratories spend performing analyses and reduce errors, expanding the range of scientific questions it is practical to address. PMID:25642185

  12. Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

    PubMed Central

    Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan

    2014-01-01

    Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432

  13. Efficient Process Migration for Parallel Processing on Non-Dedicated Networks of Workstations

    NASA Technical Reports Server (NTRS)

    Chanchio, Kasidit; Sun, Xian-He

    1996-01-01

    This paper presents the design and preliminary implementation of MpPVM, a software system that supports process migration for PVM application programs in a non-dedicated heterogeneous computing environment. New concepts of migration point as well as migration point analysis and necessary data analysis are introduced. In MpPVM, process migrations occur only at previously inserted migration points. Migration point analysis determines appropriate locations to insert migration points; whereas, necessary data analysis provides a minimum set of variables to be transferred at each migration pint. A new methodology to perform reliable point-to-point data communications in a migration environment is also discussed. Finally, a preliminary implementation of MpPVM and its experimental results are presented, showing the correctness and promising performance of our process migration mechanism in a scalable non-dedicated heterogeneous computing environment. While MpPVM is developed on top of PVM, the process migration methodology introduced in this study is general and can be applied to any distributed software environment.

  14. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  15. Efficiency of parallel direct optimization.

    PubMed

    Janies, D A; Wheeler, W C

    2001-03-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. PMID:12240679

  16. Speeding up parallel processing

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.

    1988-01-01

    In 1967 Amdahl expressed doubts about the ultimate utility of multiprocessors. The formulation, now called Amdahl's law, became part of the computing folklore and has inspired much skepticism about the ability of the current generation of massively parallel processors to efficiently deliver all their computing power to programs. The widely publicized recent results of a group at Sandia National Laboratory, which showed speedup on a 1024 node hypercube of over 500 for three fixed size problems and over 1000 for three scalable problems, have convincingly challenged this bit of folklore and have given new impetus to parallel scientific computing.

  17. Coarrars for Parallel Processing

    NASA Technical Reports Server (NTRS)

    Snyder, W. Van

    2011-01-01

    The design of the Coarray feature of Fortran 2008 was guided by answering the question "What is the smallest change required to convert Fortran to a robust and efficient parallel language." Two fundamental issues that any parallel programming model must address are work distribution and data distribution. In order to coordinate work distribution and data distribution, methods for communication and synchronization must be provided. Although originally designed for Fortran, the Coarray paradigm has stimulated development in other languages. X10, Chapel, UPC, Titanium, and class libraries being developed for C++ have the same conceptual framework.

  18. Parallel processing and expert systems

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Lau, Sonie

    1991-01-01

    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 90's cannot enjoy an increased level of autonomy without the efficient use of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real time demands are met for large expert systems. Speed-up via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial labs in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems was surveyed. The survey is divided into three major sections: (1) multiprocessors for parallel expert systems; (2) parallel languages for symbolic computations; and (3) measurements of parallelism of expert system. Results to date indicate that the parallelism achieved for these systems is small. In order to obtain greater speed-ups, data parallelism and application parallelism must be exploited.

  19. Parallel processing and expert systems

    NASA Technical Reports Server (NTRS)

    Lau, Sonie; Yan, Jerry C.

    1991-01-01

    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited.

  20. An efficient parallel algorithm for mesh smoothing

    SciTech Connect

    Freitag, L.; Plassmann, P.; Jones, M.

    1995-12-31

    Automatic mesh generation and adaptive refinement methods have proven to be very successful tools for the efficient solution of complex finite element applications. A problem with these methods is that they can produce poorly shaped elements; such elements are undesirable because they introduce numerical difficulties in the solution process. However, the shape of the elements can be improved through the determination of new geometric locations for mesh vertices by using a mesh smoothing algorithm. In this paper the authors present a new parallel algorithm for mesh smoothing that has a fast parallel runtime both in theory and in practice. The authors present an efficient implementation of the algorithm that uses non-smooth optimization techniques to find the new location of each vertex. Finally, they present experimental results obtained on the IBM SP system demonstrating the efficiency of this approach.

  1. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1991-01-01

    The main contribution of the effort in the last two years is the introduction of the MOPPS system. After doing extensive literature search, we introduced the system which is described next. MOPPS employs a new solution to the problem of managing programs which solve scientific and engineering applications on a distributed processing environment. Autonomous computers cooperate efficiently in solving large scientific problems with this solution. MOPPS has the advantage of not assuming the presence of any particular network topology or configuration, computer architecture, or operating system. It imposes little overhead on network and processor resources while efficiently managing programs concurrently. The core of MOPPS is an intelligent program manager that builds a knowledge base of the execution performance of the parallel programs it is managing under various conditions. The manager applies this knowledge to improve the performance of future runs. The program manager learns from experience.

  2. EFFICIENT SCHEDULING OF PARALLEL JOBS ON MASSIVELY PARALLEL SYSTEMS

    SciTech Connect

    F. PETRINI; W. FENG

    1999-09-01

    We present buffered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the efficient implementation of a distributed operating system. Buffered coscheduling is based on three innovative techniques: communication buffering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of buffered coscheduling include higher resource utilization, reduced communication overhead, efficient implementation of low-control strategies and fault-tolerant protocols, accurate performance modeling, and a simplified yet still expressive parallel programming model. Preliminary experimental results show that buffered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads.

  3. Efficient, massively parallel eigenvalue computation

    NASA Technical Reports Server (NTRS)

    Huo, Yan; Schreiber, Robert

    1993-01-01

    In numerical simulations of disordered electronic systems, one of the most common approaches is to diagonalize random Hamiltonian matrices and to study the eigenvalues and eigenfunctions of a single electron in the presence of a random potential. An effort to implement a matrix diagonalization routine for real symmetric dense matrices on massively parallel SIMD computers, the Maspar MP-1 and MP-2 systems, is described. Results of numerical tests and timings are also presented.

  4. Knowledge representation into Ada parallel processing

    NASA Technical Reports Server (NTRS)

    Masotto, Tom; Babikyan, Carol; Harper, Richard

    1990-01-01

    The Knowledge Representation into Ada Parallel Processing project is a joint NASA and Air Force funded project to demonstrate the execution of intelligent systems in Ada on the Charles Stark Draper Laboratory fault-tolerant parallel processor (FTPP). Two applications were demonstrated - a portion of the adaptive tactical navigator and a real time controller. Both systems are implemented as Activation Framework Objects on the Activation Framework intelligent scheduling mechanism developed by Worcester Polytechnic Institute. The implementations, results of performance analyses showing speedup due to parallelism and initial efficiency improvements are detailed and further areas for performance improvements are suggested.

  5. Parallel processing spacecraft communication system

    NASA Technical Reports Server (NTRS)

    Bolotin, Gary S. (Inventor); Donaldson, James A. (Inventor); Luong, Huy H. (Inventor); Wood, Steven H. (Inventor)

    1998-01-01

    An uplink controlling assembly speeds data processing using a special parallel codeblock technique. A correct start sequence initiates processing of a frame. Two possible start sequences can be used; and the one which is used determines whether data polarity is inverted or non-inverted. Processing continues until uncorrectable errors are found. The frame ends by intentionally sending a block with an uncorrectable error. Each of the codeblocks in the frame has a channel ID. Each channel ID can be separately processed in parallel. This obviates the problem of waiting for error correction processing. If that channel number is zero, however, it indicates that the frame of data represents a critical command only. That data is handled in a special way, independent of the software. Otherwise, the processed data further handled using special double buffering techniques to avoid problems from overrun. When overrun does occur, the system takes action to lose only the oldest data.

  6. Efficient communication in massively parallel computers

    SciTech Connect

    Cypher, R.E.

    1989-01-01

    A fundamental operation in parallel computation is sorting. Sorting is important not only because it is required by many algorithms, but also because it can be used to implement irregular, pointer-based communication. The author studies two algorithms for sorting in massively parallel computers. First, he examines Shellsort. Shellsort is a sorting algorithm that is based on a sequence of parameters called increments. Shellsort can be used to create a parallel sorting device known as a sorting network. Researchers have suggested that if the correct increment sequence is used, an optimal size sorting network can be obtained. All published increment sequences have been monotonically decreasing. He shows that no monotonically decreasing increment sequence will yield an optimal size sorting network. Second, he presents a sorting algorithm called Cubesort. Cubesort is the fastest known sorting algorithm for a variety of parallel computers aver a wide range of parameters. He also presents a paradigm for developing parallel algorithms that have efficient communication. The paradigm, called the data reduction paradigm, consists of using a divide-and-conquer strategy. Both the division and combination phases of the divide-and-conquer algorithm may require irregular, pointer-based communication between processors. However, the problem is divided so as to limit the amount of data that must be communicated. As a result the communication can be performed efficiently. He presents data reduction algorithms for the image component labeling problem, the closest pair problem and four versions of the parallel prefix problem.

  7. Massively parallel femtosecond laser processing.

    PubMed

    Hasegawa, Satoshi; Ito, Haruyasu; Toyoda, Haruyoshi; Hayasaki, Yoshio

    2016-08-01

    Massively parallel femtosecond laser processing with more than 1000 beams was demonstrated. Parallel beams were generated by a computer-generated hologram (CGH) displayed on a spatial light modulator (SLM). The key to this technique is to optimize the CGH in the laser processing system using a scheme called in-system optimization. It was analytically demonstrated that the number of beams is determined by the horizontal number of pixels in the SLM NSLM that is imaged at the pupil plane of an objective lens and a distance parameter pd obtained by dividing the distance between adjacent beams by the diffraction-limited beam diameter. A performance limitation of parallel laser processing in our system was estimated at NSLM of 250 and pd of 7.0. Based on these parameters, the maximum number of beams in a hexagonal close-packed structure was calculated to be 1189 by using an analytical equation. PMID:27505815

  8. Efficient parallel global garbage collection on massively parallel computers

    SciTech Connect

    Kamada, Tomio; Matsuoka, Satoshi; Yonezawa, Akinori

    1994-12-31

    On distributed-memory high-performance MPPs where processors are interconnected by an asynchronous network, efficient Garbage Collection (GC) becomes difficult due to inter-node references and references within pending, unprocessed messages. The parallel global GC algorithm (1) takes advantage of reference locality, (2) efficiently traverses references over nodes, (3) admits minimum pause time of ongoing computations, and (4) has been shown to scale up to 1024 node MPPs. The algorithm employs a global weight counting scheme to substantially reduce message traffic. The two methods for confirming the arrival of pending messages are used: one counts numbers of messages and the other uses network `bulldozing.` Performance evaluation in actual implementations on a multicomputer with 32-1024 nodes, Fujitsu AP1000, reveals various favorable properties of the algorithm.

  9. On the design, analysis, and implementation of efficient parallel algorithms

    SciTech Connect

    Sohn, S.M.

    1989-01-01

    There is considerable interest in developing algorithms for a variety of parallel computer architectures. This is not a trivial problem, although for certain models great progress has been made. Recently, general-purpose parallel machines have become available commercially. These machines possess widely varying interconnection topologies and data/instruction access schemes. It is important, therefore, to develop methodologies and design paradigms for not only synthesizing parallel algorithms from initial problem specifications, but also for mapping algorithms between different architectures. This work has considered both of these problems. A systolic array consists of a large collection of simple processors that are interconnected in a uniform pattern. The author has studied in detain the problem of mapping systolic algorithms onto more general-purpose parallel architectures such as the hypercube. The hypercube architecture is notable due to its symmetry and high connectivity, characteristics which are conducive to the efficient embedding of parallel algorithms. Although the parallel-to-parallel mapping techniques have yielded efficient target algorithms, it is not surprising that an algorithm designed directly for a particular parallel model would achieve superior performance. In this context, the author has developed hypercube algorithms for some important problems in speech and signal processing, text processing, language processing and artificial intelligence. These algorithms were implemented on a 64-node NCUBE/7 hypercube machine in order to evaluate their performance.

  10. Cluster-based parallel image processing toolkit

    NASA Astrophysics Data System (ADS)

    Squyres, Jeffery M.; Lumsdaine, Andrew; Stevenson, Robert L.

    1995-03-01

    Many image processing tasks exhibit a high degree of data locality and parallelism and map quite readily to specialized massively parallel computing hardware. However, as network technologies continue to mature, workstation clusters are becoming a viable and economical parallel computing resource, so it is important to understand how to use these environments for parallel image processing as well. In this paper we discuss our implementation of parallel image processing software library (the Parallel Image Processing Toolkit). The Toolkit uses a message- passing model of parallelism designed around the Message Passing Interface (MPI) standard. Experimental results are presented to demonstrate the parallel speedup obtained with the Parallel Image Processing Toolkit in a typical workstation cluster over a wide variety of image processing tasks. We also discuss load balancing and the potential for parallelizing portions of image processing tasks that seem to be inherently sequential, such as visualization and data I/O.

  11. Multi-petascale highly efficient parallel supercomputer

    SciTech Connect

    Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.; Blumrich, Matthias A.; Boyle, Peter; Brunheroto, Jose R.; Chen, Dong; Cher, Chen -Yong; Chiu, George L.; Christ, Norman; Coteus, Paul W.; Davis, Kristan D.; Dozsa, Gabor J.; Eichenberger, Alexandre E.; Eisley, Noel A.; Ellavsky, Matthew R.; Evans, Kahn C.; Fleischer, Bruce M.; Fox, Thomas W.; Gara, Alan; Giampapa, Mark E.; Gooding, Thomas M.; Gschwind, Michael K.; Gunnels, John A.; Hall, Shawn A.; Haring, Rudolf A.; Heidelberger, Philip; Inglett, Todd A.; Knudson, Brant L.; Kopcsay, Gerard V.; Kumar, Sameer; Mamidala, Amith R.; Marcella, James A.; Megerian, Mark G.; Miller, Douglas R.; Miller, Samuel J.; Muff, Adam J.; Mundy, Michael B.; O'Brien, John K.; O'Brien, Kathryn M.; Ohmacht, Martin; Parker, Jeffrey J.; Poole, Ruth J.; Ratterman, Joseph D.; Salapura, Valentina; Satterfield, David L.; Senger, Robert M.; Smith, Brian; Steinmacher-Burow, Burkhard; Stockdell, William M.; Stunkel, Craig B.; Sugavanam, Krishnan; Sugawara, Yutaka; Takken, Todd E.; Trager, Barry M.; Van Oosten, James L.; Wait, Charles D.; Walkup, Robert E.; Watson, Alfred T.; Wisniewski, Robert W.; Wu, Peng

    2015-07-14

    A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.

  12. Parallel Processing at the High School Level.

    ERIC Educational Resources Information Center

    Sheary, Kathryn Anne

    This study investigated the ability of high school students to cognitively understand and implement parallel processing. Data indicates that most parallel processing is being taught at the university level. Instructional modules on C, Linux, and the parallel processing language, P4, were designed to show that high school students are highly…

  13. Hydrologic Terrain Processing Using Parallel Computing

    NASA Astrophysics Data System (ADS)

    Tarboton, D. G.; Watson, D. W.; Wallace, R. M.; Schreuders, K.; Tesfa, T. K.

    2009-12-01

    Topography in the form of Digital Elevation Models (DEMs), is widely used to derive information for the modeling of hydrologic processes. Hydrologic terrain analysis augments the information content of digital elevation data by removing spurious pits, deriving a structured flow field, and calculating surfaces of hydrologic information derived from the flow field. The increasing availability of high-resolution terrain datasets for large areas poses a challenge for existing algorithms that process terrain data to extract this hydrologic information. This paper will describe parallel algorithms that have been developed to enhance hydrologic terrain pre-processing so that larger datasets can be more efficiently computed. Message Passing Interface (MPI) parallel implementations have been developed for pit removal, flow direction, and generalized flow accumulation methods within the Terrain Analysis Using Digital Elevation Models (TauDEM) package. The parallel algorithm works by decomposing the domain into striped or tiled data partitions where each tile is processed by a separate processor. This method also reduces the memory requirements of each processor so that larger size grids can be processed. The parallel pit removal algorithm is adapted from the method of Planchon and Darboux that starts from a high elevation then progressively scans the grid, lowering each grid cell to the maximum of the original elevation or the lowest neighbor. The MPI implementation reconciles elevations along process domain edges after each scan. Generalized flow accumulation extends flow accumulation approaches commonly available in GIS through the integration of multiple inputs and a broad class of algebraic rules into the calculation of flow related quantities. It is based on establishing a flow field through DEM grid cells, that is then used to evaluate any mathematical function that incorporates dependence on values of the quantity being evaluated at upslope (or downslope) grid cells

  14. Parallel processing and medium-scale multiprocessors

    SciTech Connect

    Wouk, A.

    1989-01-01

    For some time, the community interested in large-scale scientific computing has been attempting to come to terms with parallel computation using a number of processors sufficient to make their concurrent utilization interesting, challenging, and, in the long run, beneficial. Unexpected consequences of parallelization have been discovered. It is possible to obtain reduced performance, both relative and absolute, from an increased number of processors, as a result of inappropriate use of resources in a multiprocessor environment. This exemplifies one of the paradoxes which result from our cultural bias towards sequential thought processes. As a consequence there is a bias for sequential styles of program development in a multiprocessor environment. The authors have learned that the problem of automatic optimization in compilation of parallel programs is computationally hard. Early hopes that automatic, optimal parallelization of sequentially conceived programs would be as achievable as earlier automatic vectorization had been, have been dashed. The authors lack the insights and folklore which are needed to develop useful methodologies and heuristics in the area of parallel computation. The authors are embarked on a voyage of exploration of this new territory, and the work described in this volume can provide helpful guidance. The authors have to explore fully the differences between distributed memory systems, shared memory systems, and combinations, as well as the relative applicability of SIMD and MIMD architectures. Based on the information obtained in such exploration, useful steps towards efficient utilization of many processors should become possible. This paper covers several areas: systems programming, parallel/language/programming systems, and applications programming.

  15. Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

    NASA Technical Reports Server (NTRS)

    Sun, Xian-He

    1997-01-01

    Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm

  16. Efficient parallel CFD-DEM simulations using OpenMP

    NASA Astrophysics Data System (ADS)

    Amritkar, Amit; Deb, Surya; Tafti, Danesh

    2014-01-01

    The paper describes parallelization strategies for the Discrete Element Method (DEM) used for simulating dense particulate systems coupled to Computational Fluid Dynamics (CFD). While the field equations of CFD are best parallelized by spatial domain decomposition techniques, the N-body particulate phase is best parallelized over the number of particles. When the two are coupled together, both modes are needed for efficient parallelization. It is shown that under these requirements, OpenMP thread based parallelization has advantages over MPI processes. Two representative examples, fairly typical of dense fluid-particulate systems are investigated, including the validation of the DEM-CFD and thermal-DEM implementation with experiments. Fluidized bed calculations are performed on beds with uniform particle loading, parallelized with MPI and OpenMP. It is shown that as the number of processing cores and the number of particles increase, the communication overhead of building ghost particle lists at processor boundaries dominates time to solution, and OpenMP which does not require this step is about twice as fast as MPI. In rotary kiln heat transfer calculations, which are characterized by spatially non-uniform particle distributions, the low overhead of switching the parallelization mode in OpenMP eliminates the load imbalances, but introduces increased overheads in fetching non-local data. In spite of this, it is shown that OpenMP is between 50-90% faster than MPI.

  17. Parallel processing in immune networks

    NASA Astrophysics Data System (ADS)

    Agliari, Elena; Barra, Adriano; Bartolucci, Silvia; Galluzzi, Andrea; Guerra, Francesco; Moauro, Francesco

    2013-04-01

    In this work, we adopt a statistical-mechanics approach to investigate basic, systemic features exhibited by adaptive immune systems. The lymphocyte network made by B cells and T cells is modeled by a bipartite spin glass, where, following biological prescriptions, links connecting B cells and T cells are sparse. Interestingly, the dilution performed on links is shown to make the system able to orchestrate parallel strategies to fight several pathogens at the same time; this multitasking capability constitutes a remarkable, key property of immune systems as multiple antigens are always present within the host. We also define the stochastic process ruling the temporal evolution of lymphocyte activity and show its relaxation toward an equilibrium measure allowing statistical-mechanics investigations. Analytical results are compared with Monte Carlo simulations and signal-to-noise outcomes showing overall excellent agreement. Finally, within our model, a rationale for the experimentally well-evidenced correlation between lymphocytosis and autoimmunity is achieved; this sheds further light on the systemic features exhibited by immune networks.

  18. Efficient Parallel Engineering Computing on Linux Workstations

    NASA Technical Reports Server (NTRS)

    Lou, John Z.

    2010-01-01

    A C software module has been developed that creates lightweight processes (LWPs) dynamically to achieve parallel computing performance in a variety of engineering simulation and analysis applications to support NASA and DoD project tasks. The required interface between the module and the application it supports is simple, minimal and almost completely transparent to the user applications, and it can achieve nearly ideal computing speed-up on multi-CPU engineering workstations of all operating system platforms. The module can be integrated into an existing application (C, C++, Fortran and others) either as part of a compiled module or as a dynamically linked library (DLL).

  19. Bitplane Image Coding With Parallel Coefficient Processing.

    PubMed

    Auli-Llinas, Francesc; Enfedaque, Pablo; Moure, Juan C; Sanchez, Victor

    2016-01-01

    Image coding systems have been traditionally tailored for multiple instruction, multiple data (MIMD) computing. In general, they partition the (transformed) image in codeblocks that can be coded in the cores of MIMD-based processors. Each core executes a sequential flow of instructions to process the coefficients in the codeblock, independently and asynchronously from the others cores. Bitplane coding is a common strategy to code such data. Most of its mechanisms require sequential processing of the coefficients. The last years have seen the upraising of processing accelerators with enhanced computational performance and power efficiency whose architecture is mainly based on the single instruction, multiple data (SIMD) principle. SIMD computing refers to the execution of the same instruction to multiple data in a lockstep synchronous way. Unfortunately, current bitplane coding strategies cannot fully profit from such processors due to inherently sequential coding task. This paper presents bitplane image coding with parallel coefficient (BPC-PaCo) processing, a coding method that can process many coefficients within a codeblock in parallel and synchronously. To this end, the scanning order, the context formation, the probability model, and the arithmetic coder of the coding engine have been re-formulated. The experimental results suggest that the penalization in coding performance of BPC-PaCo with respect to the traditional strategies is almost negligible. PMID:26441420

  20. Applications of Parallel Processing to Astrodynamics

    NASA Astrophysics Data System (ADS)

    Coffey, S.; Healy, L.; Neal, H.

    1996-03-01

    Parallel processing is being used to improve the catalog of earth orbiting satellites and for problems associated with the catalog. Initial efforts centered around using SIMD parallel processors to perform debris conjunction analysis and satellite dynamics studies. More recently, the availability of cheap supercomputing processors and parallel processing software such as PVM have enabled the reutilization of existing astrodynamics software in distributed parallel processing environments, Computations once taking many days with traditional mainframes are now being performed in only a few hours. Efforts underway for the US Naval Space Command include conjunction prediction, uncorrelated target processing and a new space object catalog based on orbit determination and prediction with special perturbations methods.

  1. Instruction-level parallel processing.

    PubMed

    Fisher, J A; Rau, R

    1991-09-13

    The performance of microprocessors has increased steadily over the past 20 years at a rate of about 50% per year. This is the cumulative result of architectural improvements as well as increases in circuit speed. Moreover, this improvement has been obtained in a transparent fashion, that is, without requiring programmers to rethink their algorithms and programs, thereby enabling the tremendous proliferation of computers that we see today. To continue this performance growth, microprocessor designers have incorporated instruction-level parallelism (ILP) into new designs. ILP utilizes the parallel execution ofthe lowest level computer operations-adds, multiplies, loads, and so on-to increase performance transparently. The use of ILP promises to make possible, within the next few years, microprocessors whose performance is many times that of a CRAY-IS. This article provides an overview of ILP, with an emphasis on ILP architectures-superscalar, VLIW, and dataflow processors-and the compiler techniques necessary to make ILP work well. PMID:17831442

  2. Multi-prediction particle filter for efficient parallelized implementation

    NASA Astrophysics Data System (ADS)

    Chu, Chun-Yuan; Chao, Chih-Hao; Chao, Min-An; Wu, An-Yeu Andy

    2011-12-01

    Particle filter (PF) is an emerging signal processing methodology, which can effectively deal with nonlinear and non-Gaussian signals by a sample-based approximation of the state probability density function. The particle generation of the PF is a data-independent procedure and can be implemented in parallel. However, the resampling procedure in the PF is a sequential task in natural and difficult to be parallelized. Based on the Amdahl's law, the sequential portion of a task limits the maximum speed-up of the parallelized implementation. Moreover, large particle number is usually required to obtain an accurate estimation, and the complexity of the resampling procedure is highly related to the number of particles. In this article, we propose a multi-prediction (MP) framework with two selection approaches. The proposed MP framework can reduce the required particle number for target estimation accuracy, and the sequential operation of the resampling can be reduced. Besides, the overhead of the MP framework can be easily compensated by parallel implementation. The proposed MP-PF alleviates the global sequential operation by increasing the local parallel computation. In addition, the MP-PF is very suitable for multi-core graphics processing unit (GPU) platform, which is a popular parallel processing architecture. We give prototypical implementations of the MP-PFs on multi-core GPU platform. For the classic bearing-only tracking experiments, the proposed MP-PF can be 25.1 and 15.3 times faster than the sequential importance resampling-PF with 10,000 and 20,000 particles, respectively. Hence, the proposed MP-PF can enhance the efficiency of the parallelization.

  3. Use of parallel computing in mass processing of laser data

    NASA Astrophysics Data System (ADS)

    Będkowski, J.; Bratuś, R.; Prochaska, M.; Rzonca, A.

    2015-12-01

    The first part of the paper includes a description of the rules used to generate the algorithm needed for the purpose of parallel computing and also discusses the origins of the idea of research on the use of graphics processors in large scale processing of laser scanning data. The next part of the paper includes the results of an efficiency assessment performed for an array of different processing options, all of which were substantially accelerated with parallel computing. The processing options were divided into the generation of orthophotos using point clouds, coloring of point clouds, transformations, and the generation of a regular grid, as well as advanced processes such as the detection of planes and edges, point cloud classification, and the analysis of data for the purpose of quality control. Most algorithms had to be formulated from scratch in the context of the requirements of parallel computing. A few of the algorithms were based on existing technology developed by the Dephos Software Company and then adapted to parallel computing in the course of this research study. Processing time was determined for each process employed for a typical quantity of data processed, which helped confirm the high efficiency of the solutions proposed and the applicability of parallel computing to the processing of laser scanning data. The high efficiency of parallel computing yields new opportunities in the creation and organization of processing methods for laser scanning data.

  4. An efficient parallel termination detection algorithm

    SciTech Connect

    Baker, A. H.; Crivelli, S.; Jessup, E. R.

    2004-05-27

    Information local to any one processor is insufficient to monitor the overall progress of most distributed computations. Typically, a second distributed computation for detecting termination of the main computation is necessary. In order to be a useful computational tool, the termination detection routine must operate concurrently with the main computation, adding minimal overhead, and it must promptly and correctly detect termination when it occurs. In this paper, we present a new algorithm for detecting the termination of a parallel computation on distributed-memory MIMD computers that satisfies all of those criteria. A variety of termination detection algorithms have been devised. Of these, the algorithm presented by Sinha, Kale, and Ramkumar (henceforth, the SKR algorithm) is unique in its ability to adapt to the load conditions of the system on which it runs, thereby minimizing the impact of termination detection on performance. Because their algorithm also detects termination quickly, we consider it to be the most efficient practical algorithm presently available. The termination detection algorithm presented here was developed for use in the PMESC programming library for distributed-memory MIMD computers. Like the SKR algorithm, our algorithm adapts to system loads and imposes little overhead. Also like the SKR algorithm, ours is tree-based, and it does not depend on any assumptions about the physical interconnection topology of the processors or the specifics of the distributed computation. In addition, our algorithm is easier to implement and requires only half as many tree traverses as does the SKR algorithm. This paper is organized as follows. In section 2, we define our computational model. In section 3, we review the SKR algorithm. We introduce our new algorithm in section 4, and prove its correctness in section 5. We discuss its efficiency and present experimental results in section 6.

  5. Communication-efficient parallel architectures and algorithms for image computations

    SciTech Connect

    Alnuweiri, H.M.

    1989-01-01

    The main purpose of this dissertation is the design of efficient parallel techniques for image computations which require global operations on image pixels, as well as the development of parallel architectures with special communication features which can support global data movement efficiently. The class of image problems considered in this dissertation involves global operations on image pixels, and irregular (data-dependent) data movement operations. Such problems include histogramming, component labeling, proximity computations, computing the Hough Transform, computing convexity of regions and related properties such as computing the diameter and a smallest area enclosing rectangle for each region. Images with multiple figures and multiple labeled-sets of pixels are also considered. Efficient solutions to such problems involve integer sorting, graph theoretic techniques, and techniques from computational geometry. Although such solutions are not computationally intensive (they all require O(n{sup 2}) operations to be performed on an n {times} n image), they require global communications. The emphasis here is on developing parallel techniques for data movement, reduction, and distribution, which lead to processor-time optimal solutions for such problems on the proposed organizations. The proposed parallel architectures are based on a memory array which can be viewed as an arrangement of memory modules in a k-dimensional space such that the modules are connected to buses placed parallel to the orthogonal axes of the space, and each bus is connected to one processor or a group of processors. It will be shown that such organizations are communication-efficient and are thus highly suited to the image problems considered here, and also to several other classes of problems. The proposed organizations have p processors and O(n{sup 2}) words of memory to process n {times} n images.

  6. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1995-01-01

    The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.

  7. Computationally efficient implementation of combustion chemistry in parallel PDF calculations

    NASA Astrophysics Data System (ADS)

    Lu, Liuyan; Lantz, Steven R.; Ren, Zhuyin; Pope, Stephen B.

    2009-08-01

    In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f_mpi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel

  8. Parallel processing of a rotating shaft simulation

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.

    1989-01-01

    A FORTRAN program describing the vibration modes of a rotor-bearing system is analyzed for parellelism in this simulation using a Pascal-like structured language. Potential vector operations are also identified. A critical path through the simulation is identified and used in conjunction with somewhat fictitious processor characteristics to determine the time to calculate the problem on a parallel processing system having those characteristics. A parallel processing overhead time is included as a parameter for proper evaluation of the gain over serial calculation. The serial calculation time is determined for the same fictitious system. An improvement of up to 640 percent is possible depending on the value of the overhead time. Based on the analysis, certain conclusions are drawn pertaining to the development needs of parallel processing technology, and to the specification of parallel processing systems to meet computational needs.

  9. Efficient parallel simulation of CO2 geologic sequestration insaline aquifers

    SciTech Connect

    Zhang, Keni; Doughty, Christine; Wu, Yu-Shu; Pruess, Karsten

    2007-01-01

    An efficient parallel simulator for large-scale, long-termCO2 geologic sequestration in saline aquifers has been developed. Theparallel simulator is a three-dimensional, fully implicit model thatsolves large, sparse linear systems arising from discretization of thepartial differential equations for mass and energy balance in porous andfractured media. The simulator is based on the ECO2N module of the TOUGH2code and inherits all the process capabilities of the single-CPU TOUGH2code, including a comprehensive description of the thermodynamics andthermophysical properties of H2O-NaCl- CO2 mixtures, modeling singleand/or two-phase isothermal or non-isothermal flow processes, two-phasemixtures, fluid phases appearing or disappearing, as well as saltprecipitation or dissolution. The new parallel simulator uses MPI forparallel implementation, the METIS software package for simulation domainpartitioning, and the iterative parallel linear solver package Aztec forsolving linear equations by multiple processors. In addition, theparallel simulator has been implemented with an efficient communicationscheme. Test examples show that a linear or super-linear speedup can beobtained on Linux clusters as well as on supercomputers. Because of thesignificant improvement in both simulation time and memory requirement,the new simulator provides a powerful tool for tackling larger scale andmore complex problems than can be solved by single-CPU codes. Ahigh-resolution simulation example is presented that models buoyantconvection, induced by a small increase in brine density caused bydissolution of CO2.

  10. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  11. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-08-12

    Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  12. FORTRAN Extensions for Modular Parallel Processing

    Energy Science and Technology Software Center (ESTSC)

    1996-01-12

    FORTRAN M is a small set of extensions to FORTRAN that supports a modular approach to the construction of sequential and parallel programs. FORTRAN M programs use channels to plug together processes which may be written in FORTRAN M or FORTRAN 77. Processes communicate by sending and receiving messages on channels. Channels and processes can be created dynamically, but programs remain deterministic unless specialized nondeterministic constructs are used.

  13. Associative massively parallel processor for video processing

    NASA Astrophysics Data System (ADS)

    Krikelis, Argy; Tawiah, T.

    1996-03-01

    Massively parallel processing architectures have matured primarily through image processing and computer vision application. The similarity of processing requirements between these areas and video processing suggest that they should be very appropriate for video processing applications. This research describes the use of an associative massively parallel processing based system for video compression which includes architectural and system description, discussion of the implementation of compression tasks such as DCT/IDCT, Motion Estimation and Quantization and system evaluation. The core of the processing system is the ASP (Associative String Processor) architecture a modular massively parallel, programmable and inherently fault-tolerant fine-grain SIMD processing architecture incorporating a string of identical APEs (Associative Processing Elements), a reconfigurable inter-processor communication network and a Vector Data Buffer for fully-overlapped data input-output. For video compression applications a prototype system is developed, which is using ASP modules to implement the required compression tasks. This scheme leads to a linear speed up of the computation by simply adding more APEs to the modules.

  14. A fast and efficient adaptive parallel ray tracing based model for thermally coupled surface radiation in casting and heat treatment processes

    NASA Astrophysics Data System (ADS)

    Fainberg, J.; Schaefer, W.

    2015-06-01

    A new algorithm for heat exchange between thermally coupled diffusely radiating interfaces is presented, which can be applied for closed and half open transparent radiating cavities. Interfaces between opaque and transparent materials are automatically detected and subdivided into elementary radiation surfaces named tiles. Contrary to the classical view factor method, the fixed unit sphere area subdivision oriented along the normal tile direction is projected onto the surrounding radiation mesh and not vice versa. Then, the total incident radiating flux of the receiver is approximated as a direct sum of radiation intensities of representative “senders” with the same weight factor. A hierarchical scheme for the space angle subdivision is selected in order to minimize the total memory and the computational demands during thermal calculations. Direct visibility is tested by means of a voxel-based ray tracing method accelerated by means of the anisotropic Chebyshev distance method, which reuses the computational grid as a Chebyshev one. The ray tracing algorithm is fully parallelized using MPI and takes advantage of the balanced distribution of all available tiles among all CPU's. This approach allows tracing of each particular ray without any communication. The algorithm has been implemented in a commercial casting process simulation software. The accuracy and computational performance of the new radiation model for heat treatment, investment and ingot casting applications is illustrated using industrial examples.

  15. Casting Pearls Ballistically: Efficient Massively Parallel Simulation of Particle Deposition

    NASA Astrophysics Data System (ADS)

    Lubachevsky, Boris D.; Privman, Vladimir; Roy, Subhas C.

    1996-06-01

    We simulate ballistic particle deposition wherein a large number of spherical particles are "cast" vertically over a planar horizontal surface. Upon first contact (with the surface or with a previously deposited particle) each particle stops. This model helps material scientists to study the adsorption and sediment formation. The model is sequential, with particles deposited one by one. We have found an equivalent formulation using a continuous time random process and we simulate the latter in parallel using a method similar to the one previously employed for simulating Ising spins. We augment the parallel algorithm for simulating Ising spins with several techniques aimed at the increase of efficiency of producing the particle configuration and statistics collection. Some of these techniques are similar to earlier ones. We implement the resulting algorithm on a 16K PE MasPar MP-1 and a 4K PE MasPar MP-2. The parallel code runs on MasPar computers nearly two orders of magnitude faster than an optimized sequential code runs on a fast workstation.

  16. Casting pearls ballistically: Efficient massively parallel simulation of particle deposition

    SciTech Connect

    Lubachevsky, B.D.; Privman, V.; Roy, S.C.

    1996-06-01

    We simulate ballistic particle deposition wherein a large number of spherical particles are {open_quotes}cast{close_quotes} vertically over a planar horizontal surface. Upon first contact (with the surface or with a previously deposited particle) each particle stops. This model helps material scientists to study the adsorption and sediment formation. The model is sequential, with particles deposited one by one. We have found an equivalent formulation using a continuous time random process and we simulate the latter in parallel using a method similar to the one previously employed for simulating Ising spins. We augment the parallel algorithm for simulating Ising spins with several techniques aimed at the increase of efficiency of producing the particle configuration and statistics collection. Some of these techniques are similar to earlier ones. We implement the resulting algorithm on a 16K PE MasPar MP-1 and a 4K PE MasPar MP-2. The parallel code runs on MasPar computers nearly two orders of magnitude faster than an optimized sequential code runs on a fast workstation. 17 refs., 9 figs.

  17. Photon detection with parallel asynchronous processing

    NASA Technical Reports Server (NTRS)

    Coon, D. D.; Perera, A. G. U.

    1990-01-01

    An approach to photon detection with a parallel asynchronous signal processor is described. The visible or IR photon-detection capability of the silicon p(+)-n-n(+) detectors and the parallel asynchronous processing are addressed separately. This approach would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture consisting of a stack of planar arrays of the devices would form a 2D array processor with a 2D array of inputs located directly behind a focal-plane detector array. A 2D image data stream would propagate in neuronlike asynchronous pulse-coded form through the laminar processor. Such systems can integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The possibility of multispectral image processing is addressed.

  18. Parallel and Serial Processes in Visual Search

    ERIC Educational Resources Information Center

    Thornton, Thomas L.; Gilden, David L.

    2007-01-01

    A long-standing issue in the study of how people acquire visual information centers around the scheduling and deployment of attentional resources: Is the process serial, or is it parallel? A substantial empirical effort has been dedicated to resolving this issue. However, the results remain largely inconclusive because the methodologies that have…

  19. Efficient wiring of reconfigurable parallel processors

    SciTech Connect

    Greenberg, D.S.

    1992-08-28

    The advent of chips which include one or more CPUS, some local memory, and rudimentary communications and routing hardware has opened a new area in computer architecture design. What is the best way to connect these chips to solve particular problems This paper defines the efficiency of a wiring scheme for a set of communication patterns. It then gives upper and lower bounds on the best efficiency achievable. It also presents simple wiring schemes for some stencil patterns used in mesh-based discrete simulations.

  20. Efficient wiring of reconfigurable parallel processors

    SciTech Connect

    Greenberg, D.S.

    1992-08-28

    The advent of chips which include one or more CPUS, some local memory, and rudimentary communications and routing hardware has opened a new area in computer architecture design. What is the best way to connect these chips to solve particular problems? This paper defines the efficiency of a wiring scheme for a set of communication patterns. It then gives upper and lower bounds on the best efficiency achievable. It also presents simple wiring schemes for some stencil patterns used in mesh-based discrete simulations.

  1. Hypercluster parallel processing library user's manual

    NASA Technical Reports Server (NTRS)

    Quealy, Angela

    1990-01-01

    This User's Manual describes the Hypercluster Parallel Processing Library, composed of FORTRAN-callable subroutines which enable a FORTRAN programmer to manipulate and transfer information throughout the Hypercluster at NASA Lewis Research Center. Each subroutine and its parameters are described in detail. A simple heat flow application using Laplace's equation is included to demonstrate the use of some of the library's subroutines. The manual can be used initially as an introduction to the parallel features provided by the library. Thereafter it can be used as a reference when programming an application.

  2. Efficiency Evaluation of Cray XT Parallel IO Stack

    SciTech Connect

    Yu, Weikuan; Oral, H Sarp; Vetter, Jeffrey S; Barrett, Richard F

    2007-01-01

    PetaScale computing platforms need to be coupled with efficient IO subsystems that can deliver commensurate IO throughput to scientific applications. In order to gain insights into the deliverable IO efficiency on the Cray XT platform at ORNL, this paper presents an in-depth efficiency evaluation of its parallel IO software stack. Our evaluation covers the performance of a variety of parallel IO interfaces, including POSIX IO, MPI-IO, and HDF5. Moreover, we describe several tuning parameters for these interfaces and present their effectiveness in enhancing the IO efficiency.

  3. Parallel heterogeneous architectures for efficient OMP compressive sensing reconstruction

    NASA Astrophysics Data System (ADS)

    Kulkarni, Amey; Stanislaus, Jerome L.; Mohsenin, Tinoosh

    2014-05-01

    Compressive Sensing (CS) is a novel scheme, in which a signal that is sparse in a known transform domain can be reconstructed using fewer samples. The signal reconstruction techniques are computationally intensive and have sluggish performance, which make them impractical for real-time processing applications . The paper presents novel architectures for Orthogonal Matching Pursuit algorithm, one of the popular CS reconstruction algorithms. We show the implementation results of proposed architectures on FPGA, ASIC and on a custom many-core platform. For FPGA and ASIC implementation, a novel thresholding method is used to reduce the processing time for the optimization problem by at least 25%. Whereas, for the custom many-core platform, efficient parallelization techniques are applied, to reconstruct signals with variant signal lengths of N and sparsity of m. The algorithm is divided into three kernels. Each kernel is parallelized to reduce execution time, whereas efficient reuse of the matrix operators allows us to reduce area. Matrix operations are efficiently paralellized by taking advantage of blocked algorithms. For demonstration purpose, all architectures reconstruct a 256-length signal with maximum sparsity of 8 using 64 measurements. Implementation on Xilinx Virtex-5 FPGA, requires 27.14 μs to reconstruct the signal using basic OMP. Whereas, with thresholding method it requires 18 μs. ASIC implementation reconstructs the signal in 13 μs. However, our custom many-core, operating at 1.18 GHz, takes 18.28 μs to complete. Our results show that compared to the previous published work of the same algorithm and matrix size, proposed architectures for FPGA and ASIC implementations perform 1.3x and 1.8x respectively faster. Also, the proposed many-core implementation performs 3000x faster than the CPU and 2000x faster than the GPU.

  4. Parallel processing for computer vision and display

    SciTech Connect

    Dew, P.M. . Dept. of Computer Studies); Earnshaw, R.A. ); Heywood, T.R. )

    1989-01-01

    The widespread availability of high performance computers has led to an increased awareness of the importance of visualization techniques particularly in engineering and science. However, many visualization tasks involve processing large amounts of data or manipulating complex computer models of 3D objects. For example, in the field of computer aided engineering it is often necessary to display an edit solid object (see Plate 1) which can take many minutes even on the fastest serial processors. Another example of a computationally intensive problem, this time from computer vision, is the recognition of objects in a 3D scene from a stereo image pair. To perform visualization tasks of this type in real and reasonable time it is necessary to exploit the advances in parallel processing that have taken place over the last decade. This book uniquely provides a collection of papers from leading visualization researchers with a common interest in the application and exploitation of parallel processing techniques.

  5. High-speed parallel-processing networks for advanced architectures

    SciTech Connect

    Morgan, D.R.

    1988-06-01

    This paper describes various parallel-processing architecture networks that are candidates for eventual airborne use. An attempt at projecting which type of network is suitable or optimum for specific metafunction or stand-alone applications is made. However, specific algorithms will need to be developed and bench marks executed before firm conclusions can be drawn. Also, a conceptual projection of how these processors can be built in small, flyable units through the use of wafer-scale integration is offered. The use of the PAVE PILLAR system architecture to provide system level support for these tightly coupled networks is described. The author concludes that: (1) extremely high processing speeds implemented in flyable hardware is possible through parallel-processing networks if development programs are pursued; (2) dramatic speed enhancements through parallel processing requires an excellent match between the algorithm and computer-network architecture; (3) matching several high speed parallel oriented algorithms across the aircraft system to a limited set of hardware modules may be the most cost-effective approach to achieving speed enhancements; and (4) software-development tools and improved operating systems will need to be developed to support efficient parallel-processor use.

  6. A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

    NASA Technical Reports Server (NTRS)

    Straeter, T. A.; Markos, A. T.

    1975-01-01

    A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.

  7. Efficient sequential and parallel algorithms for record linkage

    PubMed Central

    Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

    2014-01-01

    Background and objective Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Methods Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Results Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). Conclusions We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm. PMID:24154837

  8. An efficient parallel algorithm for accelerating computational protein design

    PubMed Central

    Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang

    2014-01-01

    Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991

  9. An Expert System for the Development of Efficient Parallel Code

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.

  10. A multiarchitecture parallel-processing development environment

    NASA Technical Reports Server (NTRS)

    Townsend, Scott; Blech, Richard; Cole, Gary

    1993-01-01

    A description is given of the hardware and software of a multiprocessor test bed - the second generation Hypercluster system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multiprocessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing concepts such as message passing algorithms, debugging tools, and computational 'steering'. Such research would be difficult, if not impossible, to achieve on shared, commercial systems.

  11. Oxytocin: parallel processing in the social brain?

    PubMed

    Dölen, Gül

    2015-06-01

    Early studies attempting to disentangle the network complexity of the brain exploited the accessibility of sensory receptive fields to reveal circuits made up of synapses connected both in series and in parallel. More recently, extension of this organisational principle beyond the sensory systems has been made possible by the advent of modern molecular, viral and optogenetic approaches. Here, evidence supporting parallel processing of social behaviours mediated by oxytocin is reviewed. Understanding oxytocinergic signalling from this perspective has significant implications for the design of oxytocin-based therapeutic interventions aimed at disorders such as autism, where disrupted social function is a core clinical feature. Moreover, identification of opportunities for novel technology development will require a better appreciation of the complexity of the circuit-level organisation of the social brain. PMID:25912257

  12. Efficient Multidisciplinary Analysis Procedure Using Multi-Level Parallelization Approach

    NASA Technical Reports Server (NTRS)

    Byun, Chansup; Hatay, Ferhat; Farhangnia, Mehrdad; Guruswamy, Guru; VanDalsem, William R. (Technical Monitor)

    1997-01-01

    Multidisciplinary applications are suitable for parallel computing environment by adopting the domain decomposition method. Immediately, a multidisciplinary application can be parallelized by solving each discipline separately. In order to perform coupled multidisciplinary analysis, coupling of each discipline can be accomplished by exchanging boundary data at the interfaces. This is regarded as discipline-level parallelization. Next level could be a "coarse-grain" parallelization of each discipline, which mainly depends on the physical geometry and nature of each discipline. For example, it is almost impossible for structured-grid based computational fluid dynamics codes to do flow analysis of an aircraft by using a single grid because of the complexity of its configuration. Thus, multi-block grid is commonly used to describe the details of complex geometry. Similarly, in structural analysis, the structure is frequently subdivided into substructures. Thus, the computation of each subdomain can be easily parallelized since each subdomain is solved separately independent of other domains. The parallelization is accomplished by solving each subdomain separately on a separate processor and exchanging the boundary conditions at domain interfaces periodically. However, the physical decomposition of the domain introduces explicit boundary conditions at the domain interfaces. This is not desirable for critical areas such as those containing shock waves or flow separations. Thus, a "fine-grain" parallelization is introduced to overcome this problem. The "fine-grain" parallelization is one that solves exactly the same system of equations of a subdomain by using more than one processors without introducing any explicit boundary conditions. An efficient multidisciplinary analysis procedure can be accomplished by successfully combining the above multi-level parallelism. A multidisciplinary analysis code, ENSAERO developed at NASA Ames Research Center is used in this study to

  13. Parallel processing in the mammalian retina.

    PubMed

    Wässle, Heinz

    2004-10-01

    Our eyes send different 'images' of the outside world to the brain - an image of contours (line drawing), a colour image (watercolour painting) or an image of moving objects (movie). This is commonly referred to as parallel processing, and starts as early as the first synapse of the retina, the cone pedicle. Here, the molecular composition of the transmitter receptors of the postsynaptic neurons defines which images are transferred to the inner retina. Within the second synaptic layer - the inner plexiform layer - circuits that involve complex inhibitory and excitatory interactions represent filters that select 'what the eye tells the brain'. PMID:15378035

  14. Memory Scalability and Efficiency Analysis of Parallel Codes

    SciTech Connect

    Janjusic, Tommy; Kartsaklis, Christos

    2015-01-01

    Memory scalability is an enduring problem and bottleneck that plagues many parallel codes. Parallel codes designed for High Performance Systems are typically designed over the span of several, and in some instances 10+, years. As a result, optimization practices which were appropriate for earlier systems may no longer be valid and thus require careful optimization consideration. Specifically, parallel codes whose memory footprint is a function of their scalability must be carefully considered for future exa-scale systems. In this paper we present a methodology and tool to study the memory scalability of parallel codes. Using our methodology we evaluate an application s memory footprint as a function of scalability, which we coined memory efficiency, and describe our results. In particular, using our in-house tools we can pinpoint the specific application components which contribute to the application s overall memory foot-print (application data- structures, libraries, etc.).

  15. Parallel-Processing Algorithms For Dynamics Of Manipulators

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1991-01-01

    Class of parallel and parallel/pipeline algorithms presented for more efficient computation of manipulator inertia matrix. Essential for implementing advanced dynamic control schemes as well as dynamic simulation of manipulator motion.

  16. Parallel Processing Strategies of the Primate Visual System

    PubMed Central

    Nassi, Jonathan J.; Callaway, Edward M.

    2009-01-01

    Preface Incoming sensory information is sent to the brain along modality-specific channels corresponding to the five senses. Each of these channels further parses the incoming signals into parallel streams to provide a compact, efficient input to the brain. Ultimately, these parallel input signals must be elaborated upon and integrated within the cortex to provide a unified and coherent percept. Recent studies in the primate visual cortex have greatly contributed to our understanding of how this goal is accomplished. Multiple strategies including retinal tiling, hierarchical and parallel processing and modularity, defined spatially and by cell type-specific connectivity, are all used by the visual system to recover the rich detail of our visual surroundings. PMID:19352403

  17. Parallel processing for digital picture comparison

    NASA Technical Reports Server (NTRS)

    Cheng, H. D.; Kou, L. T.

    1987-01-01

    In picture processing an important problem is to identify two digital pictures of the same scene taken under different lighting conditions. This kind of problem can be found in remote sensing, satellite signal processing and the related areas. The identification can be done by transforming the gray levels so that the gray level histograms of the two pictures are closely matched. The transformation problem can be solved by using the packing method. Researchers propose a VLSI architecture consisting of m x n processing elements with extensive parallel and pipelining computation capabilities to speed up the transformation with the time complexity 0(max(m,n)), where m and n are the numbers of the gray levels of the input picture and the reference picture respectively. If using uniprocessor and a dynamic programming algorithm, the time complexity will be 0(m(3)xn). The algorithm partition problem, as an important issue in VLSI design, is discussed. Verification of the proposed architecture is also given.

  18. Parallel tools in HEVC for high-throughput processing

    NASA Astrophysics Data System (ADS)

    Zhou, Minhua; Sze, Vivienne; Budagavi, Madhukar

    2012-10-01

    HEVC (High Efficiency Video Coding) is the next-generation video coding standard being jointly developed by the ITU-T VCEG and ISO/IEC MPEG JCT-VC team. In addition to the high coding efficiency, which is expected to provide 50% more bit-rate reduction when compared to H.264/AVC, HEVC has built-in parallel processing tools to address bitrate, pixel-rate and motion estimation (ME) throughput requirements. This paper describes how CABAC, which is also used in H.264/AVC, has been redesigned for improved throughput, and how parallel merge/skip and tiles, which are new tools introduced for HEVC, enable high-throughput processing. CABAC has data dependencies which make it difficult to parallelize and thus limit its throughput. The prediction error/residual, represented as quantized transform coefficients, accounts for the majority of the CABAC workload. Various improvements have been made to the context selection and scans in transform coefficient coding that enable CABAC in HEVC to potentially achieve higher throughput and increased coding gains relative to H.264/AVC. The merge/skip mode is a coding efficiency enhancement tool in HEVC; the parallel merge/skip breaks dependency between the regular and merge/skip ME, which provides flexibility for high throughput and high efficiency HEVC encoder designs. For ultra high definition (UHD) video, such as 4kx2k and 8kx4k resolutions, low-latency and real-time processing may be beyond the capability of a single core codec. Tiles are an effective tool which enables pixel-rate balancing among the cores to achieve parallel processing with a throughput scalable implementation of multi-core UHD video codec. With the evenly divided tiles, a multi-core video codec can be realized by simply replicating single core codec and adding a tile boundary processing core on top of that. These tools illustrate that accounting for implementation cost when designing video coding algorithms can enable higher processing speed and reduce

  19. Parallel-processing a large scientific problem

    SciTech Connect

    Hiromoto, R.

    1982-01-01

    The author discusses a parallel-processing experiment that uses a particle-in-cell (PIC) code to study the feasibility of doing large-scale scientific calculations on multiple-processor architectures. A multithread version of this Los Alamos PIC code was successfully implemented and timed on a Univac system 1100/80 computer. Use of a single copy of the instruction stream, and common memory to hold data, eliminated data transmission between processors. The multiple-processing algorithm exploits the Pic code's high degree of large, independent tasks, as well as the configuration of the Univac system 1100/80. Timing results for the multithread version of the PIC code using one, two, three, and four identical processors are given and are shown to have promising speedup times when compared to the overall run times measured for a single-thread version of the PIC code. 4 references.

  20. Parallel processing a large scientific problem

    SciTech Connect

    Hiromoto, R.

    1982-01-01

    A parallel-processing experiment is discussed that uses a particle-in-cell (PIC) code to study the feasibility of doing large-scale scientific calculations on multiple-processor architectures. A multithread version of this Los Alamos PIC code was successfully implemented and timed on a UNIVAC System 1100/80 computer. Use of a single copy of the instruction stream, and common memory to hold data, eliminated data transmission between processors. The multiple-processing algorithm exploits the PIC code's high degree of large, independent tasks, as well as the configuration of the UNIVAC System 1100/80. Timing results for the multithread version of the PIC code using one, two, three, and four identical processors are given and are shown to have promising speedup times when compared to the overall run times measured for a single-thread version of the PIC code.

  1. Parallel digital signal processing architectures for image processing

    NASA Astrophysics Data System (ADS)

    Kshirsagar, Shirish P.; Hartley, David A.; Harvey, David M.; Hobson, Clifford A.

    1994-10-01

    This paper describes research into a high speed image processing system using parallel digital signal processors for the processing of electro-optic images. The objective of the system is to reduce the processing time of non-contact type inspection problems including industrial and medical applications. A single processor can not deliver sufficient processing power required for the use of applications hence, a MIMD system is designed and constructed to enable fast processing of electro-optic images. The Texas Instruments TMS320C40 digital signal processor is used due to its high speed floating point CPU and the support for the parallel processing environment. A custom designed VISION bus is provided to transfer images between processors. The system is being applied for solder joint inspection of high technology printed circuit boards.

  2. Organizing Compression of Hyperspectral Imagery to Allow Efficient Parallel Decompression

    NASA Technical Reports Server (NTRS)

    Klimesh, Matthew A.; Kiely, Aaron B.

    2014-01-01

    family of schemes has been devised for organizing the output of an algorithm for predictive data compression of hyperspectral imagery so as to allow efficient parallelization in both the compressor and decompressor. In these schemes, the compressor performs a number of iterations, during each of which a portion of the data is compressed via parallel threads operating on independent portions of the data. The general idea is that for each iteration it is predetermined how much compressed data will be produced from each thread.

  3. Fault Tolerance and Parallel Processing for NGST

    NASA Astrophysics Data System (ADS)

    Sengupta, R.; Offenberg, J. D.; Fixsen, D. J.; Nieto-Santisteban, M. A.; Hanisch, R. J.; Stockman, H. S.; Mather, J. C.

    1999-12-01

    The Next Generation Space Telescope (NGST) Image Processing Group is developing scalable cosmic ray rejection and data compression algorithms for parallel processors as part of NASA's Remote Exploration and Experimentation (REE) Project. The primary intention of the REE project is to use commercial-off-the shelf (COTS) technology to develop scalable, low-power, fault tolerant, high performance computers in space. NGST is one of the applications selected to demonstrate the benefit of having on-board supercomputing power. Real-time cosmic ray rejection would enable us to reduce the downlink data volume by as much as two orders of magnitude by combining multiple read-outs on the spacecraft rather than downlinking them separately. The combined read-outs can be further reduced in size by applying lossy and/or lossless data compression algorithms. This work is funded by NASA's REE project, managed by JPL.

  4. A Parallel Processing Algorithm for Gravity Inversion

    NASA Astrophysics Data System (ADS)

    Frasheri, Neki; Bushati, Salvatore; Frasheri, Alfred

    2013-04-01

    The paper presents results of using MPI parallel processing for the 3D inversion of gravity anomalies. The work is done under the FP7 project HP-SEE (http://www.hp-see.eu/). The inversion of geophysical anomalies remains a challenge, and the use of parallel processing can be a tool to achieve better results, "compensating" the complexity of the ill-posed problem of inversion with the increase of volume of calculations. We considered the gravity as the simplest case of physical fields and experimented an algorithm based in the methodology known as CLEAN and developed by Högbom in 1974. The 3D geosection was discretized in finite cuboid elements and represented by a 3D array of nodes, while the ground surface where the anomaly is observed as a 2D array of points. Starting from a geosection with mass density zero in all nodes, iteratively the algorithm defines the 3D node that offers the best anomaly shape that approximates the observed anomaly minimizing the least squares error; the mass density in the best 3D node is modified with a prefixed density step and the related effect subtracted from the observed anomaly; the process continues until some criteria is fulfilled. Theoretical complexity of he algorithm was evaluated on the basis of iterations and run-time for a geosection discretized in different scales. We considered the average number N of nodes in one edge of the 3D array. The order of number of iterations was evaluated O(N^3); and the order of run-time was evaluated O(N^8). We used several different methods for the identification of the 3D node which effect offers the best least squares error in approximating the observed anomaly: unweighted least squares error for the whole 2D array of anomalous points; weighting least squares error by the inverted value of observed anomaly over each 3D node; and limiting the area of 2D anomalous points where least squares are calculated over shallow 3D nodes. By comparing results from the inversion of single body and two

  5. Enjoying Sad Music: Paradox or Parallel Processes?

    PubMed

    Schubert, Emery

    2016-01-01

    Enjoyment of negative emotions in music is seen by many as a paradox. This article argues that the paradox exists because it is difficult to view the process that generates enjoyment as being part of the same system that also generates the subjective negative feeling. Compensation theories explain the paradox as the compensation of a negative emotion by the concomitant presence of one or more positive emotions. But compensation brings us no closer to explaining the paradox because it does not explain how experiencing sadness itself is enjoyed. The solution proposed is that an emotion is determined by three critical processes-labeled motivational action tendency (MAT), subjective feeling (SF) and Appraisal. For many emotions the MAT and SF processes are coupled in valence. For example, happiness has positive MAT and positive SF, annoyance has negative MAT and negative SF. However, it is argued that in an aesthetic context, such as listening to music, emotion processes can become decoupled. The decoupling is controlled by the Appraisal process, which can assess if the context of the sadness is real-life (where coupling occurs) or aesthetic (where decoupling can occur). In an aesthetic context sadness retains its negative SF but the aversive, negative MAT is inhibited, leaving sadness to still be experienced as a negative valanced emotion, while contributing to the overall positive MAT. Individual differences, mood and previous experiences mediate the degree to which the aversive aspects of MAT are inhibited according to this Parallel Processing Hypothesis (PPH). The reason for hesitancy in considering or testing PPH, as well as the preponderance of research on sadness at the exclusion of other negative emotions, are discussed. PMID:27445752

  6. Enjoying Sad Music: Paradox or Parallel Processes?

    PubMed Central

    Schubert, Emery

    2016-01-01

    Enjoyment of negative emotions in music is seen by many as a paradox. This article argues that the paradox exists because it is difficult to view the process that generates enjoyment as being part of the same system that also generates the subjective negative feeling. Compensation theories explain the paradox as the compensation of a negative emotion by the concomitant presence of one or more positive emotions. But compensation brings us no closer to explaining the paradox because it does not explain how experiencing sadness itself is enjoyed. The solution proposed is that an emotion is determined by three critical processes—labeled motivational action tendency (MAT), subjective feeling (SF) and Appraisal. For many emotions the MAT and SF processes are coupled in valence. For example, happiness has positive MAT and positive SF, annoyance has negative MAT and negative SF. However, it is argued that in an aesthetic context, such as listening to music, emotion processes can become decoupled. The decoupling is controlled by the Appraisal process, which can assess if the context of the sadness is real-life (where coupling occurs) or aesthetic (where decoupling can occur). In an aesthetic context sadness retains its negative SF but the aversive, negative MAT is inhibited, leaving sadness to still be experienced as a negative valanced emotion, while contributing to the overall positive MAT. Individual differences, mood and previous experiences mediate the degree to which the aversive aspects of MAT are inhibited according to this Parallel Processing Hypothesis (PPH). The reason for hesitancy in considering or testing PPH, as well as the preponderance of research on sadness at the exclusion of other negative emotions, are discussed. PMID:27445752

  7. Efficient parallel tempering for first-order phase transitions.

    PubMed

    Neuhaus, T; Magiera, M P; Hansmann, U H E

    2007-10-01

    We present a Monte Carlo algorithm that facilitates efficient parallel tempering simulations of the density of states g(E) . We show that the algorithm eliminates the supercritical slowing down in the case of the Q=20 and Q=256 Potts models in two dimensions, typical examples for systems with extreme first-order phase transitions. As recently predicted, and shown here, the microcanonical heat capacity along the calorimetric curve has negative values for finite systems. PMID:17995052

  8. Fully efficient time-parallelized quantum optimal control algorithm

    NASA Astrophysics Data System (ADS)

    Riahi, M. K.; Salomon, J.; Glaser, S. J.; Sugny, D.

    2016-04-01

    We present a time-parallelization method that enables one to accelerate the computation of quantum optimal control algorithms. We show that this approach is approximately fully efficient when based on a gradient method as optimization solver: the computational time is approximately divided by the number of available processors. The control of spin systems, molecular orientation, and Bose-Einstein condensates are used as illustrative examples to highlight the wide range of applications of this numerical scheme.

  9. Communication-efficient parallel-graph algorithms. Master's thesis

    SciTech Connect

    Maggs, B.M.

    1986-06-01

    Communication bandwidth is a resource ignored by most parallel random-access machine (PRAM) models. This thesis shows that many graph problems can be solved in parallel, not only with polylogarithmic performance, but with efficient communication at each step of the computation. The communication requirements of an algorithm are measured in a restricted PRAM model called the distributed random-access machine (DRAM), which can be viewed as an abstraction of volume-universal networks such as fat trees. In this model, communication cost is measured in terms of the congestion of memory accesses across cuts of the machine. It is demonstrated that the recursive doubling technique frequently used in PRAM algorithms is wasteful of communication resources, and that recursive pairing can be used to perform many of the same functions more efficiently. The prefix computation is generalized on linear lists to trees and show that these tree-fix computations, which can be performed in a communication-efficient fashion using a variant of the tree-contraction technique of Miller and Reif, simplify many parallel graph algorithms in the literature.

  10. Development and Applications of a Modular Parallel Process for Large Scale Fluid/Structures Problems

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Kwak, Dochan (Technical Monitor)

    2002-01-01

    A modular process that can efficiently solve large scale multidisciplinary problems using massively parallel supercomputers is presented. The process integrates disciplines with diverse physical characteristics by retaining the efficiency of individual disciplines. Computational domain independence of individual disciplines is maintained using a meta programming approach. The process integrates disciplines without affecting the combined performance. Results are demonstrated for large scale aerospace problems on several supercomputers. The super scalability and portability of the approach is demonstrated on several parallel computers.

  11. Development and Applications of a Modular Parallel Process for Large Scale Fluid/Structures Problems

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Byun, Chansup; Kwak, Dochan (Technical Monitor)

    2001-01-01

    A modular process that can efficiently solve large scale multidisciplinary problems using massively parallel super computers is presented. The process integrates disciplines with diverse physical characteristics by retaining the efficiency of individual disciplines. Computational domain independence of individual disciplines is maintained using a meta programming approach. The process integrates disciplines without affecting the combined performance. Results are demonstrated for large scale aerospace problems on several supercomputers. The super scalability and portability of the approach is demonstrated on several parallel computers.

  12. Filtering versus parallel processing in RSVP tasks.

    PubMed

    Botella, J; Eriksen, C W

    1992-04-01

    An experiment of McLean, D. E. Broadbent, and M. H. P. Broadbent (1983) using rapid serial visual presentation (RSVP) was replicated. A series of letters in one of 5 colors was presented, and the subject was asked to identify the letter that appeared in a designated color. There were several innovations in our procedure, the most important of which was the use of a response menu. After each trial, the subject was presented with 7 candidate letters from which to choose his/her response. In three experimental conditions, the target, the letter following the target, and all letters other than the target were, respectively, eliminated from the menu. In other conditions, the stimulus list was manipulated by repeating items in the series, repeating the color of successive items, or even eliminating the target color. By means of these manipulations, we were able to determine more precisely the information that subjects had obtained from the presentation of the stimulus series. Although we replicated the results of McLean et al. (1983), the more extensive information that our procedure produced was incompatible with the serial filter model that McLean et al. had used to describe their data. Overall, our results were more compatible with a parallel-processing account. Furthermore, intrusion errors are apparently not only a perceptual phenomenon but a memory problem as well. PMID:1603647

  13. An intelligent allocation algorithm for parallel processing

    NASA Technical Reports Server (NTRS)

    Carroll, Chester C.; Homaifar, Abdollah; Ananthram, Kishan G.

    1988-01-01

    The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network.

  14. Fault tolerant massively parallel processing architecture

    SciTech Connect

    Balasubramanian, V.; Banerjee, P.

    1987-08-01

    This paper presents two massively parallel processing architectures suitable for solving a wide variety of algorithms of divide-and-conquer type for problems such as the discrete Fourier transform, production systems, design automation, and others. The first architecture, called the Chain-structured Butterfly ARchitecture (CBAR), consists of a two-dimensional array of N-L . (log/sub 2/(L)+1) processing elements (PE) organized as L levels of log/sub 2/(L)+1 stages, and which has the butterfly connection between PEs in consecutive stages with straight-through feedback between PEs in the last and first stages. This connection system has the desirable property of allowing thousands of PEs to be connected with O(N) connection cost, O(log/sub 2/(N/log/sub 2/N)) communication paths, and a small number (=4) of I/O ports per PE. However, this architecture is not fault tolerant. The authors, therefore, propose a second architecture, called the REconfigurable Chain-structured Butterfly ARchitecture (RECBAR), which is a modified version of the CBAR. The RECBAR possesses all the desirable features of the CBAR, with the number of I/O ports per PE increased to six, and uses O(log/sub 2/N)/N) overhead in PEs and approximately 50% overhead in links to achieve single-level fault tolerance. Reliability improvements of the RECBAR over the CBAR are studied. This paper also presents a distributed diagnostic and structuring algorithm for the RECBAR that enables the architecture to detect faults and structure itself accordingly within 2 . log/sub 2/(L)+1 time steps, thus making it a truly fault tolerant architecture.

  15. Time efficient 3-D electromagnetic modeling on massively parallel computers

    SciTech Connect

    Alumbaugh, D.L.; Newman, G.A.

    1995-08-01

    A numerical modeling algorithm has been developed to simulate the electromagnetic response of a three dimensional earth to a dipole source for frequencies ranging from 100 to 100MHz. The numerical problem is formulated in terms of a frequency domain--modified vector Helmholtz equation for the scattered electric fields. The resulting differential equation is approximated using a staggered finite difference grid which results in a linear system of equations for which the matrix is sparse and complex symmetric. The system of equations is solved using a preconditioned quasi-minimum-residual method. Dirichlet boundary conditions are employed at the edges of the mesh by setting the tangential electric fields equal to zero. At frequencies less than 1MHz, normal grid stretching is employed to mitigate unwanted reflections off the grid boundaries. For frequencies greater than this, absorbing boundary conditions must be employed by making the stretching parameters of the modified vector Helmholtz equation complex which introduces loss at the boundaries. To allow for faster calculation of realistic models, the original serial version of the code has been modified to run on a massively parallel architecture. This modification involves three distinct tasks; (1) mapping the finite difference stencil to a processor stencil which allows for the necessary information to be exchanged between processors that contain adjacent nodes in the model, (2) determining the most efficient method to input the model which is accomplished by dividing the input into ``global`` and ``local`` data and then reading the two sets in differently, and (3) deciding how to output the data which is an inherently nonparallel process.

  16. Partitioning And Packing Equations For Parallel Processing

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Milner, Edward J.

    1989-01-01

    Algorithm developed to identify parallelism in set of coupled ordinary differential equations that describe physical system and to divide set into parallel computational paths, along with parts of solution proceeds independently of others during at least part of time. Path-identifying algorithm creates number of paths consisting of equations that must be computed serially and table that gives dependent and independent arguments and "can start," "can end," and "must end" times of each equation. "Must end" time used subsequently by packing algorithm.

  17. Programming Probabilistic Structural Analysis for Parallel Processing Computer

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.

    1991-01-01

    The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.

  18. Parallel Processing with Digital Signal Processing Hardware and Software

    NASA Technical Reports Server (NTRS)

    Swenson, Cory V.

    1995-01-01

    The assembling and testing of a parallel processing system is described which will allow a user to move a Digital Signal Processing (DSP) application from the design stage to the execution/analysis stage through the use of several software tools and hardware devices. The system will be used to demonstrate the feasibility of the Algorithm To Architecture Mapping Model (ATAMM) dataflow paradigm for static multiprocessor solutions of DSP applications. The individual components comprising the system are described followed by the installation procedure, research topics, and initial program development.

  19. Efficiently modeling neural networks on massively parallel computers

    NASA Technical Reports Server (NTRS)

    Farber, Robert M.

    1993-01-01

    Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.

  20. Communication Improvement for the LU NAS Parallel Benchmark: A Model for Efficient Parallel Relaxation Schemes

    NASA Technical Reports Server (NTRS)

    Yarrow, Maurice; VanderWijngaart, Rob; Kutler, Paul (Technical Monitor)

    1997-01-01

    The first release of the MPI version of the LU NAS Parallel Benchmark (NPB2.0) performed poorly compared to its companion NPB2.0 codes. The later LU release (NPB2.1 & 2.2) runs up to two and a half times faster, thanks to a revised point access scheme and related communications scheme. The new scheme sends substantially fewer messages. is cache "friendly", and has a better load balance. We detail the, observations and modifications that resulted in this efficiency improvement, and show that the poor behavior of the original code resulted from deriving a message passing scheme from an algorithm originally devised for a vector architecture.

  1. The efficient parallel iterative solution of large sparse linear systems

    SciTech Connect

    Jones, M.T.; Plassmann, P.E.

    1992-06-01

    The development of efficient, general-purpose software for the iterative solution of sparse linear systems on a parallel MIMD computer requires an interesting combination of expertise. Parallel graph heuristics, convergence analysis, and basic linear algebra implementation issues must all be considered. In this paper, we discuss how we have incorporated recent results in these areas into a general-purpose iterative solver. First, we consider two recently developed parallel graph coloring heuristics. We show how the method proposed by Luby, based on determining maximal independent sets, can be modified to run in an asynchronous manner and give aa expected running time bound for this modified heuristic. In addition, a number of graph reduction heuristics are described that are used in our implementation to improve the individual processor performance. The effect of these various graph reductions on the solution of sparse triangular systems is categorized. Finally, we discuss the performance of this solver from the perspective of two large-scale applications: a piezoelectric crystal finite-element modeling problem, and a nonlinear optimization problem to determine the minimum energy configuration of a three-dimensional, layered superconductor model.

  2. Experience in highly parallel processing using DAP

    NASA Technical Reports Server (NTRS)

    Parkinson, D.

    1987-01-01

    Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.

  3. Parallel Processing in Visual Search Asymmetry

    ERIC Educational Resources Information Center

    Dosher, Barbara Anne; Han, Songmei; Lu, Zhong-Lin

    2004-01-01

    The difficulty of visual search may depend on assignment of the same visual elements as targets and distractors-search asymmetry. Easy C-in-O searches and difficult O-in-C searches are often associated with parallel and serial search, respectively. Here, the time course of visual search was measured for both tasks with speed-accuracy methods. The…

  4. Functional & para-functional parallel processing

    SciTech Connect

    Not Available

    1994-11-01

    For years (about 20, in fact) dataflow researchers have argued for the use of dataflow (a subset of functional) languages for parallel computing, resting their proof on the ability to construct large-scale dataflow machines to realize the inherent parallelism in Functional programs. Unfortunately, such machines have never materialized as commercial products - instead, the market shows a vast variety of parallel multiprocessors that require special skills to program. It may be the case that these machines reflect a wrong direction in computer architecture design, and it may be the case that dataflow machines are the right way to go, but the proof is in the pudding, and thus far there does not exist even a prototype dataflow machine that can prove the {open_quote}dataflow thesis.{close_quote} Under the circumstances it would seem rather foolhardy simply to ignore the commercial parallel machines that are available now, regardless of one`s favorite programming methodology or concurrency model. It has been the authors` thesis that one can in fact use such machines effectively, while maintaining the concomitant thesis that functional programming is good for parallel computation. During the last two years the author has made considerable progress to support this two-fold thesis, and is now prepared to extend this work in several ways. The authors` particular interest, and presumably the primary interest to DOE, is to concentrate the work in the area of scientific computing, including functional language features, program development tools, and systems support tailored for scientific computing applications. The authors` desire to do this reflects confidence that this approach really will work for scientific computing - the author has spent two years proving the viability of the ideas, and now it`s time to put them into action.

  5. Extraction of Hydrological Proximity Measures from DEMs using Parallel Processing

    SciTech Connect

    Tesfa, Teklu K.; Tarboton, David G.; Watson, Daniel W.; Schreuders, Kimberly A.; Baker, Matthew M.; Wallace, Robert M.

    2011-12-01

    queue data structure to order the processing of cells so that each cell is visited only once and the cross-process communications that are a standard part of MPI are handled in an efficient manner. This parallel approach allows analysis of much larger DEMs as compared to the serial recursive algorithms. In this paper, we present the definitions of the HPMs, the serial and parallel algorithms used in their extraction and their potential applications in hydrology, geomorphology and ecology.

  6. ENERGY EFFICIENT LAUNDRY PROCESS

    SciTech Connect

    Tim Richter

    2005-04-01

    With the rising cost of energy and increased concerns for pollution and greenhouse gas emissions from power generation, increased focus is being put on energy efficiency. This study looks at several approaches to reducing energy consumption in clothes care appliances by considering the appliances and laundry chemistry as a system, rather than individually.

  7. Inverting Magnetic Data Using Parallel Processing

    NASA Astrophysics Data System (ADS)

    Connor, L. M.; Connor, C. B.

    2002-12-01

    We have collaborated to develop an innovative method for inverting magnetic data from high-resolution geomagnetic maps. Our method uses parallel computations and asynchronous communication among multiple nodes of a Beowulf cluster to produce geologically constrained 3-D models of magnetic anomalies. This modeling effort comes in response to the current revolution in gathering geophysical data. Interfacing kinematic differential GPS to magnetometers has presented geo-scientists with the daunting task of interpreting very high-resolution geomagnetic maps. Traditional methods of data interpretation, such as forward modeling, are sorely taxed. Our method manipulates a set of geologically constrained parameters to eventually build a geometric model that accurately represents the magnetic anomaly. Iterations of the code execute in parallel on multiple networked nodes via MPI, a message passing interface. Each node computes a magnetic solution at different geographical field locations based on a modeled set of geological parameters, using various forward calculations. The parameter sets are continually adjusted by the downhill simplex method. Calculated values are continually compared to the observed data using a goodness-of-fit test until all parameter sets generate the same result within a specified tolerance. A set of parameters producing an anomaly mimicking the observed anomaly is the result. By changing bounds on input parameters it is practical to quickly identify equivalent solutions. These techniques are applied to a high resolution geomagnetic data set consisting of 30,000 data points and four discrete magnetic anomalies. Data were smoothed and inverted using the parallel code. The subsurface was discretized and the depth to each unit, magnetization, and depth to the base of the entire structure were allowed to vary as independent parameters. Inversion clearly highlights volcanic features of the source rocks, including the truncated cone, crater, lava flows, and

  8. Bipartite memory network architectures for parallel processing

    SciTech Connect

    Smith, W.; Kale, L.V. . Dept. of Computer Science)

    1990-01-01

    Parallel architectures are boradly classified as either shared memory or distributed memory architectures. In this paper, the authors propose a third family of architectures, called bipartite memory network architectures. In this architecture, processors and memory modules constitute a bipartite graph, where each processor is allowed to access a small subset of the memory modules, and each memory module allows access from a small set of processors. The architecture is particularly suitable for computations requiring dynamic load balancing. The authors explore the properties of this architecture by examining the Perfect Difference set based topology for the graph. Extensions of this topology are also suggested.

  9. Hypercluster - Parallel processing for computational mechanics

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1988-01-01

    An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

  10. Efficient parallel algorithms for string editing and related problems

    NASA Technical Reports Server (NTRS)

    Apostolico, Alberto; Atallah, Mikhail J.; Larmore, Lawrence; Mcfaddin, H. S.

    1988-01-01

    The string editing problem for input strings x and y consists of transforming x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol x with another symbol. This problem has a well known O((absolute value of x)(absolute value of y)) time sequential solution (25). The efficient Program Requirements Analysis Methods (PRAM) parallel algorithms for the string editing problem are given. If m = ((absolute value of x),(absolute value of y)) and n = max((absolute value of x),(absolute value of y)), then the CREW bound is O (log m log n) time with O (mn/log m) processors. In all algorithms, space is O (mn).

  11. Parallel perfusion imaging processing using GPGPU

    PubMed Central

    Zhu, Fan; Gonzalez, David Rodriguez; Carpenter, Trevor; Atkinson, Malcolm; Wardlaw, Joanna

    2012-01-01

    Background and purpose The objective of brain perfusion quantification is to generate parametric maps of relevant hemodynamic quantities such as cerebral blood flow (CBF), cerebral blood volume (CBV) and mean transit time (MTT) that can be used in diagnosis of acute stroke. These calculations involve deconvolution operations that can be very computationally expensive when using local Arterial Input Functions (AIF). As time is vitally important in the case of acute stroke, reducing the analysis time will reduce the number of brain cells damaged and increase the potential for recovery. Methods GPUs originated as graphics generation dedicated co-processors, but modern GPUs have evolved to become a more general processor capable of executing scientific computations. It provides a highly parallel computing environment due to its large number of computing cores and constitutes an affordable high performance computing method. In this paper, we will present the implementation of a deconvolution algorithm for brain perfusion quantification on GPGPU (General Purpose Graphics Processor Units) using the CUDA programming model. We present the serial and parallel implementations of such algorithms and the evaluation of the performance gains using GPUs. Results Our method has gained a 5.56 and 3.75 speedup for CT and MR images respectively. Conclusions It seems that using GPGPU is a desirable approach in perfusion imaging analysis, which does not harm the quality of cerebral hemodynamic maps but delivers results faster than the traditional computation. PMID:22824549

  12. Parafrase restructuring of FORTRAN code for parallel processing

    NASA Technical Reports Server (NTRS)

    Wadhwa, Atul

    1988-01-01

    Parafrase transforms a FORTRAN code, subroutine by subroutine, into a parallel code for a vector and/or shared-memory multiprocessor system. Parafrase is not a compiler; it transforms a code and provides information for a vector or concurrent process. Parafrase uses a data dependency to reveal parallelism among instructions. The data dependency test distinguishes between recurrences and statements that can be directly vectorized or parallelized. A number of transformations are required to build a data dependency graph.

  13. Parallel-Processing Test Bed For Simulation Software

    NASA Technical Reports Server (NTRS)

    Blech, Richard; Cole, Gary; Townsend, Scott

    1996-01-01

    Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).

  14. Multiple-spot parallel processing for laser micronanofabrication

    NASA Astrophysics Data System (ADS)

    Kato, Jun-ichi; Takeyasu, Nobuyuki; Adachi, Yoshihiro; Sun, Hong-Bo; Kawata, Satoshi

    2005-01-01

    A tightly focused femtosecond laser has been established as a unique tool for micronanostructure fabrication due to its intrinsic three-dimensional processing. In this letter, we utilize a microlens array to produce multiple spots for parallel fabrication, giving rise to a revolutionary augmentation for our previously developed single-beam two-photon photopolymerization technology [S. Kawata, H.-B. Sun, T. Tanaka, and K. Takada, Nature (London) 412, 697 (2001)]. Two- and three-dimensional multiple structures, such as microletter set and self-standing microspring array, are demonstrated as examples of mass production. More than 200 spot simultaneous fabrication has been realized by optimizing the exposure condition for the photopolymerizable resin, i.e., a two-order increase of yield efficiency. Potential applications of this technique are discussed.

  15. Parallel processing near supercomputers for science, engineering and AI

    SciTech Connect

    Walker, T.C.; Miller, R.K.

    1987-01-01

    The book explains the workings of several SIMD, MIMD, and dataflow architectures in non-theoretical terminology. The impact of parallel processing computer is examined. Application areas are described, and several case studies are included. The parallel processing projects and products of 37 international research groups and 27 leading corporations are presented. A survey of experts in the field explores opinions and forecasts on general architecture, problem solving strategies, and applications. Views of experts in the United States, Japan, and Europe are compared. The international markets for parallel processing computers are examined for 1986, 1988, and 1990.

  16. Boost Process Heating Efficiency - PHAST

    SciTech Connect

    2005-05-01

    Use the Process Heating Assessment and Survey Tool (PHAST) to survey all process heating equipment within a facility, select the equipment that uses the most energy, and identify ways to increase efficiency.

  17. An Airborne Onboard Parallel Processing Testbed

    NASA Technical Reports Server (NTRS)

    Mandl, Daniel J.

    2014-01-01

    This presentation provides information on the progress the Intelligent Payload Module (IPM) development effort. In addition, a vision is presented on integration of the IPM architecture with the GeoSocial Application Program Interface (API) architecture to enable efficient distribution of satellite data products.

  18. Roadmap for efficient parallelization of breast anatomy simulation

    NASA Astrophysics Data System (ADS)

    Chui, Joseph H.; Pokrajac, David D.; Maidment, Andrew D. A.; Bakic, Predrag R.

    2012-03-01

    A roadmap has been proposed to optimize the simulation of breast anatomy by parallel implementation, in order to reduce the time needed to generate software breast phantoms. The rapid generation of high resolution phantoms is needed to support virtual clinical trials of breast imaging systems. We have recently developed an octree-based recursive partitioning algorithm for breast anatomy simulation. The algorithm has good asymptotic complexity; however, its current MATLAB implementation cannot provide optimal execution times. The proposed roadmap for efficient parallelization includes the following steps: (i) migrate the current code to a C/C++ platform and optimize it for single-threaded implementation; (ii) modify the code to allow for multi-threaded CPU implementation; (iii) identify and migrate the code to a platform designed for multithreaded GPU implementation. In this paper, we describe our results in optimizing the C/C++ code for single-threaded and multi-threaded CPU implementations. As the first step of the proposed roadmap we have identified a bottleneck component in the MATLAB implementation using MATLAB's profiling tool, and created a single threaded CPU implementation of the algorithm using C/C++'s overloaded operators and standard template library. The C/C++ implementation has been compared to the MATLAB version in terms of accuracy and simulation time. A 520-fold reduction of the execution time was observed in a test of phantoms with 50- 400 μm voxels. In addition, we have identified several places in the code which will be modified to allow for the next roadmap milestone of the multithreaded CPU implementation.

  19. Efficient parallel implementation of active appearance model fitting algorithm on GPU.

    PubMed

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812

  20. Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

    PubMed Central

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812

  1. Parallel Processing of Large Scale Microphone Arrays for Sound Capture

    NASA Astrophysics Data System (ADS)

    Jan, Ea-Ee.

    1995-01-01

    Performance of microphone sound pick up is degraded by deleterious properties of the acoustic environment, such as multipath distortion (reverberation) and ambient noise. The degradation becomes more prominent in a teleconferencing environment in which the microphone is positioned far away from the speaker. Besides, the ideal teleconference should feel as easy and natural as face-to-face communication with another person. This suggests hands-free sound capture with no tether or encumbrance by hand-held or body-worn sound equipment. Microphone arrays for this application represent an appropriate approach. This research develops new microphone array and signal processing techniques for high quality hands-free sound capture in noisy, reverberant enclosures. The new techniques combine matched-filtering of individual sensors and parallel processing to provide acute spatial volume selectivity which is capable of mitigating the deleterious effects of noise interference and multipath distortion. The new method outperforms traditional delay-and-sum beamformers which provide only directional spatial selectivity. The research additionally explores truncated matched-filtering and random distribution of transducers to reduce complexity and improve sound capture quality. All designs are first established by computer simulation of array performance in reverberant enclosures. The simulation is achieved by a room model which can efficiently calculate the acoustic multipath in a rectangular enclosure up to a prescribed order of images. It also calculates the incident angle of the arriving signal. Experimental arrays were constructed and their performance was measured in real rooms. Real room data were collected in a hard-walled laboratory and a controllable variable acoustics enclosure of similar size, approximately 6 x 6 x 3 m. An extensive speech database was also collected in these two enclosures for future research on microphone arrays. The simulation results are shown to be

  2. Applying Parallel Processing Techniques to Tether Dynamics Simulation

    NASA Technical Reports Server (NTRS)

    Wells, B. Earl

    1996-01-01

    The focus of this research has been to determine the effectiveness of applying parallel processing techniques to a sizable real-world problem, the simulation of the dynamics associated with a tether which connects two objects in low earth orbit, and to explore the degree to which the parallelization process can be automated through the creation of new software tools. The goal has been to utilize this specific application problem as a base to develop more generally applicable techniques.

  3. Parallel astronomical data processing with Python: Recipes for multicore machines

    NASA Astrophysics Data System (ADS)

    Singh, Navtej; Browne, Lisa-Marie; Butler, Ray

    2013-08-01

    High performance computing has been used in various fields of astrophysical research. But most of it is implemented on massively parallel systems (supercomputers) or graphical processing unit clusters. With the advent of multicore processors in the last decade, many serial software codes have been re-implemented in parallel mode to utilize the full potential of these processors. In this paper, we propose parallel processing recipes for multicore machines for astronomical data processing. The target audience is astronomers who use Python as their preferred scripting language and who may be using PyRAF/IRAF for data processing. Three problems of varied complexity were benchmarked on three different types of multicore processors to demonstrate the benefits, in terms of execution time, of parallelizing data processing tasks. The native multiprocessing module available in Python makes it a relatively trivial task to implement the parallel code. We have also compared the three multiprocessing approaches-Pool/Map, Process/Queue and Parallel Python. Our test codes are freely available and can be downloaded from our website.

  4. Parallel Signal Processing and System Simulation using aCe

    NASA Technical Reports Server (NTRS)

    Dorband, John E.; Aburdene, Maurice F.

    2003-01-01

    Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).

  5. High efficiency solar cell processing

    NASA Technical Reports Server (NTRS)

    Ho, F.; Iles, P. A.

    1985-01-01

    At the time of writing, cells made by several groups are approaching 19% efficiency. General aspects of the processing required for such cells are discussed. Most processing used for high efficiency cells is derived from space-cell or concentrator cell technology, and recent advances have been obtained from improved techniques rather than from better understanding of the limiting mechanisms. Theory and modeling are fairly well developed, and adequate to guide further asymptotic increases in performance of near conventional cells. There are several competitive cell designs with promise of higher performance ( 20%) but for these designs further improvements are required. The available cell processing technology to fabricate high efficiency cells is examined.

  6. Parallel Processing of Distributed Video Coding to Reduce Decoding Time

    NASA Astrophysics Data System (ADS)

    Tonomura, Yoshihide; Nakachi, Takayuki; Fujii, Tatsuya; Kiya, Hitoshi

    This paper proposes a parallelized DVC framework that treats each bitplane independently to reduce the decoding time. Unfortunately, simple parallelization generates inaccurate bit probabilities because additional side information is not available for the decoding of subsequent bitplanes, which degrades encoding efficiency. Our solution is an effective estimation method that can calculate the bit probability as accurately as possible by index assignment without recourse to side information. Moreover, we improve the coding performance of Rate-Adaptive LDPC (RA-LDPC), which is used in the parallelized DVC framework. This proposal selects a fitting sparse matrix for each bitplane according to the syndrome rate estimation results at the encoder side. Simulations show that our parallelization method reduces the decoding time by up to 35[%] and achieves a bit rate reduction of about 10[%].

  7. Digital signal processor and programming system for parallel signal processing

    SciTech Connect

    Van den Bout, D.E.

    1987-01-01

    This thesis describes an integrated assault upon the problem of designing high-throughput, low-cost digital signal-processing systems. The dual prongs of this assault consist of: (1) the design of a digital signal processor (DSP) which efficiently executes signal-processing algorithms in either a uniprocessor or multiprocessor configuration, (2) the PaLS programming system which accepts an arbitrary algorithm, partitions it across a group of DSPs, synthesizes an optimal communication link topology for the DSPs, and schedules the partitioned algorithm upon the DSPs. The results of applying a new quasi-dynamic analysis technique to a set of high-level signal-processing algorithms were used to determine the uniprocessor features of the DSP design. For multiprocessing applications, the DSP contains an interprocessor communications port (IPC) which supports simple, flexible, dataflow communications while allowing the total communication bandwidth to be incrementally allocated to achieve the best link utilization. The net result is a DSP with a simple architecture that is easy to program for both uniprocessor and multi-processor modes of operation. The PaLS programming system simplifies the task of parallelizing an algorithm for execution upon a multiprocessor built with the DSP.

  8. Hardware Efficient and High-Performance Networks for Parallel Computers.

    NASA Astrophysics Data System (ADS)

    Chien, Minze Vincent

    High performance interconnection networks are the key to high utilization and throughput in large-scale parallel processing systems. Since many interconnection problems in parallel processing such as concentration, permutation and broadcast problems can be cast as sorting problems, this dissertation considers the problem of sorting on a new model, called an adaptive sorting network. It presents four adaptive binary sorters the first two of which are ordinary combinational circuits while the last two exploit time-multiplexing and pipelining techniques. These sorter constructions demonstrate that any sequence of n bits can be sorted in O(log^2n) bit-level delay, using O(n) constant fanin gates. This improves the cost complexity of Batcher's binary sorters by a factor of O(log^2n) while matching their sorting time. It is further shown that any sequence of n numbers can be sorted on the same model in O(log^2n) comparator-level delay using O(nlog nloglog n) comparators. The adaptive binary sorter constructions lead to new O(n) bit-level cost concentrators and superconcentrators with O(log^2n) bit-level delay. Their employment in recently constructed permutation and generalized connectors lead to permutation and generalized connection networks with O(nlog n) bit-level cost and O(log^3n) bit-level delay. These results provide the least bit-level cost for such networks with competitive delays. Finally, the dissertation considers a key issue in the implementation of interconnection networks, namely, the pin constraint. Current VLSI technologies can house a large number of switches in a single chip, but the mere fact that one chip cannot have too many pins precludes the possibility of implementing a large connection network on a single chip. The dissertation presents techniques for partitioning connection networks into identical modules of switches in such a way that each module is contained in a single chip with an arbitrarily specified number of pins, and that the cost of

  9. Efficient graph algorithms for sequential and parallel computers. Doctoral thesis

    SciTech Connect

    Goldberg, A.V.

    1987-02-01

    This thesis studies graph algorithms, both in sequential and parallel contexts. In the outline of the thesis, algorithm complexities are stated in terms of the the number of vertices n, the number of edges m, the largest absolute value of capacities U, and the largest absolute value of costs C. Chapter 1 introduces a new approach to the maximum flow problem that leads to better algorithms for the problem. Chapter 2 is devoted to the minimum cost flow problem, which is a generalization of the maximum flow problem. Chapter 3 addresses implementation of parallel algorithms through a case study of an implementation of a parallel maximum flow algorithm. Parallel prefix operations play an important role in the implementation. Present experimental results achieved by the implementation are presented. Present parallel symmetry-breaking techniques are the main topic of Chapter 4.

  10. Processing data communications events by awakening threads in parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2016-03-15

    Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.

  11. Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing

    NASA Technical Reports Server (NTRS)

    Fricker, David M.

    1997-01-01

    The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.

  12. Parallel algorithms for high-speed SAR processing

    NASA Astrophysics Data System (ADS)

    Mallorqui, Jordi J.; Bara, Marc; Broquetas, Antoni; Wis, Mariano; Martinez, Antonio; Nogueira, Leonardo; Moreno, Victoriano

    1998-11-01

    The mass production of SAR products and its usage on monitoring emergency situations (oil spill detection, floods, etc.) requires high-speed SAR processors. Two different parallel strategies for near real time SAR processing based on a multiblock version of the Chirp Scaling Algorithm (CSA) have been studied. The first one is useful for small companies that would like to reduce computation times with no extra investment. It uses a cluster of heterogeneous UNIX workstations as a parallel computer. The second one is oriented to institutions, which have to process large amounts of data in short times and can afford the cost of large parallel computers. The parallel programming has reduced in both cases the computational times when compared with the sequential versions.

  13. Study of image processing system based on parallel structure of multiple DSPs

    NASA Astrophysics Data System (ADS)

    Song, Jianxun; Wu, Qin-zhang

    2008-03-01

    A novel parallel image processing architecture using multiple DSPs which can satisfy real-time image processing demands is proposed, The architecture is structured with high performance DSP interconnected by FPGA. Within FPGA the interconnection network by IRAM and the specific data communication protocol are implemented. The system inherits merits from the tightly coupled parallel system and the loosely coupled parallel system. The system architecture is reconfigurable and scalable. The performances measured in this platform show the high data transfer rate, and it can satisfy parallel real-time image processing demands of the complex task, large computation and high-speed data transfer. From the designed parallel hardware we analyze the benchmarks including acceleration ratio, parallel efficiency, selection of processing units, interconnection network etc. Finally some suggestions are given to further improve the system performance. The real-time image processing system based on parallel structure of multiple DSPs is easy to be implemented. Because the system structure is reconfigurable and scalable, it is easy to change the number of DSP and change the DSP into other series. So it has a bright future for the application of real-time image processing system.

  14. Watermarking scheme for large images using parallel processing

    NASA Astrophysics Data System (ADS)

    Debes, Eric; Dardier, Genevieve; Ebrahimi, Touradj; Herrigel, Alexander

    2001-08-01

    Large and high-resolution images usually have a high commercial value. Thus they are very good candidates for watermarking. If many images have to be signed in a Client-Server setup, memory and computational requirements could become unrealistic for current and near future solutions. In this paper, we propose to tile the image into sub-images. The watermarking scheme is then applied to each sub-image in the embedding and retrieval process. Thanks to this solution, the first possible optimization consists in creating different threads to read and write the image tile by tile. The time spent in input/output operations, which can be a bottleneck for large images, is reduced. In addition to this optimization, we show that the memory consumption of the application is also highly reduced for large images. Finally, the application can be multithreaded so that different tiles can be watermarked in parallel. Therefore the scheme can take advantage of the processing power of the different processors available in current servers. We show that the correct tile size and the right amount of threads have to be created to efficiently distribute the workload. Eventually, security, robustness and invisibility issues are addressed considering the signal redundancy.

  15. Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

    1996-01-01

    This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.

  16. Parallel processing implementation for the coupled transport of photons and electrons using OpenMP

    NASA Astrophysics Data System (ADS)

    Doerner, Edgardo

    2016-05-01

    In this work the use of OpenMP to implement the parallel processing of the Monte Carlo (MC) simulation of the coupled transport for photons and electrons is presented. This implementation was carried out using a modified EGSnrc platform which enables the use of the Microsoft Visual Studio 2013 (VS2013) environment, together with the developing tools available in the Intel Parallel Studio XE 2015 (XE2015). The performance study of this new implementation was carried out in a desktop PC with a multi-core CPU, taking as a reference the performance of the original platform. The results were satisfactory, both in terms of scalability as parallelization efficiency.

  17. FPGA-Based Filterbank Implementation for Parallel Digital Signal Processing

    NASA Technical Reports Server (NTRS)

    Berner, Stephan; DeLeon, Phillip

    1999-01-01

    One approach to parallel digital signal processing decomposes a high bandwidth signal into multiple lower bandwidth (rate) signals by an analysis bank. After processing, the subband signals are recombined into a fullband output signal by a synthesis bank. This paper describes an implementation of the analysis and synthesis banks using (Field Programmable Gate Arrays) FPGAs.

  18. Automatic Mapping Of Large Signal Processing Systems To A Parallel Machine

    NASA Astrophysics Data System (ADS)

    Printz, Harry; Kung, H. T.; Mummert, Todd; Scherer, Paul M.

    1989-12-01

    Since the spring of 1988, Carnegie Mellon University and the Naval Air Development Center have been working together to implement several large signal processing systems on the Warp parallel computer. In the course of this work, we have developed a prototype of a software tool that can automatically and efficiently map signal processing systems to distributed-memory parallel machines, such as Warp. We have used this tool to produce Warp implementations of small test systems. The automatically generated programs compare favorably with hand-crafted code. We believe this tool will be a significant aid in the creation of high speed signal processing systems. We assume that signal processing systems have the following characteristics: •They can be described by directed graphs of computational tasks; these graphs may contain thousands of task vertices. • Some tasks can be parallelized in a systolic or data-partitioned manner, while others cannot be parallelized at all. • The side effects of each task, if any, are limited to changes in local variables. • Each task has a data-independent execution time bound, which may be expressed as a function of the way it is parallelized, and the number of processors it is mapped to. In this paper we describe techniques to automatically map such systems to Warp-like parallel machines. We identify and address key issues in gracefully combining different parallel programming styles, in allocating processor, memory and communication bandwidth, and in generating and scheduling efficient parallel code. When iWarp, the VLSI version of the Warp machine, becomes available in 1990, we will extend this tool to generate efficient code for very large applications, which may require as many as 3000 iWarp processors, with an aggregate peak performance of 60 gigaflops.

  19. On Efficient Parallel Implementation of Moving Body Overset Grid Methods

    NASA Technical Reports Server (NTRS)

    Wissink, Andrew M.; Meakin, Robert L.; Warmbrodt, William (Technical Monitor)

    1997-01-01

    An investigation into the parallel performance of moving-body overset grid methods will be presented. Parallel versions of the OVERFLOW flow solver, DCF3D domain connectivity software, and SIXDO six-degree-of-freedom routine are coupled with an automatic load balance routine and tested for 3D Navier-Stokes calculations on the IBM SP2. The primary source of parallel inefficiency in moving and problems are the domain connectivity costs with DCF 3D. Although this algorithm constitutes a relatively low fraction of the total solution cost (e.g. 10-20%) in calculations on serial machines, the consequently cause a significant degradation in the overall parallel performance. The paper will highlight some approaches for improving the scalability of DCF3D. The paper will present results of a proposed new load balancing scheme that seeks more equal distribution of the inter-grid boundary points in order to more evenly load balance the donor search costs associated with DCF3D. Some preliminary results will also be given from a new solution-adaption algorithm coupled with OVERFLOW which incorporates overset cartesian grids with various levels of refinement. The measured parallel performance from a descending delta-wing configuration and a generic store-separation from a wing/pylon case will be presented.

  20. Mapping Pixel Windows To Vectors For Parallel Processing

    NASA Technical Reports Server (NTRS)

    Duong, Tuan A.

    1996-01-01

    Mapping performed by matrices of transistor switches. Arrays of transistor switches devised for use in forming simultaneous connections from square subarray (window) of n x n pixels within electronic imaging device containing np x np array of pixels to linear array of n(sup2) input terminals of electronic neural network or other parallel-processing circuit. Method helps to realize potential for rapidity in parallel processing for such applications as enhancement of images and recognition of patterns. In providing simultaneous connections, overcomes timing bottleneck or older multiplexing, serial-switching, and sample-and-hold methods.

  1. Parallel evolution of image processing tools for multispectral imagery

    NASA Astrophysics Data System (ADS)

    Harvey, Neal R.; Brumby, Steven P.; Perkins, Simon J.; Porter, Reid B.; Theiler, James P.; Young, Aaron C.; Szymanski, John J.; Bloch, Jeffrey J.

    2000-11-01

    We describe the implementation and performance of a parallel, hybrid evolutionary-algorithm-based system, which optimizes image processing tools for feature-finding tasks in multi-spectral imagery (MSI) data sets. Our system uses an integrated spatio-spectral approach and is capable of combining suitably-registered data from different sensors. We investigate the speed-up obtained by parallelization of the evolutionary process via multiple processors (a workstation cluster) and develop a model for prediction of run-times for different numbers of processors. We demonstrate our system on Landsat Thematic Mapper MSI , covering the recent Cerro Grande fire at Los Alamos, NM, USA.

  2. Efficient parallel architecture for highly coupled real-time linear system applications

    NASA Technical Reports Server (NTRS)

    Carroll, Chester C.; Homaifar, Abdollah; Barua, Soumavo

    1988-01-01

    A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics.

  3. Active Storage Processing in a Parallel File System

    SciTech Connect

    Felix, Evan J.; Fox, Kevin M.; Regimbal, Kevin M.; Nieplocha, Jarek

    2006-01-01

    By creating a processing system within a parallel file system one can harness the power of unused processing power on servers that have very fast access to the disks they are serving. By inserting a module the Lustre file system the Active Storage Concept is able to perform processing with the file system architecture. Results of using this technology are presented as the results of the Supercomputing StorCloud Challenge Application are reviewed.

  4. FORTRAN M. FORTRAN Extensions for Modular Parallel Processing

    SciTech Connect

    Foster, Ian; Olson, Robert; Tuecke, Steven

    1993-08-01

    FORTRAN M is a small set of extensions to FORTRAN that supports a modular approach to the construction of sequential and parallel programs. FORTRAN M programs use channels to plug together processes which may be written in FORTRAN M or FORTRAN 77. Processes communicate by sending and receiving messages on channels. Channels and processes can be created dynamically, but programs remain deterministic unless specialized nondeterministic constructs are used.

  5. A single user efficiency measure for evaluation of parallel or pipeline computer architectures

    NASA Technical Reports Server (NTRS)

    Jones, W. P.

    1978-01-01

    A precise statement of the relationship between sequential computation at one rate, parallel or pipeline computation at a much higher rate, the data movement rate between levels of memory, the fraction of inherently sequential operations or data that must be processed sequentially, the fraction of data to be moved that cannot be overlapped with computation, and the relative computational complexity of the algorithms for the two processes, scalar and vector, was developed. The relationship should be applied to the multirate processes that obtain in the employment of various new or proposed computer architectures for computational aerodynamics. The relationship, an efficiency measure that the single user of the computer system perceives, argues strongly in favor of separating scalar and vector processes, sometimes referred to as loosely coupled processes, to achieve optimum use of hardware.

  6. Efficient Parallel Learning of Hidden Markov Chain Models on SMPs

    NASA Astrophysics Data System (ADS)

    Li, Lei; Fu, Bin; Faloutsos, Christos

    Quad-core cpus have been a common desktop configuration for today's office. The increasing number of processors on a single chip opens new opportunity for parallel computing. Our goal is to make use of the multi-core as well as multi-processor architectures to speed up large-scale data mining algorithms. In this paper, we present a general parallel learning framework, Cut-And-Stitch, for training hidden Markov chain models. Particularly, we propose two model-specific variants, CAS-LDS for learning linear dynamical systems (LDS) and CAS-HMM for learning hidden Markov models (HMM). Our main contribution is a novel method to handle the data dependencies due to the chain structure of hidden variables, so as to parallelize the EM-based parameter learning algorithm. We implement CAS-LDS and CAS-HMM using OpenMP on two supercomputers and a quad-core commercial desktop. The experimental results show that parallel algorithms using Cut-And-Stitch achieve comparable accuracy and almost linear speedups over the traditional serial version.

  7. Parallel processing of atmospheric chemistry calculations: Preliminary considerations

    SciTech Connect

    Elliott, S.; Jones, P.

    1995-01-01

    Global climate calculations are already saturating the class modern vector supercomputers with only a few central processing units. Increased resolution and inclusion of routines to deal with biogeochemical portions of the terrestrial climate system will soon demand massively parallel approaches. The atmospheric photochemistry ensemble is intimately linked to climate through the trace greenhouse gases ozone and methane and modules for representing it are being attached to global three dimensional transport and GCM frameworks. Atmospheric kinetics involve dozens of highly interactive tracers and so will accentuate the need for parallel processing of earth system simulations. In the present text we lay some of the groundwork for addition of atmospheric kinetics packages to GCM and global scale atmospheric models on multiply parallel computers. The discussion is tailored for consumption by the photochemical modelling community. After a review of numerical atmospheric chemistry methods, we examine how kinetics can be implemented on a parallel computer. We concentrate especially on data layout and flexibility and how these can be implemented in various programming models. We conclude that chemistry can be implemented rather easily within existing frameworks of several parallel atmospheric models. However, memory limitations may preclude high resolution studies of global chemistry.

  8. Partitioning Rectangular and Structurally Nonsymmetric Sparse Matrices for Parallel Processing

    SciTech Connect

    B. Hendrickson; T.G. Kolda

    1998-09-01

    A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrix- transpose-vector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitioning bipartite graphs. We then introduce several algorithms for this partitioning problem and compare their performance on a set of test matrices.

  9. DNA encoding for an efficient 'Omics processing.

    PubMed

    Murovec, Bostjan; Tiedje, James M; Stres, Blaz

    2010-11-01

    The exponential growth of available DNA sequences and the increased interoperability of biological information is triggering intergovernmental efforts aimed at increasing the access, dissemination, and analysis of sequence data. Achieving the efficient storage and processing of DNA material is an important goal that parallels well with the foreseen coding standardization on the horizon. This paper proposes novel coding approaches, for both the dissemination and processing of sequences, where the speed of the DNA processing is shown to be boosted by exploring more than the normally utilized eight bits for encoding a single nucleotide. Further gains are achieved by encoding the nucleotides together with their trailing alignment information as a single 64-bit data structure. The paper also proposes a slight modification to the established FASTA scheme in order to improve on its representation of alignment information. The significance of the propositions is confirmed by the encouraging results from empirical tests. PMID:20444519

  10. Idaho Chemical Processing Plant Process Efficiency improvements

    SciTech Connect

    Griebenow, B.

    1996-03-01

    In response to decreasing funding levels available to support activities at the Idaho Chemical Processing Plant (ICPP) and a desire to be cost competitive, the Department of Energy Idaho Operations Office (DOE-ID) and Lockheed Idaho Technologies Company have increased their emphasis on cost-saving measures. The ICPP Effectiveness Improvement Initiative involves many activities to improve cost effectiveness and competitiveness. This report documents the methodology and results of one of those cost cutting measures, the Process Efficiency Improvement Activity. The Process Efficiency Improvement Activity performed a systematic review of major work processes at the ICPP to increase productivity and to identify nonvalue-added requirements. A two-phase approach was selected for the activity to allow for near-term implementation of relatively easy process modifications in the first phase while obtaining long-term continuous improvement in the second phase and beyond. Phase I of the initiative included a concentrated review of processes that had a high potential for cost savings with the intent of realizing savings in Fiscal Year 1996 (FY-96.) Phase II consists of implementing long-term strategies too complex for Phase I implementation and evaluation of processes not targeted for Phase I review. The Phase II effort is targeted for realizing cost savings in FY-97 and beyond.

  11. Parallel Processing of Objects in a Naming Task

    ERIC Educational Resources Information Center

    Meyer, Antje S.; Ouellet, Marc; Hacker, Christine

    2008-01-01

    The authors investigated whether speakers who named several objects processed them sequentially or in parallel. Speakers named object triplets, arranged in a triangle, in the order left, right, and bottom object. The left object was easy or difficult to identify and name. During the saccade from the left to the right object, the right object shown…

  12. The Extended Parallel Process Model: Illuminating the Gaps in Research

    ERIC Educational Resources Information Center

    Popova, Lucy

    2012-01-01

    This article examines constructs, propositions, and assumptions of the extended parallel process model (EPPM). Review of the EPPM literature reveals that its theoretical concepts are thoroughly developed, but the theory lacks consistency in operational definitions of some of its constructs. Out of the 12 propositions of the EPPM, a few have not…

  13. Postscript: Parallel Distributed Processing in Localist Models without Thresholds

    ERIC Educational Resources Information Center

    Plaut, David C.; McClelland, James L.

    2010-01-01

    The current authors reply to a response by Bowers on a comment by the current authors on the original article. Bowers (2010) mischaracterizes the goals of parallel distributed processing (PDP research)--explaining performance on cognitive tasks is the primary motivation. More important, his claim that localist models, such as the interactive…

  14. Using Motivational Interviewing Techniques to Address Parallel Process in Supervision

    ERIC Educational Resources Information Center

    Giordano, Amanda; Clarke, Philip; Borders, L. DiAnne

    2013-01-01

    Supervision offers a distinct opportunity to experience the interconnection of counselor-client and counselor-supervisor interactions. One product of this network of interactions is parallel process, a phenomenon by which counselors unconsciously identify with their clients and subsequently present to their supervisors in a similar fashion…

  15. Rapid Parallel Semantic Processing of Numbers without Awareness

    ERIC Educational Resources Information Center

    Van Opstal, Filip; de Lange, Floris P.; Dehaene, Stanislas

    2011-01-01

    In this study, we investigate whether multiple digits can be processed at a semantic level without awareness, either serially or in parallel. In two experiments, we presented participants with two successive sets of four simultaneous Arabic digits. The first set was masked and served as a subliminal prime for the second, visible target set.…

  16. Parallel Processing Method for Airborne Laser Scanning Data Using a PC Cluster and a Virtual Grid.

    PubMed

    Han, Soo Hee; Heo, Joon; Sohn, Hong Gyoo; Yu, Kiyun

    2009-01-01

    In this study, a parallel processing method using a PC cluster and a virtual grid is proposed for the fast processing of enormous amounts of airborne laser scanning (ALS) data. The method creates a raster digital surface model (DSM) by interpolating point data with inverse distance weighting (IDW), and produces a digital terrain model (DTM) by local minimum filtering of the DSM. To make a consistent comparison of performance between sequential and parallel processing approaches, the means of dealing with boundary data and of selecting interpolation centers were controlled for each processing node in parallel approach. To test the speedup, efficiency and linearity of the proposed algorithm, actual ALS data up to 134 million points were processed with a PC cluster consisting of one master node and eight slave nodes. The results showed that parallel processing provides better performance when the computational overhead, the number of processors, and the data size become large. It was verified that the proposed algorithm is a linear time operation and that the products obtained by parallel processing are identical to those produced by sequential processing. PMID:22574032

  17. On the Optimality of Serial and Parallel Processing in the Psychological Refractory Period Paradigm: Effects of the Distribution of Stimulus Onset Asynchronies

    ERIC Educational Resources Information Center

    Miller, Jeff; Ulrich, Rolf; Rolke, Bettina

    2009-01-01

    Within the context of the psychological refractory period (PRP) paradigm, we developed a general theoretical framework for deciding when it is more efficient to process two tasks in serial and when it is more efficient to process them in parallel. This analysis suggests that a serial mode is more efficient than a parallel mode under a wide variety…

  18. Implementation and efficiency analysis of parallel computation using OpenACC: a case study using flow field simulations

    NASA Astrophysics Data System (ADS)

    Zhang, Shanghong; Yuan, Rui; Wu, Yu; Yi, Yujun

    2016-01-01

    The Open Accelerator (OpenACC) application programming interface is a relatively new parallel computing standard. In this paper, particle-based flow field simulations are examined as a case study of OpenACC parallel computation. The parallel conversion process of the OpenACC standard is explained, and further, the performance of the flow field parallel model is analysed using different directive configurations and grid schemes. With careful implementation and optimisation of the data transportation in the parallel algorithm, a speedup factor of 18.26× is possible. In contrast, a speedup factor of just 11.77× was achieved with the conventional Open Multi-Processing (OpenMP) parallel mode on a 20-kernel computer. These results demonstrate that optimised feature settings greatly influence the degree of speedup, and models involving larger numbers of calculations exhibit greater efficiency and higher speedup factors. In addition, the OpenACC parallel mode is found to have good portability, making it easy to implement parallel computation from the original serial model.

  19. The remote sensing image segmentation mean shift algorithm parallel processing based on MapReduce

    NASA Astrophysics Data System (ADS)

    Chen, Xi; Zhou, Liqing

    2015-12-01

    With the development of satellite remote sensing technology and the remote sensing image data, traditional remote sensing image segmentation technology cannot meet the massive remote sensing image processing and storage requirements. This article put cloud computing and parallel computing technology in remote sensing image segmentation process, and build a cheap and efficient computer cluster system that uses parallel processing to achieve MeanShift algorithm of remote sensing image segmentation based on the MapReduce model, not only to ensure the quality of remote sensing image segmentation, improved split speed, and better meet the real-time requirements. The remote sensing image segmentation MeanShift algorithm parallel processing algorithm based on MapReduce shows certain significance and a realization of value.

  20. Automating the parallel processing of fluid and structural dynamics calculations

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Cole, Gary L.

    1987-01-01

    The NASA Lewis Research Center is actively involved in the development of expert system technology to assist users in applying parallel processing to computational fluid and structural dynamic analysis. The goal of this effort is to eliminate the necessity for the physical scientist to become a computer scientist in order to effectively use the computer as a research tool. Programming and operating software utilities have previously been developed to solve systems of ordinary nonlinear differential equations on parallel scalar processors. Current efforts are aimed at extending these capabilties to systems of partial differential equations, that describe the complex behavior of fluids and structures within aerospace propulsion systems. This paper presents some important considerations in the redesign, in particular, the need for algorithms and software utilities that can automatically identify data flow patterns in the application program and partition and allocate calculations to the parallel processors. A library-oriented multiprocessing concept for integrating the hardware and software functions is described.

  1. Automating the parallel processing of fluid and structural dynamics calculations

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Cole, Gary L.

    1987-01-01

    The NASA Lewis Research Center is actively involved in the development of expert system technology to assist users in applying parallel processing to computational fluid and structural dynamic analysis. The goal of this effort is to eliminate the necessity for the physical scientist to become a computer scientist in order to effectively use the computer as a research tool. Programming and operating software utilities have previously been developed to solve systems of ordinary nonlinear differential equations on parallel scalar processors. Current efforts are aimed at extending these capabilities to systems of partial differential equations, that describe the complex behavior of fluids and structures within aerospace propulsion systems. This paper presents some important considerations in the redesign, in particular, the need for algorithms and software utilities that can automatically identify data flow patterns in the application program and partition and allocate calculations to the parallel processors. A library-oriented multiprocessing concept for integrating the hardware and software functions is described.

  2. Parallel-Processing Software for Correlating Stereo Images

    NASA Technical Reports Server (NTRS)

    Klimeck, Gerhard; Deen, Robert; Mcauley, Michael; DeJong, Eric

    2007-01-01

    A computer program implements parallel- processing algorithms for cor relating images of terrain acquired by stereoscopic pairs of digital stereo cameras on an exploratory robotic vehicle (e.g., a Mars rove r). Such correlations are used to create three-dimensional computatio nal models of the terrain for navigation. In this program, the scene viewed by the cameras is segmented into subimages. Each subimage is assigned to one of a number of central processing units (CPUs) opera ting simultaneously.

  3. Parallel Processing of Broad-Band PPM Signals

    NASA Technical Reports Server (NTRS)

    Gray, Andrew; Kang, Edward; Lay, Norman; Vilnrotter, Victor; Srinivasan, Meera; Lee, Clement

    2010-01-01

    A parallel-processing algorithm and a hardware architecture to implement the algorithm have been devised for timeslot synchronization in the reception of pulse-position-modulated (PPM) optical or radio signals. As in the cases of some prior algorithms and architectures for parallel, discrete-time, digital processing of signals other than PPM, an incoming broadband signal is divided into multiple parallel narrower-band signals by means of sub-sampling and filtering. The number of parallel streams is chosen so that the frequency content of the narrower-band signals is low enough to enable processing by relatively-low speed complementary metal oxide semiconductor (CMOS) electronic circuitry. The algorithm and architecture are intended to satisfy requirements for time-varying time-slot synchronization and post-detection filtering, with correction of timing errors independent of estimation of timing errors. They are also intended to afford flexibility for dynamic reconfiguration and upgrading. The architecture is implemented in a reconfigurable CMOS processor in the form of a field-programmable gate array. The algorithm and its hardware implementation incorporate three separate time-varying filter banks for three distinct functions: correction of sub-sample timing errors, post-detection filtering, and post-detection estimation of timing errors. The design of the filter bank for correction of timing errors, the method of estimating timing errors, and the design of a feedback-loop filter are governed by a host of parameters, the most critical one, with regard to processing very broadband signals with CMOS hardware, being the number of parallel streams (equivalently, the rate-reduction parameter).

  4. Parallel processing and pipelining usher DSP model into the future

    SciTech Connect

    Kampen, T.V.; Anders, P.

    1986-02-20

    The course of digital signal processing is well plotted into the future. When standard microprocessors, constrained by their von Neumann architectures and weak arithmetic ability, proved inadequate for the task, the first specialized chips appeared. The devices were fortified with Harvard-like parallel architecture, multiplication-accumulation hardware, and instruction pipelines. In the next stage of their evolution, DSP chips will have to rely on faster IC technologies and even greater degrees of parallel operation. Two versions of such a DSP chip are planned, one with and one without data ROM and program memory, and the latter has been cast in silicon. The processor relies on high-speed CMOS technology and a parallel architecture to start an instruction every 125 ns. The chip has the processing power to handle many of the most sophisticated DSP algorithms needed in telecommunications, speech and image processing, and general industry applications. As a result, either version can replace multiple ICs in current designs, affording a single-chip solution that makes many applications practical for the first time. Moreover, its flexible I/O structure qualifies the chip for the multiple processor configurations that offer still more signal-processing power. Architecturally, twin 16-bit data buses, X and Y, connect five functional sections within the chip, all working in parallel. The sections include a 16-bit multiplier and 40-bit accumulator, an ALU teamed with a multiport register file, and combined data memory and address computation logic. Rounding out the chip's functional foundation are a versatile program control section and 16-bit serial and parallel I/O circuits.

  5. Efficient separations & processing crosscutting program

    SciTech Connect

    1996-08-01

    The Efficient Separations and Processing Crosscutting Program (ESP) was created in 1991 to identify, develop, and perfect chemical and physical separations technologies and chemical processes which treat wastes and address environmental problems throughout the DOE complex. The ESP funds several multiyear tasks that address high-priority waste remediation problems involving high-level, low-level, transuranic, hazardous, and mixed (radioactive and hazardous) wastes. The ESP supports applied research and development (R & D) leading to the demonstration or use of these separations technologies by other organizations within the Department of Energy (DOE), Office of Environmental Management.

  6. Airbreathing Propulsion System Analysis Using Multithreaded Parallel Processing

    NASA Technical Reports Server (NTRS)

    Schunk, Richard Gregory; Chung, T. J.; Rodriguez, Pete (Technical Monitor)

    2000-01-01

    In this paper, parallel processing is used to analyze the mixing, and combustion behavior of hypersonic flow. Preliminary work for a sonic transverse hydrogen jet injected from a slot into a Mach 4 airstream in a two-dimensional duct combustor has been completed [Moon and Chung, 1996]. Our aim is to extend this work to three-dimensional domain using multithreaded domain decomposition parallel processing based on the flowfield-dependent variation theory. Numerical simulations of chemically reacting flows are difficult because of the strong interactions between the turbulent hydrodynamic and chemical processes. The algorithm must provide an accurate representation of the flowfield, since unphysical flowfield calculations will lead to the faulty loss or creation of species mass fraction, or even premature ignition, which in turn alters the flowfield information. Another difficulty arises from the disparity in time scales between the flowfield and chemical reactions, which may require the use of finite rate chemistry. The situations are more complex when there is a disparity in length scales involved in turbulence. In order to cope with these complicated physical phenomena, it is our plan to utilize the flowfield-dependent variation theory mentioned above, facilitated by large eddy simulation. Undoubtedly, the proposed computation requires the most sophisticated computational strategies. The multithreaded domain decomposition parallel processing will be necessary in order to reduce both computational time and storage. Without special treatments involved in computer engineering, our attempt to analyze the airbreathing combustion appears to be difficult, if not impossible.

  7. Highly scalable parallel processing of extracellular recordings of Multielectrode Arrays.

    PubMed

    Gehring, Tiago V; Vasilaki, Eleni; Giugliano, Michele

    2015-01-01

    Technological advances of Multielectrode Arrays (MEAs) used for multisite, parallel electrophysiological recordings, lead to an ever increasing amount of raw data being generated. Arrays with hundreds up to a few thousands of electrodes are slowly seeing widespread use and the expectation is that more sophisticated arrays will become available in the near future. In order to process the large data volumes resulting from MEA recordings there is a pressing need for new software tools able to process many data channels in parallel. Here we present a new tool for processing MEA data recordings that makes use of new programming paradigms and recent technology developments to unleash the power of modern highly parallel hardware, such as multi-core CPUs with vector instruction sets or GPGPUs. Our tool builds on and complements existing MEA data analysis packages. It shows high scalability and can be used to speed up some performance critical pre-processing steps such as data filtering and spike detection, helping to make the analysis of larger data sets tractable. PMID:26737215

  8. Parallel System Architecture (PSA): An efficient approach for automatic recognition of volcano-seismic events

    NASA Astrophysics Data System (ADS)

    Cortés, Guillermo; García, Luz; Álvarez, Isaac; Benítez, Carmen; de la Torre, Ángel; Ibáñez, Jesús

    2014-02-01

    Automatic recognition of volcano-seismic events is becoming one of the most demanded features in the early warning area at continuous monitoring facilities. While human-driven cataloguing is time-consuming and often an unreliable task, an appropriate machine framework allows expert technicians to focus only on result analysis and decision-making. This work presents an alternative to serial architectures used in classic recognition systems introducing a parallel implementation of the whole process: configuration, feature extraction, feature selection and classification stages are independently carried out for each type of events in order to exploit the intrinsic properties of each signal class. The system uses Gaussian Mixture Models (GMMs) to classify the database recorded at Deception Volcano Island (Antarctica) obtaining a baseline recognition rate of 84% with a cepstral-based waveform parameterization in the serial architecture. The parallel approach increases the results to close to 92% using mixture-based parameterization vectors or up to 91% when the vector size is reduced by 19% via the Discriminative Feature Selection (DFS) algorithm. Besides the result improvement, the parallel architecture represents a major step in terms of flexibility and reliability thanks to the class-focused analysis, providing an efficient tool for monitoring observatories which require real-time solutions.

  9. Digital intermediate frequency QAM modulator using parallel processing

    DOEpatents

    Pao, Hsueh-Yuan; Tran, Binh-Nien

    2008-05-27

    The digital Intermediate Frequency (IF) modulator applies to various modulation types and offers a simple and low cost method to implement a high-speed digital IF modulator using field programmable gate arrays (FPGAs). The architecture eliminates multipliers and sequential processing by storing the pre-computed modulated cosine and sine carriers in ROM look-up-tables (LUTs). The high-speed input data stream is parallel processed using the corresponding LUTs, which reduces the main processing speed, allowing the use of low cost FPGAs.

  10. A dataflow analysis tool for parallel processing of algorithms

    NASA Technical Reports Server (NTRS)

    Jones, Robert L., III

    1993-01-01

    A graph-theoretic design process and software tool is presented for selecting a multiprocessing scheduling solution for a class of computational problems. The problems of interest are those that can be described using a dataflow graph and are intended to be executed repetitively on a set of identical parallel processors. Typical applications include signal processing and control law problems. Graph analysis techniques are introduced and shown to effectively determine performance bounds, scheduling constraints, and resource requirements. The software tool is shown to facilitate the application of the design process to a given problem.

  11. Study of parallel efficiency in message passing environments

    SciTech Connect

    Hanebutte, U.R.; Tatsumi, Masahiro

    1996-03-01

    A benchmark test using the Message Passing Interface (MPI, an emerging standard for writing message passing programs) has been developed, to study parallel performance in message passing environments. The test is comprised of a computational task of independent calculations followed by a round-robin data communication step. Performance data as a function of computational granularity and message passing requirements are presented for the IBM SPx at Argonne National Laboratory and for a cluster of quasi-dedicated SUN SPARC Station 20`s. In the later portion of the paper a widely accepted communication cost model combined with Amdahl`s law is used to obtain performance predictions for uneven distributed computational work loads.

  12. Efficient Identification of Assembly Neurons within Massively Parallel Spike Trains

    PubMed Central

    Berger, Denise; Borgelt, Christian; Louis, Sebastien; Morrison, Abigail; Grün, Sonja

    2010-01-01

    The chance of detecting assembly activity is expected to increase if the spiking activities of large numbers of neurons are recorded simultaneously. Although such massively parallel recordings are now becoming available, methods able to analyze such data for spike correlation are still rare, as a combinatorial explosion often makes it infeasible to extend methods developed for smaller data sets. By evaluating pattern complexity distributions the existence of correlated groups can be detected, but their member neurons cannot be identified. In this contribution, we present approaches to actually identify the individual neurons involved in assemblies. Our results may complement other methods and also provide a way to reduce data sets to the “relevant” neurons, thus allowing us to carry out a refined analysis of the detailed correlation structure due to reduced computation time. PMID:19809521

  13. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, J.

    1999-01-01

    A new atmospheric objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 1 X 1 lat-lon grid with 18 levels of heights and winds and 10 levels of moisture) using 120,000 observations in 17 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system is totally portable and can run on several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from 1 to 32 CPUs is 18%. In addition, the analysis results are identical regardless of the number of processors used. This system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. Static tests with a 2 X 2.5 resolution version of this system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from several months of cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (O-F statistics) as the current operational system.

  14. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, James G.

    1999-01-01

    A new objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 2 x 2.5 lat-lon grid with 20 levels of heights and winds and 10 levels of moisture) using 120,000 observations in less than 3 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system Ls totally portable and can run on -several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from I to 32 CPus is 18%. in addition, the analysis results are identical regardless of the number of processors used. T'his system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. It also includes a new quality control (buddy check) system. Static tests with the system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from a 2-month cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (0-F statistics) throughout the entire two months.

  15. Applications of massively parallel computers in telemetry processing

    NASA Technical Reports Server (NTRS)

    El-Ghazawi, Tarek A.; Pritchard, Jim; Knoble, Gordon

    1994-01-01

    Telemetry processing refers to the reconstruction of full resolution raw instrumentation data with artifacts, of space and ground recording and transmission, removed. Being the first processing phase of satellite data, this process is also referred to as level-zero processing. This study is aimed at investigating the use of massively parallel computing technology in providing level-zero processing to spaceflights that adhere to the recommendations of the Consultative Committee on Space Data Systems (CCSDS). The workload characteristics, of level-zero processing, are used to identify processing requirements in high-performance computing systems. An example of level-zero functions on a SIMD MPP, such as the MasPar, is discussed. The requirements in this paper are based in part on the Earth Observing System (EOS) Data and Operation System (EDOS).

  16. Morphological evidence for parallel processing of information in rat macula

    NASA Technical Reports Server (NTRS)

    Ross, M. D.

    1988-01-01

    Study of montages, tracings and reconstructions prepared from a series of 570 consecutive ultrathin sections shows that rat maculas are morphologically organized for parallel processing of linear acceleratory information. Type II cells of one terminal field distribute information to neighboring terminals as well. The findings are examined in light of physiological data which indicate that macular receptor fields have a preferred directional vector, and are interpreted by analogy to a computer technology known as an information network.

  17. Parallel Visualization Co-Processing of Overnight CFD Propulsion Applications

    NASA Technical Reports Server (NTRS)

    Edwards, David E.; Haimes, Robert

    1999-01-01

    An interactive visualization system pV3 is being developed for the investigation of advanced computational methodologies employing visualization and parallel processing for the extraction of information contained in large-scale transient engineering simulations. Visual techniques for extracting information from the data in terms of cutting planes, iso-surfaces, particle tracing and vector fields are included in this system. This paper discusses improvements to the pV3 system developed under NASA's Affordable High Performance Computing project.

  18. Parallel software requirements to the design of a general architecture: application to the image processing

    NASA Astrophysics Data System (ADS)

    Bonnin, Patrick J.; Hoeltzener-Douarin, Brigitte; Aubin, N.; Cartier, S.; Porcher, Thierry; Fiorini, P.; Zavidovique, Bertrand

    1993-10-01

    A great number of parallel computer architectures have been proposed, whether they are SIMD machines (Single Instruction Multiple Data) with lots of quite simple processors, or MIMD machines (Multiple Instruction Multiple Data) containing few, but powerful processors. Each one claims to offer some kind of an optimality at the hardware level. But implementing parallel image processing algorithms to make them run in real time will remain a real challenge; it addresses rather the control of communication networks between processors (message passing, circuit switching..) or the computing model (e.g. data parallel model). In that respect, our goal here is to point out some algorithmic needs to distribute image processing operators. They will be translated first in terms of programming models, more general then image processing applications, and then as hardware properties of the processor network. In that way, we do not design yet another parallel machine dedicated to image processing, but a more general parallel architecture which one will be able to efficiently implement different kinds of programming models.

  19. An efficient parallel implementation of explicit multirate Runge–Kutta schemes for discontinuous Galerkin computations

    SciTech Connect

    Seny, Bruno Lambrechts, Jonathan; Toulorge, Thomas; Legat, Vincent; Remacle, Jean-François

    2014-01-01

    Although explicit time integration schemes require small computational efforts per time step, their efficiency is severely restricted by their stability limits. Indeed, the multi-scale nature of some physical processes combined with highly unstructured meshes can lead some elements to impose a severely small stable time step for a global problem. Multirate methods offer a way to increase the global efficiency by gathering grid cells in appropriate groups under local stability conditions. These methods are well suited to the discontinuous Galerkin framework. The parallelization of the multirate strategy is challenging because grid cells have different workloads. The computational cost is different for each sub-time step depending on the elements involved and a classical partitioning strategy is not adequate any more. In this paper, we propose a solution that makes use of multi-constraint mesh partitioning. It tends to minimize the inter-processor communications, while ensuring that the workload is almost equally shared by every computer core at every stage of the algorithm. Particular attention is given to the simplicity of the parallel multirate algorithm while minimizing computational and communication overheads. Our implementation makes use of the MeTiS library for mesh partitioning and the Message Passing Interface for inter-processor communication. Performance analyses for two and three-dimensional practical applications confirm that multirate methods preserve important computational advantages of explicit methods up to a significant number of processors.

  20. Efficient solid state NMR powder simulations using SMP and MPP parallel computation.

    PubMed

    Kristensen, Jørgen Holm; Farnan, Ian

    2003-04-01

    Methods for parallel simulation of solid state NMR powder spectra are presented for both shared and distributed memory parallel supercomputers. For shared memory architectures the performance of simulation programs implementing the OpenMP application programming interface is evaluated. It is demonstrated that the design of correct and efficient shared memory parallel programs is difficult as the performance depends on data locality and cache memory effects. The distributed memory parallel programming model is examined for simulation programs using the MPI message passing interface. The results reveal that both shared and distributed memory parallel computation are very efficient with an almost perfect application speedup and may be applied to the most advanced powder simulations. PMID:12713968

  1. Efficiency of cellular information processing

    NASA Astrophysics Data System (ADS)

    Barato, Andre C.; Hartich, David; Seifert, Udo

    2014-10-01

    We show that a rate of conditional Shannon entropy reduction, characterizing the learning of an internal process about an external process, is bounded by the thermodynamic entropy production. This approach allows for the definition of an informational efficiency that can be used to study cellular information processing. We analyze three models of increasing complexity inspired by the Escherichia coli sensory network, where the external process is an external ligand concentration jumping between two values. We start with a simple model for which ATP must be consumed so that a protein inside the cell can learn about the external concentration. With a second model for a single receptor we show that the rate at which the receptor learns about the external environment can be nonzero even without any dissipation inside the cell since chemical work done by the external process compensates for this learning rate. The third model is more complete, also containing adaptation. For this model we show inter alia that a bacterium in an environment that changes at a very slow time-scale is quite inefficient, dissipating much more than it learns. Using the concept of a coarse-grained learning rate, we show for the model with adaptation that while the activity learns about the external signal the option of changing the methylation level increases the concentration range for which the learning rate is substantial.

  2. [Multi-DSP parallel processing technique of hyperspectral RX anomaly detection].

    PubMed

    Guo, Wen-Ji; Zeng, Xiao-Ru; Zhao, Bao-Wei; Ming, Xing; Zhang, Gui-Feng; Lü, Qun-Bo

    2014-05-01

    To satisfy the requirement of high speed, real-time and mass data storage etc. for RX anomaly detection of hyperspectral image data, the present paper proposes a solution of multi-DSP parallel processing system for hyperspectral image based on CPCI Express standard bus architecture. Hardware topological architecture of the system combines the tight coupling of four DSPs sharing data bus and memory unit with the interconnection of Link ports. On this hardware platform, by assigning parallel processing task for each DSP in consideration of the spectrum RX anomaly detection algorithm and the feature of 3D data in the spectral image, a 4DSP parallel processing technique which computes and solves the mean matrix and covariance matrix of the whole image by spatially partitioning the image is proposed. The experiment result shows that, in the case of equivalent detective effect, it can reach the time efficiency 4 times higher than single DSP process with the 4-DSP parallel processing technique of RX anomaly detection algorithm proposed by this paper, which makes a breakthrough in the constraints to the huge data image processing of DSP's internal storage capacity, meanwhile well meeting the demands of the spectral data in real-time processing. PMID:25095443

  3. Massively parallel spatial light modulation-based optical signal processing

    NASA Astrophysics Data System (ADS)

    Li, Yao

    1993-03-01

    A new optical parallel arithmetic processing scheme using a nonholographic optoelectronic content-addressable memory (CAM) was proposed. The design of a four-bit CAM-based optical carry look-ahead adder was studied. Compared with existing optoelectronic binary addition approaches, this nonholographic CAM Scheme offers a number of practical advantages, such as faster processing speed and ease of optical implementation and alignment. For an addition of numbers longer than four bits, by incorporating the previous stage's carry, a number of four-bit CLA's can be cascaded. Experimental results were also demonstrated. One paper to the Optics Letters was published.

  4. Parallel-Processing Equalizers for Multi-Gbps Communications

    NASA Technical Reports Server (NTRS)

    Gray, Andrew; Ghuman, Parminder; Hoy, Scott; Satorius, Edgar H.

    2004-01-01

    Architectures have been proposed for the design of frequency-domain least-mean-square complex equalizers that would be integral parts of parallel- processing digital receivers of multi-gigahertz radio signals and other quadrature-phase-shift-keying (QPSK) or 16-quadrature-amplitude-modulation (16-QAM) of data signals at rates of multiple gigabits per second. Equalizers as used here denotes receiver subsystems that compensate for distortions in the phase and frequency responses of the broad-band radio-frequency channels typically used to convey such signals. The proposed architectures are suitable for realization in very-large-scale integrated (VLSI) circuitry and, in particular, complementary metal oxide semiconductor (CMOS) application- specific integrated circuits (ASICs) operating at frequencies lower than modulation symbol rates. A digital receiver of the type to which the proposed architecture applies (see Figure 1) would include an analog-to-digital converter (A/D) operating at a rate, fs, of 4 samples per symbol period. To obtain the high speed necessary for sampling, the A/D and a 1:16 demultiplexer immediately following it would be constructed as GaAs integrated circuits. The parallel-processing circuitry downstream of the demultiplexer, including a demodulator followed by an equalizer, would operate at a rate of only fs/16 (in other words, at 1/4 of the symbol rate). The output from the equalizer would be four parallel streams of in-phase (I) and quadrature (Q) samples.

  5. Reducing neural network training time with parallel processing

    NASA Technical Reports Server (NTRS)

    Rogers, James L., Jr.; Lamarsh, William J., II

    1995-01-01

    Obtaining optimal solutions for engineering design problems is often expensive because the process typically requires numerous iterations involving analysis and optimization programs. Previous research has shown that a near optimum solution can be obtained in less time by simulating a slow, expensive analysis with a fast, inexpensive neural network. A new approach has been developed to further reduce this time. This approach decomposes a large neural network into many smaller neural networks that can be trained in parallel. Guidelines are developed to avoid some of the pitfalls when training smaller neural networks in parallel. These guidelines allow the engineer: to determine the number of nodes on the hidden layer of the smaller neural networks; to choose the initial training weights; and to select a network configuration that will capture the interactions among the smaller neural networks. This paper presents results describing how these guidelines are developed.

  6. Probabilistic structural mechanics research for parallel processing computers

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Martin, William R.

    1991-01-01

    Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical.

  7. Graphics Processing Unit Enhanced Parallel Document Flocking Clustering

    SciTech Connect

    Cui, Xiaohui; Potok, Thomas E; ST Charles, Jesse Lee

    2010-01-01

    Analyzing and clustering documents is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly difficult to generate results in a reasonable amount of time. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. In this paper, we have conducted research to exploit this archi- tecture and apply its strengths to the flocking based document clustering problem. Using the CUDA platform from NVIDIA, we developed a doc- ument flocking implementation to be run on the NVIDIA GEFORCE GPU. Performance gains ranged from thirty-six to nearly sixty times improvement of the GPU over the CPU implementation.

  8. Efficient data IO for a Parallel Global Cloud Resolving Model

    SciTech Connect

    Palmer, Bruce J.; Koontz, Annette S.; Schuchardt, Karen L.; Heikes, Ross P.; Randall, David A.

    2011-11-26

    Execution of a Global Cloud Resolving Model (GCRM) at target resolutions of 2-4 km will generate, at a minimum, 10s of Gigabytes of data per variable per snapshot. Writing this data to disk without creating a serious bottleneck in the execution of the GCRM code while also supporting efficient post-execution data analysis is a significant challenge. This paper discusses an Input/Output (IO) application programmer interface (API) for the GCRM that efficiently moves data from the model to disk while maintaining support for community standard formats, avoiding the creation of very large numbers of files, and supporting efficient analysis. Several aspects of the API will be discussed in detail. First, we discuss the output data layout which linearizes the data in a consistent way that is independent of the number of processors used to run the simulation and provides a convenient format for subsequent analyses of the data. Second, we discuss the flexible API interface that enables modelers to easily add variables to the output stream by specifying where in the GCRM code these variables are located and to flexibly configure the choice of outputs and distribution of data across files. The flexibility of the API is designed to allow model developers to add new data fields to the output as the model develops and new physics is added and also provides a mechanism for allowing users of the GCRM code itself to adjust the output frequency and the number of fields written depending on the needs of individual calculations. Third, we describe the mapping to the NetCDF data model with an emphasis on the grid description. Fourth, we describe our messaging algorithms and IO aggregation strategies that are used to achieve high bandwidth while simultaneously writing concurrently from many processors to shared files. We conclude with initial performance results.

  9. A simple hyperbolic model for communication in parallel processing environments

    NASA Technical Reports Server (NTRS)

    Stoica, Ion; Sultan, Florin; Keyes, David

    1994-01-01

    We introduce a model for communication costs in parallel processing environments called the 'hyperbolic model,' which generalizes two-parameter dedicated-link models in an analytically simple way. Dedicated interprocessor links parameterized by a latency and a transfer rate that are independent of load are assumed by many existing communication models; such models are unrealistic for workstation networks. The communication system is modeled as a directed communication graph in which terminal nodes represent the application processes that initiate the sending and receiving of the information and in which internal nodes, called communication blocks (CBs), reflect the layered structure of the underlying communication architecture. The direction of graph edges specifies the flow of the information carried through messages. Each CB is characterized by a two-parameter hyperbolic function of the message size that represents the service time needed for processing the message. The parameters are evaluated in the limits of very large and very small messages. Rules are given for reducing a communication graph consisting of many to an equivalent two-parameter form, while maintaining an approximation for the service time that is exact in both large and small limits. The model is validated on a dedicated Ethernet network of workstations by experiments with communication subprograms arising in scientific applications, for which a tight fit of the model predictions with actual measurements of the communication and synchronization time between end processes is demonstrated. The model is then used to evaluate the performance of two simple parallel scientific applications from partial differential equations: domain decomposition and time-parallel multigrid. In an appropriate limit, we also show the compatibility of the hyperbolic model with the recently proposed LogP model.

  10. A learnable parallel processing architecture towards unity of memory and computing

    PubMed Central

    Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.

    2015-01-01

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area. PMID:26271243

  11. A learnable parallel processing architecture towards unity of memory and computing

    NASA Astrophysics Data System (ADS)

    Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.

    2015-08-01

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.

  12. A learnable parallel processing architecture towards unity of memory and computing.

    PubMed

    Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J

    2015-01-01

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area. PMID:26271243

  13. Parallel Processing Systems for Passive Ranging During Helicopter Flight

    NASA Technical Reports Server (NTRS)

    Sridhar, Bavavar; Suorsa, Raymond E.; Showman, Robert D. (Technical Monitor)

    1994-01-01

    The complexity of rotorcraft missions involving operations close to the ground result in high pilot workload. In order to allow a pilot time to perform mission-oriented tasks, sensor-aiding and automation of some of the guidance and control functions are highly desirable. Images from an electro-optical sensor provide a covert way of detecting objects in the flight path of a low-flying helicopter. Passive ranging consists of processing a sequence of images using techniques based on optical low computation and recursive estimation. The passive ranging algorithm has to extract obstacle information from imagery at rates varying from five to thirty or more frames per second depending on the helicopter speed. We have implemented and tested the passive ranging algorithm off-line using helicopter-collected images. However, the real-time data and computation requirements of the algorithm are beyond the capability of any off-the-shelf microprocessor or digital signal processor. This paper describes the computational requirements of the algorithm and uses parallel processing technology to meet these requirements. Various issues in the selection of a parallel processing architecture are discussed and four different computer architectures are evaluated regarding their suitability to process the algorithm in real-time. Based on this evaluation, we conclude that real-time passive ranging is a realistic goal and can be achieved with a short time.

  14. Application of parallel distributed processing to space based systems

    NASA Technical Reports Server (NTRS)

    Macdonald, J. R.; Heffelfinger, H. L.

    1987-01-01

    The concept of using Parallel Distributed Processing (PDP) to enhance automated experiment monitoring and control is explored. Recent very large scale integration (VLSI) advances have made such applications an achievable goal. The PDP machine has demonstrated the ability to automatically organize stored information, handle unfamiliar and contradictory input data and perform the actions necessary. The PDP machine has demonstrated that it can perform inference and knowledge operations with greater speed and flexibility and at lower cost than traditional architectures. In applications where the rule set governing an expert system's decisions is difficult to formulate, PDP can be used to extract rules by associating the information an expert receives with the actions taken.

  15. Parallel-Processing Software for Creating Mosaic Images

    NASA Technical Reports Server (NTRS)

    Klimeck, Gerhard; Deen, Robert; McCauley, Michael; DeJong, Eric

    2008-01-01

    A computer program implements parallel processing for nearly real-time creation of panoramic mosaics of images of terrain acquired by video cameras on an exploratory robotic vehicle (e.g., a Mars rover). Because the original images are typically acquired at various camera positions and orientations, it is necessary to warp the images into the reference frame of the mosaic before stitching them together to create the mosaic. [Also see "Parallel-Processing Software for Correlating Stereo Images," Software Supplement to NASA Tech Briefs, Vol. 31, No. 9 (September 2007) page 26.] The warping algorithm in this computer program reflects the considerations that (1) for every pixel in the desired final mosaic, a good corresponding point must be found in one or more of the original images and (2) for this purpose, one needs a good mathematical model of the cameras and a good correlation of individual pixels with respect to their positions in three dimensions. The desired mosaic is divided into slices, each of which is assigned to one of a number of central processing units (CPUs) operating simultaneously. The results from the CPUs are gathered and placed into the final mosaic. The time taken to create the mosaic depends upon the number of CPUs, the speed of each CPU, and whether a local or a remote data-staging mechanism is used.

  16. Parallel processing environment for multi-flexible body dynamics

    NASA Technical Reports Server (NTRS)

    Venugopal, Ravi; Kumar, Manoj N.; Singh, Ramen P.; Taylor, Lawrence W., Jr.

    1989-01-01

    The implementation of a dynamics solution algorithm with inherent parallelism which is applicable to the dynamics of large flexible space structures is described. The algorithm is unique in that parts of the solution can be computed simultaneously by working with different branches of its tree topology. The algorithm exhibits close to 0(n) type behavior. The data flow within the solution algorithm is discussed along with results from its implementation in a multiprocessing environment. A model of the United States Space Station is used as an example. The results show that, with fast multiple scalar processors, an efficient algorithm, and symbolically generated equations of motion, real-time performance can be achieved with present-day hardware technology, even with complex dynamical models.

  17. XTP as a transport protocol for distributed parallel processing

    SciTech Connect

    Strayer, W.T.; Lewis, M.J.; Cline, R.E. Jr.

    1994-12-31

    The Xpress Transfer Protocol (XTP) is a flexible transport layer protocol designed to provide efficient service without dictating the communication paradigm or the delivery characteristics that quality the paradigm. XTP provides the tools to build communication services appropriate to the application. Current data delivery solutions for many popular cluster computing environments use TCP and UDP. We examine TCP, UDP, and XTP with respect to the communication characteristics typical of parallel applications. We perform measurements of end-to-end latency for several paradigms important to cluster computing. An implementation of XTP is shown to be comparable to TCP in end-to-end latency on preestablished connections, and does better for paradigms where connections must be constructed on the fly.

  18. Efficient parallel algorithm for statistical ion track simulations in crystalline materials

    NASA Astrophysics Data System (ADS)

    Jeon, Byoungseon; Grønbech-Jensen, Niels

    2009-02-01

    We present an efficient parallel algorithm for statistical Molecular Dynamics simulations of ion tracks in solids. The method is based on the Rare Event Enhanced Domain following Molecular Dynamics (REED-MD) algorithm, which has been successfully applied to studies of, e.g., ion implantation into crystalline semiconductor wafers. We discuss the strategies for parallelizing the method, and we settle on a host-client type polling scheme in which a multiple of asynchronous processors are continuously fed to the host, which, in turn, distributes the resulting feed-back information to the clients. This real-time feed-back consists of, e.g., cumulative damage information or statistics updates necessary for the cloning in the rare event algorithm. We finally demonstrate the algorithm for radiation effects in a nuclear oxide fuel, and we show the balanced parallel approach with high parallel efficiency in multiple processor configurations.

  19. An efficient parallel algorithm for O( N2) direct summation method and its variations on distributed-memory parallel machines

    NASA Astrophysics Data System (ADS)

    Makino, Junichiro

    2002-10-01

    We present a novel, highly efficient algorithm to parallelize O( N2) direct summation method for N-body problems with individual timesteps on distributed-memory parallel machines such as Beowulf clusters. Previously known algorithms, in which all processors have complete copies of the N-body system, has the serious problem that the communication-computation ratio increases as we increase the number of processors, since the communication cost is independent of the number of processors. In the new algorithm, p processors are organized as a p×p two-dimensional array. Each processor has N/ p particles, but the data are distributed in such a way that complete system is presented if we look at any row or column consisting of p processors. In this algorithm, the communication cost scales as N/ p, while the calculation cost scales as N2/ p. Thus, we can use a much larger number of processors without losing efficiency compared to what was practical with previously known algorithms.

  20. Parallel Latent Semantic Analysis using a Graphics Processing Unit

    SciTech Connect

    Cui, Xiaohui; Potok, Thomas E; Cavanagh, Joseph M

    2009-01-01

    Latent Semantic Analysis (LSA) can be used to reduce the dimensions of large Term-Document datasets using Singular Value Decomposition. However, with the ever expanding size of data sets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. The Graphics Processing Unit (GPU) can solve some highly parallel problems much faster than the traditional sequential processor (CPU). Thus, a deployable system using a GPU to speedup large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a computer cluster. In this paper, we presented a parallel LSA implementation on the GPU, using NVIDIA Compute Unified Device Architecture (CUDA) and Compute Unified Basic Linear Algebra Subprograms (CUBLAS). The performance of this implementation is compared to traditional LSA implementation on CPU using an optimized Basic Linear Algebra Subprograms library. For large matrices that have dimensions divisible by 16, the GPU algorithm ran five to six times faster than the CPU version.

  1. Implementing a Gaussian Process Learning Algorithm in Mixed Parallel Environment

    SciTech Connect

    Chandola, Varun; Vatsavai, Raju

    2011-01-01

    In this paper, we present a scalability analysis of a parallel Gaussian process training algorithm to simultaneously analyze a massive number of time series. We study three different parallel implementations: using threads, MPI, and a hybrid implementation using threads and MPI. We compare the scalability for the multi-threaded implementation on three different hardware platforms: a Mac desktop with two quad-core Intel Xeon processors (16 virtual cores), a Linux cluster node with four quad-core 2.3 GHz AMD Opteron processors, and SGI Altix ICE 8200 cluster node with two quad-core Intel Xeon processors (16 virtual cores). We also study the scalability of the MPI based and the hybrid MPI and thread based implementations on the SGI cluster with 128 nodes (2048 cores). Experimental results show that the hybrid implementation scales better than the multi-threaded and MPI based implementations. The hybrid implementation, using 1536 cores, can analyze a remote sensing data set with over 4 million time series in nearly 5 seconds while the serial algorithm takes nearly 12 hours to process the same data set.

  2. Beta operations: efficient implementation of a primitive parallel operation. Technical report

    SciTech Connect

    Cohn, E.R.; Haddad, R.W.

    1986-08-01

    The ever-decreasing cost of computer processors has created a great interest in multi-processor computers. However, along with the increased power that this parallelism brings comes increased complexity in programming. One approach to lessening this complexity is to provide the programmer with general-purpose parallel primitives that shield him from the structure of the underlying machine. In The Connection Machine, Hillis suggests the beta operation as a parallel primitive for his hypercube-based machine. This paper explores efficient ways to perform this operation on several different well known architectures, including the hypercube. It presents some lower bounds associated with the problem.

  3. Parallel processing architecture for H.264 deblocking filter on multi-core platforms

    NASA Astrophysics Data System (ADS)

    Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

    2012-03-01

    Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking

  4. Parallel Beam Approximation for Calculation of Detection Efficiency of Crystals in PET Detector Arrays

    PubMed Central

    Komarov, Sergey; Song, Tae Yong; Wu, Heyu; Tai, Yuan-Chuan

    2014-01-01

    In this work we propose a parallel beam approximation for the computation of the detection efficiency of crystals in a PET detector array. In this approximation the detection efficiency of a crystal is estimated using the distance between source and the crystal and the pre-calculated detection cross section of the crystal in a crystal array which is calculated for a uniform parallel beam of gammas. The pre-calculated detection cross sections for a few representative incident angles and gamma energies can be used to create a look-up table to be used in simulation studies or practical implementation of scatter or random correction algorithms. Utilizing the symmetries of the square crystal array, the pre-calculated look-up tables can be relatively small. The detection cross sections can be measured experimentally, calculated analytically or simulated using a Monte Carlo (MC) approach. In this work we used a MC simulation that takes into account the energy windowing, Compton scattering and factors in the “block effect”. The parallel beam approximation was validated by a separate MC simulation using point sources located at different positions around a crystal array. Experimentally measured detection efficiencies were compared with Monte Carlo simulated detection efficiencies. Results suggest that the parallel beam approximation provides an efficient and accurate way to compute the crystal detection efficiency, which can be used for estimation of random and scatter coincidences for PET data corrections. PMID:25400292

  5. Massively Parallel Processing for Fast and Accurate Stamping Simulations

    NASA Astrophysics Data System (ADS)

    Gress, Jeffrey J.; Xu, Siguang; Joshi, Ramesh; Wang, Chuan-tao; Paul, Sabu

    2005-08-01

    The competitive automotive market drives automotive manufacturers to speed up the vehicle development cycles and reduce the lead-time. Fast tooling development is one of the key areas to support fast and short vehicle development programs (VDP). In the past ten years, the stamping simulation has become the most effective validation tool in predicting and resolving all potential formability and quality problems before the dies are physically made. The stamping simulation and formability analysis has become an critical business segment in GM math-based die engineering process. As the simulation becomes as one of the major production tools in engineering factory, the simulation speed and accuracy are the two of the most important measures for stamping simulation technology. The speed and time-in-system of forming analysis becomes an even more critical to support the fast VDP and tooling readiness. Since 1997, General Motors Die Center has been working jointly with our software vendor to develop and implement a parallel version of simulation software for mass production analysis applications. By 2001, this technology was matured in the form of distributed memory processing (DMP) of draw die simulations in a networked distributed memory computing environment. In 2004, this technology was refined to massively parallel processing (MPP) and extended to line die forming analysis (draw, trim, flange, and associated spring-back) running on a dedicated computing environment. The evolution of this technology and the insight gained through the implementation of DM0P/MPP technology as well as performance benchmarks are discussed in this publication.

  6. MASSIVELY PARALLEL LATENT SEMANTIC ANALYSES USING A GRAPHICS PROCESSING UNIT

    SciTech Connect

    Cavanagh, J.; Cui, S.

    2009-01-01

    Latent Semantic Analysis (LSA) aims to reduce the dimensions of large term-document datasets using Singular Value Decomposition. However, with the ever-expanding size of datasets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. A graphics processing unit (GPU) can solve some highly parallel problems much faster than a traditional sequential processor or central processing unit (CPU). Thus, a deployable system using a GPU to speed up large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a PC cluster. Due to the GPU’s application-specifi c architecture, harnessing the GPU’s computational prowess for LSA is a great challenge. We presented a parallel LSA implementation on the GPU, using NVIDIA® Compute Unifi ed Device Architecture and Compute Unifi ed Basic Linear Algebra Subprograms software. The performance of this implementation is compared to traditional LSA implementation on a CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1 000x1 000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran fi ve to six times faster than the CPU version. The large variation is due to architectural benefi ts of the GPU for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.

  7. Parallel information processing channels created in the retina

    PubMed Central

    Schiller, Peter H.

    2010-01-01

    In the retina, several parallel channels originate that extract different attributes from the visual scene. This review describes how these channels arise and what their functions are. Following the introduction four sections deal with these channels. The first discusses the “ON” and “OFF” channels that have arisen for the purpose of rapidly processing images in the visual scene that become visible by virtue of either light increment or light decrement; the ON channel processes images that become visible by virtue of light increment and the OFF channel processes images that become visible by virtue of light decrement. The second section examines the midget and parasol channels. The midget channel processes fine detail, wavelength information, and stereoscopic depth cues; the parasol channel plays a central role in processing motion and flicker as well as motion parallax cues for depth perception. Both these channels have ON and OFF subdivisions. The third section describes the accessory optic system that receives input from the retinal ganglion cells of Dogiel; these cells play a central role, in concert with the vestibular system, in stabilizing images on the retina to prevent the blurring of images that would otherwise occur when an organism is in motion. The last section provides a brief overview of several additional channels that originate in the retina. PMID:20876118

  8. Parallel distributed processing: Implications for cognition and development. Technical report

    SciTech Connect

    McClelland, J.L.

    1988-07-11

    This paper provides a brief overview of the connectionist or parallel distributed processing framework for modeling cognitive processes, and considers the application of the connectionist framework to problems of cognitive development. Several aspects of cognitive development might result from the process of learning as it occurs in multi-layer networks. This learning process has the characteristic that it reduces the discrepancy between expected and observed events. As it does this, representations develop on hidden units which dramatically change both the way in which the network represents the environment from which it learns and the expectations that the network generates about environmental events. The learning process exhibits relatively abrupt transitions corresponding to stage shifts in cognitive development. These points are illustrated using a network that learns to anticipate which side of a balance beam will go down, based on the number of weights on each side of the fulcrum and their distance from the fulcrum on each side of the beam. The network is trained in an environment in which weight more frequently governs which side will go down. It recapitulates the states of development seen in children, as well as the stage transitions, as it learns to represent weight and distance information.

  9. Efficient parallel algorithms for (5+1)-coloring and maximal independent set problems

    SciTech Connect

    Goldberg, A.V.; Plotkin, S.A.

    1987-01-01

    An efficient technique for breaking symmetry in parallel is described. The technique works especially well on rooted trees and on graphs with a small maximum degree. In particular, a maximal independent set can be found on a constant-degree graph in O(lg*n) time on an EREW PRAM using a linear number of processors. It is shown how to apply this technique to construct more efficient parallel algorithms for several problems, including coloring of planar graphs and (delta + 1)-coloring of constant-degree graphs. Lower bounds for two related problems are proved.

  10. Energy efficient perlite expansion process

    SciTech Connect

    Jenkins, K.L.

    1982-08-31

    A thermally efficient process for the expansion of perlite ore is described. The inlet port and burner of a perlite expansion chamber (Preferably a vertical expander) are enclosed such that no ambient air can enter the chamber. Air and fuel are metered to the burner with the amount of air being controlled such that the fuel/air premix contains at least enough air to start and maintain minimum combustion, but not enough to provide stoichiometric combustion. At a point immediately above the burner, additional air is metered into an insulated enclosure surrounding the expansion chamber where it is preheated by the heat passing through the chamber walls. This preheated additional air is then circulated back to the burner where it provides the remainder of the air needed for combustion, normally full combustion. Flow of the burner fuel/air premix and the preheated additional air is controlled so as to maintain a long luminous flame throughout a substantial portion of the expansion chamber and also to form a moving laminar layer of air on the inner surface of the expansion chamber. Preferably the burner is a delayed mixing gas burner which materially aids in the generation of the long luminous flame. The long luminous flame and the laminar layer of air at the chamber wall eliminate hot spots in the expansion chamber, result in relatively low and uniform temperature gradients across the chamber, significantly reduce the amount of fuel consumed per unit of perlite expanded, increase the yield of expanded perlite and prevent the formation of a layer of perlite sinter on the walls of the chamber.

  11. Parallel Processing of Adaptive Meshes with Load Balancing

    NASA Technical Reports Server (NTRS)

    Das, Sajal K.; Harvey, Daniel J.; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

    2001-01-01

    Many scientific applications involve grids that lack a uniform underlying structure. These applications are often also dynamic in nature in that the grid structure significantly changes between successive phases of execution. In parallel computing environments, mesh adaptation of unstructured grids through selective refinement/coarsening has proven to be an effective approach. However, achieving load balance while minimizing interprocessor communication and redistribution costs is a difficult problem. Traditional dynamic load balancers are mostly inadequate because they lack a global view of system loads across processors. In this paper, we propose a novel and general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication topology, and compare its performance with a successful global load balancing environment, called PLUM, specifically created to handle adaptive unstructured applications. Our experimental results on an IBM SP2 demonstrate that the SBN-based load balancer achieves lower redistribution costs than that under PLUM by overlapping processing and data migration.

  12. Applying the Extended Parallel Process Model to workplace safety messages.

    PubMed

    Basil, Michael; Basil, Debra; Deshpande, Sameer; Lavack, Anne M

    2013-01-01

    The extended parallel process model (EPPM) proposes fear appeals are most effective when they combine threat and efficacy. Three studies conducted in the workplace safety context examine the use of various EPPM factors and their effects, especially multiplicative effects. Study 1 was a content analysis examining the use of EPPM factors in actual workplace safety messages. Study 2 experimentally tested these messages with 212 construction trainees. Study 3 replicated this experiment with 1,802 men across four English-speaking countries-Australia, Canada, the United Kingdom, and the United States. The results of these three studies (1) demonstrate the inconsistent use of EPPM components in real-world work safety communications, (2) support the necessity of self-efficacy for the effective use of threat, (3) show a multiplicative effect where communication effectiveness is maximized when all model components are present (severity, susceptibility, and efficacy), and (4) validate these findings with gory appeals across four English-speaking countries. PMID:23330856

  13. A Design Verification of the Parallel Pipelined Image Processings

    NASA Astrophysics Data System (ADS)

    Wasaki, Katsumi; Harai, Toshiaki

    2008-11-01

    This paper presents a case study of the design and verification of a parallel and pipe-lined image processing unit based on an extended Petri net, which is called a Logical Colored Petri net (LCPN). This is suitable for Flexible-Manufacturing System (FMS) modeling and discussion of structural properties. LCPN is another family of colored place/transition-net(CPN) with the addition of the following features: integer value assignment of marks, representation of firing conditions as marks' value based formulae, and coupling of output procedures with transition firing. Therefore, to study the behavior of a system modeled with this net, we provide a means of searching the reachability tree for markings.

  14. Parallel asynchronous hardware implementation of image processing algorithms

    NASA Technical Reports Server (NTRS)

    Coon, Darryl D.; Perera, A. G. U.

    1990-01-01

    Research is being carried out on hardware for a new approach to focal plane processing. The hardware involves silicon injection mode devices. These devices provide a natural basis for parallel asynchronous focal plane image preprocessing. The simplicity and novel properties of the devices would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture built from arrays of the devices would form a two-dimensional (2-D) array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuron-like asynchronous pulse-coded form through the laminar processor. No multiplexing, digitization, or serial processing would occur in the preprocessing state. High performance is expected, based on pulse coding of input currents down to one picoampere with noise referred to input of about 10 femtoamperes. Linear pulse coding has been observed for input currents ranging up to seven orders of magnitude. Low power requirements suggest utility in space and in conjunction with very large arrays. Very low dark current and multispectral capability are possible because of hardware compatibility with the cryogenic environment of high performance detector arrays. The aforementioned hardware development effort is aimed at systems which would integrate image acquisition and image processing.

  15. Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

    NASA Technical Reports Server (NTRS)

    Oliker, Leonid; Biswas, Rupak

    1999-01-01

    The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.

  16. Efficient parallelization of analytic bond-order potentials for large-scale atomistic simulations

    NASA Astrophysics Data System (ADS)

    Teijeiro, C.; Hammerschmidt, T.; Drautz, R.; Sutmann, G.

    2016-07-01

    Analytic bond-order potentials (BOPs) provide a way to compute atomistic properties with controllable accuracy. For large-scale computations of heterogeneous compounds at the atomistic level, both the computational efficiency and memory demand of BOP implementations have to be optimized. Since the evaluation of BOPs is a local operation within a finite environment, the parallelization concepts known from short-range interacting particle simulations can be applied to improve the performance of these simulations. In this work, several efficient parallelization methods for BOPs that use three-dimensional domain decomposition schemes are described. The schemes are implemented into the bond-order potential code BOPfox, and their performance is measured in a series of benchmarks. Systems of up to several millions of atoms are simulated on a high performance computing system, and parallel scaling is demonstrated for up to thousands of processors.

  17. Mobile Devices and GPU Parallelism in Ionospheric Data Processing

    NASA Astrophysics Data System (ADS)

    Mascharka, D.; Pankratius, V.

    2015-12-01

    Scientific data acquisition in the field is often constrained by data transfer backchannels to analysis environments. Geoscientists are therefore facing practical bottlenecks with increasing sensor density and variety. Mobile devices, such as smartphones and tablets, offer promising solutions to key problems in scientific data acquisition, pre-processing, and validation by providing advanced capabilities in the field. This is due to affordable network connectivity options and the increasing mobile computational power. This contribution exemplifies a scenario faced by scientists in the field and presents the "Mahali TEC Processing App" developed in the context of the NSF-funded Mahali project. Aimed at atmospheric science and the study of ionospheric Total Electron Content (TEC), this app is able to gather data from various dual-frequency GPS receivers. It demonstrates parsing of full-day RINEX files on mobile devices and on-the-fly computation of vertical TEC values based on satellite ephemeris models that are obtained from NASA. Our experiments show how parallel computing on the mobile device GPU enables fast processing and visualization of up to 2 million datapoints in real-time using OpenGL. GPS receiver bias is estimated through minimum TEC approximations that can be interactively adjusted by scientists in the graphical user interface. Scientists can also perform approximate computations for "quickviews" to reduce CPU processing time and memory consumption. In the final stage of our mobile processing pipeline, scientists can upload data to the cloud for further processing. Acknowledgements: The Mahali project (http://mahali.mit.edu) is funded by the NSF INSPIRE grant no. AGS-1343967 (PI: V. Pankratius). We would like to acknowledge our collaborators at Boston College, Virginia Tech, Johns Hopkins University, Colorado State University, as well as the support of UNAVCO for loans of dual-frequency GPS receivers for use in this project, and Intel for loans of

  18. Radon-Based Image Processing In A Parallel Pipeline Architecture

    NASA Astrophysics Data System (ADS)

    Hinkle, Eric B.; Sanz, Jorge L. C.; Jain, Anil K.

    1986-04-01

    This paper deals with a novel architecture that makes real-time projection-based algorithms a reality. The design is founded on raster-mode processing, which is exploited in a powerful and flexible pipeline. This architecture, dubbed "P3 E" ( Parallel Pipeline Projection Engine), supports a large variety of image processing and image analysis applications. The image processing applications include: discrete approximations of the Radon and inverse Radon transform, among other projection operators; CT reconstructions; 2-D convolutions; rotations and translations; discrete Fourier transform computations in polar coordinates; autocorrelations; etc. There is also an extensive list of key image analysis algorithms that are supported by P E, thus making it a profound and versatile tool for projection-based computer vision. These include: projections of gray-level images along linear patterns (the Radon transform) and other curved contours; generation of multi-color digital masks; convex hull approximations; Hough transform approximations for line and curve detection; diameter computations; calculations of moments and other principal components; etc. The effectiveness of our approach and the feasibility of the proposed architecture have been demonstrated by running some of these image analysis algorithms in conventional short pipelines, to solve some important automated inspection problems. In the present paper, we will concern ourselves with reconstructing images from their linear projections, and performing convolutions via the Radon transform.

  19. Low-cost high-efficiency optical coupling using through-silicon-hole in parallel optical transceiver module

    NASA Astrophysics Data System (ADS)

    Li, Baoxia; Wan, Lixi; Lv, Yao; Gao, Wei; Yang, Chengyue; Li, Zhihua; Zhang, Xu

    2009-06-01

    We present a cost-efficient parallel optical transceiver module based on a 1×4 VCSEL array, a 1×4 PD array, and a 12-wide multimode fiber ribbon for very-short-reach application. A passive alignment technique using through-silicon-hole (TSH) has been developed to realize high-efficient butt-coupling between optoelectronic arrays and multimode fibers. In this paper, the detail optical coupling structure, misalignment tolerance, micro-assembly process, and measurement results are mainly discussed. Finally, lensed multimode fibers formed by chemical etching are proposed, which exhibit a great potential for further improvement of coupling performance.

  20. On the design and implementation of a parallel, object-oriented, image processing toolkit

    SciTech Connect

    Kamath, C; Baldwin, C; Fodor, I; Tang, N A

    2000-06-22

    Advanced in technology have enabled us to collect data from observations, experiments, and simulations at an ever increasing pace. As these data sets approach the terabyte and petabyte range, scientists are increasingly using semi-automated techniques from data mining and pattern recognition to find useful information in the data. In order for data mining to be successful, the raw data must first be processed into a form suitable for the detection of patterns. When the data is in the form of images, this can involve a substantial amount of processing on very large data sets. To help make this task more efficient, they are designing and implementing an object-oriented image processing toolkit that specifically targets massively-parallel, distributed-memory architectures. They first show that it is possible to use object-oriented technology to effectively address the diverse needs of image applications. Next, they describe how we abstract out the similarities in image processing algorithms to enable re-use in the software. They will also discuss the difficulties encountered in parallelizing image algorithms on massively parallel machines as well as the bottlenecks to high performance. They will demonstrate the work using images from an astronomical data set, and illustrate how techniques such as filters and denoising through the thresholding of wavelet coefficients can be applied when a large image is distributed across several processors.

  1. Design and implementation of a parallel object-oriented image processing toolkit

    NASA Astrophysics Data System (ADS)

    Kamath, Chandrika; Baldwin, Chuck H.; Fodor, Imola K.; Tang, Nu A.

    2000-10-01

    Advances in technology have enabled us to collect data from observations, experiments, and simulations at an ever increasing pace. As these data sets approach the terabyte and petabyte range, scientists are increasingly using semi-automated techniques from data mining and pattern recognition to find useful information in the data. In order for data mining to be successful, the raw data must first be processed into a form suitable for the detection of patterns. When the data is in the form of images, this can involve a substantial amount of processing on very large data sets. To help make this task more efficient, we are designing and implementing an object-oriented image processing toolkit that specifically targets massively-parallel, distributed-memory architectures. We first show that it is possible to use object-oriented technology to effectively address the diverse needs of image applications. Next, we describe how we abstract out the similarities in image processing algorithms to enable re-use in our software. We will also discuss the difficulties encountered in parallelizing image algorithms on the massively parallel machines as well as the bottlenecks to high performance. We will demonstrate our work using images from an astronomical data set, and illustrate how techniques such as filters and denoising through the thresholding of wavelet coefficients can be applied when a large image is distributed across several processors.

  2. Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme

    NASA Astrophysics Data System (ADS)

    Huang, M.; Mielikainen, J.; Huang, B.; Chen, H.; Huang, H.-L. A.; Goldberg, M. D.

    2015-09-01

    The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU)-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.

  3. Resolving Multiscale Processes in Tropical Cyclogenesis Using Parallel EEMD

    NASA Astrophysics Data System (ADS)

    Wu, Y.; Shen, B. W.; Cheung, S.; Li, J. L. F.; Liu, Z.

    2014-12-01

    The recent advance in high-resolution global models has suggested that improved multiscale simulations of tropical waves may help extend the lead time of tropical cyclone (TC) formation prediction (e.g., Shen et al., 2010ab, 2012, 2013a). In previous efforts in the multiscale analysis of tropical waves , the Ensemble Empirical Mode Decomposition (EEMD) has been successfully parallelized and used to detect atmospheric wave signals on different spatial scales (e.g. Shen et al., 2013b) that include Mixed Rossby Gravity (MRG) waves, Western Wind Belt (WWB), African Easterly Waves (AEWs), etc. We now extend the related studies to examine the evolution of the large scale waves and their association with the formation of tropical cyclones in the Atlantic for an extensive time period spanning multiple years. Our goal is to analyze the multiscale interaction in the initiation and early intensification stage of an AEW and its subsequent impact on TC genesis that involves mainly the large scale downscaling processes. Specific focus is on the impact of barotropic instability and critical level (CL, or steering level) that may appear in association with the AEW. The presence of the CL is believed to play an important role in providing a favorable environment in the early TC-genesis stage in the marsupial paradigm scenario. Preliminary analysis of the satellite data obtained from the newly launched Global Precipitation Measurement (GPM) mission linked to the TC genesis processes will be included.

  4. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-18

    Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (`PAMI`) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  5. Variation in efficiency of parallel algorithms. [for study of stiffness matrices in planar trusses

    NASA Technical Reports Server (NTRS)

    Hayashi, A.; Melosh, R. J.; Utku, S.; Salama, M.

    1985-01-01

    The present study has the objective to investigate some iterative parallel-processor linear equation solving algorithms with respect to efficiency for analyses of typical linear engineering systems. Attention is given to a set of n linear equations, Ku = p, where K = an n x n positive definite, sparsely populated, symmetric matrix, u = an n x 1 vector of unknown responses, and p = an n x 1 vector of prescribed constants. This study is concerned with a hybrid method in which iteration is used to solve the problem, while a direct method is used on the local processor level. Variations in the efficiency of parallel algorithms are explored. Measures of the efficiency are based on computer experiments regarding the algorithms. For all the algorithms, the wall clock time is found to decrease as the number of processors increases.

  6. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations

    NASA Technical Reports Server (NTRS)

    Stone, H. S.

    1971-01-01

    Tridiagonal linear systems of equations are solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computations on computers of the ILLIAC IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log sub 2 N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.

  7. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations.

    NASA Technical Reports Server (NTRS)

    Stone, H. S.

    1973-01-01

    Tridiagonal linear systems of equations can be solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computation on computers of the Illiac IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log(sub-2) N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.

  8. On the parallel efficiency of the Frederickson-McBryan multigrid

    NASA Technical Reports Server (NTRS)

    Decker, Naomi H.

    1990-01-01

    To take full advantage of the parallelism in a standard multigrid algorithm requires as many processors as points. However, since coarse grids contain fewer points, most processors are idle during the coarse grid iterations. Frederickson and McBryan claim that retaining all points on all grid levels (using all processors) can lead to a superconvergent algorithm. The purpose of this work is to show that the parellel superconvergent multigrid (PSMG) algorithm of Frederickson and McBryan, though it achieves perfect processor utilization, is no more efficient than a parallel implementation of standard multigrid methods. PSMG is simply a new and perhaps simpler way of achieving the same results.

  9. Efficient Extraction of Regional Subsets from Massive Climate Datasets using Parallel IO

    SciTech Connect

    Daily, Jeffrey A.; Schuchardt, Karen L.; Palmer, Bruce J.

    2010-09-16

    The size of datasets produced by current climate models is increasing rapidly to the scale of petabytes. To handle data at this scale parallel analysis tools are required, however the majority of climate analysis software remains at the scale of workstations. Further, many climate analysis tools adequately process regularly gridded data but lack sufficient features when handling unstructured grids. This paper presents a data-parallel subsetter capable of correctly handling unstructured grids while scaling to over 2000 cores. The approach is based on the partitioned global address space (PGAS) parallel programming model and one-sided communication. The paper demonstrates that IO remains the single greatest bottleneck for this domain of applications and that parallel analysis of climate data succeeds in practice.

  10. Nonlinear structural response using adaptive dynamic relaxation on a massively-parallel-processing system

    NASA Technical Reports Server (NTRS)

    Oakley, David R.; Knight, Norman F., Jr.

    1994-01-01

    A parallel adaptive dynamic relaxation (ADR) algorithm has been developed for nonlinear structural analysis. This algorithm has minimal memory requirements, is easily parallelizable and scalable to many processors, and is generally very reliable and efficient for highly nonlinear problems. Performance evaluations on single-processor computers have shown that the ADR algorithm is reliable and highly vectorizable, and that it is competitive with direct solution methods for the highly nonlinear problems considered. The present algorithm is implemented on the 512-processor Intel Touchstone DELTA system at Caltech, and it is designed to minimize the extent and frequency of interprocessor communication. The algorithm has been used to solve for the nonlinear static response of two and three dimensional hyperelastic systems involving contact. Impressive relative speedups have been achieved and demonstrate the high scalability of the ADR algorithm. For the class of problems addressed, the ADR algorithm represents a very promising approach for parallel-vector processing.

  11. Fast phase processing in off-axis holography by CUDA including parallel phase unwrapping.

    PubMed

    Backoach, Ohad; Kariv, Saar; Girshovitz, Pinhas; Shaked, Natan T

    2016-02-22

    We present parallel processing implementation for rapid extraction of the quantitative phase maps from off-axis holograms on the Graphics Processing Unit (GPU) of the computer using computer unified device architecture (CUDA) programming. To obtain efficient implementation, we parallelized both the wrapped phase map extraction algorithm and the two-dimensional phase unwrapping algorithm. In contrast to previous implementations, we utilized unweighted least squares phase unwrapping algorithm that better suits parallelism. We compared the proposed algorithm run times on the CPU and the GPU of the computer for various sizes of off-axis holograms. Using the GPU implementation, we extracted the unwrapped phase maps from the recorded off-axis holograms at 35 frames per second (fps) for 4 mega pixel holograms, and at 129 fps for 1 mega pixel holograms, which presents the fastest processing framerates obtained so far, to the best of our knowledge. We then used common-path off-axis interferometric imaging to quantitatively capture the phase maps of a micro-organism with rapid flagellum movements. PMID:26906982

  12. Rapid parallel semantic processing of numbers without awareness.

    PubMed

    Van Opstal, Filip; de Lange, Floris P; Dehaene, Stanislas

    2011-07-01

    In this study, we investigate whether multiple digits can be processed at a semantic level without awareness, either serially or in parallel. In two experiments, we presented participants with two successive sets of four simultaneous Arabic digits. The first set was masked and served as a subliminal prime for the second, visible target set. According to the instructions, participants had to extract from the target set either the mean or the sum of the digits, and to compare it with a reference value. Results showed that participants applied the requested instruction to the entire set of digits that was presented below the threshold of conscious perception, because their magnitudes jointly affected the participant's decision. Indeed, response decision could be accurately modeled as a sigmoid logistic function that pooled together the evidence provided by the four targets and, with lower weights, the four primes. In less than 800ms, participants successfully approximated the addition and mean tasks, although they tended to overweight the large numbers, particularly in the sum task. These findings extend previous observations on ensemble coding by showing that set statistics can be extracted from abstract symbolic stimuli rather than low-level perceptual stimuli, and that an ensemble code can be represented without awareness. PMID:21489415

  13. An integrated approach to improving the parallel applications development process

    SciTech Connect

    Rasmussen, Craig E; Watson, Gregory R; Tibbitts, Beth R

    2009-01-01

    The development of parallel applications is becoming increasingly important to a broad range of industries. Traditionally, parallel programming was a niche area that was primarily exploited by scientists trying to model extremely complicated physical phenomenon. It is becoming increasingly clear, however, that continued hardware performance improvements through clock scaling and feature-size reduction are simply not going to be achievable for much longer. The hardware vendor's approach to addressing this issue is to employ parallelism through multi-processor and multi-core technologies. While there is little doubt that this approach produces scaling improvements, there are still many significant hurdles to be overcome before parallelism can be employed as a general replacement to more traditional programming techniques. The Parallel Tools Platform (PTP) Project was created in 2005 in an attempt to provide developers with new tools aimed at addressing some of the parallel development issues. Since then, the introduction of a new generation of peta-scale and multi-core systems has highlighted the need for such a platform. In this paper, we describe some of the challenges facing parallel application developers, present the current state of PTP, and provide a simple case study that demonstrates how PTP can be used to locate a potential deadlock situation in an MPI code.

  14. Introducing data parallelism into climate model post-processing through a parallel version of the NCAR Command Language (NCL)

    NASA Astrophysics Data System (ADS)

    Jacob, R. L.; Xu, X.; Krishna, J.; Tautges, T.

    2011-12-01

    The relationship between the needs of post-processing climate model output and the capability of the available tools has reached a crisis point. The large volume of data currently produced by climate models is overwhelming the current, decades-old analysis workflow. The tools used to implement that workflow are now a bottleneck in the climate science discovery processes. This crisis will only worsen as ultra-high resolution global climate models with horizontal scales of 4 km or smaller, running on leadership computing facilities, begin to produce tens to hundreds of terabytes for a single, hundred-year climate simulation. While climate models have used parallelism for several years, the post-processing tools are still mostly single-threaded applications. We have created a Parallel Climate Analysis Library (ParCAL) which implements many common climate analysis operations in a data-parallel fashion using the Message Passing Interface. ParCAL has in turn been built on sophisticated packages for describing grids in parallel (the Mesh Oriented database (MOAB) and for performing vector operations on arbitrary grids (Intrepid). ParCAL is also using parallel I/O through the PnetCDF library. ParCAL has been used to implement a parallel version of the NCAR Command Language (NCL). ParNCL/ParCAL not only speeds up analysis of large datasets but also allows operations to be performed on native grids, eliminating the need to transform everything to latitude-longitude grids. In most cases, users NCL scripts can run unaltered in parallel using ParNCL.

  15. Parallel processing experiences on the Denelcor HEP computer

    SciTech Connect

    Hayes, A.H.

    1984-01-01

    Recent experiments conducted on a Denelcor HEP (Heterogeneous Element Processor) computer are discussed in this paper. Algorithm research was done on four types of problems of interest to the Los Alamos National Laboratory: (1) Monte Carlo, using GAMTAB, in which the interaction of photons with matter is analyzed; (2) Hydrodynamics; (3) Reactor Safety, in which the operation of a nuclear reactor is simulated; and (4) Particle-in-Cell, in which electrostatic interaction of plasma beams are studied. Means of maximizing programming efficiency are analyzed with ways of speeding up processing determined.

  16. Low-cost reconfigurable DSP-based parallel image processing computer

    NASA Astrophysics Data System (ADS)

    Murphy, Ciaron W.; Harvey, David M.; Nicolson, Laurence J.

    1998-10-01

    To develop a cost-effective re-configurable DSP engine, it has been proposed to upgrade an existing custom designed TMS320C40 based multi-processing architecture with run-time configuration capabilities. The upgrade will consist of four Xilinx XC6200 series field programmable gate arrays which will enable concurrent algorithm structures to be efficiently mapped onto the system. Furthermore, the upgraded architecture will provide a platform for the development of adaptive routing structures, self- configuration techniques and facilitate the merging of instruction and hardware based parallelism.

  17. Efficient parallel linear scaling construction of the density matrix for Born-Oppenheimer molecular dynamics.

    PubMed

    Mniszewski, S M; Cawkwell, M J; Wall, M E; Mohd-Yusof, J; Bock, N; Germann, T C; Niklasson, A M N

    2015-10-13

    We present an algorithm for the calculation of the density matrix that for insulators scales linearly with system size and parallelizes efficiently on multicore, shared memory platforms with small and controllable numerical errors. The algorithm is based on an implementation of the second-order spectral projection (SP2) algorithm [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] in sparse matrix algebra with the ELLPACK-R data format. We illustrate the performance of the algorithm within self-consistent tight binding theory by total energy calculations of gas phase poly(ethylene) molecules and periodic liquid water systems containing up to 15,000 atoms on up to 16 CPU cores. We consider algorithm-specific performance aspects, such as local vs nonlocal memory access and the degree of matrix sparsity. Comparisons to sparse matrix algebra implementations using off-the-shelf libraries on multicore CPUs, graphics processing units (GPUs), and the Intel many integrated core (MIC) architecture are also presented. The accuracy and stability of the algorithm are illustrated with long duration Born-Oppenheimer molecular dynamics simulations of 1000 water molecules and a 303 atom Trp cage protein solvated by 2682 water molecules. PMID:26574255

  18. Parallel-processing with surface plasmons, a new strategy for converting the broad solar spectrum

    NASA Technical Reports Server (NTRS)

    Anderson, L. M.

    1982-01-01

    A new strategy for efficient solar-energy conversion is based on parallel processing with surface plasmons: guided electromagnetic waves supported on thin films of common metals like aluminum or silver. The approach is unique in identifying a broadband carrier with suitable range for energy transport and an inelastic tunneling process which can be used to extract more energy from the more energetic carriers without requiring different materials for each frequency band. The aim is to overcome the fundamental 56-percent loss associated with mismatch between the broad solar spectrum and the monoenergetic conduction electrons used to transport energy in conventional silicon solar cells. This paper presents a qualitative discussion of the unknowns and barrier problems, including ideas for coupling surface plasmons into the tunnels, a step which has been the weak link in the efficiency chain.

  19. A methodology for exploiting parallelism in the finite element process

    NASA Technical Reports Server (NTRS)

    Adams, L. M.; Voigt, R. G.

    1983-01-01

    A methodology is described for developing a parallel system using a top down approach taking into account the requirements of the user. Substructuring, a popular technique in structural analysis, is used to illustrate this approach.

  20. The finite element machine: An experiment in parallel processing

    NASA Technical Reports Server (NTRS)

    Storaasli, O. O.; Peebles, S. W.; Crockett, T. W.; Knott, J. D.; Adams, L.

    1982-01-01

    The finite element machine is a prototype computer designed to support parallel solutions to structural analysis problems. The hardware architecture and support software for the machine, initial solution algorithms and test applications, and preliminary results are described.

  1. Voltage and Reactive Power Control by Parallel Calculation Processing

    NASA Astrophysics Data System (ADS)

    Michihata, Masashi; Aoki, Hidenori; Mizutani, Yoshibumi

    This paper presents a new approach to optimal voltage and reactive power control based on a genetic algorithm (GA) and a tabu search (TS). To reduce time to calculate the control procedure, the parallel computation using Linux is executed. In addition, TS and GA are calculated by the master and each slave based on the parallel program language. The effectiveness of the proposed method is demonstrated by practical 118-bus system.

  2. Parallelized CCHE2D flow model with CUDA Fortran on Graphics Process Units

    Technology Transfer Automated Retrieval System (TEKTRAN)

    This paper presents the CCHE2D implicit flow model parallelized using CUDA Fortran programming technique on Graphics Processing Units (GPUs). A parallelized implicit Alternating Direction Implicit (ADI) solver using Parallel Cyclic Reduction (PCR) algorithm on GPU is developed and tested. This solve...

  3. Parallel processing for nonlinear dynamics simulations of structures including rotating bladed-disk assemblies

    NASA Technical Reports Server (NTRS)

    Hsieh, Shang-Hsien

    1993-01-01

    The principal objective of this research is to develop, test, and implement coarse-grained, parallel-processing strategies for nonlinear dynamic simulations of practical structural problems. There are contributions to four main areas: finite element modeling and analysis of rotational dynamics, numerical algorithms for parallel nonlinear solutions, automatic partitioning techniques to effect load-balancing among processors, and an integrated parallel analysis system.

  4. A new method to customize protein expression vectors for fast, efficient and background free parallel cloning

    PubMed Central

    2013-01-01

    Background Expression and purification of correctly folded proteins typically require screening of different parameters such as protein variants, solubility enhancing tags or expression hosts. Parallel vector series that cover all variations are available, but not without compromise. We have established a fast, efficient and absolutely background free cloning approach that can be applied to any selected vector. Results Here we describe a method to tailor selected expression vectors for parallel Sequence and Ligation Independent Cloning. SLIC cloning enables precise and sequence independent engineering and is based on joining vector and insert with 15–25 bp homologies on both DNA ends by homologous recombination. We modified expression vectors based on pET, pFastBac and pTT backbones for parallel PCR-based cloning and screening in E.coli, insect cells and HEK293E cells, respectively. We introduced the toxic ccdB gene under control of a strong constitutive promoter for counterselection of insert less vector. In contrast to DpnI treatment commonly used to reduce vector background, ccdB used in our vector series is 100% efficient in killing parental vector carrying cells and reduces vector background to zero. In addition, the 3’ end of ccdB functions as a primer binding site common to all vectors. The second shared primer binding site is provided by a HRV 3C protease cleavage site located downstream of purification and solubility enhancing tags for tag removal. We have so far generated more than 30 different parallel expression vectors, and successfully cloned and expressed more than 250 genes with this vector series. There is no size restriction for gene insertion, clone efficiency is > 95% with clone numbers up to 200. The procedure is simple, fast, efficient and cost-effective. All expression vectors showed efficient expression of eGFP and different target proteins requested to be produced and purified at our Core Facility services. Conclusion This new

  5. Efficient parallelization for AMR MHD multiphysics calculations; implementation in AstroBEAR

    NASA Astrophysics Data System (ADS)

    Carroll-Nellenback, Jonathan J.; Shroyer, Brandon; Frank, Adam; Ding, Chen

    2013-03-01

    Current adaptive mesh refinement (AMR) simulations require algorithms that are highly parallelized and manage memory efficiently. As compute engines grow larger, AMR simulations will require algorithms that achieve new levels of efficient parallelization and memory management. We have attempted to employ new techniques to achieve both of these goals. Patch or grid based AMR often employs ghost cells to decouple the hyperbolic advances of each grid on a given refinement level. This decoupling allows each grid to be advanced independently. In AstroBEAR we utilize this independence by threading the grid advances on each level with preference going to the finer level grids. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both physical space and AMR level. Threading of level advances can also improve performance by interleaving communication with computation, especially in deep simulations with many levels of refinement. While we see improvements of up to 30% on deep simulations run on a few cores, the speedup is typically more modest (5-20%) for larger scale simulations. To improve memory management we have employed a distributed tree algorithm that requires processors to only store and communicate local sections of the AMR tree structure with neighboring processors. Using this distributed approach we are able to get reasonable scaling efficiency (>80%) out to 12288 cores and up to 8 levels of AMR - independent of the use of threading.

  6. Parallel processing architecture for computing inverse differential kinematic equations of the PUMA arm

    NASA Technical Reports Server (NTRS)

    Hsia, T. C.; Lu, G. Z.; Han, W. H.

    1987-01-01

    In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.

  7. Characteristics-Based Methods for Efficient Parallel Integration of the Atmospheric Dynamical Equations

    NASA Astrophysics Data System (ADS)

    Norman, Matthew Ross

    The social need for realistic atmospheric simulation in weather prediction, climate change attribution, seasonal forecasting, and climate projection is great. To obtain realistic simulations, we need more physical processes included in the model with greater fidelity and finer spatial resolution. Spatial resolution primarily drives the need for computational resources because reducing the model grid spacing by a factor f requires f 4 times more computation (assuming 3-D refinement). This compute power comes from large parallel machines with 10,000s of separate nodes and accelerators such as graphics processing units (GPUs) making efficiency a complicated problem. Efficiency parallel integration algorithms need low internode communication, minimal synchronization, large time steps, and clustered computation. To this end, we propose new characteristics-based methods for the atmospheric dynamical equations with these properties in mind. These schemes are capable of simulating at a large CFL time step in only one stage of computations, needing only one copy of the state variables. They are implemented in a 2-D non-hydrostatic compressible equation set in an x-z (horizontal-vertical) Cartesian plane to simulate buoyancy-driven flows such as rising thermals and internal gravity waves. The schemes are implemented to run on CPU and multi-GPU architectures using Nvidia's CUDA (Compute Unified Device Architecture) language to test relative efficiency. Even with- out memory tuning, the GPU code showed roughly 2.5x (5x) better performance per Watt. With optimization, this could increase by an order of magnitude. The methods can use any spatial interpolant, so two major formulations are proposed and tested. One uses WENO interpolants which are pre-computed, and the other uses standard polynomials and computes them on-the-fly. The advantage of on-the-fly calculations is a significant reduction in the volume of data communicated to and from the GPU's slow global memory. In some

  8. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-11

    Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  9. the finite element machine: An experiment in parallel processing

    NASA Technical Reports Server (NTRS)

    Storaasli, O. O.; Peebles, S. W.; Crockett, T. W.; Knott, J. D.; Adams, L.

    1982-01-01

    The Finite Element Machine at the NASA Langley Research Center is a prototype computer designed to support parallel solutions to structural analysis problems. The hardware architecture and support software for the machine, initial solution algorithms and test applications, and preliminary results are described. Directions for future work are presented.

  10. Parallel design of JPEG-LS encoder on graphics processing units

    NASA Astrophysics Data System (ADS)

    Duan, Hao; Fang, Yong; Huang, Bormin

    2012-01-01

    With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.

  11. Control of automatic processes: A parallel distributed-processing model of the stroop effect. Technical report

    SciTech Connect

    Cohen, J.D.; Dunbar, K.; McClelland, J.L.

    1988-06-16

    A growing body of evidence suggests that traditional views of automaticity are in need of revision. For example, automaticity has often been treated as an all-or-none phenomenon, and traditional theories have held that automatic processes are independent of attention. Yet recent empirial data suggests that automatic processes are continuous, and furthermore are subject to attentional control. In this paper we present a model of attention which addresses these issues. Using a parallel distributed processing framework we propose that the attributes of automaticity depend upon the strength of a process and that strength increases with training. Using the Stroop effect as an example, we show how automatic processes are continuous and emerge gradually with practice. Specifically, we present a computational model of the Stroop task which simulates the time course of processing as well as the effects of learning.

  12. A Parallel Processing Algorithm for Remote Sensing Classification

    NASA Technical Reports Server (NTRS)

    Gualtieri, J. Anthony

    2005-01-01

    A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.

  13. Parallel processing numerical method for confined vortex dynamics and applications

    NASA Astrophysics Data System (ADS)

    Bistrian, Diana Alina

    2013-10-01

    This paper explores a combined analytical and numerical technique to investigate the hydrodynamic instability of confined swirling flows, with application to vortex rope dynamics in a Francis turbine diffuser, in condition of sophisticated boundary constraints. We present a new approach based on the method of orthogonal decomposition in the Hilbert space, implemented with a spectral descriptor scheme in discrete space. A parallel implementation of the numerical scheme is conducted reducing the computational time compared to other techniques.

  14. Parallel processing data network of master and slave transputers controlled by a serial control network

    DOEpatents

    Crosetto, D.B.

    1996-12-31

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.

  15. Parallel processing data network of master and slave transputers controlled by a serial control network

    DOEpatents

    Crosetto, Dario B.

    1996-01-01

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.

  16. Toward a Model Framework of Generalized Parallel Componential Processing of Multi-Symbol Numbers

    ERIC Educational Resources Information Center

    Huber, Stefan; Cornelsen, Sonja; Moeller, Korbinian; Nuerk, Hans-Christoph

    2015-01-01

    In this article, we propose and evaluate a new model framework of parallel componential multi-symbol number processing, generalizing the idea of parallel componential processing of multi-digit numbers to the case of negative numbers by considering the polarity signs similar to single digits. In a first step, we evaluated this account by defining…

  17. Double Take: Parallel Processing by the Cerebral Hemispheres Reduces Attentional Blink

    ERIC Educational Resources Information Center

    Scalf, Paige E.; Banich, Marie T.; Kramer, Arthur F.; Narechania, Kunjan; Simon, Clarissa D.

    2007-01-01

    Recent data have shown that parallel processing by the cerebral hemispheres can expand the capacity of visual working memory for spatial locations (J. F. Delvenne, 2005) and attentional tracking (G. A. Alvarez & P. Cavanagh, 2005). Evidence that parallel processing by the cerebral hemispheres can improve item identification has remained elusive.…

  18. Studies in optical parallel processing. [All optical and electro-optic approaches

    NASA Technical Reports Server (NTRS)

    Lee, S. H.

    1978-01-01

    Threshold and A/D devices for converting a gray scale image into a binary one were investigated for all-optical and opto-electronic approaches to parallel processing. Integrated optical logic circuits (IOC) and optical parallel logic devices (OPA) were studied as an approach to processing optical binary signals. In the IOC logic scheme, a single row of an optical image is coupled into the IOC substrate at a time through an array of optical fibers. Parallel processing is carried out out, on each image element of these rows, in the IOC substrate and the resulting output exits via a second array of optical fibers. The OPAL system for parallel processing which uses a Fabry-Perot interferometer for image thresholding and analog-to-digital conversion, achieves a higher degree of parallel processing than is possible with IOC.

  19. Parallel damage in mitochondrial and lysosomal compartments promotes efficient cell death with autophagy: The case of the pentacyclic triterpenoids

    PubMed Central

    Martins, Waleska K.; Costa, Érico T.; Cruz, Mário C.; Stolf, Beatriz S.; Miotto, Ronei; Cordeiro, Rodrigo M.; Baptista, Maurício S.

    2015-01-01

    The role of autophagy in cell death is still controversial and a lot of debate has concerned the transition from its pro-survival to its pro-death roles. The similar structure of the triterpenoids Betulinic (BA) and Oleanolic (OA) acids allowed us to prove that this transition involves parallel damage in mitochondria and lysosome. After treating immortalized human skin keratinocytes (HaCaT) with either BA or OA, we evaluated cell viability, proliferation and mechanism of cell death, function and morphology of mitochondria and lysosomes, and the status of the autophagy flux. We also quantified the interactions of BA and OA with membrane mimics, both in-vitro and in-silico. Essentially, OA caused mitochondrial damage that relied on autophagy to rescue cellular homeostasis, which failed upon lysosomal inhibition by Chloroquine or Bafilomycin-A1. BA caused parallel damage on mitochondria and lysosome, turning autophagy into a destructive process. The higher cytotoxicity of BA correlated with its stronger efficiency in damaging membrane mimics. Based on these findings, we underlined the concept that autophagy will turn into a destructive outcome when there is parallel damage in mitochondrial and lysosomal membranes. We trust that this concept will help the development of new drugs against aggressive cancers. PMID:26213355

  20. Efficient parallelization of short-range molecular dynamics simulations on many-core systems.

    PubMed

    Meyer, R

    2013-11-01

    This article introduces a highly parallel algorithm for molecular dynamics simulations with short-range forces on single node multi- and many-core systems. The algorithm is designed to achieve high parallel speedups for strongly inhomogeneous systems like nanodevices or nanostructured materials. In the proposed scheme the calculation of the forces and the generation of neighbor lists are divided into small tasks. The tasks are then executed by a thread pool according to a dependent task schedule. This schedule is constructed in such a way that a particle is never accessed by two threads at the same time. Benchmark simulations on a typical 12-core machine show that the described algorithm achieves excellent parallel efficiencies above 80% for different kinds of systems and all numbers of cores. For inhomogeneous systems the speedups are strongly superior to those obtained with spatial decomposition. Further benchmarks were performed on an Intel Xeon Phi coprocessor. These simulations demonstrate that the algorithm scales well to large numbers of cores. PMID:24329381

  1. Efficient parallelization of short-range molecular dynamics simulations on many-core systems

    NASA Astrophysics Data System (ADS)

    Meyer, R.

    2013-11-01

    This article introduces a highly parallel algorithm for molecular dynamics simulations with short-range forces on single node multi- and many-core systems. The algorithm is designed to achieve high parallel speedups for strongly inhomogeneous systems like nanodevices or nanostructured materials. In the proposed scheme the calculation of the forces and the generation of neighbor lists are divided into small tasks. The tasks are then executed by a thread pool according to a dependent task schedule. This schedule is constructed in such a way that a particle is never accessed by two threads at the same time. Benchmark simulations on a typical 12-core machine show that the described algorithm achieves excellent parallel efficiencies above 80% for different kinds of systems and all numbers of cores. For inhomogeneous systems the speedups are strongly superior to those obtained with spatial decomposition. Further benchmarks were performed on an Intel Xeon Phi coprocessor. These simulations demonstrate that the algorithm scales well to large numbers of cores.

  2. Efficient uncertainty quantification of large two-dimensional optical systems with a parallelized stochastic Galerkin method.

    PubMed

    Zubac, Z; Fostier, J; De Zutter, D; Vande Ginste, D

    2015-11-30

    It is well-known that geometrical variations due to manufacturing tolerances can degrade the performance of optical devices. In recent literature, polynomial chaos expansion (PCE) methods were proposed to model this statistical behavior. Nonetheless, traditional PCE solvers require a lot of memory and their computational complexity leads to prohibitively long simulation times, making these methods non-tractable for large optical systems. The uncertainty quantification (UQ) of various types of large, two-dimensional lens systems is presented in this paper, leveraging a novel parallelized Multilevel Fast Multipole Method (MLFMM) based Stochastic Galerkin Method (SGM). It is demonstrated that this technique can handle large optical structures in reasonable time, e.g., a stochastic lens system with more than 10 million unknowns was solved in less than an hour by using 3 compute nodes. The SGM, which is an intrusive PCE method, guarantees the accuracy of the method. By conjunction with MLFMM, usage of a preconditioner and by constructing and implementing a parallelized algorithm, a high efficiency is achieved. This is demonstrated with parallel scalability graphs. The novel approach is illustrated for different types of lens system and numerical results are validated against a collocation method, which relies on reusing a traditional deterministic solver. The last example concerns a Cassegrain system with five random variables, for which a speed-up of more than 12× compared to a collocation method is achieved. PMID:26698717

  3. Increasing Efficiency: A Process-Oriented Approach.

    ERIC Educational Resources Information Center

    Harbour, Jerry L.

    1993-01-01

    Discussion of the need to increase efficiency focuses on a process-oriented approach for systematically identifying and minimizing non-value-adding process steps to analyze and improve tasks, services, and production. Highlights include a historical perspective, a discussion of wasted efforts, and a case study. (Contains 16 references.) (LRW)

  4. HHFW Heating Efficiency on NSTX versus B{sub {phi}} and Antenna k{sub parallel}

    SciTech Connect

    Hosea, J.; Bell, R.; Bernabei, S.; LeBlanc, B.; Phillips, C. K.; Wilson, J. R.; Delgado-Aparicio, L.; Tritz, K.; Ryan, P.; Wilgen, J.; Sabbagh, S.; Yuh, H.

    2007-09-28

    HHFW RF power delivered to the core plasma of NSTX is strongly reduced as the launched wavelength is increased--for B{sub {phi}} = 4.5 kG, heating is {approx}1/2 as effective at k{sub {phi}} = -7 m{sup -1} as at 14 m{sup -1} and {approx}1/10 as effective at -3 m{sup -1}. Measured edge ion heating, attributable to parametric decay (PDI), increases with wavelength but not fast enough to account for the observed power loss. Surface fast waves (FW) may enhance both PDI and also losses in sheaths and structures around the machine--FW fields propagate closer to the wall with decreasing B{sub {phi}} and k{sub parallel} (onset n{sub e}{proportional_to}B{sub {phi}}xk{sub parallel}{sup 2}). A dramatic increase in core heating efficiency is observed at -7 m{sup -1} when B{sub {phi}} is increased to 5.5 kG--central T{sub e} near 4 keV at P{sub RF} = 2 MW. Also, the PDI losses are a weak function of B{sub {phi}} and k{sub parallel}, whereas the far-field RF poloidal magnetic field (at 5.5 kG) increases a factor of {approx}3 when k{sub parallel} is reduced from 14 m{sup -1} to -3 m{sup -1}, suggesting a large increase in wall/sheath power loss and a major effect of surface fast waves on edge losses.

  5. Efficient Parallel Global Optimization for High Resolution Hydrologic and Climate Impact Models

    NASA Astrophysics Data System (ADS)

    Shoemaker, C. A.; Mueller, J.; Pang, M.

    2013-12-01

    High Resolution hydrologic models are typically computationally expensive, requiring many minutes or perhaps hours for one simulation. Optimization can be used with these models for parameter estimation or for analyzing management alternatives. However Optimization of these computationally expensive simulations requires algorithms that can obtain accurate answers with relatively few simulations to avoid infeasibly long computation times. We have developed a number of efficient parallel algorithms and software codes for optimization of expensive problems with multiple local minimum. This is open source software we are distributing. It runs in Matlab and Python, and has been run on Yellowstone supercomputer. The talk will quickly discuss the characteristics of the problem (e.g. the presence of integer as well as continuous variables, the number of dimensions, the availability of parallel/grid computing, the number of simulations that can be allowed to find a solution, etc. ) that determine which algorithms are most appropriate for each type of problem. A major application of this optimization software is for parameter estimation for nonlinear hydrologic models, including contaminant transport in subsurface (e.g. for groundwater remediation or multi-phase flow for carbon sequestration), nutrient transport in watersheds, and climate models. We will present results for carbon sequestration plume monitoring (multi-phase, multi-constiuent), for groundwater remediation, and for the CLM climate model. The carbon sequestration example is based on the Frio CO2 field site and the groundwater example is for a 50,000 acre remediation site (with model requiring about 1 hour per simulation). Parallel speed-ups are excellent in most cases, and our serial and parallel algorithms tend to outperform alternative methods on complex computationally expensive simulations that have multiple global minima.

  6. Flexible, efficient and robust algorithm for parallel execution and coupling of components in a framework

    NASA Astrophysics Data System (ADS)

    Tóth, Gábor

    2006-05-01

    We describe a general algorithm suitable for executing and coupling components of a software framework on a parallel computer. The requirements of a flexible, efficient and robust algorithm are defined precisely, and the motivation for the requirements is demonstrated on several examples. In short, the requirements are the following: (i) the algorithm should allow arbitrary distribution of processors among the components, (ii) it should allow arbitrary coupling schedule between the components, (iii) it should not use any inter-processor communication other than already required by the components and their couplings, and (iv) it should never get into a dead-lock. We show that the proposed algorithm based on the Temporal and Predefined Ordering of Tasks (TPOT) satisfies all these requirements. The TPOT algorithm has been implemented in the Space Weather Modeling Framework. The flexibility and efficiency of the algorithm is demonstrated with several examples.

  7. On the efficiency of exchange in parallel tempering monte carlo simulations.

    PubMed

    Predescu, Cristian; Predescu, Mihaela; Ciobanu, Cristian V

    2005-03-10

    We introduce the concept of effective fraction, defined as the expected probability that a configuration from the lowest index replica successfully reaches the highest index replica during a replica exchange Monte Carlo simulation. We then argue that the effective fraction represents an adequate measure of the quality of the sampling technique, as far as swapping is concerned. Under the hypothesis that the correlation between successive exchanges is negligible, we propose a technique for the computation of the effective fraction, a technique that relies solely on the values of the acceptance probabilities obtained at the end of the simulation. The effective fraction is then utilized for the study of the efficiency of a popular swapping scheme in the context of parallel tempering in the canonical ensemble. For large dimensional oscillators, we show that the swapping probability that minimizes the computational effort is 38.74%. By studying the parallel tempering swapping efficiency for a 13-atom Lennard-Jones cluster, we argue that the value of 38.74% remains roughly the optimal probability for most systems with continuous distributions that are likely to be encountered in practice. PMID:16851481

  8. Signal processing applications of massively parallel charge domain computing devices

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Barhen, Jacob (Inventor); Toomarian, Nikzad (Inventor)

    1999-01-01

    The present invention is embodied in a charge coupled device (CCD)/charge injection device (CID) architecture capable of performing a Fourier transform by simultaneous matrix vector multiplication (MVM) operations in respective plural CCD/CID arrays in parallel in O(1) steps. For example, in one embodiment, a first CCD/CID array stores charge packets representing a first matrix operator based upon permutations of a Hartley transform and computes the Fourier transform of an incoming vector. A second CCD/CID array stores charge packets representing a second matrix operator based upon different permutations of a Hartley transform and computes the Fourier transform of an incoming vector. The incoming vector is applied to the inputs of the two CCD/CID arrays simultaneously, and the real and imaginary parts of the Fourier transform are produced simultaneously in the time required to perform a single MVM operation in a CCD/CID array.

  9. Fault-tolerant interconnection network and image-processing applications for the PASM parallel processing system

    SciTech Connect

    Adams, G.B. III

    1984-01-01

    The demand for very high speed data processing coupled with falling hardware costs has made large-scale parallel and distributed computer systems both desirable and feasible. Two modes of parallel processing are single instruction stream-multiple data stream (SIMD) and multiple instruction stream-multiple data stream (MIMD). PASM, a partitionable SIMD/MIMD system, is a reconfigurable multimicroprocessor system being designed for image processing and pattern recognition. An important component of these systems is the interconnection network, the mechanism for communication among the computation nodes and memories. Assuring high reliability for such complex systems is a significant task. Thus, a crucial practical aspect of an interconnection network is fault tolerance. In answer to this need, the Extra Stage Cube (ESC), a fault-tolerant, multistage cube-type interconnection network, is define. The fault tolerance of the ESC is explored for both single and multiple faults, routing tags are defined, and consideration is given to permuting data and partitioning the ESC in the presence of faults. The ESC is compared with other fault-tolerant multistage networks. Finally, reliability of the ESC and an enhanced version of it are investigated.

  10. Parallel workflow tools to facilitate human brain MRI post-processing

    PubMed Central

    Cui, Zaixu; Zhao, Chenxi; Gong, Gaolang

    2015-01-01

    Multi-modal magnetic resonance imaging (MRI) techniques are widely applied in human brain studies. To obtain specific brain measures of interest from MRI datasets, a number of complex image post-processing steps are typically required. Parallel workflow tools have recently been developed, concatenating individual processing steps and enabling fully automated processing of raw MRI data to obtain the final results. These workflow tools are also designed to make optimal use of available computational resources and to support the parallel processing of different subjects or of independent processing steps for a single subject. Automated, parallel MRI post-processing tools can greatly facilitate relevant brain investigations and are being increasingly applied. In this review, we briefly summarize these parallel workflow tools and discuss relevant issues. PMID:26029043

  11. Parallel-Processing CMOS Circuitry for M-QAM and 8PSK TCM

    NASA Technical Reports Server (NTRS)

    Gray, Andrew; Lee, Dennis; Hoy, Scott; Fisher, Dave; Fong, Wai; Ghuman, Parminder

    2009-01-01

    There has been some additional development of parts reported in "Multi-Modulator for Bandwidth-Efficient Communication" (NPO-40807), NASA Tech Briefs, Vol. 32, No. 6 (June 2009), page 34. The focus was on 1) The generation of M-order quadrature amplitude modulation (M-QAM) and octonary-phase-shift-keying, trellis-coded modulation (8PSK TCM), 2) The use of square-root raised-cosine pulse-shaping filters, 3) A parallel-processing architecture that enables low-speed [complementary metal oxide/semiconductor (CMOS)] circuitry to perform the coding, modulation, and pulse-shaping computations at a high rate; and 4) Implementation of the architecture in a CMOS field-programmable gate array.

  12. Creation of the BMA ensemble for SST using a parallel processing technique

    NASA Astrophysics Data System (ADS)

    Kim, Kwangjin; Lee, Yang Won

    2013-10-01

    Despite the same purpose, each satellite product has different value because of its inescapable uncertainty. Also the satellite products have been calculated for a long time, and the kinds of the products are various and enormous. So the efforts for reducing the uncertainty and dealing with enormous data will be necessary. In this paper, we create an ensemble Sea Surface Temperature (SST) using MODIS Aqua, MODIS Terra and COMS (Communication Ocean and Meteorological Satellite). We used Bayesian Model Averaging (BMA) as ensemble method. The principle of the BMA is synthesizing the conditional probability density function (PDF) using posterior probability as weight. The posterior probability is estimated using EM algorithm. The BMA PDF is obtained by weighted average. As the result, the ensemble SST showed the lowest RMSE and MAE, which proves the applicability of BMA for satellite data ensemble. As future work, parallel processing techniques using Hadoop framework will be adopted for more efficient computation of very big satellite data.

  13. Arts Integration Parallels Between Music and Reading: Process, Product and Affective Response.

    ERIC Educational Resources Information Center

    Merrion, Margaret Dee

    The process of aesthetic education is not limited to the fine arts. Parallels may be identified in the language arts and particularly in the art of creative reading. As in a musical experience, a creative reader will apprehend the content of the literature and couple personal feelings with the events of the reading experience. Parallel brain…

  14. Efficient time-dependent density functional theory approximations for hybrid density functionals: analytical gradients and parallelization.

    PubMed

    Petrenko, Taras; Kossmann, Simone; Neese, Frank

    2011-02-01

    In this paper, we present the implementation of efficient approximations to time-dependent density functional theory (TDDFT) within the Tamm-Dancoff approximation (TDA) for hybrid density functionals. For the calculation of the TDDFT/TDA excitation energies and analytical gradients, we combine the resolution of identity (RI-J) algorithm for the computation of the Coulomb terms and the recently introduced "chain of spheres exchange" (COSX) algorithm for the calculation of the exchange terms. It is shown that for extended basis sets, the RIJCOSX approximation leads to speedups of up to 2 orders of magnitude compared to traditional methods, as demonstrated for hydrocarbon chains. The accuracy of the adiabatic transition energies, excited state structures, and vibrational frequencies is assessed on a set of 27 excited states for 25 molecules with the configuration interaction singles and hybrid TDDFT/TDA methods using various basis sets. Compared to the canonical values, the typical error in transition energies is of the order of 0.01 eV. Similar to the ground-state results, excited state equilibrium geometries differ by less than 0.3 pm in the bond distances and 0.5° in the bond angles from the canonical values. The typical error in the calculated excited state normal coordinate displacements is of the order of 0.01, and relative error in the calculated excited state vibrational frequencies is less than 1%. The errors introduced by the RIJCOSX approximation are, thus, insignificant compared to the errors related to the approximate nature of the TDDFT methods and basis set truncation. For TDDFT/TDA energy and gradient calculations on Ag-TB2-helicate (156 atoms, 2732 basis functions), it is demonstrated that the COSX algorithm parallelizes almost perfectly (speedup ~26-29 for 30 processors). The exchange-correlation terms also parallelize well (speedup ~27-29 for 30 processors). The solution of the Z-vector equations shows a speedup of ~24 on 30 processors. The

  15. Efficient time-dependent density functional theory approximations for hybrid density functionals: Analytical gradients and parallelization

    NASA Astrophysics Data System (ADS)

    Petrenko, Taras; Kossmann, Simone; Neese, Frank

    2011-02-01

    In this paper, we present the implementation of efficient approximations to time-dependent density functional theory (TDDFT) within the Tamm-Dancoff approximation (TDA) for hybrid density functionals. For the calculation of the TDDFT/TDA excitation energies and analytical gradients, we combine the resolution of identity (RI-J) algorithm for the computation of the Coulomb terms and the recently introduced "chain of spheres exchange" (COSX) algorithm for the calculation of the exchange terms. It is shown that for extended basis sets, the RIJCOSX approximation leads to speedups of up to 2 orders of magnitude compared to traditional methods, as demonstrated for hydrocarbon chains. The accuracy of the adiabatic transition energies, excited state structures, and vibrational frequencies is assessed on a set of 27 excited states for 25 molecules with the configuration interaction singles and hybrid TDDFT/TDA methods using various basis sets. Compared to the canonical values, the typical error in transition energies is of the order of 0.01 eV. Similar to the ground-state results, excited state equilibrium geometries differ by less than 0.3 pm in the bond distances and 0.5° in the bond angles from the canonical values. The typical error in the calculated excited state normal coordinate displacements is of the order of 0.01, and relative error in the calculated excited state vibrational frequencies is less than 1%. The errors introduced by the RIJCOSX approximation are, thus, insignificant compared to the errors related to the approximate nature of the TDDFT methods and basis set truncation. For TDDFT/TDA energy and gradient calculations on Ag-TB2-helicate (156 atoms, 2732 basis functions), it is demonstrated that the COSX algorithm parallelizes almost perfectly (speedup ˜26-29 for 30 processors). The exchange-correlation terms also parallelize well (speedup ˜27-29 for 30 processors). The solution of the Z-vector equations shows a speedup of ˜24 on 30 processors. The

  16. Design of a dataway processor for a parallel image signal processing system

    NASA Astrophysics Data System (ADS)

    Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu

    1995-04-01

    Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.

  17. Advantages of Parallel Processing and the Effects of Communications Time

    NASA Technical Reports Server (NTRS)

    Eddy, Wesley M.; Allman, Mark

    2000-01-01

    Many computing tasks involve heavy mathematical calculations, or analyzing large amounts of data. These operations can take a long time to complete using only one computer. Networks such as the Internet provide many computers with the ability to communicate with each other. Parallel or distributed computing takes advantage of these networked computers by arranging them to work together on a problem, thereby reducing the time needed to obtain the solution. The drawback to using a network of computers to solve a problem is the time wasted in communicating between the various hosts. The application of distributed computing techniques to a space environment or to use over a satellite network would therefore be limited by the amount of time needed to send data across the network, which would typically take much longer than on a terrestrial network. This experiment shows how much faster a large job can be performed by adding more computers to the task, what role communications time plays in the total execution time, and the impact a long-delay network has on a distributed computing system.

  18. Parallel Block Structured Adaptive Mesh Refinement on Graphics Processing Units

    SciTech Connect

    Beckingsale, D. A.; Gaudin, W. P.; Hornung, R. D.; Gunney, B. T.; Gamblin, T.; Herdman, J. A.; Jarvis, S. A.

    2014-11-17

    Block-structured adaptive mesh refinement is a technique that can be used when solving partial differential equations to reduce the number of zones necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a native GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an eight-node cluster, and over four thousand nodes of Oak Ridge National Laboratory’s Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and has been scaled to over four thousand GPUs using a combination of MPI and CUDA.

  19. Model-integrated program synthesis environment for parallel/real-time image processing

    NASA Astrophysics Data System (ADS)

    Moore, Michael S.; Sztipanovitz, Janos; Karsai, Gabor; Nichols, James A.

    1997-09-01

    In this paper, it is shown that, through the use of model- integrated program synthesis (MIPS), parallel real-time implementations of image processing data flows can be synthesized from high level graphical specifications. The complex details in inherent to parallel and real-time software development become transparent to the programmer, enabling the cost-effective exploitation of parallel hardware for building more flexible and powerful real-time imaging systems. The model integrated real-time image processing system (MIRTIS) is presented as an example. MIRTIS employs the multigraph architecture (MGA), a framework and set of tools for building MIPS systems, to generate parallel real-time image processing software which runs under the control of a parallel run-time kernel on a network of Texas Instruments TMS320C40 DSPs (C40s). The MIRTIS models contain graphical declarations of the image processing computations to be performed, the available hardware resources, and the timing constraints of the application. The MIRTIS model interpreter performs the parallelization, scaling, and mapping of the computations to the resources automatically or determines that the timing constraints cannot be met with the available resources. MIRTIS is a clear example of how parallel real-time image processing systems can be built which are (1) cost-effectively programmable, (2) flexible, (3) scalable, and (4) built from commercial off-the-shelf (COTS) components.

  20. A Parallel Distributed Processing Model of Story Comprehension and Recall.

    ERIC Educational Resources Information Center

    Golden, Richard M.; Rumelhart, David E.

    1993-01-01

    Introduces a multistate probabilistic causal chain notation for describing the knowledge structures implicitly represented by the subjective conditional probability distribution. Proposes a psychological process model of how story comprehension and recall processes operate using causal chain representations. Compares the model's story-recall…

  1. Algorithmic Processes for Increasing Design Efficiency.

    ERIC Educational Resources Information Center

    Terrell, William R.

    1983-01-01

    Discusses the role of algorithmic processes as a supplementary method for producing cost-effective and efficient instructional materials. Examines three approaches to problem solving in the context of developing training materials for the Naval Training Command: application of algorithms, quasi-algorithms, and heuristics. (EAO)

  2. Control of automatic processes: A parallel distributed-processing account of the Stroop effect. Technical report

    SciTech Connect

    Cohen, J.D.; Dunbar, K.; McClelland, J.L.

    1989-11-22

    A growing body of evidence suggests that traditional views of automaticity are in need of revision. For example, automaticity has often been treated as an all-or-none phenomenon, and traditional theories have held that automatic processes are independent of attention. Yet recent empirical data suggest that automatic processes are continuous, and furthermore are subject to attentional control. In this paper we present a model of attention which addresses these issues. Using a parallel distributed processing framework we propose that the attributes of automaticity depend upon the strength of a processing pathway and that strength increases with training. Using the Stroop effect as an example, we show how automatic processes are continuous and emerge gradually with practice. Specifically, we present a computational model of the Stroop task which simulates the time course of processing as well as the effects of learning. This was accomplished by combining the cascade mechanism described by McClelland (1979) with the back propagation learning algorithm (Rumelhart, Hinton, Williams, 1986). The model is able to simulate performance in the standard Stroop task, as well as aspects of performance in variants of this task which manipulate SOA, response set, and degree of practice. In the discussion we contrast our model with other models, and indicate how it relates to many of the central issues in the literature on attention, automaticity, and interference.

  3. Parallel plan execution with self-processing networks

    NASA Technical Reports Server (NTRS)

    Dautrechy, C. Lynne; Reggia, James A.

    1989-01-01

    A critical issue for space operations is how to develop and apply advanced automation techniques to reduce the cost and complexity of working in space. In this context, it is important to examine how recent advances in self-processing networks can be applied for planning and scheduling tasks. For this reason, the feasibility of applying self-processing network models to a variety of planning and control problems relevant to spacecraft activities is being explored. Goals are to demonstrate that self-processing methods are applicable to these problems, and that MIRRORS/II, a general purpose software environment for implementing self-processing models, is sufficiently robust to support development of a wide range of application prototypes. Using MIRRORS/II and marker passing modelling techniques, a model of the execution of a Spaceworld plan was implemented. This is a simplified model of the Voyager spacecraft which photographed Jupiter, Saturn, and their satellites. It is shown that plan execution, a task usually solved using traditional artificial intelligence (AI) techniques, can be accomplished using a self-processing network. The fact that self-processing networks were applied to other space-related tasks, in addition to the one discussed here, demonstrates the general applicability of this approach to planning and control problems relevant to spacecraft activities. It is also demonstrated that MIRRORS/II is a powerful environment for the development and evaluation of self-processing systems.

  4. Parallel and series FED microstrip array with high efficiency and low cross polarization

    NASA Technical Reports Server (NTRS)

    Huang, John (Inventor)

    1995-01-01

    A microstrip array antenna for vertically polarized fan beam (approximately 2 deg x 50 deg) for C-band SAR applications with a physical area of 1.7 m by 0.17 m comprises two rows of patch elements and employs a parallel feed to left- and right-half sections of the rows. Each section is divided into two segments that are fed in parallel with the elements in each segment fed in series through matched transmission lines for high efficiency. The inboard section has half the number of patch elements of the outboard section, and the outboard sections, which have tapered distribution with identical transmission line sections, terminated with half wavelength long open-circuit stubs so that the remaining energy is reflected and radiated in phase. The elements of the two inboard segments of the two left- and right-half sections are provided with tapered transmission lines from element to element for uniform power distribution over the central third of the entire array antenna. The two rows of array elements are excited at opposite patch feed locations with opposite (180 deg difference) phases for reduced cross-polarization.

  5. BioFVM: an efficient, parallelized diffusive transport solver for 3-D biological simulations

    PubMed Central

    Ghaffarizadeh, Ahmadreza; Friedman, Samuel H.; Macklin, Paul

    2016-01-01

    Motivation: Computational models of multicellular systems require solving systems of PDEs for release, uptake, decay and diffusion of multiple substrates in 3D, particularly when incorporating the impact of drugs, growth substrates and signaling factors on cell receptors and subcellular systems biology. Results: We introduce BioFVM, a diffusive transport solver tailored to biological problems. BioFVM can simulate release and uptake of many substrates by cell and bulk sources, diffusion and decay in large 3D domains. It has been parallelized with OpenMP, allowing efficient simulations on desktop workstations or single supercomputer nodes. The code is stable even for large time steps, with linear computational cost scalings. Solutions are first-order accurate in time and second-order accurate in space. The code can be run by itself or as part of a larger simulator. Availability and implementation: BioFVM is written in C ++ with parallelization in OpenMP. It is maintained and available for download at http://BioFVM.MathCancer.org and http://BioFVM.sf.net under the Apache License (v2.0). Contact: paul.macklin@usc.edu. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26656933

  6. An iterative expanding and shrinking process for processor allocation in mixed-parallel workflow scheduling.

    PubMed

    Huang, Kuo-Chan; Wu, Wei-Ya; Wang, Feng-Jian; Liu, Hsiao-Ching; Hung, Chun-Hao

    2016-01-01

    Parallel computation has been widely applied in a variety of large-scale scientific and engineering applications. Many studies indicate that exploiting both task and data parallelisms, i.e. mixed-parallel workflows, to solve large computational problems can get better efficacy compared with either pure task parallelism or pure data parallelism. Scheduling traditional workflows of pure task parallelism on parallel systems has long been known to be an NP-complete problem. Mixed-parallel workflow scheduling has to deal with an additional challenging issue of processor allocation. In this paper, we explore the processor allocation issue in scheduling mixed-parallel workflows of moldable tasks, called M-task, and propose an Iterative Allocation Expanding and Shrinking (IAES) approach. Compared to previous approaches, our IAES has two distinguishing features. The first is allocating more processors to the tasks on allocated critical paths for effectively reducing the makespan of workflow execution. The second is allowing the processor allocation of an M-task to shrink during the iterative procedure, resulting in a more flexible and effective process for finding better allocation. The proposed IAES approach has been evaluated with a series of simulation experiments and compared to several well-known previous methods, including CPR, CPA, MCPA, and MCPA2. The experimental results indicate that our IAES approach outperforms those previous methods significantly in most situations, especially when nodes of the same layer in a workflow might have unequal workloads. PMID:27504236

  7. A framework for parallelized efficient global optimization with application to vehicle crashworthiness optimization

    NASA Astrophysics Data System (ADS)

    Hamza, Karim; Shalaby, Mohamed

    2014-09-01

    This article presents a framework for simulation-based design optimization of computationally expensive problems, where economizing the generation of sample designs is highly desirable. One popular approach for such problems is efficient global optimization (EGO), where an initial set of design samples is used to construct a kriging model, which is then used to generate new 'infill' sample designs at regions of the search space where there is high expectancy of improvement. This article attempts to address one of the limitations of EGO, where generation of infill samples can become a difficult optimization problem in its own right, as well as allow the generation of multiple samples at a time in order to take advantage of parallel computing in the evaluation of the new samples. The proposed approach is tested on analytical functions, and then applied to the vehicle crashworthiness design of a full Geo Metro model undergoing frontal crash conditions.

  8. Performance of a VME-based parallel processing LIDAR data acquisition system (summary)

    SciTech Connect

    Moore, K.; Buttler, B.; Caffrey, M.; Soriano, C.

    1995-05-01

    It may be possible to make accurate real time, autonomous, 2 and 3 dimensional wind measurements remotely with an elastic backscatter Light Detection and Ranging (LIDAR) system by incorporating digital parallel processing hardware into the data acquisition system. In this paper, we report the performance of a commercially available digital parallel processing system in implementing the maximum correlation technique for wind sensing using actual LIDAR data. Timing and numerical accuracy are benchmarked against a standard microprocessor impementation.

  9. A generic simulation cell method for developing extensible, efficient and readable parallel computational models

    NASA Astrophysics Data System (ADS)

    Honkonen, I.

    2015-03-01

    I present a method for developing extensible and modular computational models without sacrificing serial or parallel performance or source code readability. By using a generic simulation cell method I show that it is possible to combine several distinct computational models to run in the same computational grid without requiring modification of existing code. This is an advantage for the development and testing of, e.g., geoscientific software as each submodel can be developed and tested independently and subsequently used without modification in a more complex coupled program. An implementation of the generic simulation cell method presented here, generic simulation cell class (gensimcell), also includes support for parallel programming by allowing model developers to select which simulation variables of, e.g., a domain-decomposed model to transfer between processes via a Message Passing Interface (MPI) library. This allows the communication strategy of a program to be formalized by explicitly stating which variables must be transferred between processes for the correct functionality of each submodel and the entire program. The generic simulation cell class requires a C++ compiler that supports a version of the language standardized in 2011 (C++11). The code is available at https://github.com/nasailja/gensimcell for everyone to use, study, modify and redistribute; those who do are kindly requested to acknowledge and cite this work.

  10. Parallel processing of Prestack Kirchhoff Time Migration on a PC Cluster

    NASA Astrophysics Data System (ADS)

    Dai, Hengchang

    2005-08-01

    This paper discusses an approach that implements a parallel processing of 3-D Prestack Kirchhoff Time Migration (PKTM) on a low-cost PC Cluster by using the Message Passing Interface (MPI), and analyses its performance using a real seismic data as examples. The PC Cluster provides a significant acceleration of the migration processing with the exact same image quality. The ratio between the communication time and processing time is a critical indicator for determining the efficiency of the PC Cluster. If the processing time is longer than the communication time, using more CPUs can efficiently reduce the elapsed time. On the contrary, using more CPUs cannot reduce the elapsed time. Appling this approach to the Alba dataset on our PC Cluster up to 15 CPUs, the elapsed time of PKTM is inversely proportional to the number of CPUs used. The elapsed time for migrating a 2-D seismic line is reduced from 15 h using one CPU to 1 h using 15 CPUs. The elapsed time for migrating a 3-D image is reduced from 630 h using one CPU to 42 h using 15 CPUs. Further reduction can be achieved by using more CPUs. However, an optimal CPU number is expected for an application on large PC clusters with hundreds of nodes. Adapting existing algorithms to the cluster environment offers the potential to allow the application of more accurate algorithms for PKTM to construct a more accurate image. This work has proven that the PC Cluster is a powerful and scalable computing resource for oil and gas exploration organizations.

  11. Architecture and design of a 500-MHz gallium-arsenide processing element for a parallel supercomputer

    NASA Technical Reports Server (NTRS)

    Fouts, Douglas J.; Butner, Steven E.

    1991-01-01

    The design of the processing element of GASP, a GaAs supercomputer with a 500-MHz instruction issue rate and 1-GHz subsystem clocks, is presented. The novel, functionally modular, block data flow architecture of GASP is described. The architecture and design of a GASP processing element is then presented. The processing element (PE) is implemented in a hybrid semiconductor module with 152 custom GaAs ICs of eight different types. The effects of the implementation technology on both the system-level architecture and the PE design are discussed. SPICE simulations indicate that parts of the PE are capable of being clocked at 1 GHz, while the rest of the PE uses a 500-MHz clock. The architecture utilizes data flow techniques at a program block level, which allows efficient execution of parallel programs while maintaining reasonably good performance on sequential programs. A simulation study of the architecture indicates that an instruction execution rate of over 30,000 MIPS can be attained with 65 PEs.

  12. Sculpting in cyberspace: Parallel processing the development of new software

    NASA Technical Reports Server (NTRS)

    Fisher, Rob

    1993-01-01

    Stimulating creativity in problem solving, particularly where software development is involved, is applicable to many disciplines. Metaphorical thinking keeps the problem in focus but in a different light, jarring people out of their mental ruts and sparking fresh insights. It forces the mind to stretch to find patterns between dissimilar concepts, in the hope of discovering unusual ideas in odd associations (Technology Review January 1993, p. 37). With a background in Engineering and Visual Design from MIT, I have for the past 30 years pursued a career as a sculptor of interdisciplinary monumental artworks that bridge the fields of science, engineering and art. Since 1979, I have pioneered the application of computer simulation to solve the complex problems associated with these projects. A recent project for the roof of the Carnegie Science Center in Pittsburgh made particular use of the metaphoric creativity technique described above. The problem-solving process led to the creation of hybrid software combining scientific, architectural and engineering visualization techniques. David Steich, a Doctoral Candidate in Electrical Engineering at Penn State, was commissioned to develop special software that enabled me to create innovative free-form sculpture. This paper explores the process of inventing the software through a detailed analysis of the interaction between an artist and a computer programmer.

  13. Efficient iteration in data-parallel programs with irregular and dynamically distributed data structures

    SciTech Connect

    Littlefield, R.J.

    1990-02-01

    To implement an efficient data-parallel program on a non-shared memory MIMD multicomputer, data and computations must be properly partitioned to achieve good load balance and locality of reference. Programs with irregular data reference patterns often require irregular partitions. Although good partitions may be easy to determine, they can be difficult or impossible to implement in programming languages that provide only regular data distributions, such as blocked or cyclic arrays. We are developing Onyx, a programming system that provides a shared memory model of distributed data structures and extends the concept of data distribution to include irregular and dynamic distributions. This provides a powerful means to specify irregular partitions. Perhaps surprisingly, programs using it can also execute efficiently. In this paper, we describe and evaluate the Onyx implementation of a model problem that repeatedly executes an irregular but fixed data reference pattern. On an NCUBE hypercube, the speed of the Onyx implementation is comparable to that of carefully handwritten message-passing code.

  14. Highly efficient numerical algorithm based on random trees for accelerating parallel Vlasov-Poisson simulations

    NASA Astrophysics Data System (ADS)

    Acebrón, Juan A.; Rodríguez-Rozas, Ángel

    2013-10-01

    An efficient numerical method based on a probabilistic representation for the Vlasov-Poisson system of equations in the Fourier space has been derived. This has been done theoretically for arbitrary dimensional problems, and particularized to unidimensional problems for numerical purposes. Such a representation has been validated theoretically in the linear regime comparing the solution obtained with the classical results of the linear Landau damping. The numerical strategy followed requires generating suitable random trees combined with a Padé approximant for approximating accurately a given divergent series. Such series are obtained by summing the partial contributions to the solution coming from trees with arbitrary number of branches. These contributions, coming in general from multi-dimensional definite integrals, are efficiently computed by a quasi-Monte Carlo method. It is shown how the accuracy of the method can be effectively increased by considering more terms of the series. The new representation was used successfully to develop a Probabilistic Domain Decomposition method suited for massively parallel computers, which improves the scalability found in classical methods. Finally, a few numerical examples based on classical phenomena such as the non-linear Landau damping, and the two streaming instability are given, illustrating the remarkable performance of the algorithm, when compared the results with those obtained using a classical method.

  15. SIESTA-PEXSI: Massively parallel method for efficient and accurate ab initio materials simulation

    NASA Astrophysics Data System (ADS)

    Lin, Lin; Huhs, Georg; Garcia, Alberto; Yang, Chao

    2014-03-01

    We describe how to combine the pole expansion and selected inversion (PEXSI) technique with the SIESTA method, which uses numerical atomic orbitals for Kohn-Sham density functional theory (KSDFT) calculations. The PEXSI technique can efficiently utilize the sparsity pattern of the Hamiltonian matrix and the overlap matrix generated from codes such as SIESTA, and solves KSDFT without using cubic scaling matrix diagonalization procedure. The complexity of PEXSI scales at most quadratically with respect to the system size, and the accuracy is comparable to that obtained from full diagonalization. One distinct feature of PEXSI is that it achieves low order scaling without using the near-sightedness property and can be therefore applied to metals as well as insulators and semiconductors, at room temperature or even lower temperature. The PEXSI method is highly scalable, and the recently developed massively parallel PEXSI technique can make efficient usage of 10,000 ~100,000 processors on high performance machines. We demonstrate the performance the SIESTA-PEXSI method using several examples for large scale electronic structure calculation including long DNA chain and graphene-like structures with more than 20000 atoms. Funded by Luis Alvarez fellowship in LBNL, and DOE SciDAC project in partnership with BES.

  16. An efficient parallel algorithm: Poststack and prestack Kirchhoff 3D depth migration using flexi-depth iterations

    NASA Astrophysics Data System (ADS)

    Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh

    2015-07-01

    This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.

  17. Non-parallel processing: Gendered attrition in academic computer science

    NASA Astrophysics Data System (ADS)

    Cohoon, Joanne Louise Mcgrath

    2000-10-01

    This dissertation addresses the issue of disproportionate female attrition from computer science as an instance of gender segregation in higher education. By adopting a theoretical framework from organizational sociology, it demonstrates that the characteristics and processes of computer science departments strongly influence female retention. The empirical data identifies conditions under which women are retained in the computer science major at comparable rates to men. The research for this dissertation began with interviews of students, faculty, and chairpersons from five computer science departments. These exploratory interviews led to a survey of faculty and chairpersons at computer science and biology departments in Virginia. The data from these surveys are used in comparisons of the computer science and biology disciplines, and for statistical analyses that identify which departmental characteristics promote equal attrition for male and female undergraduates in computer science. This three-pronged methodological approach of interviews, discipline comparisons, and statistical analyses shows that departmental variation in gendered attrition rates can be explained largely by access to opportunity, relative numbers, and other characteristics of the learning environment. Using these concepts, this research identifies nine factors that affect the differential attrition of women from CS departments. These factors are: (1) The gender composition of enrolled students and faculty; (2) Faculty turnover; (3) Institutional support for the department; (4) Preferential attitudes toward female students; (5) Mentoring and supervising by faculty; (6) The local job market, starting salaries, and competitiveness of graduates; (7) Emphasis on teaching; and (8) Joint efforts for student success. This work contributes to our understanding of the gender segregation process in higher education. In addition, it contributes information that can lead to effective solutions for an

  18. The role of parallelism in the real-time processing of anaphora

    PubMed Central

    Poirier, Josée; Walenski, Matthew; Shapiro, Lewis P.

    2012-01-01

    Parallelism effects refer to the facilitated processing of a target structure when it follows a similar, parallel structure. In coordination, a parallelism-related conjunction triggers the expectation that a second conjunct with the same structure as the first conjunct should occur. It has been proposed that parallelism effects reflect the use of the first structure as a template that guides the processing of the second. In this study, we examined the role of parallelism in real-time anaphora resolution by charting activation patterns in coordinated constructions containing anaphora, Verb-Phrase Ellipsis (VPE) and Noun-Phrase Traces (NP-traces). Specifically, we hypothesised that an expectation of parallelism would incite the parser to assume a structure similar to the first conjunct in the second, anaphora-containing conjunct. The speculation of a similar structure would result in early postulation of covert anaphora. Experiment 1 confirms that following a parallelism-related conjunction, first-conjunct material is activated in the second conjunct. Experiment 2 reveals that an NP-trace in the second conjunct is posited immediately where licensed, which is earlier than previously reported in the literature. In light of our findings, we propose an intricate relation between structural expectations and anaphor resolution. PMID:23741080

  19. A co-design method for parallel image processing accelerator based on DSP and FPGA

    NASA Astrophysics Data System (ADS)

    Wang, Ze; Weng, Kaijian; Cheng, Zhao; Yan, Luxin; Guan, Jing

    2011-11-01

    In this paper, we present a co-design method for parallel image processing accelerator based on DSP and FPGA. DSP is used as application and operation subsystem to execute the complex operations, and in which the algorithms are resolving into commands. FPGA is used as co-processing subsystem for regular data-parallel processing, and operation commands and image data are transmitted to FPGA for processing acceleration. A series of experiments have been carried out, and up to a half or three quarter time is saved which supports that the proposed accelerator will consume less time and get better performance than the traditional systems.

  20. Seventh SIAM Conference on Parallel Processing for Scientific Computing. Final technical report

    SciTech Connect

    1996-10-01

    The Seventh SIAM Conference on Parallel Processing for Scientific Computing was held in downtown San Francisco on the dates above. More than 400 people attended the meeting. This SIAM conference is, in this organizer`s opinion, the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Other, related areas, most notably parallel software and applications, are also well represented. The strong contributed sessions and minisymposia at the meeting attest to these claims.

  1. Adapting high-level language programs for parallel processing using data flow

    NASA Technical Reports Server (NTRS)

    Standley, Hilda M.

    1988-01-01

    EASY-FLOW, a very high-level data flow language, is introduced for the purpose of adapting programs written in a conventional high-level language to a parallel environment. The level of parallelism provided is of the large-grained variety in which parallel activities take place between subprograms or processes. A program written in EASY-FLOW is a set of subprogram calls as units, structured by iteration, branching, and distribution constructs. A data flow graph may be deduced from an EASY-FLOW program.

  2. Note on parallel processing techniques for algebraic equations, ordinary differential equations and partial differential equations

    SciTech Connect

    Allidina, A.Y.; Malinowski, K.; Singh, M.G.

    1982-12-01

    The possibilities were explored for enhancing parallelism in the simulation of systems described by algebraic equations, ordinary differential equations and partial differential equations. These techniques, using multiprocessors, were developed to speed up simulations, e.g. for nuclear accidents. Issues involved in their design included suitable approximations to bring the problem into a numerically manageable form and a numerical procedure to perform the computations necessary to solve the problem accurately. Parallel processing techniques used as simulation procedures, and a design of a simulation scheme and simulation procedure employing parallel computer facilities, were both considered.

  3. Comparisons of elastic and rigid blade-element rotor models using parallel processing technology for piloted simulations

    NASA Technical Reports Server (NTRS)

    Hill, Gary; Duval, Ronald W.; Green, John A.; Huynh, Loc C.

    1991-01-01

    A piloted comparison of rigid and aeroelastic blade-element rotor models was conducted at the Crew Station Research and Development Facility (CSRDF) at Ames Research Center. A simulation development and analysis tool, FLIGHTLAB, was used to implement these models in real time using parallel processing technology. Pilot comments and quantitative analysis performed both on-line and off-line confirmed that elastic degrees of freedom significantly affect perceived handling qualities. Trim comparisons show improved correlation with flight test data when elastic modes are modeled. The results demonstrate the efficiency with which the mathematical modeling sophistication of existing simulation facilities can be upgraded using parallel processing, and the importance of these upgrades to simulation fidelity.

  4. Parallel computer processing and modeling: applications for the ICU

    NASA Astrophysics Data System (ADS)

    Baxter, Grant; Pranger, L. Alex; Draghic, Nicole; Sims, Nathaniel M.; Wiesmann, William P.

    2003-07-01

    Current patient monitoring procedures in hospital intensive care units (ICUs) generate vast quantities of medical data, much of which is considered extemporaneous and not evaluated. Although sophisticated monitors to analyze individual types of patient data are routinely used in the hospital setting, this equipment lacks high order signal analysis tools for detecting long-term trends and correlations between different signals within a patient data set. Without the ability to continuously analyze disjoint sets of patient data, it is difficult to detect slow-forming complications. As a result, the early onset of conditions such as pneumonia or sepsis may not be apparent until the advanced stages. We report here on the development of a distributed software architecture test bed and software medical models to analyze both asynchronous and continuous patient data in real time. Hardware and software has been developed to support a multi-node distributed computer cluster capable of amassing data from multiple patient monitors and projecting near and long-term outcomes based upon the application of physiologic models to the incoming patient data stream. One computer acts as a central coordinating node; additional computers accommodate processing needs. A simple, non-clinical model for sepsis detection was implemented on the system for demonstration purposes. This work shows exceptional promise as a highly effective means to rapidly predict and thereby mitigate the effect of nosocomial infections.

  5. Parallel processing of run-length-encoded Boolean imagery: RLE transformation, arithmetic, and global-reduce operations

    NASA Astrophysics Data System (ADS)

    Schmalz, Mark S.

    1993-09-01

    The processing of Boolean imagery compressed by runlength encoding (RLE) frequently exhibits greater computational efficiency than the processing of uncompressed imagery, due to the data reduction inherent in RLE. In a previous publication, we outlined general methods for developing operators that compute over RLE Boolean imagery. In this paper, we present sequential and parallel algorithms for a variety of operations over RLE imagery, including the customary arithmetic and logical Hadamard operations, as well as the global reduce functions of image sum and maximum. RLE neighborhood-based operations, as well as the more advanced RLE operations of linear transforms, connected component labelling, and pattern recognition are presented in the companion paper.

  6. Design and implementation of the parallel processing system of multi-channel polarization images

    NASA Astrophysics Data System (ADS)

    Li, Zhi-yong; Huang, Qin-chao

    2013-08-01

    Compared with traditional optical intensity image processing, polarization images processing has two main problems. One is that the amount of data is larger. The other is that processing tasks is more complex. To resolve these problems, the parallel processing system of multi-channel polarization images is designed by the multi-DSP technique. It contains a communication control unit (CCU) and a data processing array (DPA). CCU controls communications inside and outside the system. Its logics are designed by a FPGA chip. DPA is made up of four Digital Signal Processor (DSP) chips, which are interlinked by the loose coupling method. DPA implements processing tasks including images registration and images synthesis by parallel processing methods. The polarization images parallel processing model is designed on multi levels including the system task, the algorithm and the operation. Its program is designed by the assemble language. While the polarization image resolution is 782x582 pixels, the pixel data length is 12 bits in the experiment. After it received 3 channels of polarization image simultaneously, this system implements parallel task to acquire the target polarization characteristics. Experimental results show that this system has good real-time and reliability. The processing time of images registration is 293.343ms while the registration accuracy achieves 0.5 pixel. The processing time of images synthesis is 3.199ms.

  7. Designing efficient parallel algorithms on mesh-connected computers with multiple broadcasting

    SciTech Connect

    Chen, Y.C.; Chen, W.T. ); Chen, G.H. ); Sheu, J.P. )

    1990-04-01

    Semigroup and prefix computations on two-dimensional mesh-connected computers with multiple broadcasting (2-MCCMB's) are studied in this paper. Previously, only square 2-MCCMB's with N processing elements were considered or semigroup computations of N data items, and O(N{sup 1/6}) time was required. It is found that square machines are not the best form for semigroup computations, and an O(N{sup 1/8}) time algorithm is thus derived on an N{sup 5/8} {times} N{sup 3/8} rectangular 2-MCCMB. This time complexity can be further reduced to O(N{sup 1/9}) if fewer PE's are used. Following the same way, parallel algorithms for prefix computations are also derived with the same time complexities.

  8. On the costs of parallel processing in dual-task performance: The case of lexical processing in word production.

    PubMed

    Paucke, Madlen; Oppermann, Frank; Koch, Iring; Jescheniak, Jörg D

    2015-12-01

    Previous dual-task picture-naming studies suggest that lexical processes require capacity-limited processes and prevent other tasks to be carried out in parallel. However, studies involving the processing of multiple pictures suggest that parallel lexical processing is possible. The present study investigated the specific costs that may arise when such parallel processing occurs. We used a novel dual-task paradigm by presenting 2 visual objects associated with different tasks and manipulating between-task similarity. With high similarity, a picture-naming task (T1) was combined with a phoneme-decision task (T2), so that lexical processes were shared across tasks. With low similarity, picture-naming was combined with a size-decision T2 (nonshared lexical processes). In Experiment 1, we found that a manipulation of lexical processes (lexical frequency of T1 object name) showed an additive propagation with low between-task similarity and an overadditive propagation with high between-task similarity. Experiment 2 replicated this differential forward propagation of the lexical effect and showed that it disappeared with longer stimulus onset asynchronies. Moreover, both experiments showed backward crosstalk, indexed as worse T1 performance with high between-task similarity compared with low similarity. Together, these findings suggest that conditions of high between-task similarity can lead to parallel lexical processing in both tasks, which, however, does not result in benefits but rather in extra performance costs. These costs can be attributed to crosstalk based on the dual-task binding problem arising from parallel processing. Hence, the present study reveals that capacity-limited lexical processing can run in parallel across dual tasks but only at the expense of extraordinary high costs. PMID:26375632

  9. Parallel flow accumulation algorithms for graphical processing units with application to RUSLE model

    NASA Astrophysics Data System (ADS)

    Sten, Johan; Lilja, Harri; Hyväluoma, Jari; Westerholm, Jan; Aspnäs, Mats

    2016-04-01

    Digital elevation models (DEMs) are widely used in the modeling of surface hydrology, which typically includes the determination of flow directions and flow accumulation. The use of high-resolution DEMs increases the accuracy of flow accumulation computation, but as a drawback, the computational time may become excessively long if large areas are analyzed. In this paper we investigate the use of graphical processing units (GPUs) for efficient flow accumulation calculations. We present two new parallel flow accumulation algorithms based on dependency transfer and topological sorting and compare them to previously published flow transfer and indegree-based algorithms. We benchmark the GPU implementations against industry standards, ArcGIS and SAGA. With the flow-transfer D8 flow routing model and binary input data, a speed up of 19 is achieved compared to ArcGIS and 15 compared to SAGA. We show that on GPUs the topological sort-based flow accumulation algorithm leads on average to a speedup by a factor of 7 over the flow-transfer algorithm. Thus a total speed up of the order of 100 is achieved. We test the algorithms by applying them to the Revised Universal Soil Loss Equation (RUSLE) erosion model. For this purpose we present parallel versions of the slope, LS factor and RUSLE algorithms and show that the RUSLE erosion results for an area of 12 km x 24 km containing 72 million cells can be calculated in less than a second. Since flow accumulation is needed in many hydrological models, the developed algorithms may find use in many other applications than RUSLE modeling. The algorithm based on topological sorting is particularly promising for dynamic hydrological models where flow accumulations are repeatedly computed over an unchanged DEM.

  10. Image processing system architecture using parallel arrays of digital signal processors

    NASA Astrophysics Data System (ADS)

    Kshirsagar, Shirish P.; Hobson, Clifford A.; Hartley, David A.; Harvey, David M.

    1993-10-01

    The paper describes the requirements of a high definition, high speed image processing system. Different types of parallel architectures were considered for the system. Advantages and limitations of SIMD and MIMD architectures are briefly discussed for image processing applications. A parallel image processing system based on MIMD architecture has been developed using multiple digital signal processors which can communicate with each other through an interconnection network. Texas Instruments TMS320C40 digital signal processors have been selected because they have a powerful floating point CPU supported by fast parallel communication ports, a DMA coprocessor and two memory interfaces. A five processor system is described in the paper. The EISA bus is used as the host interface and VISION bus is used to transfer images between the processors. The system is being used for automated non-contact inspection in which electro-optic signals are processed to identify manufacturing problems.

  11. Fear Control an Danger Control: A Test of the Extended Parallel Process Model (EPPM).

    ERIC Educational Resources Information Center

    Witte, Kim

    1994-01-01

    Explores cognitive and emotional mechanisms underlying success and failure of fear appeals in context of AIDS prevention. Offers general support for Extended Parallel Process Model. Suggests that cognitions lead to fear appeal success (attitude, intention, or behavior changes) via danger control processes, whereas the emotion fear leads to fear…

  12. Parallel Processing of the Target Language during Source Language Comprehension in Interpreting

    ERIC Educational Resources Information Center

    Dong, Yanping; Lin, Jiexuan

    2013-01-01

    Two experiments were conducted to test the hypothesis that the parallel processing of the target language (TL) during source language (SL) comprehension in interpreting may be influenced by two factors: (i) link strength from SL to TL, and (ii) the interpreter's cognitive resources supplement to TL processing during SL comprehension. The…

  13. Parallels between a Collaborative Research Process and the Middle Level Philosophy

    ERIC Educational Resources Information Center

    Dever, Robin; Ross, Diane; Miller, Jennifer; White, Paula; Jones, Karen

    2014-01-01

    The characteristics of the middle level philosophy as described in This We Believe closely parallel the collaborative research process. The journey of one research team is described in relationship to these characteristics. The collaborative process includes strengths such as professional relationships, professional development, courageous…

  14. A computationally efficient parallel Levenberg-Marquardt algorithm for highly parameterized inverse model analyses

    DOE PAGESBeta

    Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.

    2016-08-19

    Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of themore » problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~101 to ~102 in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.« less

  15. QuickPIC: a highly efficient fully parallelized PIC code for plasma-based acceleration

    NASA Astrophysics Data System (ADS)

    Huang, C.; Decyk, V. K.; Zhou, M.; Lu, W.; Mori, W. B.; Cooley, J. H.; Antonsen, T. M., Jr.; Feng, B.; Katsouleas, T.; Vieira, J.; Silva, L. O.

    2006-09-01

    A highly efficient, fully parallelized, fully relativistic, three-dimensional particle-incell model for simulating plasma and laser wakefield acceleration is described. The model is based on the quasi-static approximation, which reduces a fully three-dimensional electromagnetic field solve and particle push to a two-dimensional field solve and particle push. This is done by calculating the plasma wake assuming that the drive beam and/or laser does not evolve during the time it takes for it to pass a plasma particle. The complete electromagnetic fields of the plasma wake and its associated index of refraction are then used to evolve the drive beam and/or laser using very large time steps. This algorithm reduces the computation time by 2 to 3 orders of magnitude without loss of accuracy for highly nonlinear problems of interest. The code is fully parallelizable with different domain decompositions for the 2D and 3D pieces of the code. The code also has dynamic load balancing. We present the basic algorithms and design of QuickPIC, as well as comparison between the new algorithm and conventional fully explicit models (OSIRIS). Direction for future work is also presented including a software pipeline technique to further scale QuickPIC to 10,000+ processors.

  16. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples

    PubMed Central

    Smith, Andrew M.; Heisler, Lawrence E.; St.Onge, Robert P.; Farias-Hesson, Eveline; Wallace, Iain M.; Bodeau, John; Harris, Adam N.; Perry, Kathleen M.; Giaever, Guri; Pourmand, Nader; Nislow, Corey

    2010-01-01

    Next-generation sequencing has proven an extremely effective technology for molecular counting applications where the number of sequence reads provides a digital readout for RNA-seq, ChIP-seq, Tn-seq and other applications. The extremely large number of sequence reads that can be obtained per run permits the analysis of increasingly complex samples. For lower complexity samples, however, a point of diminishing returns is reached when the number of counts per sequence results in oversampling with no increase in data quality. A solution to making next-generation sequencing as efficient and affordable as possible involves assaying multiple samples in a single run. Here, we report the successful 96-plexing of complex pools of DNA barcoded yeast mutants and show that such ‘Bar-seq’ assessment of these samples is comparable with data provided by barcode microarrays, the current benchmark for this application. The cost reduction and increased throughput permitted by highly multiplexed sequencing will greatly expand the scope of chemogenomics assays and, equally importantly, the approach is suitable for other sequence counting applications that could benefit from massive parallelization. PMID:20460461

  17. Efficient massively parallel simulation of dynamic channel assignment schemes for wireless cellular communications

    NASA Technical Reports Server (NTRS)

    Greenberg, Albert G.; Lubachevsky, Boris D.; Nicol, David M.; Wright, Paul E.

    1994-01-01

    Fast, efficient parallel algorithms are presented for discrete event simulations of dynamic channel assignment schemes for wireless cellular communication networks. The driving events are call arrivals and departures, in continuous time, to cells geographically distributed across the service area. A dynamic channel assignment scheme decides which call arrivals to accept, and which channels to allocate to the accepted calls, attempting to minimize call blocking while ensuring co-channel interference is tolerably low. Specifically, the scheme ensures that the same channel is used concurrently at different cells only if the pairwise distances between those cells are sufficiently large. Much of the complexity of the system comes from ensuring this separation. The network is modeled as a system of interacting continuous time automata, each corresponding to a cell. To simulate the model, conservative methods are used; i.e., methods in which no errors occur in the course of the simulation and so no rollback or relaxation is needed. Implemented on a 16K processor MasPar MP-1, an elegant and simple technique provides speedups of about 15 times over an optimized serial simulation running on a high speed workstation. A drawback of this technique, typical of conservative methods, is that processor utilization is rather low. To overcome this, new methods were developed that exploit slackness in event dependencies over short intervals of time, thereby raising the utilization to above 50 percent and the speedup over the optimized serial code to about 120 times.

  18. Parallel artificial liquid membrane extraction as an efficient tool for removal of phospholipids from human plasma.

    PubMed

    Ask, Kristine Skoglund; Bardakci, Turgay; Parmer, Marthe Petrine; Halvorsen, Trine Grønhaug; Øiestad, Elisabeth Leere; Pedersen-Bjergaard, Stig; Gjelstad, Astrid

    2016-09-10

    Generic Parallel Artificial Liquid Membrane Extraction (PALME) methods for non-polar basic and non-polar acidic drugs from human plasma were investigated with respect to phospholipid removal. In both cases, extractions in 96-well format were performed from plasma (125μL), through 4μL organic solvent used as supported liquid membranes (SLMs), and into 50μL aqueous acceptor solutions. The acceptor solutions were subsequently analysed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) using in-source fragmentation and monitoring the m/z 184→184 transition for investigation of phosphatidylcholines (PC), sphingomyelins (SM), and lysophosphatidylcholines (Lyso-PC). In both generic methods, no phospholipids were detected in the acceptor solutions. Thus, PALME appeared to be highly efficient for phospholipid removal. To further support this, qualitative (post-column infusion) and quantitative matrix effects were investigated with fluoxetine, fluvoxamine, and quetiapine as model analytes. No signs of matrix effects were observed. Finally, PALME was evaluated for the aforementioned drug substances, and data were in accordance with European Medicines Agency (EMA) guidelines. PMID:27433988

  19. An efficient non-iterative reconstruction algorithm for parallel MRI with arbitrary k-space trajectories.

    PubMed

    Ying, Leslie; Haldar, Justin; Liang, Zhi-Pei

    2005-01-01

    Parallel imaging using multiple receiver coils has emerged as an effective tool to reduce imaging time in various MRI applications. Although several different image reconstruction methods have been developed and demonstrated to be successful for Cartesian k-space trajectories, there is a lack of efficient reconstruction methods for arbitrary trajectories. In this paper, we formulate the reconstruction problem in k-space and propose a novel image reconstruction method that is fast and effective for arbitrary trajectories. To obtain the desired image, the method reconstructs the Nyquist-sampled k-space data of the image on a uniform Cartesian grid from the undersampled multichannel k-space data on an arbitrary grid, followed by inverse Fourier transform. We demonstrate the effectiveness of the proposed fast algorithm using simulations. In particular, we compare the proposed method with the existing iterative method and show that the former is able to achieve similar image quality to the latter but with reduced computational complexity. PMID:17282445

  20. Parallel processing technology for large-scale production of synthetic aperature radar imagery

    NASA Astrophysics Data System (ADS)

    Kirk, David; Bessette, Loretta A.; Fawcett, Glenn; Nobles, David

    1999-08-01

    This paper presents a case study in using parallel processing technology for large-scale production of Foliage Penetration (FOPEN) Synthetic Aperture Radar (SAR) imagery. The initial version of the FOPEN SAR image formation software ran on a Unix workstation. The research-grade parallel image formation software was transitioned into a full-scale remote processing facility resulting in a significant improvement in processing speed. The primary goal of this effort was to increase the production rate of calibrated, well-focused SAR imagery, but an important secondary objective was to gain insight into the capabilities and limitations of high performance parallel platforms. This paper discusses lessons that were learned in transitioning and utilizing the research-grade image formation code in a 'turn key' production setting, and discusses configuration control and image quality metrics.

  1. Medical image processing utilizing neural networks trained on a massively parallel computer.

    PubMed

    Kerr, J P; Bartlett, E B

    1995-07-01

    While finding many applications in science, engineering, and medicine, artificial neural networks (ANNs) have typically been limited to small architectures. In this paper, we demonstrate how very large architecture neural networks can be trained for medical image processing utilizing a massively parallel, single-instruction multiple data (SIMD) computer. The two- to three-orders of magnitude improvement in processing time attainable using a parallel computer makes it practical to train very large architecture ANNs. As an example we have trained several ANNs to demonstrate the tomographic reconstruction of 64 x 64 single photon emission computed tomography (SPECT) images from 64 planar views of the images. The potential for these large architecture ANNs lies in the fact that once the neural network is properly trained on the parallel computer the corresponding interconnection weight file can be loaded on a serial computer. Subsequently, relatively fast processing of all novel images can be performed on a PC or workstation. PMID:7497701

  2. Advancing the extended parallel process model through the inclusion of response cost measures.

    PubMed

    Rintamaki, Lance S; Yang, Z Janet

    2014-01-01

    This study advances the Extended Parallel Process Model through the inclusion of response cost measures, which are drawbacks associated with a proposed response to a health threat. A sample of 502 college students completed a questionnaire on perceptions regarding sexually transmitted infections and condom use after reading information from the Centers for Disease Control and Prevention on the health risks of sexually transmitted infections and the utility of latex condoms in preventing sexually transmitted infection transmission. The questionnaire included standard Extended Parallel Process Model assessments of perceived threat and efficacy, as well as questions pertaining to response costs associated with condom use. Results from hierarchical ordinary least squares regression demonstrated how the addition of response cost measures improved the predictive power of the Extended Parallel Process Model, supporting the inclusion of this variable in the model. PMID:24730535

  3. Managing internode data communications for an uninitialized process in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E

    2014-05-20

    A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.

  4. Application of integration algorithms in a parallel processing environment for the simulation of jet engines

    NASA Technical Reports Server (NTRS)

    Krosel, S. M.; Milner, E. J.

    1982-01-01

    The application of Predictor corrector integration algorithms developed for the digital parallel processing environment are investigated. The algorithms are implemented and evaluated through the use of a software simulator which provides an approximate representation of the parallel processing hardware. Test cases which focus on the use of the algorithms are presented and a specific application using a linear model of a turbofan engine is considered. Results are presented showing the effects of integration step size and the number of processors on simulation accuracy. Real time performance, interprocessor communication, and algorithm startup are also discussed.

  5. Relative Language Exposure, Processing Efficiency and Vocabulary in Spanish-English Bilingual Toddlers

    ERIC Educational Resources Information Center

    Hurtado, Nereyda; Gruter, Theres; Marchman, Virginia A.; Fernald, Anne

    2014-01-01

    Research with monolingual children has shown that early efficiency in real-time word recognition predicts later language and cognitive outcomes. In parallel research with young bilingual children, processing ability and vocabulary size are closely related within each language, although not across the two languages. For children in dual-language…

  6. a Hadoop-Based Distributed Framework for Efficient Managing and Processing Big Remote Sensing Images

    NASA Astrophysics Data System (ADS)

    Wang, C.; Hu, F.; Hu, X.; Zhao, S.; Wen, W.; Yang, C.

    2015-07-01

    Various sensors from airborne and satellite platforms are producing large volumes of remote sensing images for mapping, environmental monitoring, disaster management, military intelligence, and others. However, it is challenging to efficiently storage, query and process such big data due to the data- and computing- intensive issues. In this paper, a Hadoop-based framework is proposed to manage and process the big remote sensing data in a distributed and parallel manner. Especially, remote sensing data can be directly fetched from other data platforms into the Hadoop Distributed File System (HDFS). The Orfeo toolbox, a ready-to-use tool for large image processing, is integrated into MapReduce to provide affluent image processing operations. With the integration of HDFS, Orfeo toolbox and MapReduce, these remote sensing images can be directly processed in parallel in a scalable computing environment. The experiment results show that the proposed framework can efficiently manage and process such big remote sensing data.

  7. Efficient algorithms for processing remotely sensed imagery

    NASA Astrophysics Data System (ADS)

    Zhang, Zengyan

    At regional and global scales, satellite-based sensors are the primary source of information to study the Earth's environment, as they provide the needed dynamic temporal view of the Earth's surface. We focus this dissertation on the development of efficient methodologies and algorithms to generate custom tailored data products using Global Area Coverage (GAC) data from the NOAA Advanced Very High Resolution Radiometer (AVHRR) sensor. Furthermore, we show the retrieval of the global Bidirectional Reflectance Distribution Function (BRDF) and albedo of the Earth land surface using Pathfinder AVHRR Land (PAL) data sets which can be generated using our system. These are the first algorithms to retrieve such land surface properties on a global scale. We start by describing a software system Kronos, which allows the generation of a rich set of data products that can be easily specified through a Java interface by scientists wishing to carry out Earth system modeling or analysis. Kronos is based on a flexible methodology and consists of four major components: ingest and preprocessing, indexing and storage, search and processing engine, and a Java interface. Then efficient algorithms for custom-tailored AVHRR data products generation are developed, implemented, and tested on the UMIACS high performance computer systems. A major challenge lies in the efficient processing, storage, and archival of the available large amounts of data in such a way that pertinent information can be extracted with ease. Finally high performance algorithms are developed to retrieve global BRDF and albedo in the red and near-infrared wavelengths using three widely different models with multiangular, multi-temporal, and multi-band PAL data. Given the volume of data involved (about 27 GBytes), I/O time as well as the overall computational complexity are minimized. The algorithms access the global data once, followed by a redistribution of land pixel data to balance the computational loads among the

  8. Next generation Purex modeling by way of parallel processing with high performance computers

    SciTech Connect

    DeMuth, S.F.

    1993-08-01

    The Plutonium and Uranium Extraction (Purex) process is the predominant method used worldwide for solvent extraction in reprocessing spent nuclear fuels. Proper flowsheet design has a significant impact on the character of the process waste. Past Purex flowsheet modeling has been based on equilibrium conditions. It can be shown for the Purex process that optimum separation does not necessarily occur at equilibrium conditions. The next generation Purex flowsheet models should incorporate the fundamental diffusion and chemical kinetic processes required to study time-dependent behavior. Use of parallel processing with high-performance computers will permit transient multistage and multispecies design calculations based on mass transfer with simultaneous chemical reaction models. This paper presents an applicable mass transfer with chemical reaction model for the Purex system and presents a parallel processing solution methodology.

  9. Parallel processing algorithms for hydrocodes on a computer with MIMD architecture (DENELCOR's HEP)

    SciTech Connect

    Hicks, D.L.

    1983-11-01

    In real time simulation/prediction of complex systems such as water-cooled nuclear reactors, if reactor operators had fast simulator/predictors to check the consequences of their operations before implementing them, events such as the incident at Three Mile Island might be avoided. However, existing simulator/predictors such as RELAP run slower than real time on serial computers. It appears that the only way to overcome the barrier to higher computing rates is to use computers with architectures that allow concurrent computations or parallel processing. The computer architecture with the greatest degree of parallelism is labeled Multiple Instruction Stream, Multiple Data Stream (MIMD). An example of a machine of this type is the HEP computer by DENELCOR. It appears that hydrocodes are very well suited for parallelization on the HEP. It is a straightforward exercise to parallelize explicit, one-dimensional Lagrangean hydrocodes in a zone-by-zone parallelization. Similarly, implicit schemes can be parallelized in a zone-by-zone fashion via an a priori, symbolic inversion of the tridiagonal matrix that arises in an implicit scheme. These techniques are extended to Eulerian hydrocodes by using Harlow's rezone technique. The extension from single-phase Eulerian to two-phase Eulerian is straightforward. This step-by-step extension leads to hydrocodes with zone-by-zone parallelization that are capable of two-phase flow simulation. Extensions to two and three spatial dimensions can be achieved by operator splitting. It appears that a zone-by-zone parallelization is the best way to utilize the capabilities of an MIMD machine. 40 references.

  10. Friction Stir Processing for Efficient Manufacturing

    SciTech Connect

    Mr. Christopher B. Smith; Dr. Oyelayo Ajayi

    2012-01-31

    Friction at contacting surfaces in relative motion is a major source of parasitic energy loss in machine systems and manufacturing processes. Consequently, friction reduction usually translates to efficiency gain and reduction in energy consumption. Furthermore, friction at surfaces eventually leads to wear and failure of the components thereby compromising reliability and durability. In order to reduce friction and wear in tribological components, material surfaces are often hardened by a variety of methods, including conventional heat treatment, laser surface hardening, and thin-film coatings. While these surface treatments are effective when used in conjunction with lubrication to prevent failure, they are all energy intensive and could potentially add significant cost. A new concept for surface hardening of metallic materials and components is Friction Stir Processing (FSP). Compared to the current surface hardening technologies, FSP is more energy efficient has no emission or waste by products and may result in better tribological performance. FSP involves plunging a rotating tool to a predetermined depth (case layer thickness) and translating the FSP tool along the area to be processed. This action of the tool produces heating and severe plastic deformation of the processed area. For steel the temperature is high enough to cause phase transformation, ultimately forming hard martensitic phase. Indeed, FSP has been used for surface modification of several metals and alloys so as to homogenize the microstructure and refine the grain size, both of which led to improved fatigue and corrosion resistance. Based on the effect of FSP on near-surface layer material, it was expected to have beneficial effects on friction and wear performance of metallic materials. However, little or no knowledge existed on the impact of FSP concerning friction and wear performance the subject of the this project and final report. Specifically for steel, which is the most dominant

  11. Parallel image processing and image understanding. Final report, April 1985-March 1986

    SciTech Connect

    Rosenfeld, A.

    1986-03-31

    This research was conducted to obtain better methods for image processing. It focused on several aspects of this problem, including parallel algorithms for image processing, knowledge-based techniques for image understanding, and modeling images using shape and texture. Eighteen technical reports produced will also appear as published papers in journals. In the paper Holes and Genus of 3D images, it was shown that certain geometric invariants of a digital image (number of components, number of holes, and number of cavities) do not determine the topology (in the sense of connectivity) of the image refuting the commonly believed assumption that they do. This research lays the groundwork for research on digital and computational geometry of 3D images. In the paper Hough Transform Algorithms for Mesh-Connected SIMD Parallel Processors, several methods of Hough transform computation are studied in terms of suitability for implementation on a parallel processor, providing a valuable tool for straight-line detection.

  12. Processing communications events in parallel active messaging interface by awakening thread from wait state

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-10-22

    Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.

  13. Bin-Hash Indexing: A Parallel Method for Fast Query Processing

    SciTech Connect

    Bethel, Edward W; Gosink, Luke J.; Wu, Kesheng; Bethel, Edward Wes; Owens, John D.; Joy, Kenneth I.

    2008-06-27

    This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not be resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies.

  14. Compressed bitmap indices for efficient query processing

    SciTech Connect

    Wu, Kesheng; Otoo, Ekow; Shoshani, Arie

    2001-09-30

    Many database applications make extensive use of bitmap indexing schemes. In this paper, we study how to improve the efficiencies of these indexing schemes by proposing new compression schemes for the bitmaps. Most compression schemes are designed primarily to achieve good compression. During query processing they can be orders of magnitude slower than their uncompressed counterparts. The new schemes are designed to bridge this performance gap by reducing compression effectiveness and improving operation speed. In a number of tests on both synthetic data and real application data, we found that the new schemes significantly outperform the well-known compression schemes while using only modestly more space. For example, compared to the Byte-aligned Bitmap Code, the new schemes are 12 times faster and it uses only 50 percent more space. The new schemes use much less space(<30 percent) than the uncompressed scheme and are faster in a majority of the test cases.

  15. Parallel Digital Watermarking Process on Ultrasound Medical Images in Multicores Environment

    PubMed Central

    Khor, Hui Liang; Liew, Siau-Chuin; Zain, Jasni Mohd.

    2016-01-01

    With the advancement of technology in communication network, it facilitated digital medical images transmitted to healthcare professionals via internal network or public network (e.g., Internet), but it also exposes the transmitted digital medical images to the security threats, such as images tampering or inserting false data in the images, which may cause an inaccurate diagnosis and treatment. Medical image distortion is not to be tolerated for diagnosis purposes; thus a digital watermarking on medical image is introduced. So far most of the watermarking research has been done on single frame medical image which is impractical in the real environment. In this paper, a digital watermarking on multiframes medical images is proposed. In order to speed up multiframes watermarking processing time, a parallel watermarking processing on medical images processing by utilizing multicores technology is introduced. An experiment result has shown that elapsed time on parallel watermarking processing is much shorter than sequential watermarking processing. PMID:26981111

  16. Parallel Digital Watermarking Process on Ultrasound Medical Images in Multicores Environment.

    PubMed

    Khor, Hui Liang; Liew, Siau-Chuin; Zain, Jasni Mohd

    2016-01-01

    With the advancement of technology in communication network, it facilitated digital medical images transmitted to healthcare professionals via internal network or public network (e.g., Internet), but it also exposes the transmitted digital medical images to the security threats, such as images tampering or inserting false data in the images, which may cause an inaccurate diagnosis and treatment. Medical image distortion is not to be tolerated for diagnosis purposes; thus a digital watermarking on medical image is introduced. So far most of the watermarking research has been done on single frame medical image which is impractical in the real environment. In this paper, a digital watermarking on multiframes medical images is proposed. In order to speed up multiframes watermarking processing time, a parallel watermarking processing on medical images processing by utilizing multicores technology is introduced. An experiment result has shown that elapsed time on parallel watermarking processing is much shorter than sequential watermarking processing. PMID:26981111

  17. An Inconvenient Truth: An Application of the Extended Parallel Process Model

    ERIC Educational Resources Information Center

    Goodall, Catherine E.; Roberto, Anthony J.

    2008-01-01

    "An Inconvenient Truth" is an Academy Award-winning documentary about global warming presented by Al Gore. This documentary is appropriate for a lesson on fear appeals and the extended parallel process model (EPPM). The EPPM is concerned with the effects of perceived threat and efficacy on behavior change. Perceived threat is composed of an…

  18. Cocaine Use and Delinquent Behavior among High-Risk Youths: A Growth Model of Parallel Processes

    ERIC Educational Resources Information Center

    Dembo, Richard; Sullivan, Christopher

    2009-01-01

    We report the results of a parallel-process, latent growth model analysis examining the relationships between cocaine use and delinquent behavior among youths. The study examined a sample of 278 justice-involved juveniles completing at least one of three follow-up interviews as part of a National Institute on Drug Abuse-funded study. The results…

  19. Tracking the Continuity of Language Comprehension: Computer Mouse Trajectories Suggest Parallel Syntactic Processing

    ERIC Educational Resources Information Center

    Farmer, Thomas A.; Cargill, Sarah A.; Hindy, Nicholas C.; Dale, Rick; Spivey, Michael J.

    2007-01-01

    Although several theories of online syntactic processing assume the parallel activation of multiple syntactic representations, evidence supporting simultaneous activation has been inconclusive. Here, the continuous and non-ballistic properties of computer mouse movements are exploited, by recording their streaming x, y coordinates to procure…

  20. The Large Laboratory Course: Organize It to Parallel Industrial Process Development.

    ERIC Educational Resources Information Center

    Eckert, Roger E.; Ybarra, Robert M.

    1988-01-01

    Describes a senior level chemical engineering course at Purdue University that parallels an industrial process development department. Stresses the course organization, manager-engineer contract, evaluation of students, course evaluation, and gives examples of course improvements made during the course. (CW)

  1. One Factor or Two Parallel Processes? Comorbidity and Development of Adolescent Anxiety and Depressive Disorder Symptoms

    ERIC Educational Resources Information Center

    Hale, William W., III; Raaijmakers, Quinten A. W.; Muris, Peter; van Hoof, Anne; Meeus, Wim H. J.

    2009-01-01

    Background: This study investigates whether anxiety and depressive disorder symptoms of adolescents from the general community are best described by a model that assumes they are indicative of one general factor or by a model that assumes they are two distinct disorders with parallel growth processes. Additional analyses were conducted to explore…

  2. High Performance Parallel Processing Project: Industrial computing initiative. Progress reports for fiscal year 1995

    SciTech Connect

    Koniges, A.

    1996-02-09

    This project is a package of 11 individual CRADA`s plus hardware. This innovative project established a three-year multi-party collaboration that is significantly accelerating the availability of commercial massively parallel processing computing software technology to U.S. government, academic, and industrial end-users. This report contains individual presentations from nine principal investigators along with overall program information.

  3. Parallel Distributed Processing at 25: Further Explorations in the Microstructure of Cognition

    ERIC Educational Resources Information Center

    Rogers, Timothy T.; McClelland, James L.

    2014-01-01

    This paper introduces a special issue of "Cognitive Science" initiated on the 25th anniversary of the publication of "Parallel Distributed Processing" (PDP), a two-volume work that introduced the use of neural network models as vehicles for understanding cognition. The collection surveys the core commitments of the PDP…

  4. Using the Extended Parallel Process Model to Examine Teachers' Likelihood of Intervening in Bullying

    ERIC Educational Resources Information Center

    Duong, Jeffrey; Bradshaw, Catherine P.

    2013-01-01

    Background: Teachers play a critical role in protecting students from harm in schools, but little is known about their attitudes toward addressing problems like bullying. Previous studies have rarely used theoretical frameworks, making it difficult to advance this area of research. Using the Extended Parallel Process Model (EPPM), we examined the…

  5. Psychodrama: A Creative Approach for Addressing Parallel Process in Group Supervision

    ERIC Educational Resources Information Center

    Hinkle, Michelle Gimenez

    2008-01-01

    This article provides a model for using psychodrama to address issues of parallel process during group supervision. Information on how to utilize the specific concepts and techniques of psychodrama in relation to group supervision is discussed. A case vignette of the model is provided.

  6. A Neurally Plausible Parallel Distributed Processing Model of Event-Related Potential Word Reading Data

    ERIC Educational Resources Information Center

    Laszlo, Sarah; Plaut, David C.

    2012-01-01

    The Parallel Distributed Processing (PDP) framework has significant potential for producing models of cognitive tasks that approximate how the brain performs the same tasks. To date, however, there has been relatively little contact between PDP modeling and data from cognitive neuroscience. In an attempt to advance the relationship between…

  7. Parallel Process and Isomorphism: A Model for Decision Making in the Supervisory Triad

    ERIC Educational Resources Information Center

    Koltz, Rebecca L.; Odegard, Melissa A.; Feit, Stephen S.; Provost, Kent; Smith, Travis

    2012-01-01

    Parallel process and isomorphism are two supervisory concepts that are often discussed independently but rarely discussed in connection with each other. These two concepts, philosophically, have different historical roots, as well as different implications for interventions with regard to the supervisory triad. The authors examine the difference…

  8. Real-time SHVC software decoding with multi-threaded parallel processing

    NASA Astrophysics Data System (ADS)

    Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu

    2014-09-01

    This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.

  9. Process Simulation of Complex Biological Pathways in Physical Reactive Space and Reformulated for Massively Parallel Computing Platforms.

    PubMed

    Ganesan, Narayan; Li, Jie; Sharma, Vishakha; Jiang, Hanyu; Compagnoni, Adriana

    2016-01-01

    Biological systems encompass complexity that far surpasses many artificial systems. Modeling and simulation of large and complex biochemical pathways is a computationally intensive challenge. Traditional tools, such as ordinary differential equations, partial differential equations, stochastic master equations, and Gillespie type methods, are all limited either by their modeling fidelity or computational efficiency or both. In this work, we present a scalable computational framework based on modeling biochemical reactions in explicit 3D space, that is suitable for studying the behavior of large and complex biological pathways. The framework is designed to exploit parallelism and scalability offered by commodity massively parallel processors such as the graphics processing units (GPUs) and other parallel computing platforms. The reaction modeling in 3D space is aimed at enhancing the realism of the model compared to traditional modeling tools and framework. We introduce the Parallel Select algorithm that is key to breaking the sequential bottleneck limiting the performance of most other tools designed to study biochemical interactions. The algorithm is designed to be computationally tractable, handle hundreds of interacting chemical species and millions of independent agents by considering all-particle interactions within the system. We also present an implementation of the framework on the popular graphics processing units and apply it to the simulation study of JAK-STAT Signal Transduction Pathway. The computational framework will offer a deeper insight into various biological processes within the cell and help us observe key events as they unfold in space and time. This will advance the current state-of-the-art in simulation study of large scale biological systems and also enable the realistic simulation study of macro-biological cultures, where inter-cellular interactions are prevalent. PMID:27045833

  10. Parallel pulse processing and data acquisition for high speed, low error flow cytometry

    SciTech Connect

    van den Engh, Gerrit J.; Stokdijk, Willem

    1992-01-01

    A digitally synchronized parallel pulse processing and data acquisition system for a flow cytometer has multiple parallel input channels with independent pulse digitization and FIFO storage buffer. A trigger circuit controls the pulse digitization on all channels. After an event has been stored in each FIFO, a bus controller moves the oldest entry from each FIFO buffer onto a common data bus. The trigger circuit generates an ID number for each FIFO entry, which is checked by an error detection circuit. The system has high speed and low error rate.

  11. Parallel pulse processing and data acquisition for high speed, low error flow cytometry

    DOEpatents

    Engh, G.J. van den; Stokdijk, W.

    1992-09-22

    A digitally synchronized parallel pulse processing and data acquisition system for a flow cytometer has multiple parallel input channels with independent pulse digitization and FIFO storage buffer. A trigger circuit controls the pulse digitization on all channels. After an event has been stored in each FIFO, a bus controller moves the oldest entry from each FIFO buffer onto a common data bus. The trigger circuit generates an ID number for each FIFO entry, which is checked by an error detection circuit. The system has high speed and low error rate. 17 figs.

  12. mHealthMon: toward energy-efficient and distributed mobile health monitoring using parallel offloading.

    PubMed

    Ahnn, Jong Hoon; Potkonjak, Miodrag

    2013-10-01

    Although mobile health monitoring where mobile sensors continuously gather, process, and update sensor readings (e.g. vital signals) from patient's sensors is emerging, little effort has been investigated in an energy-efficient management of sensor information gathering and processing. Mobile health monitoring with the focus of energy consumption may instead be holistically analyzed and systematically designed as a global solution to optimization subproblems. This paper presents an attempt to decompose the very complex mobile health monitoring system whose layer in the system corresponds to decomposed subproblems, and interfaces between them are quantified as functions of the optimization variables in order to orchestrate the subproblems. We propose a distributed and energy-saving mobile health platform, called mHealthMon where mobile users publish/access sensor data via a cloud computing-based distributed P2P overlay network. The key objective is to satisfy the mobile health monitoring application's quality of service requirements by modeling each subsystem: mobile clients with medical sensors, wireless network medium, and distributed cloud services. By simulations based on experimental data, we present the proposed system can achieve up to 10.1 times more energy-efficient and 20.2 times faster compared to a standalone mobile health monitoring application, in various mobile health monitoring scenarios applying a realistic mobility model. PMID:23897403

  13. Toward energy-efficient and distributed mobile health monitoring using parallel offloading.

    PubMed

    Ahnn, Jong Hoon; Potkonjak, Miodrag

    2013-01-01

    Although mobile health monitoring where mobile sensors continuously gather, process, and update sensor readings (e.g. vital signals) from patient's sensors is emerging, little effort has been investigated in an energy-efficient management of sensor information gathering and processing. Mobile health monitoring with the focus of energy consumption may instead be holistically analyzed and systematically designed as a global solution to optimization subproblems. We propose a distributed and energy-saving mobile health platform, called mHealthMon where mobile users publish/access sensor data via a cloud computing-based distributed P2P overlay network. The key objective is to satisfy the mobile health monitoring application's quality of service requirements by modeling each subsystem: mobile clients with medical sensors, wireless network medium, and distributed cloud services. By simulations based on experimental data, we present the proposed system can achieve up to 10.1 times more energy-efficient and 20.2 times faster compared to a standalone mobile health monitoring application, in various mobile health monitoring scenarios applying a realistic mobility model. PMID:24111420

  14. A Pervasive Parallel Processing Framework For Data Visualization And Analysis At Extreme Scale Final Scientific and Technical Report

    SciTech Connect

    Geveci, Berk

    2014-10-31

    The evolution of the computing world from teraflop to petaflop has been relatively effortless,with several of the existing programming models scaling effectively to the petascale. The migration to exascale, however, poses considerable challenges. All industry trends infer that the exascale machine will be built using processors containing hundreds to thousands of cores per chip. It can be inferred that efficient concurrency on exascale machines requires a massive amount of concurrent threads, each performing many operations on a localized piece of data. Currently, visualization libraries and applications are based off what is known as the visualization pipeline. In the pipeline model, algorithms are encapsulated as filters with inputs and outputs. These filters are connected by setting the output of one component to the input of another. Parallelism in the visualization pipeline is achieved by replicating the pipeline for each processing thread. This works well for today’s distributed memory parallel computers but cannot be sustained when operating on processors with thousands of cores. Our project investigates a new visualization framework designed to exhibit the pervasive parallelism necessary for extreme scale machines. Our framework achieves this by defining algorithms in terms of worklets, which are localized stateless operations. Worklets are atomic operations that execute when invoked unlike filters, which execute when a pipeline request occurs. The worklet design allows execution on a massive amount of lightweight threads with minimal overhead. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale machine.

  15. Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms

    NASA Astrophysics Data System (ADS)

    Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel

    2016-04-01

    Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and

  16. Efficient computer architecture for the realization of realtime linear systems with intelligent processing of large sparse matrices

    SciTech Connect

    Chae, S.H.

    1988-01-01

    This dissertation describes an intelligent rule-based iterative parallel algorithm for solving randomly distributed large sparse linear systems of equations, and also the efficient parallel processing computer architecture for the implementation of the algorithm. Implemented with the Jacobi iterative method, the intelligent rule-based algorithm reduces the parallel execution time by reducing the individual inner product operation time. A static dataflow architecture is proposed for implementing the intelligent rule-based iterative parallel algorithm. The dataflow computer architecture has the capability to support parallelism exploited in the algorithm, and the execute the algorithm asynchronously. The proposed computer architecture consists of a main processor, several control processors, scalar slave processors, and pipelined slave processors. Several control processors share with the main processor the heavy burden of allocation of operation packets and of sychronization for parallel processing.

  17. Parallel multigrid solver of radiative transfer equation for photon transport via graphics processing unit.

    PubMed

    Gao, Hao; Phan, Lan; Lin, Yuting

    2012-09-01

    A graphics processing unit-based parallel multigrid solver for a radiative transfer equation with vacuum boundary condition or reflection boundary condition is presented for heterogeneous media with complex geometry based on two-dimensional triangular meshes or three-dimensional tetrahedral meshes. The computational complexity of this parallel solver is linearly proportional to the degrees of freedom in both angular and spatial variables, while the full multigrid method is utilized to minimize the number of iterations. The overall gain of speed is roughly 30 to 300 fold with respect to our prior multigrid solver, which depends on the underlying regime and the parallelization. The numerical validations are presented with the MATLAB codes at https://sites.google.com/site/rtefastsolver/. PMID:23085905

  18. A multi-satellite orbit determination problem in a parallel processing environment

    NASA Technical Reports Server (NTRS)

    Deakyne, M. S.; Anderle, R. J.

    1988-01-01

    The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.

  19. Parallel processing in a host plus multiple array processor system for radar

    NASA Technical Reports Server (NTRS)

    Barkan, B. Z.

    1983-01-01

    Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.

  20. An automated workflow for parallel processing of large multiview SPIM recordings

    PubMed Central

    Schmied, Christopher; Steinbach, Peter; Pietzsch, Tobias; Preibisch, Stephan; Tomancak, Pavel

    2016-01-01

    Summary: Selective Plane Illumination Microscopy (SPIM) allows to image developing organisms in 3D at unprecedented temporal resolution over long periods of time. The resulting massive amounts of raw image data requires extensive processing interactively via dedicated graphical user interface (GUI) applications. The consecutive processing steps can be easily automated and the individual time points can be processed independently, which lends itself to trivial parallelization on a high performance computing (HPC) cluster. Here, we introduce an automated workflow for processing large multiview, multichannel, multiillumination time-lapse SPIM data on a single workstation or in parallel on a HPC cluster. The pipeline relies on snakemake to resolve dependencies among consecutive processing steps and can be easily adapted to any cluster environment for processing SPIM data in a fraction of the time required to collect it. Availability and implementation: The code is distributed free and open source under the MIT license http://opensource.org/licenses/MIT. The source code can be downloaded from github: https://github.com/mpicbg-scicomp/snakemake-workflows. Documentation can be found here: http://fiji.sc/Automated_workflow_for_parallel_Multiview_Reconstruction. Contact: schmied@mpi-cbg.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26628585

  1. Massively parallel per-pixel-based zerotree processing architecture for real-time video compression

    NASA Astrophysics Data System (ADS)

    Alagoda, Geoffrey; Rassau, Alexander M.; Eshraghian, Kamran

    2001-11-01

    In the span of a few years, mobile multimedia communication has rapidly become a significant area of research and development constantly challenging boundaries on a variety of technological fronts. Video compression, a fundamental component for most mobile multimedia applications, generally places heavy demands in terms of the required processing capacity. Hardware implementations of typical modern hybrid codecs require realisation of components such as motion compensation, wavelet transform, quantisation, zerotree coding and arithmetic coding in real-time. While the implementation of such codecs using a fast generic processor is possible, undesirable trade-offs in terms of power consumption and speed must generally be made. The improvement in power consumption that is achievable through the use of a slow-clocked massively parallel processing environment, while maintaining real-time processing speeds, should thus not be overlooked. An architecture to realise such a massively parallel solution for a zerotree entropy coder is, therefore, presented in this paper.

  2. Parallel processing of large datasets from NanoLC-FTICR-MS measurements.

    PubMed

    van der Burgt, Y E M; Taban, I M; Konijnenburg, M; Biskup, M; Duursma, M C; Heeren, R M A; Römpp, A; van Nieuwpoort, R V; Bal, H E

    2007-01-01

    A new approach for automatic parallel processing of large mass spectral datasets in a distributed computing environment is demonstrated to significantly decrease the total processing time. The implementation of this novel approach is described and evaluated for large nanoLC-FTICR-MS datasets. The speed benefits are determined by the network speed and file transfer protocols only and allow almost real-time analysis of complex data (e.g., a 3-gigabyte raw dataset is fully processed within 5 min). Key advantages of this approach are not limited to the improved analysis speed, but also include the improved flexibility, reproducibility, and the possibility to share and reuse the pre- and postprocessing strategies. The storage of all raw data combined with the massively parallel processing approach described here allows the scientist to reprocess data with a different set of parameters (e.g., apodization, calibration, noise reduction), as is recommended by the proteomics community. This approach of parallel processing was developed in the Virtual Laboratory for e-Science (VL-e), a science portal that aims at allowing access to users outside the computer research community. As such, this strategy can be applied to all types of serially acquired large mass spectral datasets such as LC-MS, LC-MS/MS, and high-resolution imaging MS results. PMID:17055738

  3. Lossy hyperspectral image compression on a graphics processing unit: parallelization strategy and performance evaluation

    NASA Astrophysics Data System (ADS)

    Santos, Lucana; Magli, Enrico; Vitulli, Raffaele; Núñez, Antonio; López, José F.; Sarmiento, Roberto

    2013-01-01

    There is an intense necessity for the development of new hardware architectures for the implementation of algorithms for hyperspectral image compression on board satellites. Graphics processing units (GPUs) represent a very attractive opportunity, offering the possibility to dramatically increase the computation speed in applications that are data and task parallel. An algorithm for the lossy compression of hyperspectral images is implemented on a GPU using Nvidia computer unified device architecture (CUDA) parallel computing architecture. The parallelization strategy is explained, with emphasis on the entropy coding and bit packing phases, for which a more sophisticated strategy is necessary due to the existing data dependencies. Experimental results are obtained by comparing the performance of the GPU implementation with a single-threaded CPU implementation, showing high speedups of up to 15.41. A profiling of the algorithm is provided, demonstrating the high performance of the designed parallel entropy coding phase. The accuracy of the GPU implementation is presented, as well as the effect of the configuration parameters on performance. The convenience of using GPUs for on-board processing is demonstrated, and solutions to the potential difficulties encountered when accelerating hyperspectral compression algorithms are proposed, if space-qualified GPUs become a reality in the near future.

  4. Application of parallel computing to seismic damage process simulation of an arch dam

    NASA Astrophysics Data System (ADS)

    Zhong, Hong; Lin, Gao; Li, Jianbo

    2010-06-01

    The simulation of damage process of high arch dam subjected to strong earthquake shocks is significant to the evaluation of its performance and seismic safety, considering the catastrophic effect of dam failure. However, such numerical simulation requires rigorous computational capacity. Conventional serial computing falls short of that and parallel computing is a fairly promising solution to this problem. The parallel finite element code PDPAD was developed for the damage prediction of arch dams utilizing the damage model with inheterogeneity of concrete considered. Developed with programming language Fortran, the code uses a master/slave mode for programming, domain decomposition method for allocation of tasks, MPI (Message Passing Interface) for communication and solvers from AZTEC library for solution of large-scale equations. Speedup test showed that the performance of PDPAD was quite satisfactory. The code was employed to study the damage process of a being-built arch dam on a 4-node PC Cluster, with more than one million degrees of freedom considered. The obtained damage mode was quite similar to that of shaking table test, indicating that the proposed procedure and parallel code PDPAD has a good potential in simulating seismic damage mode of arch dams. With the rapidly growing need for massive computation emerged from engineering problems, parallel computing will find more and more applications in pertinent areas.

  5. Efficient audio signal processing for embedded systems

    NASA Astrophysics Data System (ADS)

    Chiu, Leung Kin

    As mobile platforms continue to pack on more computational power, electronics manufacturers start to differentiate their products by enhancing the audio features. However, consumers also demand smaller devices that could operate for longer time, hence imposing design constraints. In this research, we investigate two design strategies that would allow us to efficiently process audio signals on embedded systems such as mobile phones and portable electronics. In the first strategy, we exploit properties of the human auditory system to process audio signals. We designed a sound enhancement algorithm to make piezoelectric loudspeakers sound ”richer" and "fuller." Piezoelectric speakers have a small form factor but exhibit poor response in the low-frequency region. In the algorithm, we combine psychoacoustic bass extension and dynamic range compression to improve the perceived bass coming out from the tiny speakers. We also developed an audio energy reduction algorithm for loudspeaker power management. The perceptually transparent algorithm extends the battery life of mobile devices and prevents thermal damage in speakers. This method is similar to audio compression algorithms, which encode audio signals in such a ways that the compression artifacts are not easily perceivable. Instead of reducing the storage space, however, we suppress the audio contents that are below the hearing threshold, therefore reducing the signal energy. In the second strategy, we use low-power analog circuits to process the signal before digitizing it. We designed an analog front-end for sound detection and implemented it on a field programmable analog array (FPAA). The system is an example of an analog-to-information converter. The sound classifier front-end can be used in a wide range of applications because programmable floating-gate transistors are employed to store classifier weights. Moreover, we incorporated a feature selection algorithm to simplify the analog front-end. A machine

  6. pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs

    SciTech Connect

    Wu, Changjun; Kalyanaraman, Anantharaman; Cannon, William R.

    2012-09-15

    Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problemparticularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.

  7. Parallel Processing of Big Point Clouds Using Z-Order Partitioning

    NASA Astrophysics Data System (ADS)

    Alis, C.; Boehm, J.; Liu, K.

    2016-06-01

    As laser scanning technology improves and costs are coming down, the amount of point cloud data being generated can be prohibitively difficult and expensive to process on a single machine. This data explosion is not only limited to point cloud data. Voluminous amounts of high-dimensionality and quickly accumulating data, collectively known as Big Data, such as those generated by social media, Internet of Things devices and commercial transactions, are becoming more prevalent as well. New computing paradigms and frameworks are being developed to efficiently handle the processing of Big Data, many of which utilize a compute cluster composed of several commodity grade machines to process chunks of data in parallel. A central concept in many of these frameworks is data locality. By its nature, Big Data is large enough that the entire dataset would not fit on the memory and hard drives of a single node hence replicating the entire dataset to each worker node is impractical. The data must then be partitioned across worker nodes in a manner that minimises data transfer across the network. This is a challenge for point cloud data because there exist different ways to partition data and they may require data transfer. We propose a partitioning based on Z-order which is a form of locality-sensitive hashing. The Z-order or Morton code is computed by dividing each dimension to form a grid then interleaving the binary representation of each dimension. For example, the Z-order code for the grid square with coordinates (x = 1 = 012, y = 3 = 112) is 10112 = 11. The number of points in each partition is controlled by the number of bits per dimension: the more bits, the fewer the points. The number of bits per dimension also controls the level of detail with more bits yielding finer partitioning. We present this partitioning method by implementing it on Apache Spark and investigating how different parameters affect the accuracy and running time of the k nearest neighbour algorithm

  8. An FPGA-based High Speed Parallel Signal Processing System for Adaptive Optics Testbed

    NASA Astrophysics Data System (ADS)

    Kim, H.; Choi, Y.; Yang, Y.

    In this paper a state-of-the-art FPGA (Field Programmable Gate Array) based high speed parallel signal processing system (SPS) for adaptive optics (AO) testbed with 1 kHz wavefront error (WFE) correction frequency is reported. The AO system consists of Shack-Hartmann sensor (SHS) and deformable mirror (DM), tip-tilt sensor (TTS), tip-tilt mirror (TTM) and an FPGA-based high performance SPS to correct wavefront aberrations. The SHS is composed of 400 subapertures and the DM 277 actuators with Fried geometry, requiring high speed parallel computing capability SPS. In this study, the target WFE correction speed is 1 kHz; therefore, it requires massive parallel computing capabilities as well as strict hard real time constraints on measurements from sensors, matrix computation latency for correction algorithms, and output of control signals for actuators. In order to meet them, an FPGA based real-time SPS with parallel computing capabilities is proposed. In particular, the SPS is made up of a National Instrument's (NI's) real time computer and five FPGA boards based on state-of-the-art Xilinx Kintex 7 FPGA. Programming is done with NI's LabView environment, providing flexibility when applying different algorithms for WFE correction. It also facilitates faster programming and debugging environment as compared to conventional ones. One of the five FPGA's is assigned to measure TTS and calculate control signals for TTM, while the rest four are used to receive SHS signal, calculate slops for each subaperture and correction signal for DM. With this parallel processing capabilities of the SPS the overall closed-loop WFE correction speed of 1 kHz has been achieved. System requirements, architecture and implementation issues are described; furthermore, experimental results are also given.

  9. Use of maximum entropy method with parallel processing machine. [for x-ray object image reconstruction

    NASA Technical Reports Server (NTRS)

    Yin, Lo I.; Bielefeld, Michael J.

    1987-01-01

    The maximum entropy method (MEM) and balanced correlation method were used to reconstruct the images of low-intensity X-ray objects obtained experimentally by means of a uniformly redundant array coded aperture system. The reconstructed images from MEM are clearly superior. However, the MEM algorithm is computationally more time-consuming because of its iterative nature. On the other hand, both the inherently two-dimensional character of images and the iterative computations of MEM suggest the use of parallel processing machines. Accordingly, computations were carried out on the massively parallel processor at Goddard Space Flight Center as well as on the serial processing machine VAX 8600, and the results are compared.

  10. Parallel, multi-stage processing of colors, faces and shapes in macaque inferior temporal cortex

    PubMed Central

    Lafer-Sousa, Rosa; Conway, Bevil R.

    2014-01-01

    Visual-object processing culminates in inferior temporal (IT) cortex. To assess the organization of IT, we measured fMRI responses in alert monkey to achromatic images (faces, fruit, bodies, places) and colored gratings. IT contained multiple color-biased regions, which were typically ventral to face patches and, remarkably, yoked to them, spaced regularly at four locations predicted by known anatomy. Color and face selectivity increased for more anterior regions, indicative of a broad hierarchical arrangement. Responses to non-face shapes were found across IT, but were stronger outside color-biased regions and face patches, consistent with multiple parallel streams. IT also contained multiple coarse eccentricity maps: face patches overlapped central representations; color-biased regions spanned mid-peripheral representations; and place-biased regions overlapped peripheral representations. These results suggest that IT comprises parallel, multi-stage processing networks subject to one organizing principle. PMID:24141314

  11. Efficient parallel seismic simulations including topography and 3-D material heterogeneities on locally refined composite grids

    NASA Astrophysics Data System (ADS)

    Petersson, Anders; Rodgers, Arthur

    2010-05-01

    The finite difference method on a uniform Cartesian grid is a highly efficient and easy to implement technique for solving the elastic wave equation in seismic applications. However, the spacing in a uniform Cartesian grid is fixed throughout the computational domain, whereas the resolution requirements in realistic seismic simulations usually are higher near the surface than at depth. This can be seen from the well-known formula h ≤ L-P which relates the grid spacing h to the wave length L, and the required number of grid points per wavelength P for obtaining an accurate solution. The compressional and shear wave lengths in the earth generally increase with depth and are often a factor of ten larger below the Moho discontinuity (at about 30 km depth), than in sedimentary basins near the surface. A uniform grid must have a grid spacing based on the small wave lengths near the surface, which results in over-resolving the solution at depth. As a result, the number of points in a uniform grid is unnecessarily large. In the wave propagation project (WPP) code, we address the over-resolution-at-depth issue by generalizing our previously developed single grid finite difference scheme to work on a composite grid consisting of a set of structured rectangular grids of different spacings, with hanging nodes on the grid refinement interfaces. The computational domain in a regional seismic simulation often extends to depth 40-50 km. Hence, using a refinement ratio of two, we need about three grid refinements from the bottom of the computational domain to the surface, to keep the local grid size in approximate parity with the local wave lengths. The challenge of the composite grid approach is to find a stable and accurate method for coupling the solution across the grid refinement interface. Of particular importance is the treatment of the solution at the hanging nodes, i.e., the fine grid points which are located in between coarse grid points. WPP implements a new, energy

  12. Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts

    SciTech Connect

    1997-12-31

    This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.

  13. Queueing Network Models for Parallel Processing of Task Systems: an Operational Approach

    NASA Technical Reports Server (NTRS)

    Mak, Victor W. K.

    1986-01-01

    Computer performance modeling of possibly complex computations running on highly concurrent systems is considered. Earlier works in this area either dealt with a very simple program structure or resulted in methods with exponential complexity. An efficient procedure is developed to compute the performance measures for series-parallel-reducible task systems using queueing network models. The procedure is based on the concept of hierarchical decomposition and a new operational approach. Numerical results for three test cases are presented and compared to those of simulations.

  14. Initial operating capability for the hypercluster parallel-processing test bed

    NASA Technical Reports Server (NTRS)

    Cole, Gary L.; Blech, Richard A.; Quealy, Angela

    1989-01-01

    The NASA Lewis Research Center is investigating the benefits of parallel processing to applications in computational fluid and structural mechanics. To aid this investigation, NASA Lewis is developing the Hypercluster, a multi-architecture, parallel-processing test bed. The initial operating capability (IOC) being developed for the Hypercluster is described. The IOC will provide a user with a programming/operating environment that is interactive, responsive, and easy to use. The IOC effort includes the development of the Hypercluster Operating System (HYCLOPS). HYCLOPS runs in conjunction with a vendor-supplied disk operating system on a Front-End Processor (FEP) to provide interactive, run-time operations such as program loading, execution, memory editing, and data retrieval. Run-time libraries, that augment the FEP FORTRAN libraries, are being developed to support parallel and vector processing on the Hypercluster. Special utilities are being provided to enable passage of information about application programs and their mapping to the operating system. Communications between the FEP and the Hypercluster are being handled by dedicated processors, each running a Message-Passing Kernel, (MPK). A shared-memory interface allows rapid data exchange between HYCLOPS and the communications processors. Input/output handlers are built into the HYCLOPS-MPK interface, eliminating the need for the user to supply separate I/O support programs on the FEP.

  15. Tuning of tool dynamics for increased stability of parallel (simultaneous) turning processes

    NASA Astrophysics Data System (ADS)

    Ozturk, E.; Comak, A.; Budak, E.

    2016-01-01

    Parallel (simultaneous) turning operations make use of more than one cutting tool acting on a common workpiece offering potential for higher productivity. However, dynamic interaction between the tools and workpiece and resulting chatter vibrations may create quality problems on machined surfaces. In order to determine chatter free cutting process parameters, stability models can be employed. In this paper, stability of parallel turning processes is formulated in frequency and time domain for two different parallel turning cases. Predictions of frequency and time domain methods demonstrated reasonable agreement with each other. In addition, the predicted stability limits are also verified experimentally. Simulation and experimental results show multi regional stability diagrams which can be used to select most favorable set of process parameters for higher stable material removal rates. In addition to parameter selection, developed models can be used to determine the best natural frequency ratio of tools resulting in the highest stable depth of cuts. It is concluded that the most stable operations are obtained when natural frequency of the tools are slightly off each other and worst stability occurs when the natural frequency of the tools are exactly the same.

  16. Parallel and Space-Efficient Construction of Burrows-Wheeler Transform and Suffix Array for Big Genome Data.

    PubMed

    Liu, Yongchao; Hankeln, Thomas; Schmidt, Bertil

    2016-01-01

    Next-generation sequencing technologies have led to the sequencing of more and more genomes, propelling related research into the era of big data. In this paper, we present ParaBWT, a parallelized Burrows-Wheeler transform (BWT) and suffix array construction algorithm for big genome data. In ParaBWT, we have investigated a progressive construction approach to constructing the BWT of single genome sequences in linear space complexity, but with a small constant factor. This approach has been further parallelized using multi-threading based on a master-slave coprocessing model. After gaining the BWT, the suffix array is constructed in a memory-efficient manner. The performance of ParaBWT has been evaluated using two sequences generated from two human genome assemblies: the Ensembl Homo sapiens assembly and the human reference genome. Our performance comparison to FMD-index and Bwt-disk reveals that on 12 CPU cores, ParaBWT runs up to 2.2× faster than FMD-index and up to 99.0× faster than Bwt-disk. BWT construction algorithms for very long genomic sequences are time consuming and (due to their incremental nature) inherently difficult to parallelize. Thus, their parallelization is challenging and even relatively small speedups like the ones of our method over FMD-index are of high importance to research. ParaBWT is written in C++, and is freely available at http://parabwt.sourceforge.net. PMID:27295644

  17. A systolic array parallelizing compiler

    SciTech Connect

    Tseng, P.S. )

    1990-01-01

    This book presents a completely new approach to the problem of systolic array parallelizing compiler. It describes the AL parallelizing compiler for the Warp systolic array, the first working systolic array parallelizing compiler which can generate efficient parallel code for complete LINPACK routines. This book begins by analyzing the architectural strength of the Warp systolic array. It proposes a model for mapping programs onto the machine and introduces the notion of data relations for optimizing the program mapping. Also presented are successful applications of the AL compiler in matrix computation and image processing. A complete listing of the source program and compiler-generated parallel code are given to clarify the overall picture of the compiler. The book concludes that systolic array parallelizing compiler can produce efficient parallel code, almost identical to what the user would have written by hand.

  18. GWM-VI: groundwater management with parallel processing for multiple MODFLOW versions

    USGS Publications Warehouse

    Banta, Edward R.; Ahlfeld, David P.

    2013-01-01

    Groundwater Management–Version Independent (GWM–VI) is a new version of the Groundwater Management Process of MODFLOW. The Groundwater Management Process couples groundwater-flow simulation with a capability to optimize stresses on the simulated aquifer based on an objective function and constraints imposed on stresses and aquifer state. GWM–VI extends prior versions of Groundwater Management in two significant ways—(1) it can be used with any version of MODFLOW that meets certain requirements on input and output, and (2) it is structured to allow parallel processing of the repeated runs of the MODFLOW model that are required to solve the optimization problem. GWM–VI uses the same input structure for files that describe the management problem as that used by prior versions of Groundwater Management. GWM–VI requires only minor changes to the input files used by the MODFLOW model. GWM–VI uses the Joint Universal Parameter IdenTification and Evaluation of Reliability Application Programming Interface (JUPITER-API) to implement both version independence and parallel processing. GWM–VI communicates with the MODFLOW model by manipulating certain input files and interpreting results from the MODFLOW listing file and binary output files. Nearly all capabilities of prior versions of Groundwater Management are available in GWM–VI. GWM–VI has been tested with MODFLOW-2005, MODFLOW-NWT (a Newton formulation for MODFLOW-2005), MF2005-FMP2 (the Farm Process for MODFLOW-2005), SEAWAT, and CFP (Conduit Flow Process for MODFLOW-2005). This report provides sample problems that demonstrate a range of applications of GWM–VI and the directory structure and input information required to use the parallel-processing capability.

  19. Library Preparation and Multiplex Capture for Massive Parallel Sequencing Applications Made Efficient and Easy

    PubMed Central

    Neiman, Mårten; Sundling, Simon; Grönberg, Henrik; Hall, Per; Czene, Kamila

    2012-01-01

    During the recent years, rapid development of sequencing technologies and a competitive market has enabled researchers to perform massive sequencing projects at a reasonable cost. As the price for the actual sequencing reactions drops, enabling more samples to be sequenced, the relative price for preparing libraries gets larger and the practical laboratory work becomes complex and tedious. We present a cost-effective strategy for simplified library preparation compatible with both whole genome- and targeted sequencing experiments. An optimized enzyme composition and reaction buffer reduces the number of required clean-up steps and allows for usage of bulk enzymes which makes the whole process cheap, efficient and simple. We also present a two-tagging strategy, which allows for multiplex sequencing of targeted regions. To prove our concept, we have prepared libraries for low-pass sequencing from 100 ng DNA, performed 2-, 4- and 8-plex exome capture and a 96-plex capture of a 500 kb region. In all samples we see a high concordance (>99.4%) of SNP calls when comparing to commercially available SNP-chip platforms. PMID:23139805

  20. Real-time processing of radar return on a parallel computer

    NASA Technical Reports Server (NTRS)

    Aalfs, David D.

    1992-01-01

    NASA is working with the FAA to demonstrate the feasibility of pulse Doppler radar as a candidate airborne sensor to detect low altitude windshears. The need to provide the pilot with timely information about possible hazards has motivated a demand for real-time processing of a radar return. Investigated here is parallel processing as a means of accommodating the high data rates required. A PC based parallel computer, called the transputer, is used to investigate issues in real time concurrent processing of radar signals. A transputer network is made up of an array of single instruction stream processors that can be networked in a variety of ways. They are easily reconfigured and software development is largely independent of the particular network topology. The performance of the transputer is evaluated in light of the computational requirements. A number of algorithms have been implemented on the transputers in OCCAM, a language specially designed for parallel processing. These include signal processing algorithms such as the Fast Fourier Transform (FFT), pulse-pair, and autoregressive modelling, as well as routing software to support concurrency. The most computationally intensive task is estimating the spectrum. Two approaches have been taken on this problem, the first and most conventional of which is to use the FFT. By using table look-ups for the basis function and other optimizing techniques, an algorithm has been developed that is sufficient for real time. The other approach is to model the signal as an autoregressive process and estimate the spectrum based on the model coefficients. This technique is attractive because it does not suffer from the spectral leakage problem inherent in the FFT. Benchmark tests indicate that autoregressive modeling is feasible in real time.

  1. Parallel field programmable gate array particle filtering architecture for real-time neural signal processing.

    PubMed

    Mountney, John; Silage, Dennis; Obeid, Iyad

    2010-01-01

    Both linear and nonlinear estimation algorithms have been successfully applied as neural decoding techniques in brain machine interfaces. Nonlinear approaches such as Bayesian auxiliary particle filters offer improved estimates over other methodologies seemingly at the expense of computational complexity. Real-time implementation of particle filtering algorithms for neural signal processing may become prohibitive when the number of neurons in the observed ensemble becomes large. By implementing a parallel hardware architecture, filter performance can be improved in terms of throughput over conventional sequential processing. Such an architecture is presented here and its FPGA resource utilization is reported. PMID:21096196

  2. Some computational challenges of developing efficient parallel algorithms for data-dependent computations in thermal-hydraulics supercomputer applications

    SciTech Connect

    Woodruff, S.B.

    1992-05-01

    The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems.

  3. Distributed Parallel Processing and Dynamic Load Balancing Techniques for Multidisciplinary High Speed Aircraft Design

    NASA Technical Reports Server (NTRS)

    Krasteva, Denitza T.

    1998-01-01

    Multidisciplinary design optimization (MDO) for large-scale engineering problems poses many challenges (e.g., the design of an efficient concurrent paradigm for global optimization based on disciplinary analyses, expensive computations over vast data sets, etc.) This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems. The specific problem considered here is configuration optimization of a high speed civil transport (HSCT), and the efficient parallelization of the embedded paradigm for reasonable design space identification. Two distributed dynamic load balancing techniques (random polling and global round robin with message combining) and two necessary termination detection schemes (global task count and token passing) were implemented and evaluated in terms of effectiveness and scalability to large problem sizes and a thousand processors. The effect of certain parameters on execution time was also inspected. Empirical results demonstrated stable performance and effectiveness for all schemes, and the parametric study showed that the selected algorithmic parameters have a negligible effect on performance.

  4. Parallel architecture for labeling, segmentation, and lexical processing in speech understanding

    SciTech Connect

    Bronson, E.C.; Siegel, L.J.

    1983-01-01

    Speech understanding is a complex task which requires extensive computation. To increase the processing speed, a speech understanding system is decomposed into tasks which can be performed by a series of distributed processing subsystems. An architecture to perform labeling, segmentation, and lexical processing is described. Using a parametric characterization of the speech signal, this system divides an utterance into labeled homogeneous regions. The system then performs dictionary lookups based on all probable labelings and segmentations in order to generate a complete set of word hypotheses. Using realistic assumptions from existing speech understanding systems, a statistical model of speech input, and simulations of the speech processing algorithms, the attributes of the parallel system to perform labeling, segmentation, and lexical processing for real-time speech understanding are derived. 36 references.

  5. Propagation of electromagnetic fields between non-parallel planes: a fully vectorial formulation and an efficient implementation.

    PubMed

    Zhang, Site; Asoubar, Daniel; Hellmann, Christian; Wyrowski, Frank

    2016-01-20

    The propagation of electromagnetic fields between non-parallel planes based on a spectrum-of-plane-wave analysis is discussed and formulations for an efficient numerical implementation are presented in detail. It is shown that with the help of interpolation techniques, the numerical implementation can be done with the standard uniform fast Fourier transform (FFT) of easy access. Different interpolation techniques are numerically examined, and it turns out that the use of cubic interpolation, together with the uniform FFT, brings both significantly increased computational efficiency and high simulation accuracy. Apart from the aspect of computational efficiency, all formulations in this work are generalized in a fully vectorial manner in comparison to previous works. PMID:26835928

  6. Efficient VLSI parallel algorithm for Delaunay triangulation on orthogonal tree network in two and three dimensions

    SciTech Connect

    Saena, S.; Bhatt, P.C.P.; Prasad, V.C. )

    1990-03-01

    In this paper, a parallel algorithm for two- and three-dimensional Delaunay triangulation on an orthogonal tree network is described. The worst case time complexity of this algorithm is O(log {sup 2} N) in two dimensions and O(m {sup 1/2} log N) in three dimensions with N input points and m as the number of tetrahedra in tiangulation. The AT {sup 2} VLSI complexity on Thompson's logarithmic delay model is O(N {sup 2} log {sup 6} N) in two dimensions and O(m {sup 2} N log {sup 4} N) in three dimensions.

  7. Refractories for Industrial Processing. Opportunities for Improved Energy Efficiency

    SciTech Connect

    Hemrick, James G.; Hayden, H. Wayne; Angelini, Peter; Moore, Robert E.; Headrick, William L.

    2005-01-01

    Refractories are a class of materials of critical importance to manufacturing industries with high-temperature unit processes. This study describes industrial refractory applications and identifies refractory performance barriers to energy efficiency for processing. The report provides recommendations for R&D pathways leading to improved refractories for energy-efficient manufacturing and processing.

  8. Massively parallel mathematical sieves

    SciTech Connect

    Montry, G.R.

    1989-01-01

    The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.

  9. A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

    SciTech Connect

    Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarría-Miranda, Daniel

    2009-05-29

    We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.

  10. A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

    SciTech Connect

    Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarria-Miranda, Daniel

    2009-02-15

    We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.

  11. Locality-Aware Parallel Process Mapping for Multi-Core HPC Systems

    SciTech Connect

    Hursey, Joshua J; Squyres, Jeffrey M.; Dontje, Terry

    2011-01-01

    High Performance Computing (HPC) systems are composed of servers containing an ever-increasing number of cores. With such high processor core counts, non-uniform memory access (NUMA) architectures are almost universally used to reduce inter-processor and memory communication bottlenecks by distributing processors and memory throughout a server-internal networking topology. Application studies have shown that the tuning of processes placement in a server s NUMA networking topology to the application can have a dramatic impact on performance. The performance implications are magnified when running a parallel job across multiple server nodes, especially with large scale HPC applications. This paper presents the Locality-Aware Mapping Algorithm (LAMA) for distributing the individual processes of a parallel application across processing resources in an HPC system, paying particular attention to the internal server NUMA topologies. The algorithm is able to support both homogeneous and heterogeneous hardware systems, and dynamically adapts to the available hardware and user-specified process layout at run-time. As implemented in Open MPI, the LAMA provides 362,880 mapping permutations and is able to naturally scale out to additional hardware resources as they become available in future architectures.

  12. An experimental research on the mixing process of supersonic oxygen-iodine parallel streams

    NASA Astrophysics Data System (ADS)

    Wang, Zengqiang; Sang, Fengting; Zhang, Yuelong; Hui, Xiaokang; Xu, Mingxiu; Zhang, Peng; Zhao, Weili; Fang, Benjie; Duo, Liping; Jin, Yuqi

    2014-12-01

    The O2(1Δ)/I2 mixing process is one of the most important steps in chemical oxygen-iodine laser (COIL). Based on the chemical fluorescence method (CFM), a diagnostic system was set up to image electronically excited fluorescent I2(B3П0) by means of a high speed camera. An optimized data analysis approach was proposed to analyze the mixing process of supersonic oxygen-iodine parallel streams, employing a set of qualitative and quantitative parameters and a proper percentage boundary threshold of the fluorescence zone. A slit nozzle bank with supersonic parallel streams and a trip tab set for enhancing the mixing process were designed and fabricated. With the diagnostic system and the data analysis approach, the performance of the trip tab set was examined and is demonstrated in this work. With the mixing enhancement, the fluorescence zone area was enlarged 3.75 times. We have studied the mixing process under different flow conditions and demonstrated the mixing properties with different iodine buffer gases, including N2, Ar, He and CO2. It was found that, among the four tested gases, Ar had the best penetration ability, whilst He showed the best free diffusion ability, and both of them could be well used as the buffer gas in our experiments. These experimental results can be useful for designing and optimizing COIL systems.

  13. FAST Observations of Acceleration Processes in the Cusp--Evidence for Parallel Electric Fields

    NASA Technical Reports Server (NTRS)

    Pfaff, R. F.. Jr.; Carlson, C.; McFadden, J.; Ergun, R.; Clemmons, J.; Klumpar D.; Strangeway, R.

    1999-01-01

    The existence of precipitating keV ions in the Earth's cusp originating at the magnetosheath provide unique means to test our understanding of particle acceleration and parallel electric fields in the lower altitude acceleration region. On numerous occasions, the FAST (The Fast Auroral Snapshot) spacecraft has encountered the Earth's cusp regions near its apogee of 4175 km which are characterized by their signatures of dispersed keV ion injections. The FAST instruments also reveal a complex microphysics inherent to many, but not all, of the cusp regions encountered by the spacecraft, that include upgoing ion beams and conics, inverted-V electrons, upgoing electron beams, and spikey DC-coupled electric fields and plasma waves. Detailed inspection of the FAST data often show clear modulation of the precipitating magnetosheath ions that indicate that they are affected by local electric potentials. For example, the magnetosheath ion precipitation is sometimes abruptly shut off precisely in regions where downgoing localized inverted-V electrons are observed. Such observations support the existence of a localized process, such as parallel electric fields, above the spacecraft which accelerate the electrons downward and consequently impede the precipitating ion precipitation. Other acceleration events in the cusp are sometimes organized with an apparent cellular structure that suggests Alfven waves or other large-scale phenomena are controlling the localized potentials. We examine several cusp encounters by the FAST satellite where the modulation of energetic session on acceleration particle populations reveals evidence of localized acceleration, most likely by parallel electric fields.

  14. Operational efficiency in STS cargo processing

    NASA Technical Reports Server (NTRS)

    Wilson, R. E.

    1985-01-01

    A multifaceted program is presented that addresses both the operational aspects of Shuttle-cargo integration and the needs of the STS Cargo Community. The program consists of the following key elements: (1) processing team awareness of cargo needs and requirements; (2) standardization of Orbiter preparation and cargo integration procedures and methods; (3) maximum application of state-of-the-art ADP techniques in all relevant areas; (4) continual review of cargo integration facility and ground system capabilities versus requirements and enhancement; (5) continual assessment of proposed cargo processing changes for safety and other needs; and (6) review of cargo processing philosophies, policies, and concepts for potential improvements.

  15. Processing on high efficiency solar collector coatings

    NASA Technical Reports Server (NTRS)

    Roberts, M.

    1977-01-01

    Wavelength selective coatings for solar collectors are considered. Substrates with good infrared reflectivity were examined along with their susceptibility to physical and environmental damage. Improvements of reflective surfaces were accomplished through buffing, chemical polishing and other surface processing methods.

  16. Efficient multiprocessor architecture for digital signal processing

    SciTech Connect

    Auguin, M.; Boeri, F.

    1982-01-01

    There is a continuing pressure of better processing performances in numerical signal processing. Effective utilization of LSI semiconductor technology allows the consideration of multiprocessor architectures. The problem of interconnecting the components of the architecture arises. The authors describe a control algorithm of the Benes interconnection network in a asynchronous multiprocessor system. A simulation study of the time-shared bus, of the omega network, of the benes network and of the crossbar network gives a comparison of performances. 8 references.

  17. Development of a Robust and Efficient Parallel Solver for Unsteady Turbomachinery Flows

    NASA Technical Reports Server (NTRS)

    West, Jeff; Wright, Jeffrey; Thakur, Siddharth; Luke, Ed; Grinstead, Nathan

    2012-01-01

    The traditional design and analysis practice for advanced propulsion systems relies heavily on expensive full-scale prototype development and testing. Over the past decade, use of high-fidelity analysis and design tools such as CFD early in the product development cycle has been identified as one way to alleviate testing costs and to develop these devices better, faster and cheaper. In the design of advanced propulsion systems, CFD plays a major role in defining the required performance over the entire flight regime, as well as in testing the sensitivity of the design to the different modes of operation. Increased emphasis is being placed on developing and applying CFD models to simulate the flow field environments and performance of advanced propulsion systems. This necessitates the development of next generation computational tools which can be used effectively and reliably in a design environment. The turbomachinery simulation capability presented here is being developed in a computational tool called Loci-STREAM [1]. It integrates proven numerical methods for generalized grids and state-of-the-art physical models in a novel rule-based programming framework called Loci [2] which allows: (a) seamless integration of multidisciplinary physics in a unified manner, and (b) automatic handling of massively parallel computing. The objective is to be able to routinely simulate problems involving complex geometries requiring large unstructured grids and complex multidisciplinary physics. An immediate application of interest is simulation of unsteady flows in rocket turbopumps, particularly in cryogenic liquid rocket engines. The key components of the overall methodology presented in this paper are the following: (a) high fidelity unsteady simulation capability based on Detached Eddy Simulation (DES) in conjunction with second-order temporal discretization, (b) compliance with Geometric Conservation Law (GCL) in order to maintain conservative property on moving meshes for

  18. Parallel Demand-Withdraw Processes in Family Therapy for Adolescent Drug Abuse

    PubMed Central

    Rynes, Kristina N.; Rohrbaugh, Michael J.; Lebensohn-Chialvo, Florencia; Shoham, Varda

    2013-01-01

    Isomorphism, or parallel process, occurs in family therapy when patterns of therapist-client interaction replicate problematic interaction patterns within the family. This study investigated parallel demand-withdraw processes in Brief Strategic Family Therapy (BSFT) for adolescent drug abuse, hypothesizing that therapist-demand/adolescent-withdraw interaction (TD/AW) cycles observed early in treatment would predict poor adolescent outcomes at follow-up for families who exhibited entrenched parent-demand/adolescent-withdraw interaction (PD/AW) before treatment began. Participants were 91 families who received at least 4 sessions of BSFT in a multi-site clinical trial on adolescent drug abuse (Robbins et al., 2011). Prior to receiving therapy, families completed videotaped family interaction tasks from which trained observers coded PD/AW. Another team of raters coded TD/AW during two early BSFT sessions. The main dependent variable was the number of drug use days that adolescents reported in Timeline Follow-Back interviews 7 to 12 months after family therapy began. Zero-inflated Poisson (ZIP) regression analyses supported the main hypothesis, showing that PD/AW and TD/AW interacted to predict adolescent drug use at follow-up. For adolescents in high PD/AW families, higher levels of TD/AW predicted significant increases in drug use at follow-up, whereas for low PD/AW families, TD/AW and follow-up drug use were unrelated. Results suggest that attending to parallel demand-withdraw processes in parent/adolescent and therapist/adolescent dyads may be useful in family therapy for substance-using adolescents. PMID:23438248

  19. On the Control of Automatic Processes: A Parallel Distributed Processing Account of the Stroop Effect.

    ERIC Educational Resources Information Center

    Cohen, Jonathan D.; And Others

    1990-01-01

    It is proposed that attributes of automatization depend on the strength of a processing pathway, and that strength increases with training. With the Stroop effect as an example, automatic processes are shown through simulation to be continuous and to emerge gradually with practice. (SLD)

  20. A Parallel Non-Alignment Based Approach to Efficient Sequence Comparison using Longest Common Subsequences

    NASA Astrophysics Data System (ADS)

    Bhowmick, S.; Shafiullah, M.; Rai, H.; Bastola, D.

    2010-11-01

    Biological sequence comparison programs have revolutionized the practice of biochemistry, and molecular and evolutionary biology. Pairwise comparison of genomic sequences is a popular method of choice for analyzing genetic sequence data. However the quality of results from most sequence comparison methods are significantly affected by small perturbations in the data and furthermore, there is a dearth of computational tools to compare sequences beyond a certain length. In this paper, we describe a parallel algorithm for comparing genetic sequences using an alignment free-method based on computing the Longest Common Subsequence (LCS) between genetic sequences. We validate the quality of our results by comparing the phylogenetic tress obtained from ClustalW and LCS. We also show through complexity analysis of the isoefficiency and by empirical measurement of the running time that our algorithm is very scalable.

  1. Asymmetric Exclusion Process with Constrained Hopping and Parallel Dynamics at a Junction

    NASA Astrophysics Data System (ADS)

    Liu, Mingzhe; Tuo, Xianguo; Li, Zhe; Yang, Jianbo

    In this article totally asymmetric simple exclusion process (TASEP) with constrained hopping and parallel dynamics at a junction is investigated using a mean-field approximation and Monte Carlo simulations. The constrained particle hopping probability r (r ≤ 1) at a junction may correspond to a delay caused by a driver choosing the right direction or a delay waiting for green traffic light in the real world. There are six stationary phases in the system, which can reflect free flow and congested traffic situations. Correlations at the junction point are investigated via simulations. It is observed that small r leads to stronger correlations. The theoretical results are agreement with computer simulations well.

  2. Strong Asymmetric Coupling of Two Parallel Exclusion Processes: Effect of Unequal Injection Rates

    NASA Astrophysics Data System (ADS)

    Xiao, Song; Dong, Peng; Zhang, Yingjie; Liu, Yanna

    2016-03-01

    In this letter, strong asymmetric coupling of two parallel exclusion processes: effect of unequal injection rates will be investigated. It is a generalization of the work of Xiao et al. (Phys. Lett. A 8, 374 (2009)), in which the particles only move on two lanes with rate 1 toward right. We can obtain the diverse phase diagram and density profiles of the system. The vertical cluster mean-field approach and extensively Monte Carlo simulations are used to study the system, and theoretical predictions are in excellent agreement with simulation results.

  3. Adventures in Parallel Processing: Entry, Descent and Landing Simulation for the Genesis and Stardust Missions

    NASA Technical Reports Server (NTRS)

    Lyons, Daniel T.; Desai, Prasun N.

    2005-01-01

    This paper will describe the Entry, Descent and Landing simulation tradeoffs and techniques that were used to provide the Monte Carlo data required to approve entry during a critical period just before entry of the Genesis Sample Return Capsule. The same techniques will be used again when Stardust returns on January 15, 2006. Only one hour was available for the simulation which propagated 2000 dispersed entry states to the ground. Creative simulation tradeoffs combined with parallel processing were needed to provide the landing footprint statistics that were an essential part of the Go/NoGo decision that authorized release of the Sample Return Capsule a few hours before entry.

  4. Leveraging human oversight and intervention in large-scale parallel processing of open-source data

    NASA Astrophysics Data System (ADS)

    Casini, Enrico; Suri, Niranjan; Bradshaw, Jeffrey M.

    2015-05-01

    The popularity of cloud computing along with the increased availability of cheap storage have led to the necessity of elaboration and transformation of large volumes of open-source data, all in parallel. One way to handle such extensive volumes of information properly is to take advantage of distributed computing frameworks like Map-Reduce. Unfortunately, an entirely automated approach that excludes human intervention is often unpredictable and error prone. Highly accurate data processing and decision-making can be achieved by supporting an automatic process through human collaboration, in a variety of environments such as warfare, cyber security and threat monitoring. Although this mutual participation seems easily exploitable, human-machine collaboration in the field of data analysis presents several challenges. First, due to the asynchronous nature of human intervention, it is necessary to verify that once a correction is made, all the necessary reprocessing is done in chain. Second, it is often needed to minimize the amount of reprocessing in order to optimize the usage of resources due to limited availability. In order to improve on these strict requirements, this paper introduces improvements to an innovative approach for human-machine collaboration in the processing of large amounts of open-source data in parallel.

  5. Target-specific IPSC kinetics promote temporal processing in auditory parallel pathways

    PubMed Central

    Xie, Ruili; Manis, Paul B.

    2013-01-01

    The acoustic environment contains biologically relevant information on time scales from microseconds to tens of seconds. The auditory brainstem nuclei process this temporal information through parallel pathways that originate in the cochlear nucleus from different classes of cells. While the roles of ion channels and excitatory synapses in temporal processing have been well studied, the contribution of inhibition is less well understood. Here, we show in CBA/CaJ mice that the two major projection neurons of the ventral cochlear nucleus, the bushy and T-stellate cells, receive glycinergic inhibition with different synaptic conductance time courses. Bushy cells, which provide precisely timed spike trains used in sound localization and pitch identification, receive slow inhibitory inputs. In contrast, T-stellate cells, which encode slower envelope information, receive inhibition that is eight-fold faster. Both types of inhibition improved the precision of spike timing, but engage different cellular mechanisms and operate on different time scales. Computer models reveal that slow IPSCs in bushy cells can improve spike timing on the scale of tens of microseconds. While fast and slow IPSCs in T-stellate cells improve spike timing on the scale of milliseconds, only fast IPSCs can enhance the detection of narrowband acoustic signals in a complex background. Our results suggest that target-specific IPSC kinetics are critical for the segregated parallel processing of temporal information from the sensory environment. PMID:23345233

  6. Efficient Quantum Information Processing via Quantum Compressions

    NASA Astrophysics Data System (ADS)

    Deng, Y.; Luo, M. X.; Ma, S. Y.

    2016-01-01

    Our purpose is to improve the quantum transmission efficiency and reduce the resource cost by quantum compressions. The lossless quantum compression is accomplished using invertible quantum transformations and applied to the quantum teleportation and the simultaneous transmission over quantum butterfly networks. New schemes can greatly reduce the entanglement cost, and partially solve transmission conflictions over common links. Moreover, the local compression scheme is useful for approximate entanglement creations from pre-shared entanglements. This special task has not been addressed because of the quantum no-cloning theorem. Our scheme depends on the local quantum compression and the bipartite entanglement transfer. Simulations show the success probability is greatly dependent of the minimal entanglement coefficient. These results may be useful in general quantum network communication.

  7. Towards a Standard Mixed-Signal Parallel Processing Architecture for Miniature and Microrobotics

    PubMed Central

    Sadler, Brian M; Hoyos, Sebastian

    2014-01-01

    The conventional analog-to-digital conversion (ADC) and digital signal processing (DSP) architecture has led to major advances in miniature and micro-systems technology over the past several decades. The outlook for these systems is significantly enhanced by advances in sensing, signal processing, communications and control, and the combination of these technologies enables autonomous robotics on the miniature to micro scales. In this article we look at trends in the combination of analog and digital (mixed-signal) processing, and consider a generalized sampling architecture. Employing a parallel analog basis expansion of the input signal, this scalable approach is adaptable and reconfigurable, and is suitable for a large variety of current and future applications in networking, perception, cognition, and control. PMID:26601042

  8. Parallel photonic information processing at gigabyte per second data rates using transient states

    NASA Astrophysics Data System (ADS)

    Brunner, Daniel; Soriano, Miguel C.; Mirasso, Claudio R.; Fischer, Ingo

    2013-01-01

    The increasing demands on information processing require novel computational concepts and true parallelism. Nevertheless, hardware realizations of unconventional computing approaches never exceeded a marginal existence. While the application of optics in super-computing receives reawakened interest, new concepts, partly neuro-inspired, are being considered and developed. Here we experimentally demonstrate the potential of a simple photonic architecture to process information at unprecedented data rates, implementing a learning-based approach. A semiconductor laser subject to delayed self-feedback and optical data injection is employed to solve computationally hard tasks. We demonstrate simultaneous spoken digit and speaker recognition and chaotic time-series prediction at data rates beyond 1Gbyte/s. We identify all digits with very low classification errors and perform chaotic time-series prediction with 10% error. Our approach bridges the areas of photonic information processing, cognitive and information science.

  9. Towards a Standard Mixed-Signal Parallel Processing Architecture for Miniature and Microrobotics.

    PubMed

    Sadler, Brian M; Hoyos, Sebastian

    2014-01-01

    The conventional analog-to-digital conversion (ADC) and digital signal processing (DSP) architecture has led to major advances in miniature and micro-systems technology over the past several decades. The outlook for these systems is significantly enhanced by advances in sensing, signal processing, communications and control, and the combination of these technologies enables autonomous robotics on the miniature to micro scales. In this article we look at trends in the combination of analog and digital (mixed-signal) processing, and consider a generalized sampling architecture. Employing a parallel analog basis expansion of the input signal, this scalable approach is adaptable and reconfigurable, and is suitable for a large variety of current and future applications in networking, perception, cognition, and control. PMID:26601042

  10. A polarization-based frequency scanning interferometer and the signal processing acceleration method based on parallel processing architecture

    NASA Astrophysics Data System (ADS)

    Lee, Seung Hyun; Kim, Min Young

    FSI system, one of the most promising optical surface measurement techniques, generally results in superior optical performance comparing with other 3-dimensional measuring methods as its hardware structure is fixed in operation and only the light frequency is scanned in a specific spectral band without vertical scanning of the target surface or the objective lens. FSI system collects a set of images of interference fringe by changing the frequency of light source. After that, it transforms intensity data of acquired image into frequency information, and calculates the height profile of target objects with the help of frequency analysis based on FFT. However, it still suffers from optical noise from target surface and relatively long processing time due to the number of images acquired in frequency scanning phase. First, a polarization-based frequency scanning interferometry (PFSI) is proposed for optical noise robustness. It consists of tunable laser for light source, λ/4 plate in front of reference mirror, λ/4 plate in front of target object, polarizing beam splitter, polarizer in front of image sensor, polarizer in front of the fiber coupled light source, λ/2 plate between PBS and polarizer of the light source. Using the proposed system, we can solve the problem low contrast of acquired fringe image by using polarization technique. Also, we can control light distribution of object beam and reference beam. Second, the signal processing acceleration method is proposed for PFSI, based on parallel processing architecture, which consists of parallel processing hardware and software such as GPU (Graphic Processing Unit) and CUDA (Compute Unified Device Architecture). As a result, the processing time reaches into tact time level of real-time processing. Finally, the proposed system is evaluated in terms of accuracy and processing speed through a series of experiment and the obtained results show the effectiveness of the proposed system and method.

  11. Efficient Parallel All-Electron Four-Component Dirac-Kohn-Sham Program Using a Distributed Matrix Approach II.

    PubMed

    Storchi, Loriano; Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Quiney, Harry M

    2013-12-10

    We propose a new complete memory-distributed algorithm, which significantly improves the parallel implementation of the all-electron four-component Dirac-Kohn-Sham (DKS) module of BERTHA (J. Chem. Theory Comput. 2010, 6, 384). We devised an original procedure for mapping the DKS matrix between an efficient integral-driven distribution, guided by the structure of specific G-spinor basis sets and by density fitting algorithms, and the two-dimensional block-cyclic distribution scheme required by the ScaLAPACK library employed for the linear algebra operations. This implementation, because of the efficiency in the memory distribution, represents a leap forward in the applicability of the DKS procedure to arbitrarily large molecular systems and its porting on last-generation massively parallel systems. The performance of the code is illustrated by some test calculations on several gold clusters of increasing size. The DKS self-consistent procedure has been explicitly converged for two representative clusters, namely Au20 and Au34, for which the density of electronic states is reported and discussed. The largest gold cluster uses more than 39k basis functions and DKS matrices of the order of 23 GB. PMID:26592273

  12. Accurate, efficient, and scalable parallel simulation of mesoscale electrostatic/magnetostatic problems accelerated by a fast multipole method

    NASA Astrophysics Data System (ADS)

    Jiang, Xikai; Karpeev, Dmitry; Li, Jiyuan; de Pablo, Juan; Hernandez-Ortiz, Juan; Heinonen, Olle

    Boundary integrals arise in many electrostatic and magnetostatic problems. In computational modeling of these problems, although the integral is performed only on the boundary of a domain, its direct evaluation needs O(N2) operations, where N is number of unknowns on the boundary. The O(N2) scaling impedes a wider usage of the boundary integral method in scientific and engineering communities. We have developed a parallel computational approach that utilize the Fast Multipole Method to evaluate the boundary integral in O(N) operations. To demonstrate the accuracy, efficiency, and scalability of our approach, we consider two test cases. In the first case, we solve a boundary value problem for a ferroelectric/ferromagnetic volume in free space using a hybrid finite element-boundary integral method. In the second case, we solve an electrostatic problem involving the polarization of dielectric objects in free space using the boundary element method. The results from test cases show that our parallel approach can enable highly efficient and accurate simulations of mesoscale electrostatic/magnetostatic problems. Computing resources was provided by Blues, a high-performance cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. Work at Argonne was supported by U. S. DOE, Office of Science under Contract No. DE-AC02-06CH11357.

  13. Efficient volume reconstruction for parallel-beam computed laminography by filtered backprojection on multi-core clusters.

    PubMed

    Myagotin, Anton; Voropaev, Alexey; Helfen, Lukas; Haenschke, Daniel; Baumbach, Tilo

    2013-10-17

    Computed laminography (CL) was developed to use X rays from synchrotron sources for high-resolution imaging of the internal structure of a flat specimen from a series of twodimensional (2D) projection images. The projections are acquired by irradiation of the sample under different rotation angles where the object rotation axis is inclined with respect to the beam direction. This yields for laterally extended objects a more uniform average transmitted intensity during sample rotation compared to computed tomography (CT). The reconstruction problem of CL cannot be reduced to a data-efficient 2D case (as for parallel-beam CT) since each single slice perpendicular to the rotation axis requires a 2D region on the detector as input data for all projection directions. This paper describes a computationally efficient reconstruction procedure based on filtered backprojection (FBP) adapted to the CL acquisition geometry. From the Fourier slice theorem we derive a framework for analytic image reconstruction and outline implementation details of the generic FBP algorithm. Different approaches reducing the reconstruction time by means of parallel and distributed computations are considered and evaluated. PMID:24235251

  14. Efficient volume reconstruction for parallel-beam computed laminography by filtered backprojection on multi-core clusters.

    PubMed

    Myagotin, Anton; Voropaev, Alexey; Helfen, Lukas; Hänschke, Daniel; Baumbach, Tilo

    2013-12-01

    Computed laminography (CL) was developed to use X-rays from synchrotron sources for high-resolution imaging of the internal structure of a flat specimen from a series of 2-D projection images. The projections are acquired by irradiation of the sample under different rotation angles where the object rotation axis is inclined with respect to the beam direction. This yields for laterally extended objects a more uniform average transmitted intensity during sample rotation compared with computed tomography (CT). The reconstruction problem of CL cannot be reduced to a data-efficient 2-D case (as for parallel-beam CT) since each single slice perpendicular to the rotation axis requires a 2-D region on the detector as input data for all projection directions. This paper describes a computationally efficient reconstruction procedure based on filtered backprojection (FBP) adapted to the CL acquisition geometry. From the Fourier slice theorem, we derive a framework for analytic image reconstruction and outline implementation details of the generic FBP algorithm. Different approaches reducing the reconstruction time by means of parallel and distributed computations are considered and evaluated. PMID:24228274

  15. Parallelized multi–graphics processing unit framework for high-speed Gabor-domain optical coherence microscopy

    PubMed Central

    Tankam, Patrice; Santhanam, Anand P.; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P.

    2014-01-01

    Abstract. Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6  mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing. PMID:24695868

  16. 3D-radiative transfer in terrestrial atmosphere: An efficient parallel numerical procedure

    NASA Astrophysics Data System (ADS)

    Bass, L. P.; Germogenova, T. A.; Nikolaeva, O. V.; Kokhanovsky, A. A.; Kuznetsov, V. S.

    2003-04-01

    Light propagation and scattering in terrestrial atmosphere is usually studied in the framework of the 1D radiative transfer theory [1]. However, in reality particles (e.g., ice crystals, solid and liquid aerosols, cloud droplets) are randomly distributed in 3D space. In particular, their concentrations vary both in vertical and horizontal directions. Therefore, 3D effects influence modern cloud and aerosol retrieval procedures, which are currently based on the 1D radiative transfer theory. It should be pointed out that the standard radiative transfer equation allows to study these more complex situations as well [2]. In recent year the parallel version of the 2D and 3D RADUGA code has been developed. This version is successfully used in gammas and neutrons transport problems [3]. Applications of this code to radiative transfer in atmosphere problems are contained in [4]. Possibilities of code RADUGA are presented in [5]. The RADUGA code system is an universal solver of radiative transfer problems for complicated models, including 2D and 3D aerosol and cloud fields with arbitrary scattering anisotropy, light absorption, inhomogeneous underlying surface and topography. Both delta type and distributed light sources can be accounted for in the framework of the algorithm developed. The accurate numerical procedure is based on the new discrete ordinate SWDD scheme [6]. The algorithm is specifically designed for parallel supercomputers. The version RADUGA 5.1(P) can run on MBC1000M [7] (768 processors with 10 Gb of hard disc memory for each processor). The peak productivity is equal 1 Tfl. Corresponding scalar version RADUGA 5.1 is working on PC. As a first example of application of the algorithm developed, we have studied the shadowing effects of clouds on neighboring cloudless atmosphere, depending on the cloud optical thickness, surface albedo, and illumination conditions. This is of importance for modern satellite aerosol retrieval algorithms development. [1] Sobolev

  17. Efficient Bayesian inference for ARFIMA processes

    NASA Astrophysics Data System (ADS)

    Graves, T.; Gramacy, R. B.; Franzke, C. L. E.; Watkins, N. W.

    2015-03-01

    Many geophysical quantities, like atmospheric temperature, water levels in rivers, and wind speeds, have shown evidence of long-range dependence (LRD). LRD means that these quantities experience non-trivial temporal memory, which potentially enhances their predictability, but also hampers the detection of externally forced trends. Thus, it is important to reliably identify whether or not a system exhibits LRD. In this paper we present a modern and systematic approach to the inference of LRD. Rather than Mandelbrot's fractional Gaussian noise, we use the more flexible Autoregressive Fractional Integrated Moving Average (ARFIMA) model which is widely used in time series analysis, and of increasing interest in climate science. Unlike most previous work on the inference of LRD, which is frequentist in nature, we provide a systematic treatment of Bayesian inference. In particular, we provide a new approximate likelihood for efficient parameter inference, and show how nuisance parameters (e.g. short memory effects) can be integrated over in order to focus on long memory parameters, and hypothesis testing more directly. We illustrate our new methodology on the Nile water level data, with favorable comparison to the standard estimators.

  18. Monitoring agricultural processing electrical energy use and efficiency

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Energy costs have become proportionately larger as cotton post-harvest processing facilities have utilized other inputs more efficiently. A discrepancy in energy consumption per unit processed between facilities suggests that energy could be utilized more efficiently. Cotton gin facilities were in...

  19. The Development of Mental Processing: Efficiency, Working Memory, and Thinking.

    ERIC Educational Resources Information Center

    Demetriou, Andreas; Christou, Constantinos; Spanoudis, George; Platsidou, Maria

    2002-01-01

    Examined, over 1 year, relations between information processing efficiency, working memory, and problem solving in sample of 8-, 10-, 12-, and 14-year-olds. Identified three-stratus hierarchy with individual dimensions organized in three constructs: processing efficiency, working memory, and problem solving. Found that individual dimensions were…

  20. A self-adaptive parameter optimization algorithm in a real-time parallel image processing system.

    PubMed

    Li, Ge; Zhang, Xuehe; Zhao, Jie; Zhang, Hongli; Ye, Jianwei; Zhang, Weizhe

    2013-01-01

    Aiming at the stalemate that precision, speed, robustness, and other parameters constrain each other in the parallel processed vision servo system, this paper proposed an adaptive load capacity balance strategy on the servo parameters optimization algorithm (ALBPO) to improve the computing precision and to achieve high detection ratio while not reducing the servo circle. We use load capacity functions (LC) to estimate the load for each processor and then make continuous self-adaptation towards a balanced status based on the fluctuated LC results; meanwhile, we pick up a proper set of target detection and location parameters according to the results of LC. Compared with current load balance algorithm, the algorithm proposed in this paper is proceeded under an unknown informed status about the maximum load and the current load of the processors, which means it has great extensibility. Simulation results showed that the ALBPO algorithm has great merits on load balance performance, realizing the optimization of QoS for each processor, fulfilling the balance requirements of servo circle, precision, and robustness of the parallel processed vision servo system. PMID:24174920

  1. Investigation of Mediational Processes Using Parallel Process Latent Growth Curve Modeling.

    ERIC Educational Resources Information Center

    Cheong, JeeWon; MacKinnon, David P.; Khoo, Siek Toon

    2003-01-01

    Investigated a method to evaluate mediational processes using latent growth curve modeling and tested it with empirical data from a longitudinal steroid use prevention program focusing on 1,506 high school football players over 4 years. Findings suggest the usefulness of the approach. (SLD)

  2. Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

    DOEpatents

    Bhanot, Gyan V.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Steinmacher-Burow, Burkhard D.; Vranas, Pavlos M.

    2008-01-01

    The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.

  3. Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

    DOEpatents

    Bhanot, Gyan V.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Steinmacher-Burow, Burkhard D.; Vranas, Pavlos M.

    2012-01-10

    The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.

  4. Parallel implementation of 3D FFT with volumetric decomposition schemes for efficient molecular dynamics simulations

    NASA Astrophysics Data System (ADS)

    Jung, Jaewoon; Kobayashi, Chigusa; Imamura, Toshiyuki; Sugita, Yuji

    2016-03-01

    Three-dimensional Fast Fourier Transform (3D FFT) plays an important role in a wide variety of computer simulations and data analyses, including molecular dynamics (MD) simulations. In this study, we develop hybrid (MPI+OpenMP) parallelization schemes of 3D FFT based on two new volumetric decompositions, mainly for the particle mesh Ewald (PME) calculation in MD simulations. In one scheme, (1d_Alltoall), five all-to-all communications in one dimension are carried out, and in the other, (2d_Alltoall), one two-dimensional all-to-all communication is combined with two all-to-all communications in one dimension. 2d_Alltoall is similar to the conventional volumetric decomposition scheme. We performed benchmark tests of 3D FFT for the systems with different grid sizes using a large number of processors on the K computer in RIKEN AICS. The two schemes show comparable performances, and are better than existing 3D FFTs. The performances of 1d_Alltoall and 2d_Alltoall depend on the supercomputer network system and number of processors in each dimension. There is enough leeway for users to optimize performance for their conditions. In the PME method, short-range real-space interactions as well as long-range reciprocal-space interactions are calculated. Our volumetric decomposition schemes are particularly useful when used in conjunction with the recently developed midpoint cell method for short-range interactions, due to the same decompositions of real and reciprocal spaces. The 1d_Alltoall scheme of 3D FFT takes 4.7 ms to simulate one MD cycle for a virus system containing more than 1 million atoms using 32,768 cores on the K computer.

  5. A Computationally Efficient Parallel Levenberg-Marquardt Algorithm for Large-Scale Big-Data Inversion

    NASA Astrophysics Data System (ADS)

    Lin, Y.; O'Malley, D.; Vesselinov, V. V.

    2015-12-01

    Inverse modeling seeks model parameters given a set of observed state variables. However, for many practical problems due to the facts that the observed data sets are often large and model parameters are often numerous, conventional methods for solving the inverse modeling can be computationally expensive. We have developed a new, computationally-efficient Levenberg-Marquardt method for solving large-scale inverse modeling. Levenberg-Marquardt methods require the solution of a dense linear system of equations which can be prohibitively expensive to compute for large-scale inverse problems. Our novel method projects the original large-scale linear problem down to a Krylov subspace, such that the dimensionality of the measurements can be significantly reduced. Furthermore, instead of solving the linear system for every Levenberg-Marquardt damping parameter, we store the Krylov subspace computed when solving the first damping parameter and recycle it for all the following damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved by using these computational techniques. We apply this new inverse modeling method to invert for a random transitivity field. Our algorithm is fast enough to solve for the distributed model parameters (transitivity) at each computational node in the model domain. The inversion is also aided by the use regularization techniques. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). Julia is an advanced high-level scientific programing language that allows for efficient memory management and utilization of high-performance computational resources. By comparing with a Levenberg-Marquardt method using standard linear inversion techniques, our Levenberg-Marquardt method yields speed-up ratio of 15 in a multi-core computational environment and a speed-up ratio of 45 in a single-core computational environment. Therefore, our new inverse modeling method is a

  6. Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array

    NASA Astrophysics Data System (ADS)

    Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

    2008-04-01

    This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.

  7. Method of moment solutions to scattering problems in a parallel processing environment

    NASA Technical Reports Server (NTRS)

    Cwik, Tom; Partee, Jonathan; Patterson, Jean

    1991-01-01

    This paper describes the implementation of a parallelized method of moments (MOM) code into an interactive workstation environment. The workstation allows interactive solid body modeling and mesh generation, MOM analysis, and the graphical display of results. After describing the parallel computing environment, the implementation and results of parallelizing a general MOM code are presented in detail.

  8. LightForce Photon-Pressure Collision Avoidance: Updated Efficiency Analysis Utilizing a Highly Parallel Simulation Approach

    NASA Technical Reports Server (NTRS)

    Stupl, Jan; Faber, Nicolas; Foster, Cyrus; Yang, Fan Yang; Nelson, Bron; Aziz, Jonathan; Nuttall, Andrew; Henze, Chris; Levit, Creon

    2014-01-01

    This paper provides an updated efficiency analysis of the LightForce space debris collision avoidance scheme. LightForce aims to prevent collisions on warning by utilizing photon pressure from ground based, commercial off the shelf lasers. Past research has shown that a few ground-based systems consisting of 10 kilowatt class lasers directed by 1.5 meter telescopes with adaptive optics could lower the expected number of collisions in Low Earth Orbit (LEO) by an order of magnitude. Our simulation approach utilizes the entire Two Line Element (TLE) catalogue in LEO for a given day as initial input. Least-squares fitting of a TLE time series is used for an improved orbit estimate. We then calculate the probability of collision for all LEO objects in the catalogue for a time step of the simulation. The conjunctions that exceed a threshold probability of collision are then engaged by a simulated network of laser ground stations. After those engagements, the perturbed orbits are used to re-assess the probability of collision and evaluate the efficiency of the system. This paper describes new simulations with three updated aspects: 1) By utilizing a highly parallel simulation approach employing hundreds of processors, we have extended our analysis to a much broader dataset. The simulation time is extended to one year. 2) We analyze not only the efficiency of LightForce on conjunctions that naturally occur, but also take into account conjunctions caused by orbit perturbations due to LightForce engagements. 3) We use a new simulation approach that is regularly updating the LightForce engagement strategy, as it would be during actual operations. In this paper we present our simulation approach to parallelize the efficiency analysis, its computational performance and the resulting expected efficiency of the LightForce collision avoidance system. Results indicate that utilizing a network of four LightForce stations with 20 kilowatt lasers, 85% of all conjunctions with a

  9. Distinct cerebellar lobules process arousal, valence and their interaction in parallel following a temporal hierarchy.

    PubMed

    Styliadis, Charis; Ioannides, Andreas A; Bamidis, Panagiotis D; Papadelis, Christos

    2015-04-15

    The cerebellum participates in emotion-related neural circuits formed by different cortical and subcortical areas, which sub-serve arousal and valence. Recent neuroimaging studies have shown a functional specificity of cerebellar lobules in the processing of emotional stimuli. However, little is known about the temporal component of this process. The goal of the current study is to assess the spatiotemporal profile of neural responses within the cerebellum during the processing of arousal and valence. We hypothesized that the excitation and timing of distinct cerebellar lobules is influenced by the emotional content of the stimuli. By using magnetoencephalography, we recorded magnetic fields from twelve healthy human individuals while passively viewing affective pictures rated along arousal and valence. By using a beamformer, we localized gamma-band activity in the cerebellum across time and we related the foci of activity to the anatomical organization of the cerebellum. Successive cerebellar activations were observed within distinct lobules starting ~160ms after the stimuli onset. Arousal was processed within both vermal (VI and VIIIa) and hemispheric (left Crus II) lobules. Valence (left VI) and its interaction (left V and left Crus I) with arousal were processed only within hemispheric lobules. Arousal processing was identified first at early latencies (160ms) and was long-lived (until 980ms). In contrast, the processing of valence and its interaction to arousal was short lived at later stages (420-530ms and 570-640ms respectively). Our findings provide for the first time evidence that distinct cerebellar lobules process arousal, valence, and their interaction in a parallel yet temporally hierarchical manner determined by the emotional content of the stimuli. PMID:25665964

  10. High Efficiency Bi-Directional DC-DC Converter With ZVS-ZCS Applied For Parallel Active Filtering

    NASA Astrophysics Data System (ADS)

    Romero, V.; Soto, A.

    2011-10-01

    In space missions, it is becoming more and more common to have strict EMC requirements to be met. Coping with this is a challenge for all those instruments and subsystems implementing AC loads. In particular, the driving of motors is one of the highest challenges due to the low frequency and high amplitude of the emissions. The driving of these motors without exceeding typical EMC levels implies adding an active filter at its input. Passive filtering approach is not useful due to bulk components required to filter such low frequencies. The aim of this paper is to show a parallel active filtering solution that implements significant advantages compared to other classical approaches in terms of mass and efficiency.

  11. A Neurally Plausible Parallel Distributed Processing Model of Event-Related Potential Word Reading Data

    PubMed Central

    Laszlo, Sarah; Plaut, David C.

    2011-01-01

    The Parallel Distributed Processing (PDP) framework has significant potential for producing models of cognitive tasks that approximate how the brain performs the same tasks. To date, however, there has been relatively little contact between PDP modeling and data from cognitive neuroscience. In an attempt to advance the relationship between explicit, computational models and physiological data collected during the performance of cognitive tasks, we developed a PDP model of visual word recognition which simulates key results from the ERP reading literature, while simultaneously being able to successfully perform lexical decision—a benchmark task for reading models. Simulations reveal that the model’s success depends on the implementation of several neurally plausible features in its architecture which are sufficiently domain-general to be relevant to cognitive modeling more generally. PMID:21945392

  12. A neurally plausible parallel distributed processing model of event-related potential word reading data.

    PubMed

    Laszlo, Sarah; Plaut, David C

    2012-03-01

    The Parallel Distributed Processing (PDP) framework has significant potential for producing models of cognitive tasks that approximate how the brain performs the same tasks. To date, however, there has been relatively little contact between PDP modeling and data from cognitive neuroscience. In an attempt to advance the relationship between explicit, computational models and physiological data collected during the performance of cognitive tasks, we developed a PDP model of visual word recognition which simulates key results from the ERP reading literature, while simultaneously being able to successfully perform lexical decision-a benchmark task for reading models. Simulations reveal that the model's success depends on the implementation of several neurally plausible features in its architecture which are sufficiently domain-general to be relevant to cognitive modeling more generally. PMID:21945392

  13. All-polymer based fabrication process for an all-polymer flexible and parallel optical interconnect

    NASA Astrophysics Data System (ADS)

    Yang, Jilin; Ge, Tao; Summitt, Chris; Wang, Sunglin; Milster, Tom; Takashima, Yuzuru

    2015-08-01

    We proposed and demonstrated a new all-polymer based fabrication process for an all-polymer flexible and parallel optical interconnect cable in which a vertical light coupler is integrated. The approach potentially cut down the cost by eliminating a metallization process for alignment of multiple masks. Throughout the process we used a polyimide as the substrate, coated by Epoclad as claddings, then AP2210B and WPR 5100 were used to fabricate waveguides and 45 degree mirror couplers, respectively. In addition, precisely aligned mirror couplers to waveguides are achieved by using polymer-based, non-metallic, and transparent phase-based alignment marks. We tested a feature specific phase-based alignment system. In addition, the shape and depth of the phase-based alignment marks are optimized for phase contrast and Schlieren imaging. Our results shows that a contrast of the image is enhanced compared to that of observed by a conventional imaging system. Such process enables to integrate polymer based waveguides by only using the polymer based alignment marks on the WPR 5100 layer.

  14. The efficiency of parallel quantum memory for light in a cavity configuration

    NASA Astrophysics Data System (ADS)

    Vetlugin, A. N.; Sokolov, I. V.

    2013-12-01

    We present a new scheme of quantum memory for optical images (spatially multimode light fields) that allows mapping the quantum state of the signal onto the long-lived coherence of the ground state of an ensemble of stationary atoms or impurity centers. The memory medium is embedded in an optical cavity with degenerate transverse modes, which increases the effective optical thickness of the medium and allows one, in principle, to store information in optically thin atomic layers. Since, in reality, storage and retrieval of limited-duration signals, including signals shorter than the lifetime of the field in the cavity, is of interest, we do not use the low- Q cavity approximation. The influence of losses due to partial reflection of the nonstationary signal field incident on a coupling mirror on the storage efficiency is considered. We used the method of approximate impedance matching, wherein losses due to reflection can be minimized by controlling the coupling parameter of the light field with memory medium in time, thus creating conditions for destructive interference of the signal and local fields on the coupling mirror. The influence of diffraction on the transverse resolution of memory at the writing and readout stages is investigated, and the number of effectively stored transverse spatial modes of the signal is estimated.

  15. Analyzing Business Process Efficiency by Combining Business Process Simulation with Data Envelopment Analysis

    NASA Astrophysics Data System (ADS)

    Dohmen, Anne; Leyer, Michael

    2010-10-01

    A well grounded understanding of process efficiency is essential for the sustainable success of organizations. This paper presents a novel method for analyzing the efficiency of business processes. It combines Data Envelopment Analysis (DEA) and Business Process Simulation (BPS) on process level. DEA is used to measure the efficiency of a process while BPS analyzes potential changes leading to a better efficiency. The combination of DEA and BPS is a promising approach for analyzing the structure of process (in-)efficiency. The methodology is presented by a numerical example dealing with a loan application process. The results show that it is a powerful methodology to assess process efficiency improvements. However, it is limited by the general disadvantages of a DEA and the assumptions required for conducting a business process simulation.

  16. Information processing in parallel through directionally resolved molecular polarization components in coherent multidimensional spectroscopy

    NASA Astrophysics Data System (ADS)

    Yan, Tian-Min; Fresch, Barbara; Levine, R. D.; Remacle, F.

    2015-08-01

    We propose that information processing can be implemented by measuring the directional components of the macroscopic polarization of an ensemble of molecules subject to a sequence of laser pulses. We describe the logic operation theoretically and demonstrate it by simulations. The measurement of integrated stimulated emission in different phase matching spatial directions provides a logic decomposition of a function that is the discrete analog of an integral transform. The logic operation is reversible and all the possible outputs are computed in parallel for all sets of possible multivalued inputs. The number of logic variables of the function is the number of laser pulses used in sequence. The logic function that is computed depends on the chosen chromophoric molecular complex and on its interactions with the solvent and on the two time intervals between the three pulses and the pulse strengths and polarizations. The outputs are the homodyne detected values of the polarization components that are measured in the allowed phase matching macroscopic directions, kl, k l = ∑ i l i k i where ki is the propagation direction of the ith pulse and {li} is a set of integers that encodes the multivalued inputs. Parallelism is inherently implemented because all the partial polarizations that define the outputs are processed simultaneously. The outputs, which are read directly on the macroscopic level, can be multivalued because the high dynamical range of partial polarization measurements by nonlinear coherent spectroscopy allows for fine binning of the signals. The outputs are uniquely related to the inputs so that the logic is reversible.

  17. In-Database Raster Analytics: Map Algebra and Parallel Processing in Oracle Spatial Georaster

    NASA Astrophysics Data System (ADS)

    Xie, Q. J.; Zhang, Z. Z.; Ravada, S.

    2012-07-01

    Over the past decade several products have been using enterprise database technology to store and manage geospatial imagery and raster data inside RDBMS, which in turn provides the best manageability and security. With the data volume growing exponentially, real-time or near real-time processing and analysis of such big data becomes more challenging. Oracle Spatial GeoRaster, different from most other products, takes the enterprise database-centric approach for both data management and data processing. This paper describes one of the central components of this database-centric approach: the processing engine built completely inside the database. Part of this processing engine is raster algebra, which we call the In-database Raster Analytics. This paper discusses the three key characteristics of this in-database analytics engine and the benefits. First, it moves the data processing closer to the data instead of moving the data to the processing, which helps achieve greater performance by overcoming the bottleneck of computer networks. Second, we designed and implemented a new raster algebra expression language. This language is based on PL/SQL and is currently focused on the "local" function type of map algebra. This language includes general arithmetic, logical and relational operators and any combination of them, which dramatically improves the analytical capability of the GeoRaster database. The third feature is the implementation of parallel processing of such operations to further improve performance. This paper also presents some sample use cases. The testing results demonstrate that this in-database approach for raster analytics can effectively help solve the biggest performance challenges we are facing today with big raster and image data.

  18. On the Development of an Efficient Parallel Hybrid Solver with Application to Acoustically Treated Aero-Engine Nacelles

    NASA Technical Reports Server (NTRS)

    Watson, Willie R.; Nark, Douglas M.; Nguyen, Duc T.; Tungkahotara, Siroj

    2006-01-01

    A finite element solution to the convected Helmholtz equation in a nonuniform flow is used to model the noise field within 3-D acoustically treated aero-engine nacelles. Options to select linear or cubic Hermite polynomial basis functions and isoparametric elements are included. However, the key feature of the method is a domain decomposition procedure that is based upon the inter-mixing of an iterative and a direct solve strategy for solving the discrete finite element equations. This procedure is optimized to take full advantage of sparsity and exploit the increased memory and parallel processing capability of modern computer architectures. Example computations are presented for the Langley Flow Impedance Test facility and a rectangular mapping of a full scale, generic aero-engine nacelle. The accuracy and parallel performance of this new solver are tested on both model problems using a supercomputer that contains hundreds of central processing units. Results show that the method gives extremely accurate attenuation predictions, achieves super-linear speedup over hundreds of CPUs, and solves upward of 25 million complex equations in a quarter of an hour.

  19. Combined shared and distributed memory ab-initio computations of molecular-hydrogen systems in the correlated state: Process pool solution and two-level parallelism

    NASA Astrophysics Data System (ADS)

    Biborski, Andrzej; Kądzielawa, Andrzej P.; Spałek, Józef

    2015-12-01

    An efficient computational scheme devised for investigations of ground state properties of the electronically correlated systems is presented. As an example, (H2)n chain is considered with the long-range electron-electron interactions taken into account. The implemented procedure covers: (i) single-particle Wannier wave-function basis construction in the correlated state, (ii) microscopic parameters calculation, and (iii) ground state energy optimization. The optimization loop is based on highly effective process-pool solution - specific root-workers approach. The hierarchical, two-level parallelism was applied: both shared (by use of Open Multi-Processing) and distributed (by use of Message Passing Interface) memory models were utilized. We discuss in detail the feature that such approach results in a substantial increase of the calculation speed reaching factor of 300 for the fully parallelized solution. The scheme elaborated in detail reflects the situation in which the most demanding task is the single-particle basis optimization.

  20. Improving hospital efficiency: a process model of organizational change commitments.

    PubMed

    Nigam, Amit; Huising, Ruthanne; Golden, Brian R

    2014-02-01

    Improving hospital efficiency is a critical goal for managers and policy makers. We draw on participant observation of the perioperative coaching program in seven Ontario hospitals to develop knowledge of the process by which the content of change initiatives to increase hospital efficiency is defined. The coaching program was a change initiative involving the use of external facilitators with the goal of increasing perioperative efficiency. Focusing on the role of subjective understandings in shaping initiatives to improve efficiency, we show that physicians, nurses, administrators, and external facilitators all have differing frames of the problems that limit efficiency, and propose different changes that could enhance efficiency. Dynamics of strategic and contested framing ultimately shaped hospital change commitments. We build on work identifying factors that enhance the success of change efforts to improve hospital efficiency, highlighting the importance of subjective understandings and the politics of meaning-making in defining what hospitals change. PMID:24132582

  1. Efficiency and Effectiveness of a Resident Assistant Selection Process.

    ERIC Educational Resources Information Center

    Broitman, Thomas

    The American phenomenon of "more is better" extends a value-loaded concept implicit in budget preparation. At any university, the scope, magnitude and cost of a residence hall assistant program selection process is a metaphor to illustrate efficiency and effectiveness of human resources. In order to discover a more efficient and effective…

  2. Evaluation of parallel operated small-scale bubble columns for microbial process development using Staphylococcus carnosus.

    PubMed

    Dilsen, S; Paul, W; Herforth, D; Sandgathe, A; Altenbach-Rehm, J; Freudl, R; Wandrey, C; Weuster-Botz, D

    2001-06-01

    Shake flasks and pH-controlled small-scale bubble columns were compared with respect to their usefulness as a basic tool for process development for human calcitonin precursor fusion-protein production with Staphylococcus carnosus. Parallel control of the pH (and making use of the base addition data) is necessary to study the effects of medium composition, to identify pH-optima and to develop a medium, which minimizes the acid excretion of S. carnosus. This medium with glycerol as energy source and yeast extract as carbon and nitrogen source resulted in cell dry weight concentration in shake flasks of 5 g l(-1), which were thus improved by a factor of 10. Cell dry weight concentrations of up to 12.5 g l(-1) were measured in the batch process with pH-controlled small-scale bubble columns due to their higher oxygen transfer capability. In contrast to shake flasks it was demonstrated, that the batch process performance of recombinant S. carnosus secreting the human calcitonin precursor fusion-protein was identical within the estimation error in pH-controlled small-scale bubble columns compared to the stirred-tank reactor. PMID:11377767

  3. Membrane Transport Processes Analyzed by a Highly Parallel Nanopore Chip System at Single Protein Resolution.

    PubMed

    Urban, Michael; Vor der Brüggen, Marc; Tampé, Robert

    2016-01-01

    Membrane protein transport on the single protein level still evades detailed analysis, if the substrate translocated is non-electrogenic. Considerable efforts have been made in this field, but techniques enabling automated high-throughput transport analysis in combination with solvent-free lipid bilayer techniques required for the analysis of membrane transporters are rare. This class of transporters however is crucial in cell homeostasis and therefore a key target in drug development and methodologies to gain new insights desperately needed. The here presented manuscript describes the establishment and handling of a novel biochip for the analysis of membrane protein mediated transport processes at single transporter resolution. The biochip is composed of microcavities enclosed by nanopores that is highly parallel in its design and can be produced in industrial grade and quantity. Protein-harboring liposomes can directly be applied to the chip surface forming self-assembled pore-spanning lipid bilayers using SSM-techniques (solid supported lipid membranes). Pore-spanning parts of the membrane are freestanding, providing the interface for substrate translocation into or out of the cavity space, which can be followed by multi-spectral fluorescent readout in real-time. The establishment of standard operating procedures (SOPs) allows the straightforward establishment of protein-harboring lipid bilayers on the chip surface of virtually every membrane protein that can be reconstituted functionally. The sole prerequisite is the establishment of a fluorescent read-out system for non-electrogenic transport substrates. High-content screening applications are accomplishable by the use of automated inverted fluorescent microscopes recording multiple chips in parallel. Large data sets can be analyzed using the freely available custom-designed analysis software. Three-color multi spectral fluorescent read-out furthermore allows for unbiased data discrimination into different

  4. SIAM Conference on Parallel Processing for Scientific Computing - March 12-14, 2008

    SciTech Connect

    2008-09-08

    The themes of the 2008 conference included, but were not limited to: Programming languages, models, and compilation techniques; The transition to ubiquitous multicore/manycore processors; Scientific computing on special-purpose processors (Cell, GPUs, etc.); Architecture-aware algorithms; From scalable algorithms to scalable software; Tools for software development and performance evaluation; Global perspectives on HPC; Parallel computing in industry; Distributed/grid computing; Fault tolerance; Parallel visualization and large scale data management; and The future of parallel architectures.

  5. High-efficiency spin-resolved and spin-integrated electron detection: Parallel mounting on a hemispherical analyzer

    NASA Astrophysics Data System (ADS)

    Ghiringhelli, G.; Larsson, K.; Brookes, N. B.

    1999-11-01

    We have mounted a compact 25 kV mini-Mott spin polarimeter on a commercial high-throughput hemispherical electron analyzer with a double purpose: to maximize the polarization detection and to preserve the original efficiency of the spectrometer in the spin-integrated measurements. We have thus replaced the 16-anode microchannel-plate detector with a 12-anode microsphere-plate detector in parallel with a Rice University retarding Mott spin polarimeter. Passing from one detection mode to the other is quick and easy. The transfer optics from the analyzer exit slit to the scattering target of the polarimeter allows the full potential of both the electron analyzer and the spin detector to be exploited. The expected effective Sherman function (Seff=0.17) and figure of merit (η0≅1.4×10-4) are found in the spin-resolved mode, and only 25% of the original efficiency is lost in the spin-integrated acquisitions.

  6. Information Processing Efficiency and Regulation at Five Months

    PubMed Central

    Diaz, Anjolii; Bell, Martha Ann

    2011-01-01

    Infants with short look durations are generally thought to have better attentional capabilities due to their efficient information processing. Although effortful attention is considered a key component of developing regulatory abilities, little is known about the relation between speed and efficiency of processing and self-regulation. In this study, 5-month-old infants with shorter look duration had greater EEG power values than infants with longer look during baseline, as well as during a distressing task and a post-distress attentional processing task. These short looking infants also demonstrated higher heart rate, relative to long looking infants, during post-distress information processing. Behaviorally the two groups differed in the amount of distraction during distress. These data provide evidence for an association between the efficiency of information processing and beginning regulatory abilities in early infancy. PMID:21269705

  7. Parallel processing approach for radiative heat transfer prediction in participating media

    NASA Astrophysics Data System (ADS)

    Saltiel, C.; Naraghi, M. H. N.

    1993-10-01

    Numerical analysis of radiative transfer in participating media can be very complex. Computer simulations of practical situations often require both large computer memory and long calculation times. The use of massively parallel machines has proven very effective in simulating large complex systems. This technical note presents a unified matrix formulation for node-to-node-based radiative exchange in isotropically scattering homogeneous media using the discrete exchange factor (DEF) method. Computational implementation is compared between serial and parallel computing machines. The results demonstrate that parallel computing has the potential for changing the nature of radiative transfer calculations. Parallel computing allows for faster, more manageable calculations; it is especially effective for nonlinear problems.

  8. Vector processing efficiency of plasma MHD codes by use of the FACOM 230-75 APU

    NASA Astrophysics Data System (ADS)

    Matsuura, T.; Tanaka, Y.; Naraoka, K.; Takizuka, T.; Tsunematsu, T.; Tokuda, S.; Azumi, M.; Kurita, G.; Takeda, T.

    1982-06-01

    In the framework of pipelined vector architecture, the efficiency of vector processing is assessed with respect to plasma MHD codes in nuclear fusion research. By using a vector processor, the FACOM 230-75 APU, the limit of the enhancement factor due to parallelism of current vector machines is examined for three numerical codes based on a fluid model. Reasonable speed-up factors of approximately 6,6 and 4 times faster than the highly optimized scalar version are obtained for ERATO (linear stability code), AEOLUS-R1 (nonlinear stability code) and APOLLO (1-1/2D transport code), respectively. Problems of the pipelined vector processors are discussed from the viewpoint of restructuring, optimization and choice of algorithms. In conclusion, the important concept of "concurrency within pipelined parallelism" is emphasized.

  9. Parallel processing in the brain's visual form system: an fMRI study

    PubMed Central

    Shigihara, Yoshihito; Zeki, Semir

    2014-01-01

    We here extend and complement our earlier time-based, magneto-encephalographic (MEG), study of the processing of forms by the visual brain (Shigihara and Zeki, 2013) with a functional magnetic resonance imaging (fMRI) study, in order to better localize the activity produced in early visual areas when subjects view simple geometric stimuli of increasing perceptual complexity (lines, angles, rhombuses) constituted from the same elements (lines). Our results show that all three categories of form activate all three visual areas with which we were principally concerned (V1–V3), with angles producing the strongest and rhombuses the weakest activity in all three. The difference between the activity produced by angles and rhombuses was significant, that between lines and rhombuses was trend significant while that between lines and angles was not. Taken together with our earlier MEG results, the present ones suggest that a parallel strategy is used in processing forms, in addition to the well-documented hierarchical strategy. PMID:25126064

  10. Rapid LC-MS Drug Metabolite Profiling Using Microsomal Enzyme Bioreactors in a Parallel Processing Format

    PubMed Central

    Bajrami, Besnik; Zhao, Linlin; Schenkman, John B.; Rusling, James F.

    2009-01-01

    Silica nanoparticle bioreactors featuring thin films of enzymes and polyions were utilized in a novel high-throughput 96-well plate format for drug metabolism profiling. The utility of the approach was illustrated by investigating the metabolism of the drugs diclofenac (DCF), troglitazone (TGZ) and raloxifene, for which we observed known metabolic oxidation and bioconjugation pathways and turnover rates. A broad range of enzymes was included by utilizing human liver (HLM), rat liver (RLM) and bicistronic human-cyt P450 3A4 (bicis.-3A4) microsomes as enzyme sources. This parallel approach significantly shortens sample preparation steps compared to an earlier manual processing with nanoparticle bioreactors, allowing a range of significant enzyme reactions to be processed simultaneously. Enzyme turnover rates using the microsomal bioreactors were 2-3 fold larger compared to using conventional microsomal dispersions, most likely because of better accessibility of the enzymes. Ketoconazole (KET) and quinidine (QIN), substrates specific to cyt P450 3A enzymes, were used to demonstrate applicability to establish potentially toxic drug-drug interactions involving enzyme inhibition and acceleration. PMID:19904994

  11. A parallel offline CFD and closed-form approximation strategy for computationally efficient analysis of complex fluid flows

    NASA Astrophysics Data System (ADS)

    Allphin, Devin

    Computational fluid dynamics (CFD) solution approximations for complex fluid flow problems have become a common and powerful engineering analysis technique. These tools, though qualitatively useful, remain limited in practice by their underlying inverse relationship between simulation accuracy and overall computational expense. While a great volume of research has focused on remedying these issues inherent to CFD, one traditionally overlooked area of resource reduction for engineering analysis concerns the basic definition and determination of functional relationships for the studied fluid flow variables. This artificial relationship-building technique, called meta-modeling or surrogate/offline approximation, uses design of experiments (DOE) theory to efficiently approximate non-physical coupling between the variables of interest in a fluid flow analysis problem. By mathematically approximating these variables, DOE methods can effectively reduce the required quantity of CFD simulations, freeing computational resources for other analytical focuses. An idealized interpretation of a fluid flow problem can also be employed to create suitably accurate approximations of fluid flow variables for the purposes of engineering analysis. When used in parallel with a meta-modeling approximation, a closed-form approximation can provide useful feedback concerning proper construction, suitability, or even necessity of an offline approximation tool. It also provides a short-circuit pathway for further reducing the overall computational demands of a fluid flow analysis, again freeing resources for otherwise unsuitable resource expenditures. To validate these inferences, a design optimization problem was presented requiring the inexpensive estimation of aerodynamic forces applied to a valve operating on a simulated piston-cylinder heat engine. The determination of these forces was to be found using parallel surrogate and exact approximation methods, thus evidencing the comparative

  12. Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization

    PubMed Central

    Cheng, Kin-On; Law, Ngai-Fong; Siu, Wan-Chi; Liew, Alan Wee-Chung

    2008-01-01

    Background The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only [1]. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis. Results We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm. Conclusion We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum

  13. A comparison of real-time blade-element and rotor-map helicopter simulations using parallel processing

    NASA Technical Reports Server (NTRS)

    Corliss, Lloyd; Du Val, Ronald W.; Gillman, Herbert, III; Huynh, Loc C.

    1990-01-01

    In recent efforts by NASA, the Army, and Advanced Rotorcraft Technology, Inc. (ART), the application of parallel processing techniques to real-time simulation have been studied. Traditionally, real-time helicopter simulations have omitted the modeling of high-frequency phenomena in order to achieve real-time operation on affordable computers. Parallel processing technology can now provide the means for significantly improving the fidelity of real-time simulation, and one specific area for improvement is the modeling of rotor dynamics. This paper focuses on the results of a piloted simulation in which a traditional rotor-map mathematical model was compared with a more sophisticated blade-element mathematical model that had been implemented using parallel processing hardware and software technology.

  14. A New Perspective on Visual Word Processing Efficiency

    PubMed Central

    Houpt, Joseph W.; Townsend, James T.; Donkin, Christopher

    2013-01-01

    As a fundamental part of our daily lives, visual word processing has received much attention in the psychological literature. Despite the well established advantage of perceiving letters in a word or in a pseudoword over letters alone or in random sequences using accuracy, a comparable effect using response times has been elusive. Some researchers continue to question whether the advantage due to word context is perceptual. We use the capacity coefficient, a well established, response time based measure of efficiency to provide evidence of word processing as a particularly efficient perceptual process to complement those results from the accuracy domain. PMID:24334151

  15. A new perspective on visual word processing efficiency.

    PubMed

    Houpt, Joseph W; Townsend, James T; Donkin, Christopher

    2014-01-01

    As a fundamental part of our daily lives, visual word processing has received much attention in the psychological literature. Despite the well established advantage of perceiving letters in a word or in a pseudoword over letters alone or in random sequences using accuracy, a comparable effect using response times has been elusive. Some researchers continue to question whether the advantage due to word context is perceptual. We use the capacity coefficient, a well established, response time based measure of efficiency to provide evidence of word processing as a particularly efficient perceptual process to complement those results from the accuracy domain. PMID:24334151

  16. Information-Limited Parallel Processing in Difficult Heterogeneous Covert Visual Search

    ERIC Educational Resources Information Center

    Dosher, Barbara Anne; Han, Songmei; Lu, Zhong-Lin

    2010-01-01

    Difficult visual search is often attributed to time-limited serial attention operations, although neural computations in the early visual system are parallel. Using probabilistic search models (Dosher, Han, & Lu, 2004) and a full time-course analysis of the dynamics of covert visual search, we distinguish unlimited capacity parallel versus serial…

  17. Metastable states in the hierarchical Dyson model drive parallel processing in the hierarchical Hopfield network

    NASA Astrophysics Data System (ADS)

    Agliari, Elena; Barra, Adriano; Galluzzi, Andrea; Guerra, Francesco; Tantari, Daniele; Tavani, Flavia

    2015-01-01

    In this paper, we introduce and investigate the statistical mechanics of hierarchical neural networks. First, we approach these systems à la Mattis, by thinking of the Dyson model as a single-pattern hierarchical neural network. We also discuss the stability of different retrievable states as predicted by the related self-consistencies obtained both from a mean-field bound and from a bound that bypasses the mean-field limitation. The latter is worked out by properly reabsorbing the magnetization fluctuations related to higher levels of the hierarchy into effective fields for the lower levels. Remarkably, mixing Amit's ansatz technique for selecting candidate-retrievable states with the interpolation procedure for solving for the free energy of these states, we prove that, due to gauge symmetry, the Dyson model accomplishes both serial and parallel processing. We extend this scenario to multiple stored patterns by implementing the Hebb prescription for learning within the couplings. This results in Hopfield-like networks constrained on a hierarchical topology, for which, by restricting to the low-storage regime where the number of patterns grows at its most logarithmical with the amount of neurons, we prove the existence of the thermodynamic limit for the free energy, and we give an explicit expression of its mean-field bound and of its related improved bound. We studied the resulting self-consistencies for the Mattis magnetizations, which act as order parameters, are studied and the stability of solutions is analyzed to get a picture of the overall retrieval capabilities of the system according to both mean-field and non-mean-field scenarios. Our main finding is that embedding the Hebbian rule on a hierarchical topology allows the network to accomplish both serial and parallel processing. By tuning the level of fast noise affecting it or triggering the decay of the interactions with the distance among neurons, the system may switch from sequential retrieval to

  18. Distributed computing for signal processing: modeling of asynchronous parallel computation. Appendix C. Fault-tolerant interconnection networks and image-processing applications for the PASM parallel processing systems. Final report

    SciTech Connect

    Adams, G.B.

    1984-12-01

    The demand for very-high-speed data processing coupled with falling hardware costs has made large-scale parallel and distributed computer systems both desirable and feasible. Two modes of parallel processing are single-instruction stream-multiple data stream (SIMD) and multiple instruction stream - multiple data stream (MIMD). PASM, a partitionable SIMD/MIMD system, is a reconfigurable multimicroprocessor system being designed for image processing and pattern recognition. An important component of these systems is the interconnection network, the mechanism for communication among the computation nodes and memories. Assuring high reliability for such complex systems is a significant task. Thus, a crucial practical aspect of an interconnection network is fault tolerance. In answer to this need, the Extra Stage Cube (ESC), a fault-tolerant, multistage cube-type interconnection network, is defined. The fault tolerance of the ESC is explored for both single and multiple faults, routing tags are defined, and consideration is given to permuting data and partitioning the ESC in the presence of faults. The ESC is compared with other fault-tolerant multistage networks. Finally, reliability of the ESC and an enhanced version of it are investigated.

  19. Massively Parallel Geostatistical Inversion of Coupled Processes in Heterogeneous Porous Media

    NASA Astrophysics Data System (ADS)

    Ngo, A.; Schwede, R. L.; Li, W.; Bastian, P.; Ippisch, O.; Cirpka, O. A.

    2012-04-01

    The quasi-linear geostatistical approach is an inversion scheme that can be used to estimate the spatial distribution of a heterogeneous hydraulic conductivity field. The estimated parameter field is considered to be a random variable that varies continuously in space, meets the measurements of dependent quantities (such as the hydraulic head, the concentration of a transported solute or its arrival time) and shows the required spatial correlation (described by certain variogram models). This is a method of conditioning a parameter field to observations. Upon discretization, this results in as many parameters as elements of the computational grid. For a full three dimensional representation of the heterogeneous subsurface it is hardly sufficient to work with resolutions (up to one million parameters) of the model domain that can be achieved on a serial computer. The forward problems to be solved within the inversion procedure consists of the elliptic steady-state groundwater flow equation and the formally elliptic but nearly hyperbolic steady-state advection-dominated solute transport equation in a heterogeneous porous medium. Both equations are discretized by Finite Element Methods (FEM) using fully scalable domain decomposition techniques. Whereas standard conforming FEM is sufficient for the flow equation, for the advection dominated transport equation, which rises well known numerical difficulties at sharp fronts or boundary layers, we use the streamline diffusion approach. The arising linear systems are solved using efficient iterative solvers with an AMG (algebraic multigrid) pre-conditioner. During each iteration step of the inversion scheme one needs to solve a multitude of forward and adjoint problems in order to calculate the sensitivities of each measurement and the related cross-covariance matrix of the unknown parameters and the observations. In order to reduce interprocess communications and to improve the scalability of the code on larger clusters

  20. Application of Parallel Hybrid Algorithm in Massively Parallel GPGPU—The Improved Effective and Efficient Method for Calculating Coulombic Interactions in Simulations of Many Ions with SIMION

    NASA Astrophysics Data System (ADS)

    Saito, Kenichiro; Koizumi, Eiko; Koizumi, Hideya

    2012-09-01

    In our previous study, we introduced a new hybrid approach to effectively approximate the total force on each ion during a trajectory calculation in mass spectrometry device simulations, and the algorithm worked successfully with SIMION. We took one step further and applied the method in massively parallel general-purpose computing with GPU (GPGPU) to test its performance in simulations with thousands to over a million ions. We took extra care to minimize the barrier synchronization and data transfer between the host (CPU) and the device (GPU) memory, and took full advantage of the latency hiding. Parallel codes were written in CUDA C++ and implemented to SIMION via the user-defined Lua program. In this study, we tested the parallel hybrid algorithm with a couple of basic models and analyzed the performance by comparing it to that of the original, fully-explicit method written in serial code. The Coulomb explosion simulation with 128,000 ions was completed in 309 s, over 700 times faster than the 63 h taken by the original explicit method in which we evaluated two-body Coulomb interactions explicitly on one ion with each of all the other ions. The simulation of 1,024,000 ions was completed in 2650 s. In another example, we applied the hybrid method on a simulation of ions in a simple quadrupole ion storage model with 100,000 ions, and it only took less than 10 d. Based on our estimate, the same simulation is expected to take 5-7 y by the explicit method in serial code.

  1. Method and apparatus for optimizing the efficiency and quality of laser material processing

    DOEpatents

    Susemihl, Ingo

    1990-01-01

    The efficiency of laser welding and other laser material processing is optimized according to this invention by rotating the plane of polarization of a linearly polarized laser beam in relation to a work piece of the material being processed simultaneously and in synchronization with steering the laser beam over the work piece so as to keep the plane of polarization parallel to either the plane of incidence or the direction of travel of the beam in relation to the work piece. Also, depending to some extent on the particular processing being accomplished, such as welding or fusing, the angle of incidence of the laser beam on the work piece is kept at or near the polarizing or Brewster's angle. The combination of maintaining the plane of polarization parallel to plane of incidence while also maintaining the angle of incidence at or near the polarizing or Brewster's angle results in only minimal, if any, reflection losses during laser welding. Also, coordinating rotation of the plane of polarization with the translation or steering of a work piece under a laser cutting beam maximizes efficiency and kerf geometry, regardless of the direction of cut.

  2. Method and apparatus for optimizing the efficiency and quality of laser material processing

    DOEpatents

    Susemihl, I.

    1990-03-13

    The efficiency of laser welding and other laser material processing is optimized according to this invention by rotating the plane of polarization of a linearly polarized laser beam in relation to a work piece of the material being processed simultaneously and in synchronization with steering the laser beam over the work piece so as to keep the plane of polarization parallel to either the plane of incidence or the direction of travel of the beam in relation to the work piece. Also, depending to some extent on the particular processing being accomplished, such as welding or fusing, the angle of incidence of the laser beam on the work piece is kept at or near the polarizing or Brewster's angle. The combination of maintaining the plane of polarization parallel to plane of incidence while also maintaining the angle of incidence at or near the polarizing or Brewster's angle results in only minimal, if any, reflection losses during laser welding. Also, coordinating rotation of the plane of polarization with the translation or steering of a work piece under a laser cutting beam maximizes efficiency and kerf geometry, regardless of the direction of cut. 7 figs.

  3. Optimization of Parallel Legendre Transform using Graphics Processing Unit (GPU) for a Geodynamo Code

    NASA Astrophysics Data System (ADS)

    Lokavarapu, H. V.; Matsui, H.

    2015-12-01

    Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.

  4. A parallel process growth mixture model of conduct problems and substance use with risky sexual behavior.

    PubMed

    Wu, Johnny; Witkiewitz, Katie; McMahon, Robert J; Dodge, Kenneth A

    2010-10-01

    Conduct problems, substance use, and risky sexual behavior have been shown to coexist among adolescents, which may lead to significant health problems. The current study was designed to examine relations among these problem behaviors in a community sample of children at high risk for conduct disorder. A latent growth model of childhood conduct problems showed a decreasing trend from grades K to 5. During adolescence, four concurrent conduct problem and substance use trajectory classes were identified (high conduct problems and high substance use, increasing conduct problems and increasing substance use, minimal conduct problems and increasing substance use, and minimal conduct problems and minimal substance use) using a parallel process growth mixture model. Across all substances (tobacco, binge drinking, and marijuana use), higher levels of childhood conduct problems during kindergarten predicted a greater probability of classification into more problematic adolescent trajectory classes relative to less problematic classes. For tobacco and binge drinking models, increases in childhood conduct problems over time also predicted a greater probability of classification into more problematic classes. For all models, individuals classified into more problematic classes showed higher proportions of early sexual intercourse, infrequent condom use, receiving money for sexual services, and ever contracting an STD. Specifically, tobacco use and binge drinking during early adolescence predicted higher levels of sexual risk taking into late adolescence. Results highlight the importance of studying the conjoint relations among conduct problems, substance use, and risky sexual behavior in a unified model. PMID:20558013

  5. Evidence for parallel processing of sensory information controlling dauer formation in Caenorhabditis elegans.

    PubMed

    Thomas, J H; Birnby, D A; Vowels, J J

    1993-08-01

    Dauer formation in Caenorhabditis elegans is induced by chemosensation of high levels of a constitutively secreted pheromone. Seven genes defined by mutations that confer a dauer-formation constitutive phenotype (Daf-c) can be congruently divided into two groups by any of three criteria. Group 1 genes (daf-11 and daf-21) are (1) strongly synergistic with group 2 genes for their Daf-c phenotype, (2) incompletely suppressed by dauer-formation defective (Daf-d) mutations in the genes daf-3 and daf-5 and (3) strongly suppressed by Daf-d mutations in nine genes that affect the structure of chemosensory endings. Group 2 genes (daf-1, daf-4, daf-7, daf-8 and daf-14) are (1) strongly synergistic with group 1 genes for their Daf-c phenotype, (2) fully suppressed by Daf-d mutations in daf-3 and daf-5 and (3) not suppressed by Daf-d mutations in the nine genes that affect chemosensory ending structure. Mutations in each group of genes also cause distinct additional behavioral defects. We propose that these two groups of Daf-c genes act in parallel pathways that process sensory information. The two pathways are partially redundant with each other and normally act in concert to control dauer formation. PMID:8375650

  6. Transcranial direct current stimulation modulates efficiency of reading processes

    PubMed Central

    Thomson, Jennifer M.; Doruk, Deniz; Mascio, Bryan; Fregni, Felipe; Cerruti, Carlo

    2015-01-01

    Transcranial direct current stimulation (tDCS) is a neuromodulatory technique that offers promise as an investigative method for understanding complex cognitive operations such as reading. This study explores the ability of a single session of tDCS to modulate reading efficiency and phonological processing performance within a group of healthy adults. Half the group received anodal or cathodal stimulation, on two separate days, of the left temporo-parietal junction while the other half received anodal or cathodal stimulation of the right homologue area. Pre- and post-stimulation assessment of reading efficiency and phonological processing was carried out. A larger pre-post difference in reading efficiency was found for participants who received right anodal stimulation compared to participants who received left anodal stimulation. Further, there was a significant post-stimulation increase in phonological processing speed following right hemisphere anodal stimulation. Implications for models of reading and reading impairment are discussed. PMID:25852513

  7. High efficiency integration of three-dimensional functional microdevices inside a microfluidic chip by using femtosecond laser multifoci parallel microfabrication

    NASA Astrophysics Data System (ADS)

    Xu, Bing; Du, Wen-Qiang; Li, Jia-Wen; Hu, Yan-Lei; Yang, Liang; Zhang, Chen-Chu; Li, Guo-Qiang; Lao, Zhao-Xin; Ni, Jin-Cheng; Chu, Jia-Ru; Wu, Dong; Liu, Su-Ling; Sugioka, Koji

    2016-01-01

    High efficiency fabrication and integration of three-dimension (3D) functional devices in Lab-on-a-chip systems are crucial for microfluidic applications. Here, a spatial light modulator (SLM)-based multifoci parallel femtosecond laser scanning technology was proposed to integrate microstructures inside a given ‘Y’ shape microchannel. The key novelty of our approach lies on rapidly integrating 3D microdevices inside a microchip for the first time, which significantly reduces the fabrication time. The high quality integration of various 2D-3D microstructures was ensured by quantitatively optimizing the experimental conditions including prebaking time, laser power and developing time. To verify the designable and versatile capability of this method for integrating functional 3D microdevices in microchannel, a series of microfilters with adjustable pore sizes from 12.2 μm to 6.7 μm were fabricated to demonstrate selective filtering of the polystyrene (PS) particles and cancer cells with different sizes. The filter can be cleaned by reversing the flow and reused for many times. This technology will advance the fabrication technique of 3D integrated microfluidic and optofluidic chips.

  8. High efficiency integration of three-dimensional functional microdevices inside a microfluidic chip by using femtosecond laser multifoci parallel microfabrication.

    PubMed

    Xu, Bing; Du, Wen-Qiang; Li, Jia-Wen; Hu, Yan-Lei; Yang, Liang; Zhang, Chen-Chu; Li, Guo-Qiang; Lao, Zhao-Xin; Ni, Jin-Cheng; Chu, Jia-Ru; Wu, Dong; Liu, Su-Ling; Sugioka, Koji

    2016-01-01

    High efficiency fabrication and integration of three-dimension (3D) functional devices in Lab-on-a-chip systems are crucial for microfluidic applications. Here, a spatial light modulator (SLM)-based multifoci parallel femtosecond laser scanning technology was proposed to integrate microstructures inside a given 'Y' shape microchannel. The key novelty of our approach lies on rapidly integrating 3D microdevices inside a microchip for the first time, which significantly reduces the fabrication time. The high quality integration of various 2D-3D microstructures was ensured by quantitatively optimizing the experimental conditions including prebaking time, laser power and developing time. To verify the designable and versatile capability of this method for integrating functional 3D microdevices in microchannel, a series of microfilters with adjustable pore sizes from 12.2 μm to 6.7 μm were fabricated to demonstrate selective filtering of the polystyrene (PS) particles and cancer cells with different sizes. The filter can be cleaned by reversing the flow and reused for many times. This technology will advance the fabrication technique of 3D integrated microfluidic and optofluidic chips. PMID:26818119

  9. High efficiency integration of three-dimensional functional microdevices inside a microfluidic chip by using femtosecond laser multifoci parallel microfabrication

    PubMed Central

    Xu, Bing; Du, Wen-Qiang; Li, Jia-Wen; Hu, Yan-Lei; Yang, Liang; Zhang, Chen-Chu; Li, Guo-Qiang; Lao, Zhao-Xin; Ni, Jin-Cheng; Chu, Jia-Ru; Wu, Dong; Liu, Su-Ling; Sugioka, Koji

    2016-01-01

    High efficiency fabrication and integration of three-dimension (3D) functional devices in Lab-on-a-chip systems are crucial for microfluidic applications. Here, a spatial light modulator (SLM)-based multifoci parallel femtosecond laser scanning technology was proposed to integrate microstructures inside a given ‘Y’ shape microchannel. The key novelty of our approach lies on rapidly integrating 3D microdevices inside a microchip for the first time, which significantly reduces the fabrication time. The high quality integration of various 2D-3D microstructures was ensured by quantitatively optimizing the experimental conditions including prebaking time, laser power and developing time. To verify the designable and versatile capability of this method for integrating functional 3D microdevices in microchannel, a series of microfilters with adjustable pore sizes from 12.2 μm to 6.7 μm were fabricated to demonstrate selective filtering of the polystyrene (PS) particles and cancer cells with different sizes. The filter can be cleaned by reversing the flow and reused for many times. This technology will advance the fabrication technique of 3D integrated microfluidic and optofluidic chips. PMID:26818119

  10. Stochastic dynamics of small ensembles of non-processive molecular motors: The parallel cluster model

    NASA Astrophysics Data System (ADS)

    Erdmann, Thorsten; Albert, Philipp J.; Schwarz, Ulrich S.

    2013-11-01

    Non-processive molecular motors have to work together in ensembles in order to generate appreciable levels of force or movement. In skeletal muscle, for example, hundreds of myosin II molecules cooperate in thick filaments. In non-muscle cells, by contrast, small groups with few tens of non-muscle myosin II motors contribute to essential cellular processes such as transport, shape changes, or mechanosensing. Here we introduce a detailed and analytically tractable model for this important situation. Using a three-state crossbridge model for the myosin II motor cycle and exploiting the assumptions of fast power stroke kinetics and equal load sharing between motors in equivalent states, we reduce the stochastic reaction network to a one-step master equation for the binding and unbinding dynamics (parallel cluster model) and derive the rules for ensemble movement. We find that for constant external load, ensemble dynamics is strongly shaped by the catch bond character of myosin II, which leads to an increase of the fraction of bound motors under load and thus to firm attachment even for small ensembles. This adaptation to load results in a concave force-velocity relation described by a Hill relation. For external load provided by a linear spring, myosin II ensembles dynamically adjust themselves towards an isometric state with constant average position and load. The dynamics of the ensembles is now determined mainly by the distribution of motors over the different kinds of bound states. For increasing stiffness of the external spring, there is a sharp transition beyond which myosin II can no longer perform the power stroke. Slow unbinding from the pre-power-stroke state protects the ensembles against detachment.

  11. The breakdown process in an atmospheric pressure nanosecond parallel-plate helium/argon mixture discharge

    NASA Astrophysics Data System (ADS)

    Huang, Bang-Dou; Takashima, Keisuke; Zhu, Xi-Ming; Pu, Yi-Kang

    2016-02-01

    The breakdown process in an atmospheric pressure nanosecond helium/argon mixture discharge with parallel-plate electrodes is investigated by temporally and spatially resolved optical emission spectroscopy (OES). The spatially resolved electric field is obtained from the Stark splitting of the He i 492.1 nm line. Using the emissions from the He ii 468.6 nm, He i 667.8 nm, and Ar i 750.4 nm lines and a collisional-radiative model, the spatially resolved T e, high and T e, low (representing the effective T e in the high energy and low energy part of the EEDF, respectively) are obtained. It is found that, compared with the average electric field provided by the external pulser, the electric field is greatly enhanced at certain location and is significantly weakened at other places. This observation shows the effect of the ionization wave propagation, as predicted in [1, 2]. The value of T e, high is much larger than that of T e, low, which indicates that an elevated high energy tail in the EEDF is built up under the influence of strong electric field during the breakdown process. Initially, the spatial distribution of the T e, low and the T e, high generally follows that of the electric field. However, at the end of the breakdown period, the location of the highest T e, low and T e, high is shifted away from the cathode sheath, where the electric field is strongest. This indicates the existence of a non-local effect and is supported by the result from a simple Monte-Carlo simulation.

  12. Stochastic dynamics of small ensembles of non-processive molecular motors: The parallel cluster model

    SciTech Connect

    Erdmann, Thorsten; Albert, Philipp J.; Schwarz, Ulrich S.

    2013-11-07

    Non-processive molecular motors have to work together in ensembles in order to generate appreciable levels of force or movement. In skeletal muscle, for example, hundreds of myosin II molecules cooperate in thick filaments. In non-muscle cells, by contrast, small groups with few tens of non-muscle myosin II motors contribute to essential cellular processes such as transport, shape changes, or mechanosensing. Here we introduce a detailed and analytically tractable model for this important situation. Using a three-state crossbridge model for the myosin II motor cycle and exploiting the assumptions of fast power stroke kinetics and equal load sharing between motors in equivalent states, we reduce the stochastic reaction network to a one-step master equation for the binding and unbinding dynamics (parallel cluster model) and derive the rules for ensemble movement. We find that for constant external load, ensemble dynamics is strongly shaped by the catch bond character of myosin II, which leads to an increase of the fraction of bound motors under load and thus to firm attachment even for small ensembles. This adaptation to load results in a concave force-velocity relation described by a Hill relation. For external load provided by a linear spring, myosin II ensembles dynamically adjust themselves towards an isometric state with constant average position and load. The dynamics of the ensembles is now determined mainly by the distribution of motors over the different kinds of bound states. For increasing stiffness of the external spring, there is a sharp transition beyond which myosin II can no longer perform the power stroke. Slow unbinding from the pre-power-stroke state protects the ensembles against detachment.

  13. Reconstruction for Time-Domain In Vivo EPR 3D Multigradient Oximetric Imaging—A Parallel Processing Perspective

    PubMed Central

    Dharmaraj, Christopher D.; Thadikonda, Kishan; Fletcher, Anthony R.; Doan, Phuc N.; Devasahayam, Nallathamby; Matsumoto, Shingo; Johnson, Calvin A.; Cook, John A.; Mitchell, James B.; Subramanian, Sankaran; Krishna, Murali C.

    2009-01-01

    Three-dimensional Oximetric Electron Paramagnetic Resonance Imaging using the Single Point Imaging modality generates unpaired spin density and oxygen images that can readily distinguish between normal and tumor tissues in small animals. It is also possible with fast imaging to track the changes in tissue oxygenation in response to the oxygen content in the breathing air. However, this involves dealing with gigabytes of data for each 3D oximetric imaging experiment involving digital band pass filtering and background noise subtraction, followed by 3D Fourier reconstruction. This process is rather slow in a conventional uniprocessor system. This paper presents a parallelization framework using OpenMP runtime support and parallel MATLAB to execute such computationally intensive programs. The Intel compiler is used to develop a parallel C++ code based on OpenMP. The code is executed on four Dual-Core AMD Opteron shared memory processors, to reduce the computational burden of the filtration task significantly. The results show that the parallel code for filtration has achieved a speed up factor of 46.66 as against the equivalent serial MATLAB code. In addition, a parallel MATLAB code has been developed to perform 3D Fourier reconstruction. Speedup factors of 4.57 and 4.25 have been achieved during the reconstruction process and oximetry computation, for a data set with 23 × 23 × 23 gradient steps. The execution time has been computed for both the serial and parallel implementations using different dimensions of the data and presented for comparison. The reported system has been designed to be easily accessible even from low-cost personal computers through local internet (NIHnet). The experimental results demonstrate that the parallel computing provides a source of high computational power to obtain biophysical parameters from 3D EPR oximetric imaging, almost in real-time. PMID:19672315

  14. Reconstruction for time-domain in vivo EPR 3D multigradient oximetric imaging--a parallel processing perspective.

    PubMed

    Dharmaraj, Christopher D; Thadikonda, Kishan; Fletcher, Anthony R; Doan, Phuc N; Devasahayam, Nallathamby; Matsumoto, Shingo; Johnson, Calvin A; Cook, John A; Mitchell, James B; Subramanian, Sankaran; Krishna, Murali C

    2009-01-01

    Three-dimensional Oximetric Electron Paramagnetic Resonance Imaging using the Single Point Imaging modality generates unpaired spin density and oxygen images that can readily distinguish between normal and tumor tissues in small animals. It is also possible with fast imaging to track the changes in tissue oxygenation in response to the oxygen content in the breathing air. However, this involves dealing with gigabytes of data for each 3D oximetric imaging experiment involving digital band pass filtering and background noise subtraction, followed by 3D Fourier reconstruction. This process is rather slow in a conventional uniprocessor system. This paper presents a parallelization framework using OpenMP runtime support and parallel MATLAB to execute such computationally intensive programs. The Intel compiler is used to develop a parallel C++ code based on OpenMP. The code is executed on four Dual-Core AMD Opteron shared memory processors, to reduce the computational burden of the filtration task significantly. The results show that the parallel code for filtration has achieved a speed up factor of 46.66 as against the equivalent serial MATLAB code. In addition, a parallel MATLAB code has been developed to perform 3D Fourier reconstruction. Speedup factors of 4.57 and 4.25 have been achieved during the reconstruction process and oximetry computation, for a data set with 23 x 23 x 23 gradient steps. The execution time has been computed for both the serial and parallel implementations using different dimensions of the data and presented for comparison. The reported system has been designed to be easily accessible even from low-cost personal computers through local internet (NIHnet). The experimental results demonstrate that the parallel computing provides a source of high computational power to obtain biophysical parameters from 3D EPR oximetric imaging, almost in real-time. PMID:19672315

  15. Distinct lateral inhibitory circuits drive parallel processing of sensory information in the mammalian olfactory bulb

    PubMed Central

    Geramita, Matthew A; Burton, Shawn D; Urban, Nathan N

    2016-01-01

    Splitting sensory information into parallel pathways is a common strategy in sensory systems. Yet, how circuits in these parallel pathways are composed to maintain or even enhance the encoding of specific stimulus features is poorly understood. Here, we have investigated the parallel pathways formed by mitral and tufted cells of the olfactory system in mice and characterized the emergence of feature selectivity in these cell types via distinct lateral inhibitory circuits. We find differences in activity-dependent lateral inhibition between mitral and tufted cells that likely reflect newly described differences in the activation of deep and superficial granule cells. Simulations show that these circuit-level differences allow mitral and tufted cells to best discriminate odors in separate concentration ranges, indicating that segregating information about different ranges of stimulus intensity may be an important function of these parallel sensory pathways. DOI: http://dx.doi.org/10.7554/eLife.16039.001 PMID:27351103

  16. Distinct lateral inhibitory circuits drive parallel processing of sensory information in the mammalian olfactory bulb.

    PubMed

    Geramita, Matthew A; Burton, Shawn D; Urban, Nathan N

    2016-01-01

    Splitting sensory information into parallel pathways is a common strategy in sensory systems. Yet, how circuits in these parallel pathways are composed to maintain or even enhance the encoding of specific stimulus features is poorly understood. Here, we have investigated the parallel pathways formed by mitral and tufted cells of the olfactory system in mice and characterized the emergence of feature selectivity in these cell types via distinct lateral inhibitory circuits. We find differences in activity-dependent lateral inhibition between mitral and tufted cells that likely reflect newly described differences in the activation of deep and superficial granule cells. Simulations show that these circuit-level differences allow mitral and tufted cells to best discriminate odors in separate concentration ranges, indicating that segregating information about different ranges of stimulus intensity may be an important function of these parallel sensory pathways. PMID:27351103

  17. Applications of Parallel Process HiMAP for Large Scale Multidisciplinary Problems

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Potsdam, Mark; Rodriguez, David; Kwak, Dochay (Technical Monitor)

    2000-01-01

    HiMAP is a three level parallel middleware that can be interfaced to a large scale global design environment for code independent, multidisciplinary analysis using high fidelity equations. Aerospace technology needs are rapidly changing. Computational tools compatible with the requirements of national programs such as space transportation are needed. Conventional computation tools are inadequate for modern aerospace design needs. Advanced, modular computational tools are needed, such as those that incorporate the technology of massively parallel processors (MPP).

  18. Multiple-Instruction, Multiple-Data Path Computers: Parallel Processing Impact on Flight Simulation Software. Final Report.

    ERIC Educational Resources Information Center

    Lord, Robert E.; And Others

    The purpose of this study was to evaluate the parallel processing impact of multiple-instruction multiple-data path (MIMD) computers on flight simulation software. Basic mathematical functions and arithmetic expressions from typical flight simulation software were selected and run on an MIMD computer to evaluate the improvement in execution time…

  19. Parallel versus Serial Processing Dependencies in the Perisylvian Speech Network: A Granger Analysis of Intracranial EEG Data

    ERIC Educational Resources Information Center

    Gow, David W., Jr.; Keller, Corey J.; Eskandar, Emad; Meng, Nate; Cash, Sydney S.

    2009-01-01

    In this work, we apply Granger causality analysis to high spatiotemporal resolution intracranial EEG (iEEG) data to examine how different components of the left perisylvian language network interact during spoken language perception. The specific focus is on the characterization of serial versus parallel processing dependencies in the dominant…

  20. Testing the Theoretical Design of a Health Risk Message: Reexamining the Major Tenets of the Extended Parallel Process Model

    ERIC Educational Resources Information Center

    Gore, Thomas D.; Bracken, Cheryl Campanella

    2005-01-01

    This study examined the fear control/danger control responses that are predicted by the Extended Parallel Process Model (EPPM). In a campaign designed to inform college students about the symptoms and dangers of meningitis, participants were given either a high-threat/no-efficacy or high-efficacy/no-threat health risk message, thus testing the…

  1. SLAPP: A systolic linear algebra parallel processor

    SciTech Connect

    Drake, B.L.; Luk, F.T.; Speiser, J.M.; Symanski, J.J.

    1987-07-01

    Systolic array computer architectures provide a means for fast computation of the linear algebra algorithms that form the building blocks of many signal-processing algorithms, facilitating their real-time computation. For applications to signal processing, the systolic array operates on matrices, an inherently parallel view of the data, using numerical linear algebra algorithms that have been suitably parallelized to efficiently utilize the available hardware. This article describes work currently underway at the Naval Ocean Systems Center, San Diego, California, to build a two-dimensional systolic array, SLAPP, demonstrating efficient and modular parallelization of key matric computations for real-time signal- and image-processing problems.

  2. Processes for producing low cost, high efficiency silicon solar cells

    DOEpatents

    Rohatgi, Ajeet; Doshi, Parag; Tate, John Keith; Mejia, Jose; Chen, Zhizhang

    1998-06-16

    Processes which utilize rapid thermal processing (RTP) are provided for inexpensively producing high efficiency silicon solar cells. The RTP processes preserve minority carrier bulk lifetime .tau. and permit selective adjustment of the depth of the diffused regions, including emitter and back surface field (bsf), within the silicon substrate. In a first RTP process, an RTP step is utilized to simultaneously diffuse phosphorus and aluminum into the front and back surfaces, respectively, of a silicon substrate. Moreover, an in situ controlled cooling procedure preserves the carrier bulk lifetime .tau. and permits selective adjustment of the depth of the diffused regions. In a second RTP process, both simultaneous diffusion of the phosphorus and aluminum as well as annealing of the front and back contacts are accomplished during the RTP step. In a third RTP process, the RTP step accomplishes simultaneous diffusion of the phosphorus and aluminum, annealing of the contacts, and annealing of a double-layer antireflection/passivation coating SiN/SiO.sub.x. In a fourth RTP process, the process of applying front and back contacts is broken up into two separate respective steps, which enhances the efficiency of the cells, at a slight time expense. In a fifth RTP process, a second RTP step is utilized to fire and adhere the screen printed or evaporated contacts to the structure.

  3. Processes for producing low cost, high efficiency silicon solar cells

    DOEpatents

    Rohatgi, A.; Doshi, P.; Tate, J.K.; Mejia, J.; Chen, Z.

    1998-06-16

    Processes which utilize rapid thermal processing (RTP) are provided for inexpensively producing high efficiency silicon solar cells. The RTP processes preserve minority carrier bulk lifetime {tau} and permit selective adjustment of the depth of the diffused regions, including emitter and back surface field (bsf), within the silicon substrate. In a first RTP process, an RTP step is utilized to simultaneously diffuse phosphorus and aluminum into the front and back surfaces, respectively, of a silicon substrate. Moreover, an in situ controlled cooling procedure preserves the carrier bulk lifetime {tau} and permits selective adjustment of the depth of the diffused regions. In a second RTP process, both simultaneous diffusion of the phosphorus and aluminum as well as annealing of the front and back contacts are accomplished during the RTP step. In a third RTP process, the RTP step accomplishes simultaneous diffusion of the phosphorus and aluminum, annealing of the contacts, and annealing of a double-layer antireflection/passivation coating SiN/SiO{sub x}. In a fourth RTP process, the process of applying front and back contacts is broken up into two separate respective steps, which enhances the efficiency of the cells, at a slight time expense. In a fifth RTP process, a second RTP step is utilized to fire and adhere the screen printed or evaporated contacts to the structure. 28 figs.

  4. Improved motion contrast and processing efficiency in OCT angiography using complex-correlation algorithm

    NASA Astrophysics Data System (ADS)

    Guo, Li; Li, Pei; Pan, Cong; Liao, Rujia; Cheng, Yuxuan; Hu, Weiwei; Chen, Zhong; Ding, Zhihua; Li, Peng

    2016-02-01

    The complex-based OCT angiography (Angio-OCT) offers high motion contrast by combining both the intensity and phase information. However, due to involuntary bulk tissue motions, complex-valued OCT raw data are processed sequentially with different algorithms for correcting bulk image shifts (BISs), compensating global phase fluctuations (GPFs) and extracting flow signals. Such a complicated procedure results in massive computational load. To mitigate such a problem, in this work, we present an inter-frame complex-correlation (CC) algorithm. The CC algorithm is suitable for parallel processing of both flow signal extraction and BIS correction, and it does not need GPF compensation. This method provides high processing efficiency and shows superiority in motion contrast. The feasibility and performance of the proposed CC algorithm is demonstrated using both flow phantom and live animal experiments.

  5. Parallel Subconvolution Filtering Architectures

    NASA Technical Reports Server (NTRS)

    Gray, Andrew A.

    2003-01-01

    These architectures are based on methods of vector processing and the discrete-Fourier-transform/inverse-discrete- Fourier-transform (DFT-IDFT) overlap-and-save method, combined with time-block separation of digital filters into frequency-domain subfilters implemented by use of sub-convolutions. The parallel-processing method implemented in these architectures enables the use of relatively small DFT-IDFT pairs, while filter tap lengths are theoretically unlimited. The size of a DFT-IDFT pair is determined by the desired reduction in processing rate, rather than on the order of the filter that one seeks to implement. The emphasis in this report is on those aspects of the underlying theory and design rules that promote computational efficiency, parallel processing at reduced data rates, and simplification of the designs of very-large-scale integrated (VLSI) circuits needed to implement high-order filters and correlators.

  6. Using Monte Carlo techniques and parallel processing for fragmentation analysis of explosive payloads

    SciTech Connect

    LaFarge, R.A.

    1992-01-01

    Sandia National Laboratories (SNL) launched the Los Alamos National Laboratory (LANL) sponsored ZEST flight test program from the SNL Kauai Test Facility (KTF) in the summer of 1991. The ZEST program had about 255 pounds of high explosive (HE) aboard a Talos-Castor launch vehicle. Naturally, such undertakings raise questions about the safety of personnel and the environment in the event of a premature detonation of the HE. These questions pertain not only to KTF and the island of Kauai but to the neighboring islands as well. The ability to determine realistically P{sub I}, the probability of an explosively generated fragment impacting a given exclusion area, is an important factor in the safety analysis of any flight test involving explosive payloads. Once P{sub I} is known, the casualty expectations C{sub E} can be computed based on local demographics. A set of two computer codes was developed to determine P{sub I} based on computed fragment impacts. One of these codes, SAFETIE1 (Sandia Analysis of FragmEnt TrajectorIEs), computes files of trajectory initial conditions generated in a Monte Carlo sense for a set of n explosions each containing m fragments. These initial condition files are then used to compute trajectories in a parallel processing environment using a local area network (LAN) of 40 Sun workstations. This approach saves the equivalent of 40 hours of Cray YMP time. The other code, SAFETIE2, is a postprocessor that uses an AMEER output file generated by SAFETIE1, to determine how many explosions have at least one fragment in a user defined exclusion area. The average number of fragments per explosion in the exclusion area ({bar N}) is also computed (for C{sub E} considerations). 12 refs.

  7. Parallel processing streams for motor output and sensory prediction during action preparation.

    PubMed

    Stenner, Max-Philipp; Bauer, Markus; Heinze, Hans-Jochen; Haggard, Patrick; Dolan, Raymond J

    2015-03-15

    Sensory consequences of one's own actions are perceived as less intense than identical, externally generated stimuli. This is generally taken as evidence for sensory prediction of action consequences. Accordingly, recent theoretical models explain this attenuation by an anticipatory modulation of sensory processing prior to stimulus onset (Roussel et al. 2013) or even action execution (Brown et al. 2013). Experimentally, prestimulus changes that occur in anticipation of self-generated sensations are difficult to disentangle from more general effects of stimulus expectation, attention and task load (performing an action). Here, we show that an established manipulation of subjective agency over a stimulus leads to a predictive modulation in sensory cortex that is independent of these factors. We recorded magnetoencephalography while subjects performed a simple action with either hand and judged the loudness of a tone caused by the action. Effector selection was manipulated by subliminal motor priming. Compatible priming is known to enhance a subjective experience of agency over a consequent stimulus (Chambon and Haggard 2012). In line with this effect on subjective agency, we found stronger sensory attenuation when the action that caused the tone was compatibly primed. This perceptual effect was reflected in a transient phase-locked signal in auditory cortex before stimulus onset and motor execution. Interestingly, this sensory signal emerged at a time when the hemispheric lateralization of motor signals in M1 indicated ongoing effector selection. Our findings confirm theoretical predictions of a sensory modulation prior to self-generated sensations and support the idea that a sensory prediction is generated in parallel to motor output (Walsh and Haggard 2010), before an efference copy becomes available. PMID:25540223

  8. Using a model of parallel distributed processing associated with data mining in the characterization of sexuality in a university population.

    PubMed

    de Azevedo Simões, Priscyla Waleska Targino; Martins, Paulo João; Casagrandre, Rogério Antônio; Madeira, Kristian; de Mattos, Merisandra Côrtes; Manenti, Sandra Aparecida; da Rosa, Maria Inês; Dal-Pizzol, Felipe; Venson, Ramon; Coral, Leandro Natal; de Souza, Gabriel Scheffer; Pandini, Jeison Cleiton; Cassettari Junior, José Márcio; Moretti, Gustavo Pasquali; Cesconetto, Samuel

    2013-01-01

    Using the framework for developing parallel applications Java Parallel Programming Framework were conducted performance analysis of an application for the clustering data by the method of fuzzy logic combined with Gustafson-Kessel algorithm. In addition to running in a distributed environment, for comparative purposes, were also conducted collections of processing time in environments with a single Personal Computer approach. With the results obtained by collecting time of application, there was a statistical analysis to validate the application and the algorithm as well as the use of computational clustering as a way to increase performance applications. PMID:23920909

  9. Numerical Processing Efficiency Improved in Experienced Mental Abacus Children

    ERIC Educational Resources Information Center

    Wang, Yunqi; Geng, Fengji; Hu, Yuzheng; Du, Fenglei; Chen, Feiyan

    2013-01-01

    Experienced mental abacus (MA) users are able to perform mental arithmetic calculations with unusual speed and accuracy. However, it remains unclear whether their extraordinary gains in mental arithmetic ability are accompanied by an improvement in numerical processing efficiency. To address this question, the present study, using a numerical…

  10. IMPROVING TACONITE PROCESSING PLANT EFFICIENCY BY COMPUTER SIMULATION, Final Report

    SciTech Connect

    William M. Bond; Salih Ersayin

    2007-03-30

    This project involved industrial scale testing of a mineral processing simulator to improve the efficiency of a taconite processing plant, namely the Minorca mine. The Concentrator Modeling Center at the Coleraine Minerals Research Laboratory, University of Minnesota Duluth, enhanced the capabilities of available software, Usim Pac, by developing mathematical models needed for accurate simulation of taconite plants. This project provided funding for this technology to prove itself in the industrial environment. As the first step, data representing existing plant conditions were collected by sampling and sample analysis. Data were then balanced and provided a basis for assessing the efficiency of individual devices and the plant, and also for performing simulations aimed at improving plant efficiency. Performance evaluation served as a guide in developing alternative process strategies for more efficient production. A large number of computer simulations were then performed to quantify the benefits and effects of implementing these alternative schemes. Modification of makeup ball size was selected as the most feasible option for the target performance improvement. This was combined with replacement of existing hydrocyclones with more efficient ones. After plant implementation of these modifications, plant sampling surveys were carried out to validate findings of the simulation-based study. Plant data showed very good agreement with the simulated data, confirming results of simulation. After the implementation of modifications in the plant, several upstream bottlenecks became visible. Despite these bottlenecks limiting full capacity, concentrator energy improvement of 7% was obtained. Further improvements in energy efficiency are expected in the near future. The success of this project demonstrated the feasibility of a simulation-based approach. Currently, the Center provides simulation-based service to all the iron ore mining companies operating in northern

  11. Comparison of cleaning processes with respect to cleaning efficiency

    NASA Astrophysics Data System (ADS)

    Nesladek, Pavel; Osborne, Steve; Rode, Thomas

    2011-03-01

    Photomask technology has attained feature sizes of about 50nm and below. Whereas the main feature size is still above 70-80nm at 20nm technology node recently reported e.g. by Toppan Printing Company as developed, assist features for this node are in the range of 50-60nm. One of the critical aspects of this technology development is the cleaning process. Processes are supposed to clean off contamination and particles down to a defect size of about 40nm and at the same time prevent damage to assist features in the same size range. Due to obvious trade offs between cleaning power and Feature Damage Probability (FPD), this task becomes tricky. Improvement of cleaning processes by raising the power of megasonic (MS) cleaning, or adjusting the speed and size of droplets for spray cleaning occurs at the expense of increased feature damage. Prolongation of physical cleaning steps does not necessarily leads to improvement of the cleaning as shown previously. Susceptibility to feature damage occurs predicatively according to dimension and orientation. This allows us to extrapolate a Feature Damage Limit (FDL) which approximates the smallest feature size for which a process has an acceptable probability of success. In a practical sense, the most advantageous approach seems to be to adjust the cleaning power to the maximum allowed by the FDP and then optimize to the lowest process time necessary to reach expected cleaning efficiency. Since there are several alternative physical cleaning principles, we have to pick the best one for a given application. At this point we have to raise the question of how to compare the cleaning efficiency of processes. The goal of this work is to provide a method for evaluation and comparison of cleaning efficiency between physical cleaning processes and demonstrate the method on an example. We will focus on comparing two physical cleaning processes 1MHz megasonic and binary spray process.

  12. Increased Efficiencies in the INEEL SAR/TSR/USQ Process

    SciTech Connect

    Cole, N.E.

    2002-05-16

    The Idaho National Engineering and Environmental Laboratory (INEEL) has implemented a number of efficiencies to reduce the time and cost of preparing safety basis documents. The INEEL is continuing to look at other aspects of the safety basis process to identify other efficiencies that can be implemented and remain in compliance with Title 10 Code of Federal Regulations (CFR) Part 830. A six-sigma approach is used to identify areas to improve efficiencies and develop the action plan for implementation of the new process, as applicable. Three improvement processes have been implemented: The first was the development of standardized Documented Safety Analysis (DSA) and technical safety requirement (TSR) documents that all nuclear facilities use, by adding facility-specific details. The second is a material procurement process, which is based on safety systems specified in the individual safety basis documents. The third is a restructuring of the entire safety basis preparation and approval process. Significant savings in time to prepare safety basis document, cost of materials, and total cost of the documents are currently being realized.

  13. Increased Efficiencies in the INEEL SAR/TSR/USQ Process

    SciTech Connect

    Cole, Norman Edward

    2002-06-01

    The Idaho National Engineering and Environmental Laboratory (INEEL) has implemented a number of efficiencies to reduce the time and cost of preparing safety basis documents. The INEEL is continuing to look at other aspects of the safety basis process to identify other efficiencies that can be implemented and remain in compliance with Title 10 Code of Federal Regulations (CFR) Part 830. A six-sigma approach is used to identify areas to improve efficiencies and develop the action plan for implementation of the new process, as applicable. Three improvement processes have been implemented: The first was the development of standardized Documented Safety Analysis (DSA) and technical safety requirement (TSR) documents that all nuclear facilities use, by adding facility-specific details. The second is a material procurement process, which is based on safety systems specified in the individual safety basis documents. The third is a restructuring of the entire safety basis preparation and approval process. Significant savings in time to prepare safety basis document, cost of materials, and total cost of the documents are currently being realized.

  14. A parallel real-time computing-cluster implementation of spotlight SAR processing

    NASA Astrophysics Data System (ADS)

    Mathew, Bipin; Rabinkin, Daniel

    2005-05-01

    The high resolution imaging capability of Synthetic Aperture Radar (SAR) is largely unaffected by atmospheric conditions and has proven to be an indispensable asset in a variety of military and civilian applications. Application of SAR methodology for real-time imaging however carries with it the large computational complexity and storage requirements of the image-forming algorithms. Recently however, the rapidly diminishing cost of computing hardware and the related ascent of cluster-based computing, has made parallelization of these algorithms an appealing area of investigation. This paper describes a parallel SAR processor developed at MIT Lincoln Laboratory. Several novel technologies were employed in it's implementation, including pMatlab which is a parallel extension of standard Matlab that is also being developed at MIT Lincoln Laboratory. These technologies will be described later in the document. We begin with a brief description of the basic SAR algorithm.

  15. Processing technology for high efficiency silicon solar cells

    NASA Technical Reports Server (NTRS)

    Spitzer, M. B.; Keavney, C. J.

    1985-01-01

    Recent advances in silicon solar cell processing have led to attainment of conversion efficiency approaching 20%. The basic cell design is investigated and features of greatest importance to achievement of 20% efficiency are indicated. Experiments to separately optimize high efficiency design features in test structures are discussed. The integration of these features in a high efficiency cell is examined. Ion implantation has been used to achieve optimal concentrations of emitter dopant and junction depth. The optimization reflects the trade-off between high sheet conductivity, necessary for high fill factor, and heavy doping effects, which must be minimized for high open circuit voltage. A second important aspect of the design experiments is the development of a passivation process to minimize front surface recombination velocity. The manner in which a thin SiO2 layer may be used for this purpose is indicated without increasing reflection losses, if the antireflection coating is properly designed. Details are presented of processing intended to reduce recombination at the contact/Si interface. Data on cell performance (including CZ and ribbon) and analysis of loss mechanisms are also presented.

  16. Energy and Resource Efficiency of Laser Cutting Processes

    NASA Astrophysics Data System (ADS)

    Kellens, Karel; Rodrigues, Goncalo Costa; Dewulf, Wim; Duflou, Joost R.

    Due to increasing energy and resource costs at the one hand and upcoming regulations on energy and resource efficiency at the other, a growing interest of machine tool builders in the environmental performance of their machine tools can be observed today. The last decade, academic as well as industrial research groups started to assess the environmental aspects of discrete part manufacturing processes and indicated a significant potential for improvement [1]. This paper provides an overview of the environmental performance (energy and resource efficiency) of different types of laser cutting systems and derived performance improving strategies.

  17. Parallel NPARC: Implementation and Performance

    NASA Technical Reports Server (NTRS)

    Townsend, S. E.

    1996-01-01

    Version 3 of the NPARC Navier-Stokes code includes support for large-grain (block level) parallelism using explicit message passing between a heterogeneous collection of computers. This capability has the potential for significant performance gains, depending upon the block data distribution. The parallel implementation uses a master/worker arrangement of processes. The master process assigns blocks to workers, controls worker actions, and provides remote file access for the workers. The processes communicate via explicit message passing using an interface library which provides portability to a number of message passing libraries, such as PVM (Parallel Virtual Machine). A Bourne shell script is used to simplify the task of selecting hosts, starting processes, retrieving remote files, and terminating a computation. This script also provides a simple form of fault tolerance. An analysis of the computational performance of NPARC is presented, using data sets from an F/A-18 inlet study and a Rocket Based Combined Cycle Engine analysis. Parallel speedup and overall computational efficiency were obtained for various NPARC run parameters on a cluster of IBM RS6000 workstations. The data show that although NPARC performance compares favorably with the estimated potential parallelism, typical data sets used with previous versions of NPARC will often need to be reblocked for optimum parallel performance. In one of the cases studied, reblocking increased peak parallel speedup from 3.2 to 11.8.

  18. Parallel Processing of Numerical Tsunami Simulations on a High Performance Cluster based on the GDAL Library

    NASA Astrophysics Data System (ADS)

    Schroeder, Matthias; Jankowski, Cedric; Hammitzsch, Martin; Wächter, Joachim

    2014-05-01

    Thousands of numerical tsunami simulations allow the computation of inundation and run-up along the coast for vulnerable areas over the time. A so-called Matching Scenario Database (MSDB) [1] contains this large number of simulations in text file format. In order to visualize these wave propagations the scenarios have to be reprocessed automatically. In the TRIDEC project funded by the seventh Framework Programme of the European Union a Virtual Scenario Database (VSDB) and a Matching Scenario Database (MSDB) were established amongst others by the working group of the University of Bologna (UniBo) [1]. One part of TRIDEC was the developing of a new generation of a Decision Support System (DSS) for tsunami Early Warning Systems (TEWS) [2]. A working group of the GFZ German Research Centre for Geosciences was responsible for developing the Command and Control User Interface (CCUI) as central software application which support operator activities, incident management and message disseminations. For the integration and visualization in the CCUI, the numerical tsunami simulations from MSDB must be converted into the shapefiles format. The usage of shapefiles enables a much easier integration into standard Geographic Information Systems (GIS). Since also the CCUI is based on two widely used open source products (GeoTools library and uDig), whereby the integration of shapefiles is provided by these libraries a priori. In this case, for an example area around the Western Iberian margin several thousand tsunami variations were processed. Due to the mass of data only a program-controlled process was conceivable. In order to optimize the computing efforts and operating time the use of an existing GFZ High Performance Computing Cluster (HPC) had been chosen. Thus, a geospatial software was sought after that is capable for parallel processing. The FOSS tool Geospatial Data Abstraction Library (GDAL/OGR) was used to match the coordinates with the wave heights and generates the

  19. Working memory capacity and redundant information processing efficiency

    PubMed Central

    Endres, Michael J.; Houpt, Joseph W.; Donkin, Chris; Finn, Peter R.

    2015-01-01

    Working memory capacity (WMC) is typically measured by the amount of task-relevant information an individual can keep in mind while resisting distraction or interference from task-irrelevant information. The current research investigated the extent to which differences in WMC were associated with performance on a novel redundant memory probes (RMP) task that systematically varied the amount of to-be-remembered (targets) and to-be-ignored (distractor) information. The RMP task was designed to both facilitate and inhibit working memory search processes, as evidenced by differences in accuracy, response time, and Linear Ballistic Accumulator (LBA) model estimates of information processing efficiency. Participants (N = 170) completed standard intelligence tests and dual-span WMC tasks, along with the RMP task. As expected, accuracy, response-time, and LBA model results indicated memory search and retrieval processes were facilitated under redundant-target conditions, but also inhibited under mixed target/distractor and redundant-distractor conditions. Repeated measures analyses also indicated that, while individuals classified as high (n = 85) and low (n = 85) WMC did not differ in the magnitude of redundancy effects, groups did differ in the efficiency of memory search and retrieval processes overall. Results suggest that redundant information reliably facilitates and inhibits the efficiency or speed of working memory search, and these effects are independent of more general limits and individual differences in the capacity or space of working memory. PMID:26074828

  20. Parallel processing of real-time dynamic systems simulation on OSCAR (Optimally SCheduled Advanced multiprocessoR)

    NASA Technical Reports Server (NTRS)

    Kasahara, Hironori; Honda, Hiroki; Narita, Seinosuke

    1989-01-01

    Parallel processing of real-time dynamic systems simulation on a multiprocessor system named OSCAR is presented. In the simulation of dynamic systems, generally, the same calculation are repeated every time step. However, we cannot apply to Do-all or the Do-across techniques for parallel processing of the simulation since there exist data dependencies from the end of an iteration to the beginning of the next iteration and furthermore data-input and data-output are required every sampling time period. Therefore, parallelism inside the calculation required for a single time step, or a large basic block which consists of arithmetic assignment statements, must be used. In the proposed method, near fine grain tasks, each of which consists of one or more floating point operations, are generated to extract the parallelism from the calculation and assigned to processors by using optimal static scheduling at compile time in order to reduce large run time overhead caused by the use of near fine grain tasks. The practicality of the scheme is demonstrated on OSCAR (Optimally SCheduled Advanced multiprocessoR) which has been developed to extract advantageous features of static scheduling algorithms to the maximum extent.