Science.gov

Sample records for efficient parallel processing

  1. Efficient parallel implementation of polarimetric synthetic aperture radar data processing

    NASA Astrophysics Data System (ADS)

    Martinez, Sergio S.; Marpu, Prashanth R.; Plaza, Antonio J.

    2014-10-01

    This work investigates the parallel implementation of polarimetric synthetic aperture radar (POLSAR) data processing chain. Such processing can be computationally expensive when large data sets are processed. However, the processing steps can be largely implemented in a high performance computing (HPC) environ- ment. In this work, we studied different aspects of the computations involved in processing the POLSAR data and developed an efficient parallel scheme to achieve near-real time performance. The algorithm is implemented using message parsing interface (MPI) framework in this work, but it can be easily adapted for other parallel architectures such as general purpose graphics processing units (GPGPUs).

  2. Efficient multitasking: parallel versus serial processing of multiple tasks

    PubMed Central

    Fischer, Rico; Plessow, Franziska

    2015-01-01

    In the context of performance optimizations in multitasking, a central debate has unfolded in multitasking research around whether cognitive processes related to different tasks proceed only sequentially (one at a time), or can operate in parallel (simultaneously). This review features a discussion of theoretical considerations and empirical evidence regarding parallel versus serial task processing in multitasking. In addition, we highlight how methodological differences and theoretical conceptions determine the extent to which parallel processing in multitasking can be detected, to guide their employment in future research. Parallel and serial processing of multiple tasks are not mutually exclusive. Therefore, questions focusing exclusively on either task-processing mode are too simplified. We review empirical evidence and demonstrate that shifting between more parallel and more serial task processing critically depends on the conditions under which multiple tasks are performed. We conclude that efficient multitasking is reflected by the ability of individuals to adjust multitasking performance to environmental demands by flexibly shifting between different processing strategies of multiple task-component scheduling. PMID:26441742

  3. Efficient multitasking: parallel versus serial processing of multiple tasks.

    PubMed

    Fischer, Rico; Plessow, Franziska

    2015-01-01

    In the context of performance optimizations in multitasking, a central debate has unfolded in multitasking research around whether cognitive processes related to different tasks proceed only sequentially (one at a time), or can operate in parallel (simultaneously). This review features a discussion of theoretical considerations and empirical evidence regarding parallel versus serial task processing in multitasking. In addition, we highlight how methodological differences and theoretical conceptions determine the extent to which parallel processing in multitasking can be detected, to guide their employment in future research. Parallel and serial processing of multiple tasks are not mutually exclusive. Therefore, questions focusing exclusively on either task-processing mode are too simplified. We review empirical evidence and demonstrate that shifting between more parallel and more serial task processing critically depends on the conditions under which multiple tasks are performed. We conclude that efficient multitasking is reflected by the ability of individuals to adjust multitasking performance to environmental demands by flexibly shifting between different processing strategies of multiple task-component scheduling.

  4. The power and efficiency of advanced software and parallel processing

    NASA Technical Reports Server (NTRS)

    Singh, Ramen P.; Taylor, Lawrence W., Jr.

    1989-01-01

    Real-time simulation of flexible and articulating systems is difficult because of the computational burden of the time varying calculations. The mobile servicing system of the NASA Space Station Freedom will handle heavy payloads by local arm manipulations and by translating along the spline of the Station, it is crucial to have real-time simulation available. To enable such a simulation to be of high fidelity and to be able to be hosted on a modest computer, special care must be made in formulating the structural dynamics. Frontal solution algorithms save considerable time in performing these calculations. In addition, it is necessary to take advantage of parallel processing be compatible to take full advantage of both. An approach is offered which will result in high fidelity, real-time simulation for flexible, articulating systems such as the space Station remote servicing system.

  5. Efficient biased random bit generation for parallel processing

    SciTech Connect

    Slone, Dale M.

    1994-09-28

    A lattice gas automaton was implemented on a massively parallel machine (the BBN TC2000) and a vector supercomputer (the CRAY C90). The automaton models Burgers equation ρt + ρρx = vρxx in 1 dimension. The lattice gas evolves by advecting and colliding pseudo-particles on a 1-dimensional, periodic grid. The specific rules for colliding particles are stochastic in nature and require the generation of many billions of random numbers to create the random bits necessary for the lattice gas. The goal of the thesis was to speed up the process of generating the random bits and thereby lessen the computational bottleneck of the automaton.

  6. Parallel processing for efficient 3D slope stability modelling

    NASA Astrophysics Data System (ADS)

    Marchesini, Ivan; Mergili, Martin; Alvioli, Massimiliano; Metz, Markus; Schneider-Muntau, Barbara; Rossi, Mauro; Guzzetti, Fausto

    2014-05-01

    We test the performance of the GIS-based, three-dimensional slope stability model r.slope.stability. The model was developed as a C- and python-based raster module of the GRASS GIS software. It considers the three-dimensional geometry of the sliding surface, adopting a modification of the model proposed by Hovland (1977), and revised and extended by Xie and co-workers (2006). Given a terrain elevation map and a set of relevant thematic layers, the model evaluates the stability of slopes for a large number of randomly selected potential slip surfaces, ellipsoidal or truncated in shape. Any single raster cell may be intersected by multiple sliding surfaces, each associated with a value of the factor of safety, FS. For each pixel, the minimum value of FS and the depth of the associated slip surface are stored. This information is used to obtain a spatial overview of the potentially unstable slopes in the study area. We test the model in the Collazzone area, Umbria, central Italy, an area known to be susceptible to landslides of different type and size. Availability of a comprehensive and detailed landslide inventory map allowed for a critical evaluation of the model results. The r.slope.stability code automatically splits the study area into a defined number of tiles, with proper overlap in order to provide the same statistical significance for the entire study area. The tiles are then processed in parallel by a given number of processors, exploiting a multi-purpose computing environment at CNR IRPI, Perugia. The map of the FS is obtained collecting the individual results, taking the minimum values on the overlapping cells. This procedure significantly reduces the processing time. We show how the gain in terms of processing time depends on the tile dimensions and on the number of cores.

  7. Dynamic CT perfusion image data compression for efficient parallel processing.

    PubMed

    Barros, Renan Sales; Olabarriaga, Silvia Delgado; Borst, Jordi; van Walderveen, Marianne A A; Posthuma, Jorrit S; Streekstra, Geert J; van Herk, Marcel; Majoie, Charles B L M; Marquering, Henk A

    2016-03-01

    The increasing size of medical imaging data, in particular time series such as CT perfusion (CTP), requires new and fast approaches to deliver timely results for acute care. Cloud architectures based on graphics processing units (GPUs) can provide the processing capacity required for delivering fast results. However, the size of CTP datasets makes transfers to cloud infrastructures time-consuming and therefore not suitable in acute situations. To reduce this transfer time, this work proposes a fast and lossless compression algorithm for CTP data. The algorithm exploits redundancies in the temporal dimension and keeps random read-only access to the image elements directly from the compressed data on the GPU. To the best of our knowledge, this is the first work to present a GPU-ready method for medical image compression with random access to the image elements from the compressed data.

  8. A high resolution finite volume method for efficient parallel simulation of casting processes on unstructured meshes

    SciTech Connect

    Kothe, D.B.; Turner, J.A.; Mosso, S.J.; Ferrell, R.C.

    1997-03-01

    We discuss selected aspects of a new parallel three-dimensional (3-D) computational tool for the unstructured mesh simulation of Los Alamos National Laboratory (LANL) casting processes. This tool, known as {bold Telluride}, draws upon on robust, high resolution finite volume solutions of metal alloy mass, momentum, and enthalpy conservation equations to model the filling, cooling, and solidification of LANL castings. We briefly describe the current {bold Telluride} physical models and solution methods, then detail our parallelization strategy as implemented with Fortran 90 (F90). This strategy has yielded straightforward and efficient parallelization on distributed and shared memory architectures, aided in large part by new parallel libraries {bold JTpack9O} for Krylov-subspace iterative solution methods and {bold PGSLib} for efficient gather/scatter operations. We illustrate our methodology and current capabilities with source code examples and parallel efficiency results for a LANL casting simulation.

  9. Reliable and Efficient Parallel Processing Algorithms and Architectures for Modern Signal Processing. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Liu, Kuojuey Ray

    1990-01-01

    Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.

  10. A new strategy for efficient solar energy conversion: Parallel-processing with surface plasmons

    NASA Technical Reports Server (NTRS)

    Anderson, L. M.

    1982-01-01

    This paper introduces an advanced concept for direct conversion of sunlight to electricity, which aims at high efficiency by tailoring the conversion process to separate energy bands within the broad solar spectrum. The objective is to obtain a high level of spectrum-splitting without sequential losses or unique materials for each frequency band. In this concept, sunlight excites a spectrum of surface plasma waves which are processed in parallel on the same metal film. The surface plasmons transport energy to an array of metal-barrier-semiconductor diodes, where energy is extracted by inelastic tunneling. Diodes are tuned to different frequency bands by selecting the operating voltage and geometry, but all diodes share the same materials.

  11. Special parallel processing workshop

    SciTech Connect

    1994-12-01

    This report contains viewgraphs from the Special Parallel Processing Workshop. These viewgraphs deal with topics such as parallel processing performance, message passing, queue structure, and other basic concept detailing with parallel processing.

  12. Automatic analysis (aa): efficient neuroimaging workflows and parallel processing using Matlab and XML.

    PubMed

    Cusack, Rhodri; Vicente-Grabovetsky, Alejandro; Mitchell, Daniel J; Wild, Conor J; Auer, Tibor; Linke, Annika C; Peelle, Jonathan E

    2014-01-01

    Recent years have seen neuroimaging data sets becoming richer, with larger cohorts of participants, a greater variety of acquisition techniques, and increasingly complex analyses. These advances have made data analysis pipelines complicated to set up and run (increasing the risk of human error) and time consuming to execute (restricting what analyses are attempted). Here we present an open-source framework, automatic analysis (aa), to address these concerns. Human efficiency is increased by making code modular and reusable, and managing its execution with a processing engine that tracks what has been completed and what needs to be (re)done. Analysis is accelerated by optional parallel processing of independent tasks on cluster or cloud computing resources. A pipeline comprises a series of modules that each perform a specific task. The processing engine keeps track of the data, calculating a map of upstream and downstream dependencies for each module. Existing modules are available for many analysis tasks, such as SPM-based fMRI preprocessing, individual and group level statistics, voxel-based morphometry, tractography, and multi-voxel pattern analyses (MVPA). However, aa also allows for full customization, and encourages efficient management of code: new modules may be written with only a small code overhead. aa has been used by more than 50 researchers in hundreds of neuroimaging studies comprising thousands of subjects. It has been found to be robust, fast, and efficient, for simple-single subject studies up to multimodal pipelines on hundreds of subjects. It is attractive to both novice and experienced users. aa can reduce the amount of time neuroimaging laboratories spend performing analyses and reduce errors, expanding the range of scientific questions it is practical to address.

  13. Automatic analysis (aa): efficient neuroimaging workflows and parallel processing using Matlab and XML

    PubMed Central

    Cusack, Rhodri; Vicente-Grabovetsky, Alejandro; Mitchell, Daniel J.; Wild, Conor J.; Auer, Tibor; Linke, Annika C.; Peelle, Jonathan E.

    2015-01-01

    Recent years have seen neuroimaging data sets becoming richer, with larger cohorts of participants, a greater variety of acquisition techniques, and increasingly complex analyses. These advances have made data analysis pipelines complicated to set up and run (increasing the risk of human error) and time consuming to execute (restricting what analyses are attempted). Here we present an open-source framework, automatic analysis (aa), to address these concerns. Human efficiency is increased by making code modular and reusable, and managing its execution with a processing engine that tracks what has been completed and what needs to be (re)done. Analysis is accelerated by optional parallel processing of independent tasks on cluster or cloud computing resources. A pipeline comprises a series of modules that each perform a specific task. The processing engine keeps track of the data, calculating a map of upstream and downstream dependencies for each module. Existing modules are available for many analysis tasks, such as SPM-based fMRI preprocessing, individual and group level statistics, voxel-based morphometry, tractography, and multi-voxel pattern analyses (MVPA). However, aa also allows for full customization, and encourages efficient management of code: new modules may be written with only a small code overhead. aa has been used by more than 50 researchers in hundreds of neuroimaging studies comprising thousands of subjects. It has been found to be robust, fast, and efficient, for simple-single subject studies up to multimodal pipelines on hundreds of subjects. It is attractive to both novice and experienced users. aa can reduce the amount of time neuroimaging laboratories spend performing analyses and reduce errors, expanding the range of scientific questions it is practical to address. PMID:25642185

  14. Efficient parallel video processing techniques on GPU: from framework to implementation.

    PubMed

    Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan

    2014-01-01

    Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design.

  15. Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

    PubMed Central

    Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan

    2014-01-01

    Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432

  16. Efficient Process Migration for Parallel Processing on Non-Dedicated Networks of Workstations

    NASA Technical Reports Server (NTRS)

    Chanchio, Kasidit; Sun, Xian-He

    1996-01-01

    This paper presents the design and preliminary implementation of MpPVM, a software system that supports process migration for PVM application programs in a non-dedicated heterogeneous computing environment. New concepts of migration point as well as migration point analysis and necessary data analysis are introduced. In MpPVM, process migrations occur only at previously inserted migration points. Migration point analysis determines appropriate locations to insert migration points; whereas, necessary data analysis provides a minimum set of variables to be transferred at each migration pint. A new methodology to perform reliable point-to-point data communications in a migration environment is also discussed. Finally, a preliminary implementation of MpPVM and its experimental results are presented, showing the correctness and promising performance of our process migration mechanism in a scalable non-dedicated heterogeneous computing environment. While MpPVM is developed on top of PVM, the process migration methodology introduced in this study is general and can be applied to any distributed software environment.

  17. Parallel processing ITS

    SciTech Connect

    Fan, W.C.; Halbleib, J.A. Sr.

    1996-09-01

    This report provides a users` guide for parallel processing ITS on a UNIX workstation network, a shared-memory multiprocessor or a massively-parallel processor. The parallelized version of ITS is based on a master/slave model with message passing. Parallel issues such as random number generation, load balancing, and communication software are briefly discussed. Timing results for example problems are presented for demonstration purposes.

  18. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  19. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  20. Efficiency of parallel direct optimization.

    PubMed

    Janies, D A; Wheeler, W C

    2001-03-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size.

  1. Speeding up parallel processing

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.

    1988-01-01

    In 1967 Amdahl expressed doubts about the ultimate utility of multiprocessors. The formulation, now called Amdahl's law, became part of the computing folklore and has inspired much skepticism about the ability of the current generation of massively parallel processors to efficiently deliver all their computing power to programs. The widely publicized recent results of a group at Sandia National Laboratory, which showed speedup on a 1024 node hypercube of over 500 for three fixed size problems and over 1000 for three scalable problems, have convincingly challenged this bit of folklore and have given new impetus to parallel scientific computing.

  2. Parallel processing and expert systems

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Lau, Sonie

    1991-01-01

    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 90's cannot enjoy an increased level of autonomy without the efficient use of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real time demands are met for large expert systems. Speed-up via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial labs in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems was surveyed. The survey is divided into three major sections: (1) multiprocessors for parallel expert systems; (2) parallel languages for symbolic computations; and (3) measurements of parallelism of expert system. Results to date indicate that the parallelism achieved for these systems is small. In order to obtain greater speed-ups, data parallelism and application parallelism must be exploited.

  3. Parallel processing and expert systems

    NASA Technical Reports Server (NTRS)

    Lau, Sonie; Yan, Jerry C.

    1991-01-01

    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited.

  4. Modular and efficient ozone systems based on massively parallel chemical processing in microchannel plasma arrays: performance and commercialization

    NASA Astrophysics Data System (ADS)

    Kim, M.-H.; Cho, J. H.; Park, S.-J.; Eden, J. G.

    2017-08-01

    Plasmachemical systems based on the production of a specific molecule (O3) in literally thousands of microchannel plasmas simultaneously have been demonstrated, developed and engineered over the past seven years, and commercialized. At the heart of this new plasma technology is the plasma chip, a flat aluminum strip fabricated by photolithographic and wet chemical processes and comprising 24-48 channels, micromachined into nanoporous aluminum oxide, with embedded electrodes. By integrating 4-6 chips into a module, the mass output of an ozone microplasma system is scaled linearly with the number of modules operating in parallel. A 115 g/hr (2.7 kg/day) ozone system, for example, is realized by the combined output of 18 modules comprising 72 chips and 1,800 microchannels. The implications of this plasma processing architecture for scaling ozone production capability, and reducing capital and service costs when introducing redundancy into the system, are profound. In contrast to conventional ozone generator technology, microplasma systems operate reliably (albeit with reduced output) in ambient air and humidity levels up to 90%, a characteristic attributable to the water adsorption/desorption properties and electrical breakdown strength of nanoporous alumina. Extensive testing has documented chip and system lifetimes (MTBF) beyond 5,000 hours, and efficiencies >130 g/kWh when oxygen is the feedstock gas. Furthermore, the weight and volume of microplasma systems are a factor of 3-10 lower than those for conventional ozone systems of comparable output. Massively-parallel plasmachemical processing offers functionality, performance, and commercial value beyond that afforded by conventional technology, and is currently in operation in more than 30 countries worldwide.

  5. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1991-01-01

    The main contribution of the effort in the last two years is the introduction of the MOPPS system. After doing extensive literature search, we introduced the system which is described next. MOPPS employs a new solution to the problem of managing programs which solve scientific and engineering applications on a distributed processing environment. Autonomous computers cooperate efficiently in solving large scientific problems with this solution. MOPPS has the advantage of not assuming the presence of any particular network topology or configuration, computer architecture, or operating system. It imposes little overhead on network and processor resources while efficiently managing programs concurrently. The core of MOPPS is an intelligent program manager that builds a knowledge base of the execution performance of the parallel programs it is managing under various conditions. The manager applies this knowledge to improve the performance of future runs. The program manager learns from experience.

  6. Parallelization of heterogeneous reactor calculations on a graphics processing unit

    NASA Astrophysics Data System (ADS)

    Malofeev, V. M.; Pal'shin, V. A.

    2016-12-01

    Parallelization is applied to the neutron calculations performed by the heterogeneous method on a graphics processing unit. The parallel algorithm of the modified TREC code is described. The efficiency of the parallel algorithm is evaluated.

  7. Parallelization of heterogeneous reactor calculations on a graphics processing unit

    SciTech Connect

    Malofeev, V. M. Pal’shin, V. A.

    2016-12-15

    Parallelization is applied to the neutron calculations performed by the heterogeneous method on a graphics processing unit. The parallel algorithm of the modified TREC code is described. The efficiency of the parallel algorithm is evaluated.

  8. EFFICIENT SCHEDULING OF PARALLEL JOBS ON MASSIVELY PARALLEL SYSTEMS

    SciTech Connect

    F. PETRINI; W. FENG

    1999-09-01

    We present buffered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the efficient implementation of a distributed operating system. Buffered coscheduling is based on three innovative techniques: communication buffering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of buffered coscheduling include higher resource utilization, reduced communication overhead, efficient implementation of low-control strategies and fault-tolerant protocols, accurate performance modeling, and a simplified yet still expressive parallel programming model. Preliminary experimental results show that buffered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads.

  9. Parallel processing of natural language

    SciTech Connect

    Chang, H.O.

    1986-01-01

    Two types of parallel natural language processing are studied in this work: (1) the parallelism between syntactic and nonsyntactic processing and (2) the parallelism within syntactic processing. It is recognized that a syntactic category can potentially be attached to more than one node in the syntactic tree of a sentence. Even if all the attachments are syntactically well-formed, nonsyntactic factors such as semantic and pragmatic consideration may require one particular attachment. Syntactic processing must synchronize and communicate with nonsyntactic processing. Two syntactic processing algorithms are proposed for use in a parallel environment: Early's algorithm and the LR(k) algorithm. Conditions are identified to detect the syntactic ambiguity and the algorithms are augmented accordingly. It is shown that by using nonsyntactic information during syntactic processing, backtracking can be reduced, and the performance of the syntactic processor is improved. For the second type of parallelism, it is recognized that one portion of a grammar can be isolated from the rest of the grammar and be processed by a separate processor. A partial grammar of a larger grammar is defined. Parallel syntactic processing is achieved by using two processors concurrently: the main processor (mp) and the two processors concurrently: the main processor (mp) and the auxiliary processor (ap).

  10. Parallel processing for control applications

    SciTech Connect

    Telford, J. W.

    2001-01-01

    Parallel processing has been a topic of discussion in computer science circles for decades. Using more than one single computer to control a process has many advantages that compensate for the additional cost. Initially multiple computers were used to attain higher speeds. A single cpu could not perform all of the operations necessary for real time operation. As technology progressed and cpu's became faster, the speed issue became less significant. The additional processing capabilities however continue to make high speeds an attractive element of parallel processing. Another reason for multiple processors is reliability. For the purpose of this discussion, reliability and robustness will be the focal paint. Most contemporary conceptions of parallel processing include visions of hundreds of single computers networked to provide 'computing power'. Indeed our own teraflop machines are built from large numbers of computers configured in a network (and thus limited by the network). There are many approaches to parallel configfirations and this presentation offers something slightly different from the contemporary networked model. In the world of embedded computers, which is a pervasive force in contemporary computer controls, there are many single chip computers available. If one backs away from the PC based parallel computing model and considers the possibilities of a parallel control device based on multiple single chip computers, a new area of possibilities becomes apparent. This study will look at the use of multiple single chip computers in a parallel configuration with emphasis placed on maximum reliability.

  11. Applications of Parallel Processing in Configuration Analyses

    NASA Technical Reports Server (NTRS)

    Sundaram, Ppchuraman; Hager, James O.; Biedron, Robert T.

    1999-01-01

    The paper presents the recent progress made towards developing an efficient and user-friendly parallel environment for routine analysis of large CFD problems. The coarse-grain parallel version of the CFL3D Euler/Navier-Stokes analysis code, CFL3Dhp, has been ported onto most available parallel platforms. The CFL3Dhp solution accuracy on these parallel platforms has been verified with the CFL3D sequential analyses. User-friendly pre- and post-processing tools that enable a seamless transfer from sequential to parallel processing have been written. Static load balancing tool for CFL3Dhp analysis has also been implemented for achieving good parallel efficiency. For large problems, load balancing efficiency as high as 95% can be achieved even when large number of processors are used. Linear scalability of the CFL3Dhp code with increasing number of processors has also been shown using a large installed transonic nozzle boattail analysis. To highlight the fast turn-around time of parallel processing, the TCA full configuration in sideslip Navier-Stokes drag polar at supersonic cruise has been obtained in a day. CFL3Dhp is currently being used as a production analysis tool.

  12. Knowledge representation into Ada parallel processing

    NASA Technical Reports Server (NTRS)

    Masotto, Tom; Babikyan, Carol; Harper, Richard

    1990-01-01

    The Knowledge Representation into Ada Parallel Processing project is a joint NASA and Air Force funded project to demonstrate the execution of intelligent systems in Ada on the Charles Stark Draper Laboratory fault-tolerant parallel processor (FTPP). Two applications were demonstrated - a portion of the adaptive tactical navigator and a real time controller. Both systems are implemented as Activation Framework Objects on the Activation Framework intelligent scheduling mechanism developed by Worcester Polytechnic Institute. The implementations, results of performance analyses showing speedup due to parallelism and initial efficiency improvements are detailed and further areas for performance improvements are suggested.

  13. Parallel processing spacecraft communication system

    NASA Technical Reports Server (NTRS)

    Bolotin, Gary S. (Inventor); Donaldson, James A. (Inventor); Luong, Huy H. (Inventor); Wood, Steven H. (Inventor)

    1998-01-01

    An uplink controlling assembly speeds data processing using a special parallel codeblock technique. A correct start sequence initiates processing of a frame. Two possible start sequences can be used; and the one which is used determines whether data polarity is inverted or non-inverted. Processing continues until uncorrectable errors are found. The frame ends by intentionally sending a block with an uncorrectable error. Each of the codeblocks in the frame has a channel ID. Each channel ID can be separately processed in parallel. This obviates the problem of waiting for error correction processing. If that channel number is zero, however, it indicates that the frame of data represents a critical command only. That data is handled in a special way, independent of the software. Otherwise, the processed data further handled using special double buffering techniques to avoid problems from overrun. When overrun does occur, the system takes action to lose only the oldest data.

  14. Parallel strategies for SAR processing

    NASA Astrophysics Data System (ADS)

    Segoviano, Jesus A.

    2004-12-01

    This article proposes a series of strategies for improving the computer process of the Synthetic Aperture Radar (SAR) signal treatment, following the three usual lines of action to speed up the execution of any computer program. On the one hand, it is studied the optimization of both, the data structures and the application architecture used on it. On the other hand it is considered a hardware improvement. For the former, they are studied both, the usually employed SAR process data structures, proposing the use of parallel ones and the way the parallelization of the algorithms employed on the process is implemented. Besides, the parallel application architecture classifies processes between fine/coarse grain. These are assigned to individual processors or separated in a division among processors, all of them in their corresponding architectures. For the latter, it is studied the hardware employed on the computer parallel process used in the SAR handling. The improvement here refers to several kinds of platforms in which the SAR process is implemented, shared memory multicomputers, and distributed memory multiprocessors. A comparison between them gives us some guidelines to follow in order to get a maximum throughput with a minimum latency and a maximum effectiveness with a minimum cost, all together with a limited complexness. It is concluded and described, that the approach consisting of the processing of the algorithms in a GNU/Linux environment, together with a Beowulf cluster platform offers, under certain conditions, the best compromise between performance and cost, and promises the major development in the future for the Synthetic Aperture Radar computer power thirsty applications in the next years.

  15. Parallel processing of genomics data

    NASA Astrophysics Data System (ADS)

    Agapito, Giuseppe; Guzzi, Pietro Hiram; Cannataro, Mario

    2016-10-01

    The availability of high-throughput experimental platforms for the analysis of biological samples, such as mass spectrometry, microarrays and Next Generation Sequencing, have made possible to analyze a whole genome in a single experiment. Such platforms produce an enormous volume of data per single experiment, thus the analysis of this enormous flow of data poses several challenges in term of data storage, preprocessing, and analysis. To face those issues, efficient, possibly parallel, bioinformatics software needs to be used to preprocess and analyze data, for instance to highlight genetic variation associated with complex diseases. In this paper we present a parallel algorithm for the parallel preprocessing and statistical analysis of genomics data, able to face high dimension of data and resulting in good response time. The proposed system is able to find statistically significant biological markers able to discriminate classes of patients that respond to drugs in different ways. Experiments performed on real and synthetic genomic datasets show good speed-up and scalability.

  16. Multi-petascale highly efficient parallel supercomputer

    DOEpatents

    Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.; Blumrich, Matthias A.; Boyle, Peter; Brunheroto, Jose R.; Chen, Dong; Cher, Chen -Yong; Chiu, George L.; Christ, Norman; Coteus, Paul W.; Davis, Kristan D.; Dozsa, Gabor J.; Eichenberger, Alexandre E.; Eisley, Noel A.; Ellavsky, Matthew R.; Evans, Kahn C.; Fleischer, Bruce M.; Fox, Thomas W.; Gara, Alan; Giampapa, Mark E.; Gooding, Thomas M.; Gschwind, Michael K.; Gunnels, John A.; Hall, Shawn A.; Haring, Rudolf A.; Heidelberger, Philip; Inglett, Todd A.; Knudson, Brant L.; Kopcsay, Gerard V.; Kumar, Sameer; Mamidala, Amith R.; Marcella, James A.; Megerian, Mark G.; Miller, Douglas R.; Miller, Samuel J.; Muff, Adam J.; Mundy, Michael B.; O'Brien, John K.; O'Brien, Kathryn M.; Ohmacht, Martin; Parker, Jeffrey J.; Poole, Ruth J.; Ratterman, Joseph D.; Salapura, Valentina; Satterfield, David L.; Senger, Robert M.; Smith, Brian; Steinmacher-Burow, Burkhard; Stockdell, William M.; Stunkel, Craig B.; Sugavanam, Krishnan; Sugawara, Yutaka; Takken, Todd E.; Trager, Barry M.; Van Oosten, James L.; Wait, Charles D.; Walkup, Robert E.; Watson, Alfred T.; Wisniewski, Robert W.; Wu, Peng

    2015-07-14

    A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.

  17. Scheduling Tasks In Parallel Processing

    NASA Technical Reports Server (NTRS)

    Price, Camille C.; Salama, Moktar A.

    1989-01-01

    Algorithms sought to minimize time and cost of computation. Report describes research on scheduling of computations tasks in system of multiple identical data processors operating in parallel. Computational intractability requires use of suboptimal heuristic algorithms. First algorithm called "list heuristic", variation of classical list scheduling. Second algorithm called "cluster heuristic" applied to tightly coupled tasks and consists of four phases. Third algorithm called "exchange heuristic", iterative-improvement algorithm beginning with initial feasible assignment of tasks to processors and periods of time. Fourth algorithm is iterative one for optimal assignment of tasks and based on concept called "simulated annealing" because of mathematical resemblance to aspects of physical annealing processes.

  18. Parallel Processing at the High School Level.

    ERIC Educational Resources Information Center

    Sheary, Kathryn Anne

    This study investigated the ability of high school students to cognitively understand and implement parallel processing. Data indicates that most parallel processing is being taught at the university level. Instructional modules on C, Linux, and the parallel processing language, P4, were designed to show that high school students are highly…

  19. Coordination in serial-parallel image processing

    NASA Astrophysics Data System (ADS)

    Wójcik, Waldemar; Dubovoi, Vladymyr M.; Duda, Marina E.; Romaniuk, Ryszard S.; Yesmakhanova, Laura; Kozbakova, Ainur

    2015-12-01

    Serial-parallel systems used to convert the image. The control of their work results with the need to solve coordination problem. The paper summarizes the model of coordination of resource allocation in relation to the task of synchronizing parallel processes; the genetic algorithm of coordination developed, its adequacy verified in relation to the process of parallel image processing.

  20. Parallel computation of Gaussian processes

    NASA Astrophysics Data System (ADS)

    Preuss, R.; von Toussaint, U.

    2017-06-01

    Within the Bayesian framework we utilize Gaussian processes for parametric studies of long running computer codes. Since the simulations are expensive it is necessary to exploit the computational budget in the best possible manner. Employing the sum over variances - being indicators for the quality of the fit - as the utility function we established an optimized and automated sequential parameter selection procedure. However, often it is also desirable to utilize the parallel running capabilities of present computer technology and abandon the sequential parameter selection for a faster overall turn-around time (wall-clock time). The paper proposes to achieve this by marginalizing over the expected outcomes at optimized test points in order to set up a pool of starting values for batch execution.

  1. Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

    NASA Technical Reports Server (NTRS)

    Sun, Xian-He

    1997-01-01

    Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm

  2. Efficient parallel CFD-DEM simulations using OpenMP

    NASA Astrophysics Data System (ADS)

    Amritkar, Amit; Deb, Surya; Tafti, Danesh

    2014-01-01

    The paper describes parallelization strategies for the Discrete Element Method (DEM) used for simulating dense particulate systems coupled to Computational Fluid Dynamics (CFD). While the field equations of CFD are best parallelized by spatial domain decomposition techniques, the N-body particulate phase is best parallelized over the number of particles. When the two are coupled together, both modes are needed for efficient parallelization. It is shown that under these requirements, OpenMP thread based parallelization has advantages over MPI processes. Two representative examples, fairly typical of dense fluid-particulate systems are investigated, including the validation of the DEM-CFD and thermal-DEM implementation with experiments. Fluidized bed calculations are performed on beds with uniform particle loading, parallelized with MPI and OpenMP. It is shown that as the number of processing cores and the number of particles increase, the communication overhead of building ghost particle lists at processor boundaries dominates time to solution, and OpenMP which does not require this step is about twice as fast as MPI. In rotary kiln heat transfer calculations, which are characterized by spatially non-uniform particle distributions, the low overhead of switching the parallelization mode in OpenMP eliminates the load imbalances, but introduces increased overheads in fetching non-local data. In spite of this, it is shown that OpenMP is between 50-90% faster than MPI.

  3. Efficient Parallel Engineering Computing on Linux Workstations

    NASA Technical Reports Server (NTRS)

    Lou, John Z.

    2010-01-01

    A C software module has been developed that creates lightweight processes (LWPs) dynamically to achieve parallel computing performance in a variety of engineering simulation and analysis applications to support NASA and DoD project tasks. The required interface between the module and the application it supports is simple, minimal and almost completely transparent to the user applications, and it can achieve nearly ideal computing speed-up on multi-CPU engineering workstations of all operating system platforms. The module can be integrated into an existing application (C, C++, Fortran and others) either as part of a compiled module or as a dynamically linked library (DLL).

  4. Parallel Activation in Bilingual Phonological Processing

    ERIC Educational Resources Information Center

    Lee, Su-Yeon

    2011-01-01

    In bilingual language processing, the parallel activation hypothesis suggests that bilinguals activate their two languages simultaneously during language processing. Support for the parallel activation mainly comes from studies of lexical (word-form) processing, with relatively less attention to phonological (sound) processing. According to…

  5. Partitioning in parallel processing of production systems

    SciTech Connect

    Oflazer, K.

    1987-01-01

    This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpreter with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.

  6. Parallel processing considerations for image recognition tasks

    NASA Astrophysics Data System (ADS)

    Simske, Steven J.

    2011-01-01

    Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.

  7. An efficient parallel termination detection algorithm

    SciTech Connect

    Baker, A. H.; Crivelli, S.; Jessup, E. R.

    2004-05-27

    Information local to any one processor is insufficient to monitor the overall progress of most distributed computations. Typically, a second distributed computation for detecting termination of the main computation is necessary. In order to be a useful computational tool, the termination detection routine must operate concurrently with the main computation, adding minimal overhead, and it must promptly and correctly detect termination when it occurs. In this paper, we present a new algorithm for detecting the termination of a parallel computation on distributed-memory MIMD computers that satisfies all of those criteria. A variety of termination detection algorithms have been devised. Of these, the algorithm presented by Sinha, Kale, and Ramkumar (henceforth, the SKR algorithm) is unique in its ability to adapt to the load conditions of the system on which it runs, thereby minimizing the impact of termination detection on performance. Because their algorithm also detects termination quickly, we consider it to be the most efficient practical algorithm presently available. The termination detection algorithm presented here was developed for use in the PMESC programming library for distributed-memory MIMD computers. Like the SKR algorithm, our algorithm adapts to system loads and imposes little overhead. Also like the SKR algorithm, ours is tree-based, and it does not depend on any assumptions about the physical interconnection topology of the processors or the specifics of the distributed computation. In addition, our algorithm is easier to implement and requires only half as many tree traverses as does the SKR algorithm. This paper is organized as follows. In section 2, we define our computational model. In section 3, we review the SKR algorithm. We introduce our new algorithm in section 4, and prove its correctness in section 5. We discuss its efficiency and present experimental results in section 6.

  8. Efficient parallel implicit methods for rotary-wing aerodynamics calculations

    NASA Astrophysics Data System (ADS)

    Wissink, Andrew M.

    Euler/Navier-Stokes Computational Fluid Dynamics (CFD) methods are commonly used for prediction of the aerodynamics and aeroacoustics of modern rotary-wing aircraft. However, their widespread application to large complex problems is limited lack of adequate computing power. Parallel processing offers the potential for dramatic increases in computing power, but most conventional implicit solution methods are inefficient in parallel and new techniques must be adopted to realize its potential. This work proposes alternative implicit schemes for Euler/Navier-Stokes rotary-wing calculations which are robust and efficient in parallel. The first part of this work proposes an efficient parallelizable modification of the Lower Upper-Symmetric Gauss Seidel (LU-SGS) implicit operator used in the well-known Transonic Unsteady Rotor Navier Stokes (TURNS) code. The new hybrid LU-SGS scheme couples a point-relaxation approach of the Data Parallel-Lower Upper Relaxation (DP-LUR) algorithm for inter-processor communication with the Symmetric Gauss Seidel algorithm of LU-SGS for on-processor computations. With the modified operator, TURNS is implemented in parallel using Message Passing Interface (MPI) for communication. Numerical performance and parallel efficiency are evaluated on the IBM SP2 and Thinking Machines CM-5 multi-processors for a variety of steady-state and unsteady test cases. The hybrid LU-SGS scheme maintains the numerical performance of the original LU-SGS algorithm in all cases and shows a good degree of parallel efficiency. It experiences a higher degree of robustness than DP-LUR for third-order upwind solutions. The second part of this work examines use of Krylov subspace iterative solvers for the nonlinear CFD solutions. The hybrid LU-SGS scheme is used as a parallelizable preconditioner. Two iterative methods are tested, Generalized Minimum Residual (GMRES) and Orthogonal s-Step Generalized Conjugate Residual (OSGCR). The Newton method demonstrates good

  9. Use of parallel computing in mass processing of laser data

    NASA Astrophysics Data System (ADS)

    Będkowski, J.; Bratuś, R.; Prochaska, M.; Rzonca, A.

    2015-12-01

    The first part of the paper includes a description of the rules used to generate the algorithm needed for the purpose of parallel computing and also discusses the origins of the idea of research on the use of graphics processors in large scale processing of laser scanning data. The next part of the paper includes the results of an efficiency assessment performed for an array of different processing options, all of which were substantially accelerated with parallel computing. The processing options were divided into the generation of orthophotos using point clouds, coloring of point clouds, transformations, and the generation of a regular grid, as well as advanced processes such as the detection of planes and edges, point cloud classification, and the analysis of data for the purpose of quality control. Most algorithms had to be formulated from scratch in the context of the requirements of parallel computing. A few of the algorithms were based on existing technology developed by the Dephos Software Company and then adapted to parallel computing in the course of this research study. Processing time was determined for each process employed for a typical quantity of data processed, which helped confirm the high efficiency of the solutions proposed and the applicability of parallel computing to the processing of laser scanning data. The high efficiency of parallel computing yields new opportunities in the creation and organization of processing methods for laser scanning data.

  10. A note on parallel efficiency of fire simulation on cluster

    NASA Astrophysics Data System (ADS)

    Valasek, L.; Glasa, J.

    2016-08-01

    Current HPC clusters are capable to reduce execution time of parallelized tasks significantly. The paper discusses the use of two selected strategies of cluster computational resources allocation and their impact on parallel efficiency of fire simulation. Simulation of a simple corridor fire scenario by Fire Dynamics Simulator parallelized by the MPI programming model is tested on the HPC cluster at the Institute of Informatics of Slovak Academy of Sciences in Bratislava (Slovakia). The tests confirm that parallelization has a great potential to reduce execution times achieving promising values of parallel efficiency of the simulation, however, the results also show that the use of increasing numbers of computational meshes resulting in increasing numbers of used computational cores does not necessarily decrease the execution time nor the parallel efficiency of simulation. The results obtained indicate that the simulation achieves different values of the execution time and the parallel efficiency in regard of the used strategy for cluster computational resources allocation.

  11. Computer architecture and parallel processing

    SciTech Connect

    Hwang, K.; Faye, A.

    1984-01-01

    The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.

  12. Computationally efficient implementation of combustion chemistry in parallel PDF calculations

    SciTech Connect

    Lu Liuyan Lantz, Steven R.; Ren Zhuyin; Pope, Stephen B.

    2009-08-20

    In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f{sub m}pi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive

  13. Computationally efficient implementation of combustion chemistry in parallel PDF calculations

    NASA Astrophysics Data System (ADS)

    Lu, Liuyan; Lantz, Steven R.; Ren, Zhuyin; Pope, Stephen B.

    2009-08-01

    In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f_mpi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel

  14. Parallel processing for scientific computations

    NASA Technical Reports Server (NTRS)

    Alkhatib, Hasan S.

    1995-01-01

    The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.

  15. Efficient parallel simulation of CO2 geologic sequestration insaline aquifers

    SciTech Connect

    Zhang, Keni; Doughty, Christine; Wu, Yu-Shu; Pruess, Karsten

    2007-01-01

    An efficient parallel simulator for large-scale, long-termCO2 geologic sequestration in saline aquifers has been developed. Theparallel simulator is a three-dimensional, fully implicit model thatsolves large, sparse linear systems arising from discretization of thepartial differential equations for mass and energy balance in porous andfractured media. The simulator is based on the ECO2N module of the TOUGH2code and inherits all the process capabilities of the single-CPU TOUGH2code, including a comprehensive description of the thermodynamics andthermophysical properties of H2O-NaCl- CO2 mixtures, modeling singleand/or two-phase isothermal or non-isothermal flow processes, two-phasemixtures, fluid phases appearing or disappearing, as well as saltprecipitation or dissolution. The new parallel simulator uses MPI forparallel implementation, the METIS software package for simulation domainpartitioning, and the iterative parallel linear solver package Aztec forsolving linear equations by multiple processors. In addition, theparallel simulator has been implemented with an efficient communicationscheme. Test examples show that a linear or super-linear speedup can beobtained on Linux clusters as well as on supercomputers. Because of thesignificant improvement in both simulation time and memory requirement,the new simulator provides a powerful tool for tackling larger scale andmore complex problems than can be solved by single-CPU codes. Ahigh-resolution simulation example is presented that models buoyantconvection, induced by a small increase in brine density caused bydissolution of CO2.

  16. Parallel processing of a rotating shaft simulation

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.

    1989-01-01

    A FORTRAN program describing the vibration modes of a rotor-bearing system is analyzed for parellelism in this simulation using a Pascal-like structured language. Potential vector operations are also identified. A critical path through the simulation is identified and used in conjunction with somewhat fictitious processor characteristics to determine the time to calculate the problem on a parallel processing system having those characteristics. A parallel processing overhead time is included as a parameter for proper evaluation of the gain over serial calculation. The serial calculation time is determined for the same fictitious system. An improvement of up to 640 percent is possible depending on the value of the overhead time. Based on the analysis, certain conclusions are drawn pertaining to the development needs of parallel processing technology, and to the specification of parallel processing systems to meet computational needs.

  17. New Bandwidth Efficient Parallel Concatenated Coding Schemes

    NASA Technical Reports Server (NTRS)

    Denedetto, S.; Divsalar, D.; Montorsi, G.; Pollara, F.

    1996-01-01

    We propose a new solution to parallel concatenation of trellis codes with multilevel amplitude/phase modulations and a suitable iterative decoding structure. Examples are given for throughputs 2 bits/sec/Hz with 8PSK and 16QAM signal constellations.

  18. Highly Parallel Modern Signal Processing.

    DTIC Science & Technology

    1982-02-28

    looked at the application of these techniques to systems with coherent speckle noise, such as synthetic aperature (SAR) imagery, coherent sonar and...pprtitioned matrix inversion , comput;atio-n o"f crossambigul ty fun~ctions, formation of outer prCdu1cL tAand skewed outer products, and multiplication of...operations are multiplication, inversion , and L-U decomposition. In signal processing such operations can be found in adaptive filtering, data

  19. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  20. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-08-12

    Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  1. Casting pearls ballistically: Efficient massively parallel simulation of particle deposition

    SciTech Connect

    Lubachevsky, B.D.; Privman, V.; Roy, S.C.

    1996-06-01

    We simulate ballistic particle deposition wherein a large number of spherical particles are {open_quotes}cast{close_quotes} vertically over a planar horizontal surface. Upon first contact (with the surface or with a previously deposited particle) each particle stops. This model helps material scientists to study the adsorption and sediment formation. The model is sequential, with particles deposited one by one. We have found an equivalent formulation using a continuous time random process and we simulate the latter in parallel using a method similar to the one previously employed for simulating Ising spins. We augment the parallel algorithm for simulating Ising spins with several techniques aimed at the increase of efficiency of producing the particle configuration and statistics collection. Some of these techniques are similar to earlier ones. We implement the resulting algorithm on a 16K PE MasPar MP-1 and a 4K PE MasPar MP-2. The parallel code runs on MasPar computers nearly two orders of magnitude faster than an optimized sequential code runs on a fast workstation. 17 refs., 9 figs.

  2. A fast and efficient adaptive parallel ray tracing based model for thermally coupled surface radiation in casting and heat treatment processes

    NASA Astrophysics Data System (ADS)

    Fainberg, J.; Schaefer, W.

    2015-06-01

    A new algorithm for heat exchange between thermally coupled diffusely radiating interfaces is presented, which can be applied for closed and half open transparent radiating cavities. Interfaces between opaque and transparent materials are automatically detected and subdivided into elementary radiation surfaces named tiles. Contrary to the classical view factor method, the fixed unit sphere area subdivision oriented along the normal tile direction is projected onto the surrounding radiation mesh and not vice versa. Then, the total incident radiating flux of the receiver is approximated as a direct sum of radiation intensities of representative “senders” with the same weight factor. A hierarchical scheme for the space angle subdivision is selected in order to minimize the total memory and the computational demands during thermal calculations. Direct visibility is tested by means of a voxel-based ray tracing method accelerated by means of the anisotropic Chebyshev distance method, which reuses the computational grid as a Chebyshev one. The ray tracing algorithm is fully parallelized using MPI and takes advantage of the balanced distribution of all available tiles among all CPU's. This approach allows tracing of each particular ray without any communication. The algorithm has been implemented in a commercial casting process simulation software. The accuracy and computational performance of the new radiation model for heat treatment, investment and ingot casting applications is illustrated using industrial examples.

  3. The effects of parallel processing architectures on discrete event simulation

    NASA Astrophysics Data System (ADS)

    Cave, William; Slatt, Edward; Wassmer, Robert E.

    2005-05-01

    represented. Models need not be changed to move from a single processor to parallel processor hardware architectures. * Knowledge of the architectural parallelism is stored within the system and used during run time to allocate processors to processes in a maximally efficient way.

  4. Improving operating room productivity via parallel anesthesia processing.

    PubMed

    Brown, Michael J; Subramanian, Arun; Curry, Timothy B; Kor, Daryl J; Moran, Steven L; Rohleder, Thomas R

    2014-01-01

    Parallel processing of regional anesthesia may improve operating room (OR) efficiency in patients undergoes upper extremity surgical procedures. The purpose of this paper is to evaluate whether performing regional anesthesia outside the OR in parallel increases total cases per day, improve efficiency and productivity. Data from all adult patients who underwent regional anesthesia as their primary anesthetic for upper extremity surgery over a one-year period were used to develop a simulation model. The model evaluated pure operating modes of regional anesthesia performed within and outside the OR in a parallel manner. The scenarios were used to evaluate how many surgeries could be completed in a standard work day (555 minutes) and assuming a standard three cases per day, what was the predicted end-of-day time overtime. Modeling results show that parallel processing of regional anesthesia increases the average cases per day for all surgeons included in the study. The average increase was 0.42 surgeries per day. Where it was assumed that three cases per day would be performed by all surgeons, the days going to overtime was reduced by 43 percent with parallel block. The overtime with parallel anesthesia was also projected to be 40 minutes less per day per surgeon. Key limitations include the assumption that all cases used regional anesthesia in the comparisons. Many days may have both regional and general anesthesia. Also, as a case study, single-center research may limit generalizability. Perioperative care providers should consider parallel administration of regional anesthesia where there is a desire to increase daily upper extremity surgical case capacity. Where there are sufficient resources to do parallel anesthesia processing, efficiency and productivity can be significantly improved. Simulation modeling can be an effective tool to show practice change effects at a system-wide level.

  5. Efficiency Evaluation of Cray XT Parallel IO Stack

    SciTech Connect

    Yu, Weikuan; Oral, H Sarp; Vetter, Jeffrey S; Barrett, Richard F

    2007-01-01

    PetaScale computing platforms need to be coupled with efficient IO subsystems that can deliver commensurate IO throughput to scientific applications. In order to gain insights into the deliverable IO efficiency on the Cray XT platform at ORNL, this paper presents an in-depth efficiency evaluation of its parallel IO software stack. Our evaluation covers the performance of a variety of parallel IO interfaces, including POSIX IO, MPI-IO, and HDF5. Moreover, we describe several tuning parameters for these interfaces and present their effectiveness in enhancing the IO efficiency.

  6. Parallel processing: The Cm/sup */ experience

    SciTech Connect

    Siewiorek, D.; Gehringer, E.; Segall, Z.

    1986-01-01

    This book describes the parallel-processing research with CM/sup */ at Carnegie-Mellon University. Cm/sup */ is a tightly coupled 50-processor multiprocessing system that has been in operation since 1977. Two complete operating systems-StarOS and Medusa-are part of its development along with a number of applications.

  7. Parallel and Serial Processes in Visual Search

    ERIC Educational Resources Information Center

    Thornton, Thomas L.; Gilden, David L.

    2007-01-01

    A long-standing issue in the study of how people acquire visual information centers around the scheduling and deployment of attentional resources: Is the process serial, or is it parallel? A substantial empirical effort has been dedicated to resolving this issue. However, the results remain largely inconclusive because the methodologies that have…

  8. Hypercluster parallel processing library user's manual

    NASA Technical Reports Server (NTRS)

    Quealy, Angela

    1990-01-01

    This User's Manual describes the Hypercluster Parallel Processing Library, composed of FORTRAN-callable subroutines which enable a FORTRAN programmer to manipulate and transfer information throughout the Hypercluster at NASA Lewis Research Center. Each subroutine and its parameters are described in detail. A simple heat flow application using Laplace's equation is included to demonstrate the use of some of the library's subroutines. The manual can be used initially as an introduction to the parallel features provided by the library. Thereafter it can be used as a reference when programming an application.

  9. Parallel processing for computer vision and display

    SciTech Connect

    Dew, P.M. . Dept. of Computer Studies); Earnshaw, R.A. ); Heywood, T.R. )

    1989-01-01

    The widespread availability of high performance computers has led to an increased awareness of the importance of visualization techniques particularly in engineering and science. However, many visualization tasks involve processing large amounts of data or manipulating complex computer models of 3D objects. For example, in the field of computer aided engineering it is often necessary to display an edit solid object (see Plate 1) which can take many minutes even on the fastest serial processors. Another example of a computationally intensive problem, this time from computer vision, is the recognition of objects in a 3D scene from a stereo image pair. To perform visualization tasks of this type in real and reasonable time it is necessary to exploit the advances in parallel processing that have taken place over the last decade. This book uniquely provides a collection of papers from leading visualization researchers with a common interest in the application and exploitation of parallel processing techniques.

  10. Parallel Processing of Affective Visual Stimuli

    PubMed Central

    Peyk, Peter; Schupp, Harald T.; Keil, Andreas; Elbert, Thomas; Junghöfer, Markus

    2009-01-01

    Event-related potential (ERP) studies of affective picture processing have demonstrated an early posterior negativity (EPN) for emotionally arousing pictures that are embedded in a rapid visual stream. The present study examined the selective processing of emotional pictures while systematically varying picture presentation rates between 1 and 16 Hz. Previous results with presentation rates up to 5 Hz were replicated in that emotional compared to neutral pictures were associated with a greater EPN. Discrimination among emotional and neutral contents was maintained up to 12 Hz. To explore the notion of parallel processing, convolution analysis was used: EPNs generated by linear superposition of slow rate ERPs explained 70-93% of the variance of measured EPNs, giving evidence for an impressive capacity of parallel affective discrimination in rapid serial picture presentation. PMID:19055507

  11. Efficient sequential and parallel algorithms for record linkage.

    PubMed

    Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

    2014-01-01

    Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Our sequential and parallel algorithms have been tested on a real dataset of 1,083,878 records and synthetic datasets ranging in size from 50,000 to 9,000,000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.

  12. An efficient parallel algorithm for accelerating computational protein design

    PubMed Central

    Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang

    2014-01-01

    Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991

  13. Efficient clustering of large EST data sets on parallel computers

    PubMed Central

    Kalyanaraman, Anantharaman; Aluru, Srinivas; Kothari, Suresh; Brendel, Volker

    2003-01-01

    Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for Parallel Clustering of ESTs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200 Arabidopsis ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694 Triticum aestivum ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark Arabidopsis EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website. PMID:12771222

  14. Efficient sequential and parallel algorithms for record linkage

    PubMed Central

    Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

    2014-01-01

    Background and objective Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Methods Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Results Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). Conclusions We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm. PMID:24154837

  15. An Expert System for the Development of Efficient Parallel Code

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.

  16. Parallel Processing for Computational Continuum Dynamics,

    DTIC Science & Technology

    1985-01-01

    Instruction stream, Multiple Data stream ( MIMD ). An example of a machine of this type is the HEP HIOO computer manu- factured by the Denelcor...parallel architecture in general and for the HEP H1O00 computer in partic- ular. The approach is a step-by-step procedure based on a progression from the...Element Processor) by Denelcor has MIMD architecture. The HEP computer is designed to combine from one up to 16 Process Execu- tion Modules (PEM’s

  17. Competitive Parallel Processing For Compression Of Data

    NASA Technical Reports Server (NTRS)

    Diner, Daniel B.; Fender, Antony R. H.

    1990-01-01

    Momentarily-best compression algorithm selected. Proposed competitive-parallel-processing system compresses data for transmission in channel of limited band-width. Likely application for compression lies in high-resolution, stereoscopic color-television broadcasting. Data from information-rich source like color-television camera compressed by several processors, each operating with different algorithm. Referee processor selects momentarily-best compressed output.

  18. A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

    NASA Technical Reports Server (NTRS)

    Straeter, T. A.; Markos, A. T.

    1975-01-01

    A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.

  19. High-speed parallel-processing networks for advanced architectures

    SciTech Connect

    Morgan, D.R.

    1988-06-01

    This paper describes various parallel-processing architecture networks that are candidates for eventual airborne use. An attempt at projecting which type of network is suitable or optimum for specific metafunction or stand-alone applications is made. However, specific algorithms will need to be developed and bench marks executed before firm conclusions can be drawn. Also, a conceptual projection of how these processors can be built in small, flyable units through the use of wafer-scale integration is offered. The use of the PAVE PILLAR system architecture to provide system level support for these tightly coupled networks is described. The author concludes that: (1) extremely high processing speeds implemented in flyable hardware is possible through parallel-processing networks if development programs are pursued; (2) dramatic speed enhancements through parallel processing requires an excellent match between the algorithm and computer-network architecture; (3) matching several high speed parallel oriented algorithms across the aircraft system to a limited set of hardware modules may be the most cost-effective approach to achieving speed enhancements; and (4) software-development tools and improved operating systems will need to be developed to support efficient parallel-processor use.

  20. Massively parallel processing of remotely sensed hyperspectral images

    NASA Astrophysics Data System (ADS)

    Plaza, Javier; Plaza, Antonio; Valencia, David; Paz, Abel

    2009-08-01

    In this paper, we develop several parallel techniques for hyperspectral image processing that have been specifically designed to be run on massively parallel systems. The techniques developed cover the three relevant areas of hyperspectral image processing: 1) spectral mixture analysis, a popular approach to characterize mixed pixels in hyperspectral data addressed in this work via efficient implementation of a morphological algorithm for automatic identification of pure spectral signatures or endmembers from the input data; 2) supervised classification of hyperspectral data using multi-layer perceptron neural networks with back-propagation learning; and 3) automatic target detection in the hyperspectral data using orthogonal subspace projection concepts. The scalability of the proposed parallel techniques is investigated using Barcelona Supercomputing Center's MareNostrum facility, one of the most powerful supercomputers in Europe.

  1. A multiarchitecture parallel-processing development environment

    NASA Technical Reports Server (NTRS)

    Townsend, Scott; Blech, Richard; Cole, Gary

    1993-01-01

    A description is given of the hardware and software of a multiprocessor test bed - the second generation Hypercluster system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multiprocessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing concepts such as message passing algorithms, debugging tools, and computational 'steering'. Such research would be difficult, if not impossible, to achieve on shared, commercial systems.

  2. Oxytocin: parallel processing in the social brain?

    PubMed

    Dölen, Gül

    2015-06-01

    Early studies attempting to disentangle the network complexity of the brain exploited the accessibility of sensory receptive fields to reveal circuits made up of synapses connected both in series and in parallel. More recently, extension of this organisational principle beyond the sensory systems has been made possible by the advent of modern molecular, viral and optogenetic approaches. Here, evidence supporting parallel processing of social behaviours mediated by oxytocin is reviewed. Understanding oxytocinergic signalling from this perspective has significant implications for the design of oxytocin-based therapeutic interventions aimed at disorders such as autism, where disrupted social function is a core clinical feature. Moreover, identification of opportunities for novel technology development will require a better appreciation of the complexity of the circuit-level organisation of the social brain. © 2015 The Authors. Journal of Neuroendocrinology published by John Wiley & Sons Ltd on behalf of British Society for Neuroendocrinology.

  3. Memory Scalability and Efficiency Analysis of Parallel Codes

    SciTech Connect

    Janjusic, Tommy; Kartsaklis, Christos

    2015-01-01

    Memory scalability is an enduring problem and bottleneck that plagues many parallel codes. Parallel codes designed for High Performance Systems are typically designed over the span of several, and in some instances 10+, years. As a result, optimization practices which were appropriate for earlier systems may no longer be valid and thus require careful optimization consideration. Specifically, parallel codes whose memory footprint is a function of their scalability must be carefully considered for future exa-scale systems. In this paper we present a methodology and tool to study the memory scalability of parallel codes. Using our methodology we evaluate an application s memory footprint as a function of scalability, which we coined memory efficiency, and describe our results. In particular, using our in-house tools we can pinpoint the specific application components which contribute to the application s overall memory foot-print (application data- structures, libraries, etc.).

  4. An efficient data dependence analysis for parallelizing compilers

    NASA Technical Reports Server (NTRS)

    Li, Zhiyuan; Yew, Pen-Chung; Zhu, Chuan-Qi

    1990-01-01

    A novel algorithm, called the lambda test, is presented for an efficient and accurate data dependence analysis of multidimensional array references. It extends the numerical methods to allow all dimensions of array references to be tested simultaneously. Hence, it combines the efficiency and the accuracy of the both approaches. This algorithm has been implemented in PARAFRASE, a FORTRAN program parallelization restructurer developed at the University of Illinois at Urbana-Champaign. Some experimental results are presented to show its effectiveness.

  5. Efficient Thread Labeling for Monitoring Programs with Nested Parallelism

    NASA Astrophysics Data System (ADS)

    Ha, Ok-Kyoon; Kim, Sun-Sook; Jun, Yong-Kee

    It is difficult and cumbersome to detect data races occurred in an execution of parallel programs. Any on-the-fly race detection techniques using Lamport's happened-before relation needs a thread labeling scheme for generating unique identifiers which maintain logical concurrency information for the parallel threads. NR labeling is an efficient thread labeling scheme for the fork-join program model with nested parallelism, because its efficiency depends only on the nesting depth for every fork and join operation. This paper presents an improved NR labeling, called e-NR labeling, in which every thread generates its label by inheriting the pointer to its ancestor list from the parent threads or by updating the pointer in a constant amount of time and space. This labeling is more efficient than the NR labeling, because its efficiency does not depend on the nesting depth for every fork and join operation. Some experiments were performed with OpenMP programs having nesting depths of three or four and maximum parallelisms varying from 10,000 to 1,000,000. The results show that e-NR is 5 times faster than NR labeling and 4.3 times faster than OS labeling in the average time for creating and maintaining the thread labels. In average space required for labeling, it is 3.5 times smaller than NR labeling and 3 times smaller than OS labeling.

  6. Parallel Processing for Computational Continuum Dynamics.

    DTIC Science & Technology

    1985-05-10

    F49620-84-C-0111In I PARALLEL PROCESSING FOR COMPUTATIONAL CONTINUUM DYNAMICS: A FINAL REPORT Accession For Joseph F. McGrath DTIc TAB KMS Fusion, Inc...Uiarmouncod 0P . . B O X 1 5 6 7 J u s t tic a t io - --- - - Ann Arbor, MI 48106 A v ar_ _ la b il it¥ C o d e a 10 May 1985 nF , Final Report ... REPORT (Yr., Mo. a) 15 PAGE COUNT * Final IFROM 5S4i..4r.5 .. Mar. 10 May 1985 42 * 16. SUPPLEMENTARY NOTATION 17. COSATI CODES IB. SUBJECT TERMS

  7. Partitioning sparse rectangular matrices for parallel processing

    SciTech Connect

    Kolda, T.G.

    1998-05-01

    The authors are interested in partitioning sparse rectangular matrices for parallel processing. The partitioning problem has been well-studied in the square symmetric case, but the rectangular problem has received very little attention. They will formalize the rectangular matrix partitioning problem and discuss several methods for solving it. They will extend the spectral partitioning method for symmetric matrices to the rectangular case and compare this method to three new methods -- the alternating partitioning method and two hybrid methods. The hybrid methods will be shown to be best.

  8. Parallel processing for digital picture comparison

    NASA Technical Reports Server (NTRS)

    Cheng, H. D.; Kou, L. T.

    1987-01-01

    In picture processing an important problem is to identify two digital pictures of the same scene taken under different lighting conditions. This kind of problem can be found in remote sensing, satellite signal processing and the related areas. The identification can be done by transforming the gray levels so that the gray level histograms of the two pictures are closely matched. The transformation problem can be solved by using the packing method. Researchers propose a VLSI architecture consisting of m x n processing elements with extensive parallel and pipelining computation capabilities to speed up the transformation with the time complexity 0(max(m,n)), where m and n are the numbers of the gray levels of the input picture and the reference picture respectively. If using uniprocessor and a dynamic programming algorithm, the time complexity will be 0(m(3)xn). The algorithm partition problem, as an important issue in VLSI design, is discussed. Verification of the proposed architecture is also given.

  9. Efficient Parallel Algorithm For Direct Numerical Simulation of Turbulent Flows

    NASA Technical Reports Server (NTRS)

    Moitra, Stuti; Gatski, Thomas B.

    1997-01-01

    A distributed algorithm for a high-order-accurate finite-difference approach to the direct numerical simulation (DNS) of transition and turbulence in compressible flows is described. This work has two major objectives. The first objective is to demonstrate that parallel and distributed-memory machines can be successfully and efficiently used to solve computationally intensive and input/output intensive algorithms of the DNS class. The second objective is to show that the computational complexity involved in solving the tridiagonal systems inherent in the DNS algorithm can be reduced by algorithm innovations that obviate the need to use a parallelized tridiagonal solver.

  10. Organizing Compression of Hyperspectral Imagery to Allow Efficient Parallel Decompression

    NASA Technical Reports Server (NTRS)

    Klimesh, Matthew A.; Kiely, Aaron B.

    2014-01-01

    family of schemes has been devised for organizing the output of an algorithm for predictive data compression of hyperspectral imagery so as to allow efficient parallelization in both the compressor and decompressor. In these schemes, the compressor performs a number of iterations, during each of which a portion of the data is compressed via parallel threads operating on independent portions of the data. The general idea is that for each iteration it is predetermined how much compressed data will be produced from each thread.

  11. Parallel tools in HEVC for high-throughput processing

    NASA Astrophysics Data System (ADS)

    Zhou, Minhua; Sze, Vivienne; Budagavi, Madhukar

    2012-10-01

    HEVC (High Efficiency Video Coding) is the next-generation video coding standard being jointly developed by the ITU-T VCEG and ISO/IEC MPEG JCT-VC team. In addition to the high coding efficiency, which is expected to provide 50% more bit-rate reduction when compared to H.264/AVC, HEVC has built-in parallel processing tools to address bitrate, pixel-rate and motion estimation (ME) throughput requirements. This paper describes how CABAC, which is also used in H.264/AVC, has been redesigned for improved throughput, and how parallel merge/skip and tiles, which are new tools introduced for HEVC, enable high-throughput processing. CABAC has data dependencies which make it difficult to parallelize and thus limit its throughput. The prediction error/residual, represented as quantized transform coefficients, accounts for the majority of the CABAC workload. Various improvements have been made to the context selection and scans in transform coefficient coding that enable CABAC in HEVC to potentially achieve higher throughput and increased coding gains relative to H.264/AVC. The merge/skip mode is a coding efficiency enhancement tool in HEVC; the parallel merge/skip breaks dependency between the regular and merge/skip ME, which provides flexibility for high throughput and high efficiency HEVC encoder designs. For ultra high definition (UHD) video, such as 4kx2k and 8kx4k resolutions, low-latency and real-time processing may be beyond the capability of a single core codec. Tiles are an effective tool which enables pixel-rate balancing among the cores to achieve parallel processing with a throughput scalable implementation of multi-core UHD video codec. With the evenly divided tiles, a multi-core video codec can be realized by simply replicating single core codec and adding a tile boundary processing core on top of that. These tools illustrate that accounting for implementation cost when designing video coding algorithms can enable higher processing speed and reduce

  12. Processes parallel execution using Grid Wizard Enterprise.

    PubMed

    Ruiz, Marco

    2009-01-01

    The field of high-performance computing (HPC) has provided a wide array of strategies for supplying additional computing power to the goal of reducing the total "clock time" required to complete various computational processes. These strategies range from the development of higher-performance hardware to the assembly of large networks of commodity computers, with each strategy designed to address a particular aspect and/or manifestation of a given computational problem. GWE (Grid Wizard Enterprise) in that regard, is an HPC distributed enterprise system, aimed at providing a solution to the particular problem of running inter-independent computational processes faster by parallelizing their execution across a virtual grid of computational resources with a minimum of user intervention.

  13. Enjoying Sad Music: Paradox or Parallel Processes?

    PubMed

    Schubert, Emery

    2016-01-01

    Enjoyment of negative emotions in music is seen by many as a paradox. This article argues that the paradox exists because it is difficult to view the process that generates enjoyment as being part of the same system that also generates the subjective negative feeling. Compensation theories explain the paradox as the compensation of a negative emotion by the concomitant presence of one or more positive emotions. But compensation brings us no closer to explaining the paradox because it does not explain how experiencing sadness itself is enjoyed. The solution proposed is that an emotion is determined by three critical processes-labeled motivational action tendency (MAT), subjective feeling (SF) and Appraisal. For many emotions the MAT and SF processes are coupled in valence. For example, happiness has positive MAT and positive SF, annoyance has negative MAT and negative SF. However, it is argued that in an aesthetic context, such as listening to music, emotion processes can become decoupled. The decoupling is controlled by the Appraisal process, which can assess if the context of the sadness is real-life (where coupling occurs) or aesthetic (where decoupling can occur). In an aesthetic context sadness retains its negative SF but the aversive, negative MAT is inhibited, leaving sadness to still be experienced as a negative valanced emotion, while contributing to the overall positive MAT. Individual differences, mood and previous experiences mediate the degree to which the aversive aspects of MAT are inhibited according to this Parallel Processing Hypothesis (PPH). The reason for hesitancy in considering or testing PPH, as well as the preponderance of research on sadness at the exclusion of other negative emotions, are discussed.

  14. Enjoying Sad Music: Paradox or Parallel Processes?

    PubMed Central

    Schubert, Emery

    2016-01-01

    Enjoyment of negative emotions in music is seen by many as a paradox. This article argues that the paradox exists because it is difficult to view the process that generates enjoyment as being part of the same system that also generates the subjective negative feeling. Compensation theories explain the paradox as the compensation of a negative emotion by the concomitant presence of one or more positive emotions. But compensation brings us no closer to explaining the paradox because it does not explain how experiencing sadness itself is enjoyed. The solution proposed is that an emotion is determined by three critical processes—labeled motivational action tendency (MAT), subjective feeling (SF) and Appraisal. For many emotions the MAT and SF processes are coupled in valence. For example, happiness has positive MAT and positive SF, annoyance has negative MAT and negative SF. However, it is argued that in an aesthetic context, such as listening to music, emotion processes can become decoupled. The decoupling is controlled by the Appraisal process, which can assess if the context of the sadness is real-life (where coupling occurs) or aesthetic (where decoupling can occur). In an aesthetic context sadness retains its negative SF but the aversive, negative MAT is inhibited, leaving sadness to still be experienced as a negative valanced emotion, while contributing to the overall positive MAT. Individual differences, mood and previous experiences mediate the degree to which the aversive aspects of MAT are inhibited according to this Parallel Processing Hypothesis (PPH). The reason for hesitancy in considering or testing PPH, as well as the preponderance of research on sadness at the exclusion of other negative emotions, are discussed. PMID:27445752

  15. Parallel approach in RDF query processing

    NASA Astrophysics Data System (ADS)

    Vajgl, Marek; Parenica, Jan

    2017-07-01

    Parallel approach is nowadays a very cheap solution to increase computational power due to possibility of usage of multithreaded computational units. This hardware became typical part of nowadays personal computers or notebooks and is widely spread. This contribution deals with experiments how evaluation of computational complex algorithm of the inference over RDF data can be parallelized over graphical cards to decrease computational time.

  16. Serial Order: A Parallel Distributed Processing Approach.

    ERIC Educational Resources Information Center

    Jordan, Michael I.

    Human behavior shows a variety of serially ordered action sequences. This paper presents a theory of serial order which describes how sequences of actions might be learned and performed. In this theory, parallel interactions across time (coarticulation) and parallel interactions across space (dual-task interference) are viewed as two aspects of a…

  17. Development and Applications of a Modular Parallel Process for Large Scale Fluid/Structures Problems

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Byun, Chansup; Kwak, Dochan (Technical Monitor)

    2001-01-01

    A modular process that can efficiently solve large scale multidisciplinary problems using massively parallel super computers is presented. The process integrates disciplines with diverse physical characteristics by retaining the efficiency of individual disciplines. Computational domain independence of individual disciplines is maintained using a meta programming approach. The process integrates disciplines without affecting the combined performance. Results are demonstrated for large scale aerospace problems on several supercomputers. The super scalability and portability of the approach is demonstrated on several parallel computers.

  18. Development and Applications of a Modular Parallel Process for Large Scale Fluid/Structures Problems

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Kwak, Dochan (Technical Monitor)

    2002-01-01

    A modular process that can efficiently solve large scale multidisciplinary problems using massively parallel supercomputers is presented. The process integrates disciplines with diverse physical characteristics by retaining the efficiency of individual disciplines. Computational domain independence of individual disciplines is maintained using a meta programming approach. The process integrates disciplines without affecting the combined performance. Results are demonstrated for large scale aerospace problems on several supercomputers. The super scalability and portability of the approach is demonstrated on several parallel computers.

  19. Message passing kernel for the hypercluster parallel processing test bed

    SciTech Connect

    Blech, R.A.; Quealy, A.; Cole, G.L.

    1989-01-01

    A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP.

  20. Efficiently modeling neural networks on massively parallel computers

    NASA Technical Reports Server (NTRS)

    Farber, Robert M.

    1993-01-01

    Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.

  1. Communication Improvement for the LU NAS Parallel Benchmark: A Model for Efficient Parallel Relaxation Schemes

    NASA Technical Reports Server (NTRS)

    Yarrow, Maurice; VanderWijngaart, Rob; Kutler, Paul (Technical Monitor)

    1997-01-01

    The first release of the MPI version of the LU NAS Parallel Benchmark (NPB2.0) performed poorly compared to its companion NPB2.0 codes. The later LU release (NPB2.1 & 2.2) runs up to two and a half times faster, thanks to a revised point access scheme and related communications scheme. The new scheme sends substantially fewer messages. is cache "friendly", and has a better load balance. We detail the, observations and modifications that resulted in this efficiency improvement, and show that the poor behavior of the original code resulted from deriving a message passing scheme from an algorithm originally devised for a vector architecture.

  2. Programming Probabilistic Structural Analysis for Parallel Processing Computer

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.

    1991-01-01

    The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.

  3. Partitioning And Packing Equations For Parallel Processing

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Milner, Edward J.

    1989-01-01

    Algorithm developed to identify parallelism in set of coupled ordinary differential equations that describe physical system and to divide set into parallel computational paths, along with parts of solution proceeds independently of others during at least part of time. Path-identifying algorithm creates number of paths consisting of equations that must be computed serially and table that gives dependent and independent arguments and "can start," "can end," and "must end" times of each equation. "Must end" time used subsequently by packing algorithm.

  4. Partitioning And Packing Equations For Parallel Processing

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Milner, Edward J.

    1989-01-01

    Algorithm developed to identify parallelism in set of coupled ordinary differential equations that describe physical system and to divide set into parallel computational paths, along with parts of solution proceeds independently of others during at least part of time. Path-identifying algorithm creates number of paths consisting of equations that must be computed serially and table that gives dependent and independent arguments and "can start," "can end," and "must end" times of each equation. "Must end" time used subsequently by packing algorithm.

  5. The efficient parallel iterative solution of large sparse linear systems

    SciTech Connect

    Jones, M.T.; Plassmann, P.E.

    1992-06-01

    The development of efficient, general-purpose software for the iterative solution of sparse linear systems on a parallel MIMD computer requires an interesting combination of expertise. Parallel graph heuristics, convergence analysis, and basic linear algebra implementation issues must all be considered. In this paper, we discuss how we have incorporated recent results in these areas into a general-purpose iterative solver. First, we consider two recently developed parallel graph coloring heuristics. We show how the method proposed by Luby, based on determining maximal independent sets, can be modified to run in an asynchronous manner and give aa expected running time bound for this modified heuristic. In addition, a number of graph reduction heuristics are described that are used in our implementation to improve the individual processor performance. The effect of these various graph reductions on the solution of sparse triangular systems is categorized. Finally, we discuss the performance of this solver from the perspective of two large-scale applications: a piezoelectric crystal finite-element modeling problem, and a nonlinear optimization problem to determine the minimum energy configuration of a three-dimensional, layered superconductor model.

  6. Parallel Processing with Digital Signal Processing Hardware and Software

    NASA Technical Reports Server (NTRS)

    Swenson, Cory V.

    1995-01-01

    The assembling and testing of a parallel processing system is described which will allow a user to move a Digital Signal Processing (DSP) application from the design stage to the execution/analysis stage through the use of several software tools and hardware devices. The system will be used to demonstrate the feasibility of the Algorithm To Architecture Mapping Model (ATAMM) dataflow paradigm for static multiprocessor solutions of DSP applications. The individual components comprising the system are described followed by the installation procedure, research topics, and initial program development.

  7. An Efficient Parallel Garbage Collection System and Its Correctness Proof.

    DTIC Science & Technology

    1977-09-01

    A0 A046 626 CARNCGIE’MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER —ETC F ~ G 9~2AN EFFICIENT PARALLEL GARBAGE C04.LECTION SYSTEM AND ITS COØRECT—ETC(U...Naval Research under Contract N000l4-76-C-037O~ NR 044-42 2. Th. second author issupported in part by Fundacao de Amparo a Pesquisa do Estado de Sao... F I~ N~S~1 ~:~~~~~~ A L) Approved fc~ pU~~k~~ r :~ —~ Dtatribut:~ T)~~~~ ’~~ / ~~~~~___. Abstract An efficient system to perform garbage

  8. Controlled loading of oligodeoxyribonucleotide monolayers onto unoxidized crystalline silicon; fluorescence-based determination of the surface coverage and of the hybridization efficiency; parallel imaging of the process by Atomic Force Microscopy

    PubMed Central

    Cattaruzza, Fabrizio; Cricenti, Antonio; Flamini, Alberto; Girasole, Marco; Longo, Giovanni; Prosperi, Tommaso; Andreano, Giuseppina; Cellai, Luciano; Chirivino, Emanuele

    2006-01-01

    Unoxidized crystalline silicon, characterized by high purity, high homogeneity, sturdiness and an atomically flat surface, offers many advantages for the construction of electronic miniaturized biosensor arrays upon attachment of biomolecules (DNA, proteins or small organic compounds). This allows to study the incidence of molecular interactions through the simultaneous analysis, within a single experiment, of a number of samples containing small quantities of potential targets, in the presence of thousands of variables. A simple, accurate and robust methodology was established and is here presented, for the assembling of DNA sensors on the unoxidized, crystalline Si(100) surface, by loading controlled amounts of a monolayer DNA-probe through a two-step procedure. At first a monolayer of a spacer molecule, such as 10-undecynoic acid, was deposited, under optimized conditions, via controlled cathodic electrografting, then a synthetic DNA-probe was anchored to it, through amidation in aqueous solution. The surface coverage of several DNA-probes and the control of their efficiency in recognizing a complementary target-DNA upon hybridization were evaluated by fluorescence measurements. The whole process was also monitored in parallel by Atomic Force Microscopy (AFM). PMID:16507670

  9. Direct stereo radargrammetric processing using massively parallel processing

    NASA Astrophysics Data System (ADS)

    Balz, Timo; Zhang, Lu; Liao, Mingsheng

    2013-05-01

    Synthetic Aperture Radar (SAR) offers many ways to reconstruct digital surface models (DSMs). The two most commonly used methods are SAR interferometry (InSAR) and stereo radargrammetry. Stereo radargrammetry is a very stable and reliable process and is far less affected by temporal decorrelation compared with InSAR. It is therefore often used for DSM generation in heavily vegetated areas. However, stereo radargrammetry often produces rather noisy DSMs, sometimes containing large outliers. In this manuscript, we present a new approach for stereo radargrammetric processing, where the homologous points between the images are found by geocoding large amount of points. This offers a very flexible approach, allowing the simultaneous processing of multiple images and of cross-heading image pairs. Our approach relies on a good initial geocoding accuracy of the data and on very fast processing using a massively parallel implementation. The approach is demonstrated using TerraSAR-X images from Mount Song, China, and from Trento, Italy.

  10. Experience in highly parallel processing using DAP

    NASA Technical Reports Server (NTRS)

    Parkinson, D.

    1987-01-01

    Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.

  11. Wavelet Transforms in Parallel Image Processing

    DTIC Science & Technology

    1994-01-27

    NUMBER OF PAGES Object Segmentation, Texture Segmentation, Image Compression, Image 137 Halftoning , Neural Network, Parallel Algorithms, 2D and 3D...Vector Quantization of Wavelet Transform Coefficients ........ ............................. 57 B.1.f Adaptive Image Halftoning based on Wavelet...application has been directed to the adaptive image halftoning . The gray information at a pixel, including its gray value and gradient, is represented by

  12. ENERGY EFFICIENT LAUNDRY PROCESS

    SciTech Connect

    Tim Richter

    2005-04-01

    With the rising cost of energy and increased concerns for pollution and greenhouse gas emissions from power generation, increased focus is being put on energy efficiency. This study looks at several approaches to reducing energy consumption in clothes care appliances by considering the appliances and laundry chemistry as a system, rather than individually.

  13. Efficient parallel algorithms for string editing and related problems

    NASA Technical Reports Server (NTRS)

    Apostolico, Alberto; Atallah, Mikhail J.; Larmore, Lawrence; Mcfaddin, H. S.

    1988-01-01

    The string editing problem for input strings x and y consists of transforming x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol x with another symbol. This problem has a well known O((absolute value of x)(absolute value of y)) time sequential solution (25). The efficient Program Requirements Analysis Methods (PRAM) parallel algorithms for the string editing problem are given. If m = ((absolute value of x),(absolute value of y)) and n = max((absolute value of x),(absolute value of y)), then the CREW bound is O (log m log n) time with O (mn/log m) processors. In all algorithms, space is O (mn).

  14. Parallel approach to incorporating face image information into dialogue processing

    NASA Astrophysics Data System (ADS)

    Ren, Fuji

    2000-10-01

    There are many kinds of so-called irregular expressions in natural dialogues. Even if the content of a conversation is the same in words, different meanings can be interpreted by a person's feeling or face expression. To have a good understanding of dialogues, it is required in a flexible dialogue processing system to infer the speaker's view properly. However, it is difficult to obtain the meaning of the speaker's sentences in various scenes using traditional methods. In this paper, a new approach for dialogue processing that incorporates information from the speaker's face is presented. We first divide conversation statements into several simple tasks. Second, we process each simple task using an independent processor. Third, we employ some speaker's face information to estimate the view of the speakers to solve ambiguities in dialogues. The approach presented in this paper can work efficiently, because independent processors run in parallel, writing partial results to a shared memory, incorporating partial results at appropriate points, and complementing each other. A parallel algorithm and a method for employing the face information in a dialogue machine translation will be discussed, and some results will be included in this paper.

  15. A massively parallel solution strategy for efficient thermal radiation simulation

    NASA Astrophysics Data System (ADS)

    Nguyen, P. D.; Moureau, V.; Vervisch, L.; Perret, N.

    2012-06-01

    A novel and efficient methodology to solve the Radiative Transfer Equations (RTE) in thermal radiation is discussed. The BiCGStab(2) iterative solution method, as designed for the non-symmetric linear equation systems, is used to solve the discretized RTE. The numerical upwind and central schemes are blended to provide a stable numerical scheme (MUCS) for interpolation of the cell facial radiation intensities in finite volume formulation. The combination of the BiCGStab(2) and MUCS methods proved to be very efficient when coupling with the DOM approach to solve the RTE. A cost-effective tabulation technique for the gaseous radiative property model SNB-FSCK using 7-point Gauss-Labatto quadrature scheme is also introduced. The whole methodology is implemented into a massively parallel unstructured CFD code where the radiative and fluid flow solutions share the same domain decomposition, which is the bottleneck in current radiative solvers. The dual mesh decomposition at the cell groups level and processors level is adopted to optimize the CFD code for massively parallel computing. The whole method is applied to simulate the radiation heat-transfer in a 3D rectangular enclosure containing non-isothermal CO2 and H2O mixtures. Two test cases are studied for homogeneous and inhomogeneous distributions of CO2 and H2O in the enclosure. The result is reported for the heat flux and radiation energy source and the comparison is also made between the present methodology BiCGStab(2)/MUCS/tabulated SNB-FSCK, the benchmark method SNB-CK (implemented at 25cm-1 narrow-band) and some other methods available in the literature. The present method (BiCGStab(2)/MUCS/tabulated SNB-FSCK) yields more accurate predictions particularly for the radiation source term. When comparing with the benchmark solution, the relative error of the radiation source term is remarkably reduced to less than 4% and the CPU time is drastically diminished.

  16. Extraction of Hydrological Proximity Measures from DEMs using Parallel Processing

    SciTech Connect

    Tesfa, Teklu K.; Tarboton, David G.; Watson, Daniel W.; Schreuders, Kimberly A.; Baker, Matthew M.; Wallace, Robert M.

    2011-12-01

    queue data structure to order the processing of cells so that each cell is visited only once and the cross-process communications that are a standard part of MPI are handled in an efficient manner. This parallel approach allows analysis of much larger DEMs as compared to the serial recursive algorithms. In this paper, we present the definitions of the HPMs, the serial and parallel algorithms used in their extraction and their potential applications in hydrology, geomorphology and ecology.

  17. Holographic Routing Network For Parallel Processing Machines

    NASA Astrophysics Data System (ADS)

    Maniloff, Eric S.; Johnson, Kristina M.; Reif, John H.

    1989-10-01

    Dynamic holographic architectures for connecting processors in parallel computers have been generally limited by the response time of the holographic recording media. In this paper we present a different approach to dynamic optical interconnects involving spatial light modulators (SLMs) and volume holograms. Multiple-exposure holograms are stored in a volume recording media, which associate the address of a destination processor encoded on a spatial light modulator with a distinct reference beam. A destination address programmed on the spatial light modulator is then holographically steered to the correct destination processor. We present the design and experimental results of a holographic router for connecting four originator processors to four destination processors.

  18. Hypercluster - Parallel processing for computational mechanics

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1988-01-01

    An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

  19. Bipartite memory network architectures for parallel processing

    SciTech Connect

    Smith, W.; Kale, L.V. . Dept. of Computer Science)

    1990-01-01

    Parallel architectures are boradly classified as either shared memory or distributed memory architectures. In this paper, the authors propose a third family of architectures, called bipartite memory network architectures. In this architecture, processors and memory modules constitute a bipartite graph, where each processor is allowed to access a small subset of the memory modules, and each memory module allows access from a small set of processors. The architecture is particularly suitable for computations requiring dynamic load balancing. The authors explore the properties of this architecture by examining the Perfect Difference set based topology for the graph. Extensions of this topology are also suggested.

  20. CRBLASTER: A Parallel-Processing Computational Framework for Embarrassingly-Parallel Image-Analysis Algorithms

    NASA Astrophysics Data System (ADS)

    Mighell, Kenneth John

    2011-11-01

    The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called CRBLASTER, which does cosmic-ray rejection of CCD (charge-coupled device) images using the embarrassingly-parallel L.A.COSMIC algorithm. CRBLASTER is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly-parallel algorithms. The CRBLASTER source code is freely available at the official application website at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800x800 pixel Hubble Space Telescope WFPC2 image takes 44 seconds with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8-GHz quad-core Intel Xeon processors. CRBLASTER is 7.4 times faster processing the same image on a single core on the same machine. Processing the same image with CRBLASTER simultaneously on all 8 cores of the same machine takes 0.875 seconds -- which is a speedup factor of 50.3 times faster than the IRAF script. A detailed analysis is presented of the performance of CRBLASTER using between 1 and 57 processors on a low-power Tilera 700-MHz 64-core TILE64 processor.

  1. Efficiently parallelized modeling of tightly focused, large bandwidth laser pulses

    NASA Astrophysics Data System (ADS)

    Dumont, Joey; Fillion-Gourdeau, François; Lefebvre, Catherine; Gagnon, Denis; MacLean, Steve

    2017-02-01

    The Stratton–Chu integral representation of electromagnetic fields is used to study the spatio-temporal properties of large bandwidth laser pulses focused by high numerical aperture mirrors. We review the formal aspects of the derivation of diffraction integrals from the Stratton–Chu representation and discuss the use of the Hadamard finite part in the derivation of the physical optics approximation. By analyzing the formulation we show that, for the specific case of a parabolic mirror, the integrands involved in the description of the reflected field near the focal spot do not possess the strong oscillations characteristic of diffraction integrals. Consequently, the integrals can be evaluated with simple and efficient quadrature methods rather than with specialized, more costly approaches. We report on the development of an efficiently parallelized algorithm that evaluates the Stratton–Chu diffraction integrals for incident fields of arbitrary temporal and spatial dependence. This method has the advantage that its input is the unfocused field coming from the laser chain, which is experimentally known with high accuracy. We use our method to show that the reflection of a linearly polarized Gaussian beam of femtosecond duration off a high numerical aperture parabolic mirror induces ellipticity in the dominant field components and generates strong longitudinal components. We also estimate that future high-power laser facilities may reach intensities of {10}24 {{W}} {{cm}}-2.

  2. Parafrase restructuring of FORTRAN code for parallel processing

    NASA Technical Reports Server (NTRS)

    Wadhwa, Atul

    1988-01-01

    Parafrase transforms a FORTRAN code, subroutine by subroutine, into a parallel code for a vector and/or shared-memory multiprocessor system. Parafrase is not a compiler; it transforms a code and provides information for a vector or concurrent process. Parafrase uses a data dependency to reveal parallelism among instructions. The data dependency test distinguishes between recurrences and statements that can be directly vectorized or parallelized. A number of transformations are required to build a data dependency graph.

  3. Parallel-Processing Test Bed For Simulation Software

    NASA Technical Reports Server (NTRS)

    Blech, Richard; Cole, Gary; Townsend, Scott

    1996-01-01

    Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).

  4. Roadmap for efficient parallelization of breast anatomy simulation

    NASA Astrophysics Data System (ADS)

    Chui, Joseph H.; Pokrajac, David D.; Maidment, Andrew D. A.; Bakic, Predrag R.

    2012-03-01

    A roadmap has been proposed to optimize the simulation of breast anatomy by parallel implementation, in order to reduce the time needed to generate software breast phantoms. The rapid generation of high resolution phantoms is needed to support virtual clinical trials of breast imaging systems. We have recently developed an octree-based recursive partitioning algorithm for breast anatomy simulation. The algorithm has good asymptotic complexity; however, its current MATLAB implementation cannot provide optimal execution times. The proposed roadmap for efficient parallelization includes the following steps: (i) migrate the current code to a C/C++ platform and optimize it for single-threaded implementation; (ii) modify the code to allow for multi-threaded CPU implementation; (iii) identify and migrate the code to a platform designed for multithreaded GPU implementation. In this paper, we describe our results in optimizing the C/C++ code for single-threaded and multi-threaded CPU implementations. As the first step of the proposed roadmap we have identified a bottleneck component in the MATLAB implementation using MATLAB's profiling tool, and created a single threaded CPU implementation of the algorithm using C/C++'s overloaded operators and standard template library. The C/C++ implementation has been compared to the MATLAB version in terms of accuracy and simulation time. A 520-fold reduction of the execution time was observed in a test of phantoms with 50- 400 μm voxels. In addition, we have identified several places in the code which will be modified to allow for the next roadmap milestone of the multithreaded CPU implementation.

  5. Parallel processing research in the former Soviet Union

    SciTech Connect

    Dongarra, J.J.; Snyder, L.; Wolcott, P.

    1992-03-01

    This technical assessment report examines strengths and weaknesses of parallel processing research and development in the Soviet Union from the 1980s to June 1991. The assessment was carried out by panel of US scientists who are experts on parallel processing hardware, software, algorithms, and applications, and on Soviet computing. Soviet computer research and development organizations have pursued many of the major avenues of inquiry related to parallel processing that the West has chosen to explore. But, the limited size and substantial breadth of their effort have limited the collective depth of Soviet activity. Even more serious limitations (and delays) of Soviet achievement in parallel processing research can be traced to shortcomings of the Soviet computer industry, which was unable to supply adequate, reliable computer components. Without the ability to build, demonstrate, and test embodiments of their ideas in actual high-performance parallel hardware, both the scope of activity and the success of Soviet parallel processing researchers were severely limited. The quality of the Soviet parallel processing research assessed varied from very sound and interesting to pedestrian, with most of the groups at the major hardware and software centers to which the work is largely confined doing good (or at least serious) research. In a few instances, interesting and competent parallel language development work was found at institutions not associated with hardware development efforts. Unlike Soviet mainframe and minicomputer developers, Soviet parallel processing researchers have not concentrated their efforts on reverse- engineering specific Western systems. No evidence was found of successful Soviet attempts to use breakthroughs in parallel processing technology to ``leapfrog`` impediments and limitations that Soviet industrial weakness in microelectronics and other computer manufacturing areas impose on the performance of high-end Soviet computers.

  6. Parallel processing research in the former Soviet Union

    SciTech Connect

    Dongarra, J.J.; Snyder, L.; Wolcott, P.

    1992-03-01

    This technical assessment report examines strengths and weaknesses of parallel processing research and development in the Soviet Union from the 1980s to June 1991. The assessment was carried out by panel of US scientists who are experts on parallel processing hardware, software, algorithms, and applications, and on Soviet computing. Soviet computer research and development organizations have pursued many of the major avenues of inquiry related to parallel processing that the West has chosen to explore. But, the limited size and substantial breadth of their effort have limited the collective depth of Soviet activity. Even more serious limitations (and delays) of Soviet achievement in parallel processing research can be traced to shortcomings of the Soviet computer industry, which was unable to supply adequate, reliable computer components. Without the ability to build, demonstrate, and test embodiments of their ideas in actual high-performance parallel hardware, both the scope of activity and the success of Soviet parallel processing researchers were severely limited. The quality of the Soviet parallel processing research assessed varied from very sound and interesting to pedestrian, with most of the groups at the major hardware and software centers to which the work is largely confined doing good (or at least serious) research. In a few instances, interesting and competent parallel language development work was found at institutions not associated with hardware development efforts. Unlike Soviet mainframe and minicomputer developers, Soviet parallel processing researchers have not concentrated their efforts on reverse- engineering specific Western systems. No evidence was found of successful Soviet attempts to use breakthroughs in parallel processing technology to leapfrog'' impediments and limitations that Soviet industrial weakness in microelectronics and other computer manufacturing areas impose on the performance of high-end Soviet computers.

  7. [CMACPAR an modified parallel neuro-controller for control processes].

    PubMed

    Ramos, E; Surós, R

    1999-01-01

    CMACPAR is a Parallel Neurocontroller oriented to real time systems as for example Control Processes. Its characteristics are mainly a fast learning algorithm, a reduced number of calculations, great generalization capacity, local learning and intrinsic parallelism. This type of neurocontroller is used in real time applications required by refineries, hydroelectric centers, factories, etc. In this work we present the analysis and the parallel implementation of a modified scheme of the Cerebellar Model CMAC for the n-dimensional space projection using a mean granularity parallel neurocontroller. The proposed memory management allows for a significant memory reduction in training time and required memory size.

  8. An Airborne Onboard Parallel Processing Testbed

    NASA Technical Reports Server (NTRS)

    Mandl, Daniel J.

    2014-01-01

    This presentation provides information on the progress the Intelligent Payload Module (IPM) development effort. In addition, a vision is presented on integration of the IPM architecture with the GeoSocial Application Program Interface (API) architecture to enable efficient distribution of satellite data products.

  9. Efficient parallel implementation of active appearance model fitting algorithm on GPU.

    PubMed

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.

  10. Time dependent processing in a parallel pipeline architecture.

    PubMed

    Biddiscombe, John; Geveci, Berk; Martin, Ken; Moreland, Kenneth; Thompson, David

    2007-01-01

    Pipeline architectures provide a versatile and efficient mechanism for constructing visualizations, and they have been implemented in numerous libraries and applications over the past two decades. In addition to allowing developers and users to freely combine algorithms, visualization pipelines have proven to work well when streaming data and scale well on parallel distributed-memory computers. However, current pipeline visualization frameworks have a critical flaw: they are unable to manage time varying data. As data flows through the pipeline, each algorithm has access to only a single snapshot in time of the data. This prevents the implementation of algorithms that do any temporal processing such as particle tracing; plotting over time; or interpolation, fitting, or smoothing of time series data. As data acquisition technology improves, as simulation time-integration techniques become more complex, and as simulations save less frequently and regularly, the ability to analyze the time-behavior of data becomes more important. This paper describes a modification to the traditional pipeline architecture that allows it to accommodate temporal algorithms. Furthermore, the architecture allows temporal algorithms to be used in conjunction with algorithms expecting a single time snapshot, thus simplifying software design and allowing adoption into existing pipeline frameworks. Our architecture also continues to work well in parallel distributed-memory environments. We demonstrate our architecture by modifying the popular VTK framework and exposing the functionality to the ParaView application. We use this framework to apply time-dependent algorithms on large data with a parallel cluster computer and thereby exercise a functionality that previously did not exist.

  11. schwimmbad: A uniform interface to parallel processing pools in Python

    NASA Astrophysics Data System (ADS)

    Price-Whelan, Adrian M.; Foreman-Mackey, Daniel

    2017-09-01

    Many scientific and computing problems require doing some calculation on all elements of some data set. If the calculations can be executed in parallel (i.e. without any communication between calculations), these problems are said to be perfectly parallel. On computers with multiple processing cores, these tasks can be distributed and executed in parallel to greatly improve performance. A common paradigm for handling these distributed computing problems is to use a processing "pool": the "tasks" (the data) are passed in bulk to the pool, and the pool handles distributing the tasks to a number of worker processes when available. schwimmbad provides a uniform interface to parallel processing pools and enables switching easily between local development (e.g., serial processing or with multiprocessing) and deployment on a cluster or supercomputer (via, e.g., MPI or JobLib).

  12. High efficiency solar cell processing

    NASA Technical Reports Server (NTRS)

    Ho, F.; Iles, P. A.

    1985-01-01

    At the time of writing, cells made by several groups are approaching 19% efficiency. General aspects of the processing required for such cells are discussed. Most processing used for high efficiency cells is derived from space-cell or concentrator cell technology, and recent advances have been obtained from improved techniques rather than from better understanding of the limiting mechanisms. Theory and modeling are fairly well developed, and adequate to guide further asymptotic increases in performance of near conventional cells. There are several competitive cell designs with promise of higher performance ( 20%) but for these designs further improvements are required. The available cell processing technology to fabricate high efficiency cells is examined.

  13. Parallel Processing of Large Scale Microphone Arrays for Sound Capture

    NASA Astrophysics Data System (ADS)

    Jan, Ea-Ee.

    1995-01-01

    Performance of microphone sound pick up is degraded by deleterious properties of the acoustic environment, such as multipath distortion (reverberation) and ambient noise. The degradation becomes more prominent in a teleconferencing environment in which the microphone is positioned far away from the speaker. Besides, the ideal teleconference should feel as easy and natural as face-to-face communication with another person. This suggests hands-free sound capture with no tether or encumbrance by hand-held or body-worn sound equipment. Microphone arrays for this application represent an appropriate approach. This research develops new microphone array and signal processing techniques for high quality hands-free sound capture in noisy, reverberant enclosures. The new techniques combine matched-filtering of individual sensors and parallel processing to provide acute spatial volume selectivity which is capable of mitigating the deleterious effects of noise interference and multipath distortion. The new method outperforms traditional delay-and-sum beamformers which provide only directional spatial selectivity. The research additionally explores truncated matched-filtering and random distribution of transducers to reduce complexity and improve sound capture quality. All designs are first established by computer simulation of array performance in reverberant enclosures. The simulation is achieved by a room model which can efficiently calculate the acoustic multipath in a rectangular enclosure up to a prescribed order of images. It also calculates the incident angle of the arriving signal. Experimental arrays were constructed and their performance was measured in real rooms. Real room data were collected in a hard-walled laboratory and a controllable variable acoustics enclosure of similar size, approximately 6 x 6 x 3 m. An extensive speech database was also collected in these two enclosures for future research on microphone arrays. The simulation results are shown to be

  14. Applying Parallel Processing Techniques to Tether Dynamics Simulation

    NASA Technical Reports Server (NTRS)

    Wells, B. Earl

    1996-01-01

    The focus of this research has been to determine the effectiveness of applying parallel processing techniques to a sizable real-world problem, the simulation of the dynamics associated with a tether which connects two objects in low earth orbit, and to explore the degree to which the parallelization process can be automated through the creation of new software tools. The goal has been to utilize this specific application problem as a base to develop more generally applicable techniques.

  15. Parallel astronomical data processing with Python: Recipes for multicore machines

    NASA Astrophysics Data System (ADS)

    Singh, Navtej; Browne, Lisa-Marie; Butler, Ray

    2013-08-01

    High performance computing has been used in various fields of astrophysical research. But most of it is implemented on massively parallel systems (supercomputers) or graphical processing unit clusters. With the advent of multicore processors in the last decade, many serial software codes have been re-implemented in parallel mode to utilize the full potential of these processors. In this paper, we propose parallel processing recipes for multicore machines for astronomical data processing. The target audience is astronomers who use Python as their preferred scripting language and who may be using PyRAF/IRAF for data processing. Three problems of varied complexity were benchmarked on three different types of multicore processors to demonstrate the benefits, in terms of execution time, of parallelizing data processing tasks. The native multiprocessing module available in Python makes it a relatively trivial task to implement the parallel code. We have also compared the three multiprocessing approaches-Pool/Map, Process/Queue and Parallel Python. Our test codes are freely available and can be downloaded from our website.

  16. Parallel Processing and Sentence Comprehension Difficulty

    ERIC Educational Resources Information Center

    Boston, Marisa Ferrara; Hale, John T.; Vasishth, Shravan; Kliegl, Reinhold

    2011-01-01

    Eye fixation durations during normal reading correlate with processing difficulty, but the specific cognitive mechanisms reflected in these measures are not well understood. This study finds support in German readers' eye fixations for two distinct difficulty metrics: surprisal, which reflects the change in probabilities across syntactic analyses…

  17. Image Processing Using a Parallel Architecture.

    DTIC Science & Technology

    1987-12-01

    Computer," Byte, 3: 14-25 (December 1978). McGraw-Hill, 1985 24. Trussell, H. Joel . "Processing of X-ray Images," Proceedings of the IEEE, 69: 615-627...Services Electronics Program contract N00014-79-C-0424 (AD-085-846). 107 Therrien , Charles W. et al. "A Multiprocessor System for Simulation of

  18. Fault-tolerant parallel processing system

    SciTech Connect

    Harper, R.E.; Lala, J.H.

    1990-03-06

    This patent describes a fault tolerant processing system for providing processing operations, while tolerating f failures in the execution thereof. It comprises: at least (3f + 1) fault containment regions. Each of the regions includes a plurality of processors; network means connected to the processors and to the network means of the others of the fault containment regions; groups of one or more processors being configured to form redundant processing sites at least one of the groups having (2f + 1) processors, each of the processors of a group being included in a different one of the fault containment regions. Each network means of a fault containment region includes means for providing communication operations between the network means and the network means of the others of the fault containment regions, each of the network means being connected to each other network means by at lest (2f + 1) disjoint communication paths, a minimum of (f + 1) rounds of communication being provided among the network means of the fault containment regions in the execution of a the processing operation; and means for synchronizing the communication operations of the network means with the communications operations of the network means of the other fault containment regions.

  19. Efficient parallel algorithms for optical computing with the discrete Fourier transform (DFT) primitive

    NASA Astrophysics Data System (ADS)

    Reif, John H.; Tyagi, Akhilesh

    1997-10-01

    Optical-computing technology offers new challenges to algorithm designers since it can perform an n -point discrete Fourier transform (DFT) computation in only unit time. Note that the DFT is a nontrivial computation in the parallel random-access machine model, a model of computing commonly used by parallel-algorithm designers. We develop two new models, the DFT VLSIO (very-large-scale integrated optics) and the DFT circuit, to capture this characteristic of optical computing. We also provide two paradigms for developing parallel algorithms in these models. Efficient parallel algorithms for many problems, including polynomial and matrix computations, sorting, and string matching, are presented. The sorting and string-matching algorithms are particularly noteworthy. Almost all these algorithms are within a polylog factor of the optical-computing (VLSIO) lower bounds derived by Barakat and Reif Appl. Opt. 26, 1015 (1987) and by Tyagi and Reif Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing (Institute of Electrical and Electronics Engineers, New York, 1990) p. 14 .

  20. Parallel-META 3: Comprehensive taxonomical and functional analysis platform for efficient comparison of microbial communities

    PubMed Central

    Jing, Gongchao; Sun, Zheng; Wang, Honglei; Gong, Yanhai; Huang, Shi; Ning, Kang; Xu, Jian; Su, Xiaoquan

    2017-01-01

    The number of metagenomes is increasing rapidly. However, current methods for metagenomic analysis are limited by their capability for in-depth data mining among a large number of microbiome each of which carries a complex community structure. Moreover, the complexity of configuring and operating computational pipeline also hinders efficient data processing for the end users. In this work we introduce Parallel-META 3, a comprehensive and fully automatic computational toolkit for rapid data mining among metagenomic datasets, with advanced features including 16S rRNA extraction for shotgun sequences, 16S rRNA copy number calibration, 16S rRNA based functional prediction, diversity statistics, bio-marker selection, interaction network construction, vector-graph-based visualization and parallel computing. Application of Parallel-META 3 on 5,337 samples with 1,117,555,208 sequences from diverse studies and platforms showed it could produce similar results as QIIME and PICRUSt with much faster speed and lower memory usage, which demonstrates its ability to unravel the taxonomical and functional dynamics patterns across large datasets and elucidate ecological links between microbiome and the environment. Parallel-META 3 is implemented in C/C++ and R, and integrated into an executive package for rapid installation and easy access under Linux and Mac OS X. Both binary and source code packages are available at http://bioinfo.single-cell.cn/parallel-meta.html. PMID:28079128

  1. Parallel-META 3: Comprehensive taxonomical and functional analysis platform for efficient comparison of microbial communities.

    PubMed

    Jing, Gongchao; Sun, Zheng; Wang, Honglei; Gong, Yanhai; Huang, Shi; Ning, Kang; Xu, Jian; Su, Xiaoquan

    2017-01-12

    The number of metagenomes is increasing rapidly. However, current methods for metagenomic analysis are limited by their capability for in-depth data mining among a large number of microbiome each of which carries a complex community structure. Moreover, the complexity of configuring and operating computational pipeline also hinders efficient data processing for the end users. In this work we introduce Parallel-META 3, a comprehensive and fully automatic computational toolkit for rapid data mining among metagenomic datasets, with advanced features including 16S rRNA extraction for shotgun sequences, 16S rRNA copy number calibration, 16S rRNA based functional prediction, diversity statistics, bio-marker selection, interaction network construction, vector-graph-based visualization and parallel computing. Application of Parallel-META 3 on 5,337 samples with 1,117,555,208 sequences from diverse studies and platforms showed it could produce similar results as QIIME and PICRUSt with much faster speed and lower memory usage, which demonstrates its ability to unravel the taxonomical and functional dynamics patterns across large datasets and elucidate ecological links between microbiome and the environment. Parallel-META 3 is implemented in C/C++ and R, and integrated into an executive package for rapid installation and easy access under Linux and Mac OS X. Both binary and source code packages are available at http://bioinfo.single-cell.cn/parallel-meta.html.

  2. Solution-processed parallel tandem polymer solar cells using silver nanowires as intermediate electrode.

    PubMed

    Guo, Fei; Kubis, Peter; Li, Ning; Przybilla, Thomas; Matt, Gebhard; Stubhan, Tobias; Ameri, Tayebeh; Butz, Benjamin; Spiecker, Erdmann; Forberich, Karen; Brabec, Christoph J

    2014-12-23

    Tandem architecture is the most relevant concept to overcome the efficiency limit of single-junction photovoltaic solar cells. Series-connected tandem polymer solar cells (PSCs) have advanced rapidly during the past decade. In contrast, the development of parallel-connected tandem cells is lagging far behind due to the big challenge in establishing an efficient interlayer with high transparency and high in-plane conductivity. Here, we report all-solution fabrication of parallel tandem PSCs using silver nanowires as intermediate charge collecting electrode. Through a rational interface design, a robust interlayer is established, enabling the efficient extraction and transport of electrons from subcells. The resulting parallel tandem cells exhibit high fill factors of ∼60% and enhanced current densities which are identical to the sum of the current densities of the subcells. These results suggest that solution-processed parallel tandem configuration provides an alternative avenue toward high performance photovoltaic devices.

  3. CRBLASTER: A Parallel-Processing Computational Framework for Embarrassingly Parallel Image-Analysis Algorithms

    NASA Astrophysics Data System (ADS)

    Mighell, Kenneth John

    2010-10-01

    The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called crblaster, which does cosmic-ray rejection of CCD images using the embarrassingly parallel l.a.cosmic algorithm. crblaster is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. crblaster uses a two-dimensional image partitioning algorithm that partitions an input image into N rectangular subimages of nearly equal area; the subimages include sufficient additional pixels along common image partition edges such that the need for communication between computer processes is eliminated. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly parallel algorithms. The crblaster source code is freely available at the official application Web site at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800 × 800 pixel Hubble Space Telescope WFPC2 image takes 44 s with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8 GHz quad-core Intel Xeon processors. crblaster is 7.4 times faster when processing the same image on a single core on the same machine. Processing the same image with crblaster simultaneously on all eight cores of the same machine takes 0.875 s—which is a speedup factor of 50.3 times faster than the

  4. Parallel Signal Processing and System Simulation using aCe

    NASA Technical Reports Server (NTRS)

    Dorband, John E.; Aburdene, Maurice F.

    2003-01-01

    Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).

  5. Software Engineering Challenges for Parallel Processing Systems

    DTIC Science & Technology

    2008-05-02

    count + 2 = 4 write count = 4 This type of error caused by Therac - 25 radiation therapy machine resulted in 5 deaths Data Race Deadlock PROCESS 1 Send...OpenMP Jacobi using OpenMP 1 5 25 125 625 1 2 4 8 16 Execution Time Sequential OpenMP 1 2 4 8 16 32 64 128 256 1 2 4 8 16 Execution Time Sequential

  6. Sentence Comprehension: A Parallel Distributed Processing Approach

    DTIC Science & Technology

    1989-07-14

    to suggest that the model we have presented here Is a tabula rasa , acquiring knowledge of language without any prior structure. Indeed, the input is...AvailabilitY CodesAvail and/cr .Dist Special 2 What is constructed mentally when we comprehend a sentence? How does this constructive process occur? What role ...expression is a function of the semantic content of each of the parts of the expression and of the organization of the constituents. 1.2. What role do

  7. Digital signal processor and programming system for parallel signal processing

    SciTech Connect

    Van den Bout, D.E.

    1987-01-01

    This thesis describes an integrated assault upon the problem of designing high-throughput, low-cost digital signal-processing systems. The dual prongs of this assault consist of: (1) the design of a digital signal processor (DSP) which efficiently executes signal-processing algorithms in either a uniprocessor or multiprocessor configuration, (2) the PaLS programming system which accepts an arbitrary algorithm, partitions it across a group of DSPs, synthesizes an optimal communication link topology for the DSPs, and schedules the partitioned algorithm upon the DSPs. The results of applying a new quasi-dynamic analysis technique to a set of high-level signal-processing algorithms were used to determine the uniprocessor features of the DSP design. For multiprocessing applications, the DSP contains an interprocessor communications port (IPC) which supports simple, flexible, dataflow communications while allowing the total communication bandwidth to be incrementally allocated to achieve the best link utilization. The net result is a DSP with a simple architecture that is easy to program for both uniprocessor and multi-processor modes of operation. The PaLS programming system simplifies the task of parallelizing an algorithm for execution upon a multiprocessor built with the DSP.

  8. CRBLASTER: a fast parallel-processing program for cosmic ray rejection

    NASA Astrophysics Data System (ADS)

    Mighell, Kenneth J.

    2008-08-01

    Many astronomical image-analysis programs are based on algorithms that can be described as being embarrassingly parallel, where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of todays fast multi-core servers costing a few thousands of dollars. A major reason for the shortage of state-of-the-art parallel-processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called crblaster which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. crblaster is written in C using the industry standard Message Passing Interface (MPI) library. Processing a single 800×800 HST WFPC2 image takes 1.87 seconds using 4 processes on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 processors is 82%. The code can be used as a software framework for easy development of parallel-processing image-anlaysis programs using embarrassing parallel algorithms; the biggest required modification is the replacement of the core image processing function with an alternative image-analysis function based on a single-processor algorithm. I describe the design, implementation and performance of the program.

  9. Developing Software to Use Parallel Processing Effectively

    DTIC Science & Technology

    1988-10-01

    granularity (grain size) e) potential for interleaving processing and communication f) memory access method (global versus local) g ) processor...procedure G that works on both S1 and S2 but depends only on some small common part of these two types. For example, G might be a sort routine that relies on...hierarchy is more problem- atical than the first, since it is unlikely that S, and S2 have exactly the operations required by G . If they do not, then at the

  10. High power parallel ultrashort pulse laser processing

    NASA Astrophysics Data System (ADS)

    Gillner, Arnold; Gretzki, Patrick; Büsing, Lasse

    2016-03-01

    The class of ultra-short-pulse (USP) laser sources are used, whenever high precession and high quality material processing is demanded. These laser sources deliver pulse duration in the range of ps to fs and are characterized with high peak intensities leading to a direct vaporization of the material with a minimum thermal damage. With the availability of industrial laser source with an average power of up to 1000W, the main challenge consist of the effective energy distribution and disposition. Using lasers with high repetition rates in the MHz region can cause thermal issues like overheating, melt production and low ablation quality. In this paper, we will discuss different approaches for multibeam processing for utilization of high pulse energies. The combination of diffractive optics and conventional galvometer scanner can be used for high throughput laser ablation, but are limited in the optical qualities. We will show which applications can benefit from this hybrid optic and which improvements in productivity are expected. In addition, the optical limitations of the system will be compiled, in order to evaluate the suitability of this approach for any given application.

  11. An efficient parallel algorithm for matrix-vector multiplication

    SciTech Connect

    Hendrickson, B.; Leland, R.; Plimpton, S.

    1993-03-01

    The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.

  12. Processing data communications events by awakening threads in parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2016-03-15

    Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.

  13. Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

    1996-01-01

    This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.

  14. Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing

    NASA Technical Reports Server (NTRS)

    Fricker, David M.

    1997-01-01

    The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.

  15. Highly efficient spatial data filtering in parallel using the opensource library CPPPO

    NASA Astrophysics Data System (ADS)

    Municchi, Federico; Goniva, Christoph; Radl, Stefan

    2016-10-01

    CPPPO is a compilation of parallel data processing routines developed with the aim to create a library for "scale bridging" (i.e. connecting different scales by mean of closure models) in a multi-scale approach. CPPPO features a number of parallel filtering algorithms designed for use with structured and unstructured Eulerian meshes, as well as Lagrangian data sets. In addition, data can be processed on the fly, allowing the collection of relevant statistics without saving individual snapshots of the simulation state. Our library is provided with an interface to the widely-used CFD solver OpenFOAM®, and can be easily connected to any other software package via interface modules. Also, we introduce a novel, extremely efficient approach to parallel data filtering, and show that our algorithms scale super-linearly on multi-core clusters. Furthermore, we provide a guideline for choosing the optimal Eulerian cell selection algorithm depending on the number of CPU cores used. Finally, we demonstrate the accuracy and the parallel scalability of CPPPO in a showcase focusing on heat and mass transfer from a dense bed of particles.

  16. Efficient solid state NMR powder simulations using SMP and MPP parallel computation

    NASA Astrophysics Data System (ADS)

    Kristensen, Jørgen Holm; Farnan, Ian

    2003-04-01

    Methods for parallel simulation of solid state NMR powder spectra are presented for both shared and distributed memory parallel supercomputers. For shared memory architectures the performance of simulation programs implementing the OpenMP application programming interface is evaluated. It is demonstrated that the design of correct and efficient shared memory parallel programs is difficult as the performance depends on data locality and cache memory effects. The distributed memory parallel programming model is examined for simulation programs using the MPI message passing interface. The results reveal that both shared and distributed memory parallel computation are very efficient with an almost perfect application speedup and may be applied to the most advanced powder simulations.

  17. Computational efficiency of parallel combinatorial OR-tree searches

    NASA Technical Reports Server (NTRS)

    Li, Guo-Jie; Wah, Benjamin W.

    1990-01-01

    The performance of parallel combinatorial OR-tree searches is analytically evaluated. This performance depends on the complexity of the problem to be solved, the error allowance function, the dominance relation, and the search strategies. The exact performance may be difficult to predict due to the nondeterminism and anomalies of parallelism. The authors derive the performance bounds of parallel OR-tree searches with respect to the best-first, depth-first, and breadth-first strategies, and verify these bounds by simulation. They show that a near-linear speedup can be achieved with respect to a large number of processors for parallel OR-tree searches. Using the bounds developed, the authors derive sufficient conditions for assuring that parallelism will not degrade performance and necessary conditions for allowing parallelism to have a speedup greater than the ratio of the numbers of processors. These bounds and conditions provide the theoretical foundation for determining the number of processors required to assure a near-linear speedup.

  18. Fault Tolerant Statistical Signal Processing Algorithms for Parallel Architectures.

    DTIC Science & Technology

    2014-09-26

    AD-fi57 393 FAULT TOLERANT STATISTICAL SIGNAL PROCESSING ALGORITHMS i/i FOR PARALLEL ARCH U) JOHNS HOPKINS UNIV BALTIMORE MD DEPT OF ELECTRICAL...COVERED * ’ Fault Tolerant Statistical Signal Processing Technical A l g o r i t h m s f o r P a r a l l e l A r c h i t e c t u r e s a ._ P E R F O R M I...Identify by block number) , Fault Tolerance, Signal Processing, Parallel Architecture 0 20. ABSTRACT (Continue on reveree side It neceseary and identify by

  19. Parallel processing implementation for the coupled transport of photons and electrons using OpenMP

    NASA Astrophysics Data System (ADS)

    Doerner, Edgardo

    2016-05-01

    In this work the use of OpenMP to implement the parallel processing of the Monte Carlo (MC) simulation of the coupled transport for photons and electrons is presented. This implementation was carried out using a modified EGSnrc platform which enables the use of the Microsoft Visual Studio 2013 (VS2013) environment, together with the developing tools available in the Intel Parallel Studio XE 2015 (XE2015). The performance study of this new implementation was carried out in a desktop PC with a multi-core CPU, taking as a reference the performance of the original platform. The results were satisfactory, both in terms of scalability as parallelization efficiency.

  20. FPGA-Based Filterbank Implementation for Parallel Digital Signal Processing

    NASA Technical Reports Server (NTRS)

    Berner, Stephan; DeLeon, Phillip

    1999-01-01

    One approach to parallel digital signal processing decomposes a high bandwidth signal into multiple lower bandwidth (rate) signals by an analysis bank. After processing, the subband signals are recombined into a fullband output signal by a synthesis bank. This paper describes an implementation of the analysis and synthesis banks using (Field Programmable Gate Arrays) FPGAs.

  1. The science of computing - The evolution of parallel processing

    NASA Technical Reports Server (NTRS)

    Denning, P. J.

    1985-01-01

    The present paper is concerned with the approaches to be employed to overcome the set of limitations in software technology which impedes currently an effective use of parallel hardware technology. The process required to solve the arising problems is found to involve four different stages. At the present time, Stage One is nearly finished, while Stage Two is under way. Tentative explorations are beginning on Stage Three, and Stage Four is more distant. In Stage One, parallelism is introduced into the hardware of a single computer, which consists of one or more processors, a main storage system, a secondary storage system, and various peripheral devices. In Stage Two, parallel execution of cooperating programs on different machines becomes explicit, while in Stage Three, new languages will make parallelism implicit. In Stage Four, there will be very high level user interfaces capable of interacting with scientists at the same level of abstraction as scientists do with each other.

  2. Automatic Mapping Of Large Signal Processing Systems To A Parallel Machine

    NASA Astrophysics Data System (ADS)

    Printz, Harry; Kung, H. T.; Mummert, Todd; Scherer, Paul M.

    1989-12-01

    Since the spring of 1988, Carnegie Mellon University and the Naval Air Development Center have been working together to implement several large signal processing systems on the Warp parallel computer. In the course of this work, we have developed a prototype of a software tool that can automatically and efficiently map signal processing systems to distributed-memory parallel machines, such as Warp. We have used this tool to produce Warp implementations of small test systems. The automatically generated programs compare favorably with hand-crafted code. We believe this tool will be a significant aid in the creation of high speed signal processing systems. We assume that signal processing systems have the following characteristics: •They can be described by directed graphs of computational tasks; these graphs may contain thousands of task vertices. • Some tasks can be parallelized in a systolic or data-partitioned manner, while others cannot be parallelized at all. • The side effects of each task, if any, are limited to changes in local variables. • Each task has a data-independent execution time bound, which may be expressed as a function of the way it is parallelized, and the number of processors it is mapped to. In this paper we describe techniques to automatically map such systems to Warp-like parallel machines. We identify and address key issues in gracefully combining different parallel programming styles, in allocating processor, memory and communication bandwidth, and in generating and scheduling efficient parallel code. When iWarp, the VLSI version of the Warp machine, becomes available in 1990, we will extend this tool to generate efficient code for very large applications, which may require as many as 3000 iWarp processors, with an aggregate peak performance of 60 gigaflops.

  3. Efficient parallel architecture for highly coupled real-time linear system applications

    NASA Technical Reports Server (NTRS)

    Carroll, Chester C.; Homaifar, Abdollah; Barua, Soumavo

    1988-01-01

    A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics.

  4. Latency and bandwidth considerations in parallel robotics image processing

    SciTech Connect

    Webb, J.A.

    1993-12-31

    Parallel image processing for robotics applications differs in a fundamental way from parallel scientific computing applications: the problem size is fixed, and latency requirements are tight. This brings Amdhal`s law in effect with full force, so that message-passing latency and bandwidth severely restrict performance. In this paper the authors examine an application from this domain, stereo image processing, which has been implemented in Adapt, a niche language for parallel image processing implemented on the Carnegie Mellon-Intel Corporation iWarp. High performance has been achieved for this application. They show how a I/O building block approach on iWarp achieved this, and then examine the implications of this performance for more traditional machines that do not have iWarp`s rich I/O primitive set.

  5. Efficient Parallel Learning of Hidden Markov Chain Models on SMPs

    NASA Astrophysics Data System (ADS)

    Li, Lei; Fu, Bin; Faloutsos, Christos

    Quad-core cpus have been a common desktop configuration for today's office. The increasing number of processors on a single chip opens new opportunity for parallel computing. Our goal is to make use of the multi-core as well as multi-processor architectures to speed up large-scale data mining algorithms. In this paper, we present a general parallel learning framework, Cut-And-Stitch, for training hidden Markov chain models. Particularly, we propose two model-specific variants, CAS-LDS for learning linear dynamical systems (LDS) and CAS-HMM for learning hidden Markov models (HMM). Our main contribution is a novel method to handle the data dependencies due to the chain structure of hidden variables, so as to parallelize the EM-based parameter learning algorithm. We implement CAS-LDS and CAS-HMM using OpenMP on two supercomputers and a quad-core commercial desktop. The experimental results show that parallel algorithms using Cut-And-Stitch achieve comparable accuracy and almost linear speedups over the traditional serial version.

  6. A single user efficiency measure for evaluation of parallel or pipeline computer architectures

    NASA Technical Reports Server (NTRS)

    Jones, W. P.

    1978-01-01

    A precise statement of the relationship between sequential computation at one rate, parallel or pipeline computation at a much higher rate, the data movement rate between levels of memory, the fraction of inherently sequential operations or data that must be processed sequentially, the fraction of data to be moved that cannot be overlapped with computation, and the relative computational complexity of the algorithms for the two processes, scalar and vector, was developed. The relationship should be applied to the multirate processes that obtain in the employment of various new or proposed computer architectures for computational aerodynamics. The relationship, an efficiency measure that the single user of the computer system perceives, argues strongly in favor of separating scalar and vector processes, sometimes referred to as loosely coupled processes, to achieve optimum use of hardware.

  7. Mapping Pixel Windows To Vectors For Parallel Processing

    NASA Technical Reports Server (NTRS)

    Duong, Tuan A.

    1996-01-01

    Mapping performed by matrices of transistor switches. Arrays of transistor switches devised for use in forming simultaneous connections from square subarray (window) of n x n pixels within electronic imaging device containing np x np array of pixels to linear array of n(sup2) input terminals of electronic neural network or other parallel-processing circuit. Method helps to realize potential for rapidity in parallel processing for such applications as enhancement of images and recognition of patterns. In providing simultaneous connections, overcomes timing bottleneck or older multiplexing, serial-switching, and sample-and-hold methods.

  8. Mapping Pixel Windows To Vectors For Parallel Processing

    NASA Technical Reports Server (NTRS)

    Duong, Tuan A.

    1996-01-01

    Mapping performed by matrices of transistor switches. Arrays of transistor switches devised for use in forming simultaneous connections from square subarray (window) of n x n pixels within electronic imaging device containing np x np array of pixels to linear array of n(sup2) input terminals of electronic neural network or other parallel-processing circuit. Method helps to realize potential for rapidity in parallel processing for such applications as enhancement of images and recognition of patterns. In providing simultaneous connections, overcomes timing bottleneck or older multiplexing, serial-switching, and sample-and-hold methods.

  9. Parallel optical image processing with image-logic algebra and a polynomial approach.

    PubMed

    Bhattacharya, P

    1994-09-10

    An interesting relationship between an optical parallel-processing single-instruction-multiple-data generic language, called image-logic algebra, and a polynomial approach for processing binary images by electronic computers is shown. Using only two basic operations of the ILA, one can reformulate a number of algorithms developed earlier in the polynomial approach into algorithms in the ILA environment. Thus a large number of new algorithms for parallel optical processing of binary images can be developed in the ILA environment that are fast and efficient.

  10. Efficient assignment of the temperature set for Parallel Tempering

    NASA Astrophysics Data System (ADS)

    Guidetti, M.; Rolando, V.; Tripiccione, R.

    2012-02-01

    We propose a simple algorithm able to identify a set of temperatures for a Parallel Tempering Monte Carlo simulation, that maximizes the probability that the configurations drift across all temperature values, from the coldest to the hottest ones, and vice versa. The proposed algorithm starts from data gathered from relatively short Monte Carlo simulations and is straightforward to implement. We assess its effectiveness on a test case simulation of an Edwards-Anderson spin glass on a lattice of 12 3 sites.

  11. Recent development for the ITS code system: Parallel processing and visualization

    SciTech Connect

    Fan, W.C.; Turner, C.D.; Halbleib, J.A. Sr.; Kensek, R.P.

    1996-03-01

    A brief overview is given for two software developments related to the ITS code system. These developments provide parallel processing and visualization capabilities and thus allow users to perform ITS calculations more efficiently. Timing results and a graphical example are presented to demonstrate these capabilities.

  12. Partitioning Rectangular and Structurally Nonsymmetric Sparse Matrices for Parallel Processing

    SciTech Connect

    B. Hendrickson; T.G. Kolda

    1998-09-01

    A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrix- transpose-vector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitioning bipartite graphs. We then introduce several algorithms for this partitioning problem and compare their performance on a set of test matrices.

  13. Parallel processing of atmospheric chemistry calculations: Preliminary considerations

    SciTech Connect

    Elliott, S.; Jones, P.

    1995-01-01

    Global climate calculations are already saturating the class modern vector supercomputers with only a few central processing units. Increased resolution and inclusion of routines to deal with biogeochemical portions of the terrestrial climate system will soon demand massively parallel approaches. The atmospheric photochemistry ensemble is intimately linked to climate through the trace greenhouse gases ozone and methane and modules for representing it are being attached to global three dimensional transport and GCM frameworks. Atmospheric kinetics involve dozens of highly interactive tracers and so will accentuate the need for parallel processing of earth system simulations. In the present text we lay some of the groundwork for addition of atmospheric kinetics packages to GCM and global scale atmospheric models on multiply parallel computers. The discussion is tailored for consumption by the photochemical modelling community. After a review of numerical atmospheric chemistry methods, we examine how kinetics can be implemented on a parallel computer. We concentrate especially on data layout and flexibility and how these can be implemented in various programming models. We conclude that chemistry can be implemented rather easily within existing frameworks of several parallel atmospheric models. However, memory limitations may preclude high resolution studies of global chemistry.

  14. Implementation and efficiency analysis of parallel computation using OpenACC: a case study using flow field simulations

    NASA Astrophysics Data System (ADS)

    Zhang, Shanghong; Yuan, Rui; Wu, Yu; Yi, Yujun

    2016-01-01

    The Open Accelerator (OpenACC) application programming interface is a relatively new parallel computing standard. In this paper, particle-based flow field simulations are examined as a case study of OpenACC parallel computation. The parallel conversion process of the OpenACC standard is explained, and further, the performance of the flow field parallel model is analysed using different directive configurations and grid schemes. With careful implementation and optimisation of the data transportation in the parallel algorithm, a speedup factor of 18.26× is possible. In contrast, a speedup factor of just 11.77× was achieved with the conventional Open Multi-Processing (OpenMP) parallel mode on a 20-kernel computer. These results demonstrate that optimised feature settings greatly influence the degree of speedup, and models involving larger numbers of calculations exhibit greater efficiency and higher speedup factors. In addition, the OpenACC parallel mode is found to have good portability, making it easy to implement parallel computation from the original serial model.

  15. A parallel and incremental algorithm for efficient unique signature discovery on DNA databases.

    PubMed

    Lee, Hsiao Ping; Sheu, Tzu-Fang; Tang, Chuan Yi

    2010-03-16

    DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length < or = l with mismatch tolerance > or = d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures. This work proposes two discovery algorithms - the consecutive multiple discovery (CMD) algorithm and the parallel and incremental signature discovery (PISD) algorithm. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures

  16. Parallel processing optimization strategy based on MapReduce model in cloud storage environment

    NASA Astrophysics Data System (ADS)

    Cui, Jianming; Liu, Jiayi; Li, Qiuyan

    2017-05-01

    Currently, a large number of documents in the cloud storage process employed the way of packaging after receiving all the packets. From the local transmitter this stored procedure to the server, packing and unpacking will consume a lot of time, and the transmission efficiency is low as well. A new parallel processing algorithm is proposed to optimize the transmission mode. According to the operation machine graphs model work, using MPI technology parallel execution Mapper and Reducer mechanism. It is good to use MPI technology to implement Mapper and Reducer parallel mechanism. After the simulation experiment of Hadoop cloud computing platform, this algorithm can not only accelerate the file transfer rate, but also shorten the waiting time of the Reducer mechanism. It will break through traditional sequential transmission constraints and reduce the storage coupling to improve the transmission efficiency.

  17. Using Motivational Interviewing Techniques to Address Parallel Process in Supervision

    ERIC Educational Resources Information Center

    Giordano, Amanda; Clarke, Philip; Borders, L. DiAnne

    2013-01-01

    Supervision offers a distinct opportunity to experience the interconnection of counselor-client and counselor-supervisor interactions. One product of this network of interactions is parallel process, a phenomenon by which counselors unconsciously identify with their clients and subsequently present to their supervisors in a similar fashion…

  18. Parallel Processing of Objects in a Naming Task

    ERIC Educational Resources Information Center

    Meyer, Antje S.; Ouellet, Marc; Hacker, Christine

    2008-01-01

    The authors investigated whether speakers who named several objects processed them sequentially or in parallel. Speakers named object triplets, arranged in a triangle, in the order left, right, and bottom object. The left object was easy or difficult to identify and name. During the saccade from the left to the right object, the right object shown…

  19. Postscript: Parallel Distributed Processing in Localist Models without Thresholds

    ERIC Educational Resources Information Center

    Plaut, David C.; McClelland, James L.

    2010-01-01

    The current authors reply to a response by Bowers on a comment by the current authors on the original article. Bowers (2010) mischaracterizes the goals of parallel distributed processing (PDP research)--explaining performance on cognitive tasks is the primary motivation. More important, his claim that localist models, such as the interactive…

  20. Parallel Processing of Objects in a Naming Task

    ERIC Educational Resources Information Center

    Meyer, Antje S.; Ouellet, Marc; Hacker, Christine

    2008-01-01

    The authors investigated whether speakers who named several objects processed them sequentially or in parallel. Speakers named object triplets, arranged in a triangle, in the order left, right, and bottom object. The left object was easy or difficult to identify and name. During the saccade from the left to the right object, the right object shown…

  1. Using Motivational Interviewing Techniques to Address Parallel Process in Supervision

    ERIC Educational Resources Information Center

    Giordano, Amanda; Clarke, Philip; Borders, L. DiAnne

    2013-01-01

    Supervision offers a distinct opportunity to experience the interconnection of counselor-client and counselor-supervisor interactions. One product of this network of interactions is parallel process, a phenomenon by which counselors unconsciously identify with their clients and subsequently present to their supervisors in a similar fashion…

  2. The Extended Parallel Process Model: Illuminating the Gaps in Research

    ERIC Educational Resources Information Center

    Popova, Lucy

    2012-01-01

    This article examines constructs, propositions, and assumptions of the extended parallel process model (EPPM). Review of the EPPM literature reveals that its theoretical concepts are thoroughly developed, but the theory lacks consistency in operational definitions of some of its constructs. Out of the 12 propositions of the EPPM, a few have not…

  3. Postscript: Parallel Distributed Processing in Localist Models without Thresholds

    ERIC Educational Resources Information Center

    Plaut, David C.; McClelland, James L.

    2010-01-01

    The current authors reply to a response by Bowers on a comment by the current authors on the original article. Bowers (2010) mischaracterizes the goals of parallel distributed processing (PDP research)--explaining performance on cognitive tasks is the primary motivation. More important, his claim that localist models, such as the interactive…

  4. The Extended Parallel Process Model: Illuminating the Gaps in Research

    ERIC Educational Resources Information Center

    Popova, Lucy

    2012-01-01

    This article examines constructs, propositions, and assumptions of the extended parallel process model (EPPM). Review of the EPPM literature reveals that its theoretical concepts are thoroughly developed, but the theory lacks consistency in operational definitions of some of its constructs. Out of the 12 propositions of the EPPM, a few have not…

  5. On the Optimality of Serial and Parallel Processing in the Psychological Refractory Period Paradigm: Effects of the Distribution of Stimulus Onset Asynchronies

    ERIC Educational Resources Information Center

    Miller, Jeff; Ulrich, Rolf; Rolke, Bettina

    2009-01-01

    Within the context of the psychological refractory period (PRP) paradigm, we developed a general theoretical framework for deciding when it is more efficient to process two tasks in serial and when it is more efficient to process them in parallel. This analysis suggests that a serial mode is more efficient than a parallel mode under a wide variety…

  6. Parallel Processing Method for Airborne Laser Scanning Data Using a PC Cluster and a Virtual Grid.

    PubMed

    Han, Soo Hee; Heo, Joon; Sohn, Hong Gyoo; Yu, Kiyun

    2009-01-01

    In this study, a parallel processing method using a PC cluster and a virtual grid is proposed for the fast processing of enormous amounts of airborne laser scanning (ALS) data. The method creates a raster digital surface model (DSM) by interpolating point data with inverse distance weighting (IDW), and produces a digital terrain model (DTM) by local minimum filtering of the DSM. To make a consistent comparison of performance between sequential and parallel processing approaches, the means of dealing with boundary data and of selecting interpolation centers were controlled for each processing node in parallel approach. To test the speedup, efficiency and linearity of the proposed algorithm, actual ALS data up to 134 million points were processed with a PC cluster consisting of one master node and eight slave nodes. The results showed that parallel processing provides better performance when the computational overhead, the number of processors, and the data size become large. It was verified that the proposed algorithm is a linear time operation and that the products obtained by parallel processing are identical to those produced by sequential processing.

  7. Parallel Processing Method for Airborne Laser Scanning Data Using a PC Cluster and a Virtual Grid

    PubMed Central

    Han, Soo Hee; Heo, Joon; Sohn, Hong Gyoo; Yu, Kiyun

    2009-01-01

    In this study, a parallel processing method using a PC cluster and a virtual grid is proposed for the fast processing of enormous amounts of airborne laser scanning (ALS) data. The method creates a raster digital surface model (DSM) by interpolating point data with inverse distance weighting (IDW), and produces a digital terrain model (DTM) by local minimum filtering of the DSM. To make a consistent comparison of performance between sequential and parallel processing approaches, the means of dealing with boundary data and of selecting interpolation centers were controlled for each processing node in parallel approach. To test the speedup, efficiency and linearity of the proposed algorithm, actual ALS data up to 134 million points were processed with a PC cluster consisting of one master node and eight slave nodes. The results showed that parallel processing provides better performance when the computational overhead, the number of processors, and the data size become large. It was verified that the proposed algorithm is a linear time operation and that the products obtained by parallel processing are identical to those produced by sequential processing. PMID:22574032

  8. Efficient separations & processing crosscutting program

    SciTech Connect

    1996-08-01

    The Efficient Separations and Processing Crosscutting Program (ESP) was created in 1991 to identify, develop, and perfect chemical and physical separations technologies and chemical processes which treat wastes and address environmental problems throughout the DOE complex. The ESP funds several multiyear tasks that address high-priority waste remediation problems involving high-level, low-level, transuranic, hazardous, and mixed (radioactive and hazardous) wastes. The ESP supports applied research and development (R & D) leading to the demonstration or use of these separations technologies by other organizations within the Department of Energy (DOE), Office of Environmental Management.

  9. The remote sensing image segmentation mean shift algorithm parallel processing based on MapReduce

    NASA Astrophysics Data System (ADS)

    Chen, Xi; Zhou, Liqing

    2015-12-01

    With the development of satellite remote sensing technology and the remote sensing image data, traditional remote sensing image segmentation technology cannot meet the massive remote sensing image processing and storage requirements. This article put cloud computing and parallel computing technology in remote sensing image segmentation process, and build a cheap and efficient computer cluster system that uses parallel processing to achieve MeanShift algorithm of remote sensing image segmentation based on the MapReduce model, not only to ensure the quality of remote sensing image segmentation, improved split speed, and better meet the real-time requirements. The remote sensing image segmentation MeanShift algorithm parallel processing algorithm based on MapReduce shows certain significance and a realization of value.

  10. A novel parallel architecture for real-time image processing

    NASA Astrophysics Data System (ADS)

    Hu, Junhong; Zhang, Tianxu; Zhong, Sheng; Chen, Xujun

    2009-10-01

    A novel DSP/FPGA-based parallel architecture for real-time image processing is presented in this paper, DSPs are the main processing unit and FPGAs are used to be logic units for image interface protocol, image processing, image display, synchronization communication portocol of DSPs and DSP's reprogramming interface of 422/485. The presented architecture is composed of two modules: the preprocessing module and the processing module, and the latter is extendable for better performance. Modules are connected by LINK communication port, whose LVDS protocol has the ability of anti-jamming. And DSP's programs can be updated easily by 422/485 with PC's serial port. Analysis and experiments result shows that the prototype with the proposed parallel architecture has many promising charactersitics such as powerful computing capability, broad data transfer bandwidth, and is easy to be extended and updated.

  11. Efficient parallel algorithms for elastic plastic finite element analysis

    NASA Astrophysics Data System (ADS)

    Ding, K. Z.; Qin, Q.-H.; Cardew-Hall, M.; Kalyanasundaram, S.

    2008-03-01

    This paper presents our new development of parallel finite element algorithms for elastic plastic problems. The proposed method is based on dividing the original structure under consideration into a number of substructures which are treated as isolated finite element models via the interface conditions. Throughout the analysis, each processor stores only the information relevant to its substructure and generates the local stiffness matrix. A parallel substructure oriented preconditioned conjugate gradient method, which is combined with MR smoothing and diagonal storage scheme are employed to solve linear systems of equations. After having obtained the displacements of the problem under consideration, a substepping scheme is used to integrate elastic plastic stress strain relations. The procedure outlined controls the error of the computed stress by choosing each substep size automatically according to a prescribed tolerance. The combination of these algorithms shows a good speedup when increasing the number of processors and the effective solution of 3D elastic plastic problems whose size is much too large for a single workstation becomes possible.

  12. Overview of parallel processing approaches to image and video compression

    NASA Astrophysics Data System (ADS)

    Shen, Ke; Cook, Gregory W.; Jamieson, Leah H.; Delp, Edward J., III

    1994-05-01

    In this paper we present an overview of techniques used to implement various image and video compression algorithms using parallel processing. Approaches used can largely be divided into four areas. The first is the use of special purpose architectures designed specifically for image and video compression. An example of this is the use of an array of DSP chips to implement a version of MPEG1. The second approach is the use of VLSI techniques. These include various chip sets for JPEG and MPEG1. The third approach is algorithm driven, in which the structure of the compression algorithm describes the architecture, e.g. pyramid algorithms. The fourth approach is the implementation of algorithms on high performance parallel computers. Examples of this approach are the use of a massively parallel computer such as the MasPar MP-1 or the use of a coarse-grained machine such as the Intel Touchstone Delta.

  13. Automating the parallel processing of fluid and structural dynamics calculations

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Cole, Gary L.

    1987-01-01

    The NASA Lewis Research Center is actively involved in the development of expert system technology to assist users in applying parallel processing to computational fluid and structural dynamic analysis. The goal of this effort is to eliminate the necessity for the physical scientist to become a computer scientist in order to effectively use the computer as a research tool. Programming and operating software utilities have previously been developed to solve systems of ordinary nonlinear differential equations on parallel scalar processors. Current efforts are aimed at extending these capabilties to systems of partial differential equations, that describe the complex behavior of fluids and structures within aerospace propulsion systems. This paper presents some important considerations in the redesign, in particular, the need for algorithms and software utilities that can automatically identify data flow patterns in the application program and partition and allocate calculations to the parallel processors. A library-oriented multiprocessing concept for integrating the hardware and software functions is described.

  14. Automating the parallel processing of fluid and structural dynamics calculations

    NASA Technical Reports Server (NTRS)

    Arpasi, Dale J.; Cole, Gary L.

    1987-01-01

    The NASA Lewis Research Center is actively involved in the development of expert system technology to assist users in applying parallel processing to computational fluid and structural dynamic analysis. The goal of this effort is to eliminate the necessity for the physical scientist to become a computer scientist in order to effectively use the computer as a research tool. Programming and operating software utilities have previously been developed to solve systems of ordinary nonlinear differential equations on parallel scalar processors. Current efforts are aimed at extending these capabilities to systems of partial differential equations, that describe the complex behavior of fluids and structures within aerospace propulsion systems. This paper presents some important considerations in the redesign, in particular, the need for algorithms and software utilities that can automatically identify data flow patterns in the application program and partition and allocate calculations to the parallel processors. A library-oriented multiprocessing concept for integrating the hardware and software functions is described.

  15. Parallel, semiparallel, and serial processing of visual hyperacuity

    NASA Astrophysics Data System (ADS)

    Fahle, Manfred W.

    1990-10-01

    Humans can discriminate between certain elementary stimulus features in parallel, i.e., simultaneously over the visual field. I present evidence that, in man, vernier rnisalignments in the hyperacuity-range, i.e., below the photoreceptor diameter, can also be detected in parallel. This indicates that the visUal system performs some form of spatial interpolation beyond the photoreceptor spacing simultaneously over the visual field. Vernier offsets are detected in parallel even when orientation cues are masked: deviation from straightness is an elementary feature of visual perception. However, the identification process, that classifies each vernier in a stimulus as being offset to the right (versus to the left) is serial and has to scan the visual field sequentially if orientation cues are masked. Therefore, reaction times and thresholds in vernier acuity tasks increase with the number of verniers presented simultaneously if classification of different features is required. Furthermore, when approaching vernier threshold, simple vernier detection is no longer parallel but becomes partially serial, or semi-parallel.

  16. Cloud parallel processing of tandem mass spectrometry based proteomics data.

    PubMed

    Mohammed, Yassene; Mostovenko, Ekaterina; Henneman, Alex A; Marissen, Rob J; Deelder, André M; Palmblad, Magnus

    2012-10-05

    Data analysis in mass spectrometry based proteomics struggles to keep pace with the advances in instrumentation and the increasing rate of data acquisition. Analyzing this data involves multiple steps requiring diverse software, using different algorithms and data formats. Speed and performance of the mass spectral search engines are continuously improving, although not necessarily as needed to face the challenges of acquired big data. Improving and parallelizing the search algorithms is one possibility; data decomposition presents another, simpler strategy for introducing parallelism. We describe a general method for parallelizing identification of tandem mass spectra using data decomposition that keeps the search engine intact and wraps the parallelization around it. We introduce two algorithms for decomposing mzXML files and recomposing resulting pepXML files. This makes the approach applicable to different search engines, including those relying on sequence databases and those searching spectral libraries. We use cloud computing to deliver the computational power and scientific workflow engines to interface and automate the different processing steps. We show how to leverage these technologies to achieve faster data analysis in proteomics and present three scientific workflows for parallel database as well as spectral library search using our data decomposition programs, X!Tandem and SpectraST.

  17. Parallel-Processing Software for Correlating Stereo Images

    NASA Technical Reports Server (NTRS)

    Klimeck, Gerhard; Deen, Robert; Mcauley, Michael; DeJong, Eric

    2007-01-01

    A computer program implements parallel- processing algorithms for cor relating images of terrain acquired by stereoscopic pairs of digital stereo cameras on an exploratory robotic vehicle (e.g., a Mars rove r). Such correlations are used to create three-dimensional computatio nal models of the terrain for navigation. In this program, the scene viewed by the cameras is segmented into subimages. Each subimage is assigned to one of a number of central processing units (CPUs) opera ting simultaneously.

  18. Parallel Attack and the Enemy’s Decision Making Process

    DTIC Science & Technology

    1998-04-01

    definition of this asymptotic notation. 6 JaJa , Joseph, An Introduction to Parallel Algorithms, Addison-Wesley, 1992, p 158 7 Ibid, Chapter 1 8 Baker, Louis...the processes within the hierarchy for the bureaucratic and organizational process models take the same amount of computational time. Notes 1 JaJa ...Press, New York, 1991 J.F.C. Fuller, The Foundations of the Science of War, London, Hutchinson and Company, 1925, p314. JaJa , Joseph, An Introduction to

  19. Parallel Processing of Broad-Band PPM Signals

    NASA Technical Reports Server (NTRS)

    Gray, Andrew; Kang, Edward; Lay, Norman; Vilnrotter, Victor; Srinivasan, Meera; Lee, Clement

    2010-01-01

    A parallel-processing algorithm and a hardware architecture to implement the algorithm have been devised for timeslot synchronization in the reception of pulse-position-modulated (PPM) optical or radio signals. As in the cases of some prior algorithms and architectures for parallel, discrete-time, digital processing of signals other than PPM, an incoming broadband signal is divided into multiple parallel narrower-band signals by means of sub-sampling and filtering. The number of parallel streams is chosen so that the frequency content of the narrower-band signals is low enough to enable processing by relatively-low speed complementary metal oxide semiconductor (CMOS) electronic circuitry. The algorithm and architecture are intended to satisfy requirements for time-varying time-slot synchronization and post-detection filtering, with correction of timing errors independent of estimation of timing errors. They are also intended to afford flexibility for dynamic reconfiguration and upgrading. The architecture is implemented in a reconfigurable CMOS processor in the form of a field-programmable gate array. The algorithm and its hardware implementation incorporate three separate time-varying filter banks for three distinct functions: correction of sub-sample timing errors, post-detection filtering, and post-detection estimation of timing errors. The design of the filter bank for correction of timing errors, the method of estimating timing errors, and the design of a feedback-loop filter are governed by a host of parameters, the most critical one, with regard to processing very broadband signals with CMOS hardware, being the number of parallel streams (equivalently, the rate-reduction parameter).

  20. Efficient relaxed-Jacobi smoothers for multigrid on parallel computers

    NASA Astrophysics Data System (ADS)

    Yang, Xiang; Mittal, Rajat

    2017-03-01

    In this Technical Note, we present a family of Jacobi-based multigrid smoothers suitable for the solution of discretized elliptic equations. These smoothers are based on the idea of scheduled-relaxation Jacobi proposed recently by Yang & Mittal (2014) [18] and employ two or three successive relaxed Jacobi iterations with relaxation factors derived so as to maximize the smoothing property of these iterations. The performance of these new smoothers measured in terms of convergence acceleration and computational workload, is assessed for multi-domain implementations typical of parallelized solvers, and compared to the lexicographic point Gauss-Seidel smoother. The tests include the geometric multigrid method on structured grids as well as the algebraic grid method on unstructured grids. The tests demonstrate that unlike Gauss-Seidel, the convergence of these Jacobi-based smoothers is unaffected by domain decomposition, and furthermore, they outperform the lexicographic Gauss-Seidel by factors that increase with domain partition count.

  1. Study of parallel efficiency in message passing environments

    SciTech Connect

    Hanebutte, U.R.; Tatsumi, Masahiro

    1996-03-01

    A benchmark test using the Message Passing Interface (MPI, an emerging standard for writing message passing programs) has been developed, to study parallel performance in message passing environments. The test is comprised of a computational task of independent calculations followed by a round-robin data communication step. Performance data as a function of computational granularity and message passing requirements are presented for the IBM SPx at Argonne National Laboratory and for a cluster of quasi-dedicated SUN SPARC Station 20`s. In the later portion of the paper a widely accepted communication cost model combined with Amdahl`s law is used to obtain performance predictions for uneven distributed computational work loads.

  2. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, J.

    1999-01-01

    A new atmospheric objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 1 X 1 lat-lon grid with 18 levels of heights and winds and 10 levels of moisture) using 120,000 observations in 17 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system is totally portable and can run on several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from 1 to 32 CPUs is 18%. In addition, the analysis results are identical regardless of the number of processors used. This system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. Static tests with a 2 X 2.5 resolution version of this system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from several months of cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (O-F statistics) as the current operational system.

  3. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, James G.

    1999-01-01

    A new objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 2 x 2.5 lat-lon grid with 20 levels of heights and winds and 10 levels of moisture) using 120,000 observations in less than 3 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system Ls totally portable and can run on -several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from I to 32 CPus is 18%. in addition, the analysis results are identical regardless of the number of processors used. T'his system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. It also includes a new quality control (buddy check) system. Static tests with the system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from a 2-month cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (0-F statistics) throughout the entire two months.

  4. CRBLASTER: A Fast Parallel-Processing Program for Cosmic Ray Rejection in Space-Based Observations

    NASA Astrophysics Data System (ADS)

    Mighell, K.

    Many astronomical image analysis tasks are based on algorithms that can be described as being embarrassingly parallel - where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of today's fast multi-core servers costing a few thousands of dollars. One reason for the shortage of state-of-the-art parallel processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called CRBLASTER which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. CRBLASTER is written in C using the industry standard Message Passing Interface library. Processing a single 800 x 800 Hubble Space Telescope Wide-Field Planetary Camera 2 (WFPC2) image takes 1.9 seconds using 4 processors on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 cores is 82%. The code has been designed to be used as a software framework for the easy development of parallel-processing image-analysis programs using embarrassing parallel algorithms; all that needs to be done is to replace the core image processing task (in this case the C function that performs the L.A.Cosmic algorithm) with an alternative image analysis task based on a single processor algorithm. I describe the design and implementation of the program and then discuss how it could possibly be used to quickly do time-critical analysis applications such as those involved with space surveillance or do complex calibration tasks as part of the pipeline processing of images from large focal plane arrays.

  5. Efficiency of cellular information processing

    NASA Astrophysics Data System (ADS)

    Barato, Andre C.; Hartich, David; Seifert, Udo

    2014-10-01

    We show that a rate of conditional Shannon entropy reduction, characterizing the learning of an internal process about an external process, is bounded by the thermodynamic entropy production. This approach allows for the definition of an informational efficiency that can be used to study cellular information processing. We analyze three models of increasing complexity inspired by the Escherichia coli sensory network, where the external process is an external ligand concentration jumping between two values. We start with a simple model for which ATP must be consumed so that a protein inside the cell can learn about the external concentration. With a second model for a single receptor we show that the rate at which the receptor learns about the external environment can be nonzero even without any dissipation inside the cell since chemical work done by the external process compensates for this learning rate. The third model is more complete, also containing adaptation. For this model we show inter alia that a bacterium in an environment that changes at a very slow time-scale is quite inefficient, dissipating much more than it learns. Using the concept of a coarse-grained learning rate, we show for the model with adaptation that while the activity learns about the external signal the option of changing the methylation level increases the concentration range for which the learning rate is substantial.

  6. Airbreathing Propulsion System Analysis Using Multithreaded Parallel Processing

    NASA Technical Reports Server (NTRS)

    Schunk, Richard Gregory; Chung, T. J.; Rodriguez, Pete (Technical Monitor)

    2000-01-01

    In this paper, parallel processing is used to analyze the mixing, and combustion behavior of hypersonic flow. Preliminary work for a sonic transverse hydrogen jet injected from a slot into a Mach 4 airstream in a two-dimensional duct combustor has been completed [Moon and Chung, 1996]. Our aim is to extend this work to three-dimensional domain using multithreaded domain decomposition parallel processing based on the flowfield-dependent variation theory. Numerical simulations of chemically reacting flows are difficult because of the strong interactions between the turbulent hydrodynamic and chemical processes. The algorithm must provide an accurate representation of the flowfield, since unphysical flowfield calculations will lead to the faulty loss or creation of species mass fraction, or even premature ignition, which in turn alters the flowfield information. Another difficulty arises from the disparity in time scales between the flowfield and chemical reactions, which may require the use of finite rate chemistry. The situations are more complex when there is a disparity in length scales involved in turbulence. In order to cope with these complicated physical phenomena, it is our plan to utilize the flowfield-dependent variation theory mentioned above, facilitated by large eddy simulation. Undoubtedly, the proposed computation requires the most sophisticated computational strategies. The multithreaded domain decomposition parallel processing will be necessary in order to reduce both computational time and storage. Without special treatments involved in computer engineering, our attempt to analyze the airbreathing combustion appears to be difficult, if not impossible.

  7. Visual analysis of inter-process communication for large-scale parallel computing.

    PubMed

    Muelder, Chris; Gygi, Francois; Ma, Kwan-Liu

    2009-01-01

    In serial computation, program profiling is often helpful for optimization of key sections of code. When moving to parallel computation, not only does the code execution need to be considered but also communication between the different processes which can induce delays that are detrimental to performance. As the number of processes increases, so does the impact of the communication delays on performance. For large-scale parallel applications, it is critical to understand how the communication impacts performance in order to make the code more efficient. There are several tools available for visualizing program execution and communications on parallel systems. These tools generally provide either views which statistically summarize the entire program execution or process-centric views. However, process-centric visualizations do not scale well as the number of processes gets very large. In particular, the most common representation of parallel processes is a Gantt char t with a row for each process. As the number of processes increases, these charts can become difficult to work with and can even exceed screen resolution. We propose a new visualization approach that affords more scalability and then demonstrate it on systems running with up to 16,384 processes.

  8. An efficient parallel implementation of explicit multirate Runge–Kutta schemes for discontinuous Galerkin computations

    SciTech Connect

    Seny, Bruno Lambrechts, Jonathan; Toulorge, Thomas; Legat, Vincent; Remacle, Jean-François

    2014-01-01

    Although explicit time integration schemes require small computational efforts per time step, their efficiency is severely restricted by their stability limits. Indeed, the multi-scale nature of some physical processes combined with highly unstructured meshes can lead some elements to impose a severely small stable time step for a global problem. Multirate methods offer a way to increase the global efficiency by gathering grid cells in appropriate groups under local stability conditions. These methods are well suited to the discontinuous Galerkin framework. The parallelization of the multirate strategy is challenging because grid cells have different workloads. The computational cost is different for each sub-time step depending on the elements involved and a classical partitioning strategy is not adequate any more. In this paper, we propose a solution that makes use of multi-constraint mesh partitioning. It tends to minimize the inter-processor communications, while ensuring that the workload is almost equally shared by every computer core at every stage of the algorithm. Particular attention is given to the simplicity of the parallel multirate algorithm while minimizing computational and communication overheads. Our implementation makes use of the MeTiS library for mesh partitioning and the Message Passing Interface for inter-processor communication. Performance analyses for two and three-dimensional practical applications confirm that multirate methods preserve important computational advantages of explicit methods up to a significant number of processors.

  9. Imaging Subcellular Dynamics with Fast and Light-Efficient Volumetrically Parallelized Microscopy

    PubMed Central

    Dean, Kevin M.; Roudot, Philippe; Welf, Erik S.; Pohlkamp, Theresa; Garrelts, Gerard; Herz, Joachim; Fiolka, Reto

    2017-01-01

    In fluorescence microscopy, the serial acquisition of 2D images to form a 3D volume limits the maximum imaging speed. This is particularly evident when imaging adherent cells in a light-sheet fluorescence microscopy format, as their elongated morphologies require ~200 image planes per image volume. Here, by illuminating the specimen with three light-sheets, each independently detected, we present a light-efficient, crosstalk free, and volumetrically parallelized 3D microscopy technique that is optimized for high-speed (up to 14 Hz) subcellular (300 nm lateral, 600 nm axial resolution) imaging of adherent cells. We demonstrate 3D imaging of intracellular processes, including cytoskeletal dynamics in single cell migration and collective wound healing for 1500 and 1000 time points, respectively. Further, we capture rapid biological processes, including trafficking of early endosomes with velocities exceeding 10 microns per second and calcium signaling in primary neurons. PMID:28944279

  10. Parallel Flux Tensor Analysis for Efficient Moving Object Detection

    DTIC Science & Technology

    2011-07-01

    convolution, power efficient 1. INTRODUCTION Realtime persistent moving object detection and target track- ing for surveillance and situational awareness ...synchronization for programming the Cell/B.E. some tools for mapping serial code in a semi-automatic fashion are in development [22]. We first give a brief...version for HD video using a single PS-3 Cell/B.E. processor and is faster than realtime for a range of filter con- figurations and video frame sizes. We

  11. Digital image processing using parallel computing based on CUDA technology

    NASA Astrophysics Data System (ADS)

    Skirnevskiy, I. P.; Pustovit, A. V.; Abdrashitova, M. O.

    2017-01-01

    This article describes expediency of using a graphics processing unit (GPU) in big data processing in the context of digital images processing. It provides a short description of a parallel computing technology and its usage in different areas, definition of the image noise and a brief overview of some noise removal algorithms. It also describes some basic requirements that should be met by certain noise removal algorithm in the projection to computer tomography. It provides comparison of the performance with and without using GPU as well as with different percentage of using CPU and GPU.

  12. Digital intermediate frequency QAM modulator using parallel processing

    DOEpatents

    Pao, Hsueh-Yuan [Livermore, CA; Tran, Binh-Nien [San Ramon, CA

    2008-05-27

    The digital Intermediate Frequency (IF) modulator applies to various modulation types and offers a simple and low cost method to implement a high-speed digital IF modulator using field programmable gate arrays (FPGAs). The architecture eliminates multipliers and sequential processing by storing the pre-computed modulated cosine and sine carriers in ROM look-up-tables (LUTs). The high-speed input data stream is parallel processed using the corresponding LUTs, which reduces the main processing speed, allowing the use of low cost FPGAs.

  13. Semi-automatic process partitioning for parallel computation

    NASA Technical Reports Server (NTRS)

    Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John

    1988-01-01

    On current multiprocessor architectures one must carefully distribute data in memory in order to achieve high performance. Process partitioning is the operation of rewriting an algorithm as a collection of tasks, each operating primarily on its own portion of the data, to carry out the computation in parallel. A semi-automatic approach to process partitioning is considered in which the compiler, guided by advice from the user, automatically transforms programs into such an interacting task system. This approach is illustrated with a picture processing example written in BLAZE, which is transformed into a task system maximizing locality of memory reference.

  14. A dataflow analysis tool for parallel processing of algorithms

    NASA Technical Reports Server (NTRS)

    Jones, Robert L., III

    1993-01-01

    A graph-theoretic design process and software tool is presented for selecting a multiprocessing scheduling solution for a class of computational problems. The problems of interest are those that can be described using a dataflow graph and are intended to be executed repetitively on a set of identical parallel processors. Typical applications include signal processing and control law problems. Graph analysis techniques are introduced and shown to effectively determine performance bounds, scheduling constraints, and resource requirements. The software tool is shown to facilitate the application of the design process to a given problem.

  15. Applications of massively parallel computers in telemetry processing

    NASA Technical Reports Server (NTRS)

    El-Ghazawi, Tarek A.; Pritchard, Jim; Knoble, Gordon

    1994-01-01

    Telemetry processing refers to the reconstruction of full resolution raw instrumentation data with artifacts, of space and ground recording and transmission, removed. Being the first processing phase of satellite data, this process is also referred to as level-zero processing. This study is aimed at investigating the use of massively parallel computing technology in providing level-zero processing to spaceflights that adhere to the recommendations of the Consultative Committee on Space Data Systems (CCSDS). The workload characteristics, of level-zero processing, are used to identify processing requirements in high-performance computing systems. An example of level-zero functions on a SIMD MPP, such as the MasPar, is discussed. The requirements in this paper are based in part on the Earth Observing System (EOS) Data and Operation System (EDOS).

  16. Transactional memories: A new abstraction for parallel processing

    SciTech Connect

    Fasel, J.H.; Lubeck, O.M.; Agrawal, D.; Bruno, J.L.; El Abbadi, A.

    1997-12-01

    This is the final report of a three-year, Laboratory Directed Research and Development (LDRD) project at Los Alamos National Laboratory (LANL). Current distributed memory multiprocessor computer systems make the development of parallel programs difficult. From a programmer`s perspective, it would be most desirable if the underlying hardware and software could provide the programming abstraction commonly referred to as sequential consistency--a single address space and multiple threads; but enforcement of sequential consistency limits opportunities for architectural and operating system performance optimizations, leading to poor performance. Recently, Herlihy and Moss have introduced a new abstraction called transactional memories for parallel programming. The programming model is shared memory with multiple threads. However, data consistency is obtained through the use of transactions rather than mutual exclusion based on locking. The transaction approach permits the underlying system to exploit the potential parallelism in transaction processing. The authors explore the feasibility of designing parallel programs using the transaction paradigm for data consistency and a barrier type of thread synchronization.

  17. Morphological evidence for parallel processing of information in rat macula

    NASA Technical Reports Server (NTRS)

    Ross, M. D.

    1988-01-01

    Study of montages, tracings and reconstructions prepared from a series of 570 consecutive ultrathin sections shows that rat maculas are morphologically organized for parallel processing of linear acceleratory information. Type II cells of one terminal field distribute information to neighboring terminals as well. The findings are examined in light of physiological data which indicate that macular receptor fields have a preferred directional vector, and are interpreted by analogy to a computer technology known as an information network.

  18. Efficient data IO for a Parallel Global Cloud Resolving Model

    SciTech Connect

    Palmer, Bruce J.; Koontz, Annette S.; Schuchardt, Karen L.; Heikes, Ross P.; Randall, David A.

    2011-11-26

    Execution of a Global Cloud Resolving Model (GCRM) at target resolutions of 2-4 km will generate, at a minimum, 10s of Gigabytes of data per variable per snapshot. Writing this data to disk without creating a serious bottleneck in the execution of the GCRM code while also supporting efficient post-execution data analysis is a significant challenge. This paper discusses an Input/Output (IO) application programmer interface (API) for the GCRM that efficiently moves data from the model to disk while maintaining support for community standard formats, avoiding the creation of very large numbers of files, and supporting efficient analysis. Several aspects of the API will be discussed in detail. First, we discuss the output data layout which linearizes the data in a consistent way that is independent of the number of processors used to run the simulation and provides a convenient format for subsequent analyses of the data. Second, we discuss the flexible API interface that enables modelers to easily add variables to the output stream by specifying where in the GCRM code these variables are located and to flexibly configure the choice of outputs and distribution of data across files. The flexibility of the API is designed to allow model developers to add new data fields to the output as the model develops and new physics is added and also provides a mechanism for allowing users of the GCRM code itself to adjust the output frequency and the number of fields written depending on the needs of individual calculations. Third, we describe the mapping to the NetCDF data model with an emphasis on the grid description. Fourth, we describe our messaging algorithms and IO aggregation strategies that are used to achieve high bandwidth while simultaneously writing concurrently from many processors to shared files. We conclude with initial performance results.

  19. Computation of the Density Matrix in Electronic Structure Theory in Parallel on Multiple Graphics Processing Units.

    PubMed

    Cawkwell, M J; Wood, M A; Niklasson, Anders M N; Mniszewski, S M

    2014-12-09

    The algorithm developed in Cawkwell, M. J. et al. J. Chem. Theory Comput. 2012 , 8 , 4094 for the computation of the density matrix in electronic structure theory on a graphics processing unit (GPU) using the second-order spectral projection (SP2) method [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] has been efficiently parallelized over multiple GPUs on a single compute node. The parallel implementation provides significant speed-ups with respect to the single GPU version with no loss of accuracy. The performance and accuracy of the parallel GPU-based algorithm is compared with the performance of the SP2 algorithm and traditional matrix diagonalization methods on a multicore central processing unit (CPU).

  20. [Multi-DSP parallel processing technique of hyperspectral RX anomaly detection].

    PubMed

    Guo, Wen-Ji; Zeng, Xiao-Ru; Zhao, Bao-Wei; Ming, Xing; Zhang, Gui-Feng; Lü, Qun-Bo

    2014-05-01

    To satisfy the requirement of high speed, real-time and mass data storage etc. for RX anomaly detection of hyperspectral image data, the present paper proposes a solution of multi-DSP parallel processing system for hyperspectral image based on CPCI Express standard bus architecture. Hardware topological architecture of the system combines the tight coupling of four DSPs sharing data bus and memory unit with the interconnection of Link ports. On this hardware platform, by assigning parallel processing task for each DSP in consideration of the spectrum RX anomaly detection algorithm and the feature of 3D data in the spectral image, a 4DSP parallel processing technique which computes and solves the mean matrix and covariance matrix of the whole image by spatially partitioning the image is proposed. The experiment result shows that, in the case of equivalent detective effect, it can reach the time efficiency 4 times higher than single DSP process with the 4-DSP parallel processing technique of RX anomaly detection algorithm proposed by this paper, which makes a breakthrough in the constraints to the huge data image processing of DSP's internal storage capacity, meanwhile well meeting the demands of the spectral data in real-time processing.

  1. Regional-scale calculation of the LS factor using parallel processing

    NASA Astrophysics Data System (ADS)

    Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong

    2015-05-01

    With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.

  2. Parallel-Processing Equalizers for Multi-Gbps Communications

    NASA Technical Reports Server (NTRS)

    Gray, Andrew; Ghuman, Parminder; Hoy, Scott; Satorius, Edgar H.

    2004-01-01

    Architectures have been proposed for the design of frequency-domain least-mean-square complex equalizers that would be integral parts of parallel- processing digital receivers of multi-gigahertz radio signals and other quadrature-phase-shift-keying (QPSK) or 16-quadrature-amplitude-modulation (16-QAM) of data signals at rates of multiple gigabits per second. Equalizers as used here denotes receiver subsystems that compensate for distortions in the phase and frequency responses of the broad-band radio-frequency channels typically used to convey such signals. The proposed architectures are suitable for realization in very-large-scale integrated (VLSI) circuitry and, in particular, complementary metal oxide semiconductor (CMOS) application- specific integrated circuits (ASICs) operating at frequencies lower than modulation symbol rates. A digital receiver of the type to which the proposed architecture applies (see Figure 1) would include an analog-to-digital converter (A/D) operating at a rate, fs, of 4 samples per symbol period. To obtain the high speed necessary for sampling, the A/D and a 1:16 demultiplexer immediately following it would be constructed as GaAs integrated circuits. The parallel-processing circuitry downstream of the demultiplexer, including a demodulator followed by an equalizer, would operate at a rate of only fs/16 (in other words, at 1/4 of the symbol rate). The output from the equalizer would be four parallel streams of in-phase (I) and quadrature (Q) samples.

  3. Reducing neural network training time with parallel processing

    NASA Technical Reports Server (NTRS)

    Rogers, James L., Jr.; Lamarsh, William J., II

    1995-01-01

    Obtaining optimal solutions for engineering design problems is often expensive because the process typically requires numerous iterations involving analysis and optimization programs. Previous research has shown that a near optimum solution can be obtained in less time by simulating a slow, expensive analysis with a fast, inexpensive neural network. A new approach has been developed to further reduce this time. This approach decomposes a large neural network into many smaller neural networks that can be trained in parallel. Guidelines are developed to avoid some of the pitfalls when training smaller neural networks in parallel. These guidelines allow the engineer: to determine the number of nodes on the hidden layer of the smaller neural networks; to choose the initial training weights; and to select a network configuration that will capture the interactions among the smaller neural networks. This paper presents results describing how these guidelines are developed.

  4. Probabilistic structural mechanics research for parallel processing computers

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Martin, William R.

    1991-01-01

    Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical.

  5. Stellar Evolution and Social Evolution: A Study in Parallel Processes

    NASA Astrophysics Data System (ADS)

    Carneiro, Robert L.

    From the beginning of anthropology, social evolution has been one of its major interests. However, in recent years the study of this process has languished. Accordingly, those anthropologists who still consider social evolution to be of central importance to their discipline, and who continue to pursue it, find their endeavor bolstered when parallel instances of evolutionary reconstructions can be demonstrated in other fields. Stellar evolution has long been a prime interest to astronomers, and their progress in deciphering its course has been truly remarkable. In examining astronomers' reconstructions of stellar evolution, I have been struck by a number of similarities between ways stars and societies have evolved. The parallels actually begin with the method used by both disciplines, namely, the comparative method. In astronomy, the method involves plotting stars on a Hertzsprung-Russell Diagram, and interpreting, diachonically, the pattern made by essentially synchronic data used for plotting. The comparative method is particularly appropriate when one is studying a process that cannot be observed over its full range in the life of any single individual, be it a star or a society. Parallels also occur in that stars and societies have each followed distinctive stages in their evolution. These stages are, in both cases, sometimes unlinear and sometimes multilinear. Moreover, the distinction drawn by anthropologists between a pristine and a secondary state (which depends on whether state so represented is the first such occurrence in an area, or was a later development derivative from earlier states) finds its astronomical parallel in the relationship existing between Population II and Population I stars. These and other similarities between stellar and social evolution will be cited and discussed.

  6. Intracellular signalling proteins as smart' agents in parallel distributed processes.

    PubMed

    Fisher, M J; Paton, R C; Matsuno, K

    1999-06-01

    In eucaryotic organisms, responses to external signals are mediated by a repertoire of intracellular signalling pathways that ultimately bring about the activation/inactivation of protein kinases and/or protein phosphatases. Until relatively recently, little thought had been given to the intracellular distribution of the components of these signalling pathways. However, experimental evidence from a diverse range of organisms indicates that rather than being freely distributed, many of the protein components of signalling cascades show a significant degree of spatial organisation. Here, we briefly review the roles of 'anchor' 'scaffold' and 'adaptor' proteins in the organisation and functioning of intracellular signalling pathways. We then consider some of the parallel distributed processing capacities of these adaptive systems. We focus on signalling proteins-both as individual 'devices' (agents) and as 'networks' (ecologies) of parallel processes. Signalling proteins are described as 'smart thermodynamic machines' which satisfy 'gluing' (functorial) roles in the information economy of the cell. This combines two information-processing views of signalling proteins. Individually, they show 'cognitive' capacities and collectively they integrate (cohere) cellular processes. We exploit these views by drawing comparisons between signalling proteins and verbs. This text/dialogical metaphor also helps refine our view of signalling proteins as context-sensitive information processing agents.

  7. Efficient p-value estimation in massively parallel testing problems

    PubMed Central

    Kustra, Rafal; Shi, Xiaofei; Murdoch, Duncan J.; Greenwood, Celia M. T.; Rangrej, Jagadish

    2008-01-01

    We present a new method to efficiently estimate very large numbers of p-values using empirically constructed null distributions of a test statistic. The need to evaluate a very large number of p-values is increasingly common with modern genomic data, and when interaction effects are of interest, the number of tests can easily run into billions. When the asymptotic distribution is not easily available, permutations are typically used to obtain p-values but these can be computationally infeasible in large problems. Our method constructs a prediction model to obtain a first approximation to the p-values and uses Bayesian methods to choose a fraction of these to be refined by permutations. We apply and evaluate our method on the study of association between 2-way interactions of genetic markers and colorectal cancer using the data from the first phase of a large, genome-wide case–control study. The results show enormous computational savings as compared to evaluating a full set of permutations, with little decrease in accuracy. PMID:18304995

  8. Efficient p-value estimation in massively parallel testing problems.

    PubMed

    Kustra, Rafal; Shi, Xiaofei; Murdoch, Duncan J; Greenwood, Celia M T; Rangrej, Jagadish

    2008-10-01

    We present a new method to efficiently estimate very large numbers of p-values using empirically constructed null distributions of a test statistic. The need to evaluate a very large number of p-values is increasingly common with modern genomic data, and when interaction effects are of interest, the number of tests can easily run into billions. When the asymptotic distribution is not easily available, permutations are typically used to obtain p-values but these can be computationally infeasible in large problems. Our method constructs a prediction model to obtain a first approximation to the p-values and uses Bayesian methods to choose a fraction of these to be refined by permutations. We apply and evaluate our method on the study of association between 2-way interactions of genetic markers and colorectal cancer using the data from the first phase of a large, genome-wide case-control study. The results show enormous computational savings as compared to evaluating a full set of permutations, with little decrease in accuracy.

  9. A General-purpose Framework for Parallel Processing of Large-scale LiDAR Data

    NASA Astrophysics Data System (ADS)

    Li, Z.; Hodgson, M.; Li, W.

    2016-12-01

    Light detection and ranging (LiDAR) technologies have proven efficiency to quickly obtain very detailed Earth surface data for a large spatial extent. Such data is important for scientific discoveries such as Earth and ecological sciences and natural disasters and environmental applications. However, handling LiDAR data poses grand geoprocessing challenges due to data intensity and computational intensity. Previous studies received notable success on parallel processing of LiDAR data to these challenges. However, these studies either relied on high performance computers and specialized hardware (GPUs) or focused mostly on finding customized solutions for some specific algorithms. We developed a general-purpose scalable framework coupled with sophisticated data decomposition and parallelization strategy to efficiently handle big LiDAR data. Specifically, 1) a tile-based spatial index is proposed to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system, 2) two spatial decomposition techniques are developed to enable efficient parallelization of different types of LiDAR processing tasks, and 3) by coupling existing LiDAR processing tools with Hadoop, this framework is able to conduct a variety of LiDAR data processing tasks in parallel in a highly scalable distributed computing environment. The performance and scalability of the framework is evaluated with a series of experiments conducted on a real LiDAR dataset using a proof-of-concept prototype system. The results show that the proposed framework 1) is able to handle massive LiDAR data more efficiently than standalone tools; and 2) provides almost linear scalability in terms of either increased workload (data volume) or increased computing nodes with both spatial decomposition strategies. We believe that the proposed framework provides valuable references on developing a collaborative cyberinfrastructure for processing big earth science data in a highly scalable environment.

  10. A learnable parallel processing architecture towards unity of memory and computing

    PubMed Central

    Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.

    2015-01-01

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area. PMID:26271243

  11. a Robust Parallel Framework for Massive Spatial Data Processing on High Performance Clusters

    NASA Astrophysics Data System (ADS)

    Guan, X.

    2012-07-01

    Massive spatial data requires considerable computing power for real-time processing. With the help of the development of multicore technology and computer component cost reduction in recent years, high performance clusters become the only economically viable solution for this requirement. Massive spatial data processing demands heavy I/O operations however, and should be characterized as a data-intensive application. Data-intensive application parallelization strategies are imcompatible with currently available procssing frameworks, which are basically designed for traditional compute-intensive applications. In this paper we introduce a Split-and-Merge paradigm for spatial data processing and also propose a robust parallel framework in a cluster environment to support this paradigm. The Split-and-Merge paradigm efficiently exploits data parallelism for massive data processing. The proposed framework is based on the open-source TORQUE project and hosted on a multicore-enabled Linux cluster. One common LiDAR point cloud algorithm, Delaunay triangulation, was implemented on the proposed framework to evaluate its efficiency and scalability. Experimental results demonstrate that the system provides efficient performance speedup.

  12. A learnable parallel processing architecture towards unity of memory and computing.

    PubMed

    Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J

    2015-08-14

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.

  13. A learnable parallel processing architecture towards unity of memory and computing

    NASA Astrophysics Data System (ADS)

    Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.

    2015-08-01

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.

  14. A simple hyperbolic model for communication in parallel processing environments

    NASA Technical Reports Server (NTRS)

    Stoica, Ion; Sultan, Florin; Keyes, David

    1994-01-01

    We introduce a model for communication costs in parallel processing environments called the 'hyperbolic model,' which generalizes two-parameter dedicated-link models in an analytically simple way. Dedicated interprocessor links parameterized by a latency and a transfer rate that are independent of load are assumed by many existing communication models; such models are unrealistic for workstation networks. The communication system is modeled as a directed communication graph in which terminal nodes represent the application processes that initiate the sending and receiving of the information and in which internal nodes, called communication blocks (CBs), reflect the layered structure of the underlying communication architecture. The direction of graph edges specifies the flow of the information carried through messages. Each CB is characterized by a two-parameter hyperbolic function of the message size that represents the service time needed for processing the message. The parameters are evaluated in the limits of very large and very small messages. Rules are given for reducing a communication graph consisting of many to an equivalent two-parameter form, while maintaining an approximation for the service time that is exact in both large and small limits. The model is validated on a dedicated Ethernet network of workstations by experiments with communication subprograms arising in scientific applications, for which a tight fit of the model predictions with actual measurements of the communication and synchronization time between end processes is demonstrated. The model is then used to evaluate the performance of two simple parallel scientific applications from partial differential equations: domain decomposition and time-parallel multigrid. In an appropriate limit, we also show the compatibility of the hyperbolic model with the recently proposed LogP model.

  15. A parallel and incremental algorithm for efficient unique signature discovery on DNA databases

    PubMed Central

    2010-01-01

    Background DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length ≤ l with mismatch tolerance ≥ d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures. Results This work proposes two discovery algorithms - the consecutive multiple discovery (CMD) algorithm and the parallel and incremental signature discovery (PISD) algorithm. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit

  16. Studies of random number generators for parallel processing

    SciTech Connect

    Bowman, K.O.; Robinson, M.T.

    1986-09-01

    If Monte Carlo calculations are to be performed in a parallel processing environment, a method of generating appropriate sequences of pseudorandom numbers for each process must be available. Frederickson et al. proposed an elegant algorithm based on the concept of pseudorandom or Lehmer trees: the sequence of numbers from a linear congruential generator is divided into disjoint subsequences by the members of an auxilary sequence. One subsequence can be assigned to each process. Extensive tests show the algorithm to suffer from correlations between the parallel subsequences: this is a result of the small number of bits which particpate in the auxiliary sequence and illustrates the well-known discovery of Marsaglia. Two alternative algorithms are proposed, both of which appear to be free of interprocess correlations. One relaxes the conditions on the Lehmer tree by using an arbitrary auxiliary multiplier: it is not known to what extent the subsequences are disjoint. The other partitions the main sequence into disjoint subsequences by sending one member to each process in turn, minimizing interprocess communication by defining new sequence generating parameters. 10 refs., 4 figs.

  17. Parallel processing at the SSC: The fact and the fiction

    SciTech Connect

    Bourianoff, G.; Cole, B.

    1991-10-01

    Accurately modelling the behavior of particles circulating in accelerators is a computationally demanding task. The particle tracking code currently in use at SSC is based upon a ``thin element`` analysis (TEAPOT). In this model each magnet in the lattice is described by a thin element at which the particle experiences an impulsive kick. Each kick requires approximately 200 floating point operations (``FLOP``). For the SSC collider lattice consisting of 10{sup 4} elements, performing a tracking of study for a set of 100 particles for 10{sup 7} turns would require 2 {times} 10{sup 15} FLOPS. Even on a machine capable of 100 MFLOP/sec (MFLOPS), this would require 2 {times} 10{sup 7} seconds, and many such runs are necessary. It should be noted that the accuracy with which the kicks are to be calculated is important: the large number of iterations involved will magnify the effects of small errors. The inability of current computational resources to effectively perform the full calculation motivates the migration of this calculation to the most powerful computers available. A survey of the current research into new technologies for superconducting reveals that the supercomputers of the future will be parallel in nature. Further, numerous such machines exist today, and are being used to solve other difficult problems. Thus it seems clear that it is not early to begin developing the capability to develop tracking codes for parallel architectures. This report discusses implementing parallel processing on the SCC.

  18. Parallel processing at the SSC: The fact and the fiction

    SciTech Connect

    Bourianoff, G.; Cole, B.

    1991-10-01

    Accurately modelling the behavior of particles circulating in accelerators is a computationally demanding task. The particle tracking code currently in use at SSC is based upon a thin element'' analysis (TEAPOT). In this model each magnet in the lattice is described by a thin element at which the particle experiences an impulsive kick. Each kick requires approximately 200 floating point operations ( FLOP''). For the SSC collider lattice consisting of 10{sup 4} elements, performing a tracking of study for a set of 100 particles for 10{sup 7} turns would require 2 {times} 10{sup 15} FLOPS. Even on a machine capable of 100 MFLOP/sec (MFLOPS), this would require 2 {times} 10{sup 7} seconds, and many such runs are necessary. It should be noted that the accuracy with which the kicks are to be calculated is important: the large number of iterations involved will magnify the effects of small errors. The inability of current computational resources to effectively perform the full calculation motivates the migration of this calculation to the most powerful computers available. A survey of the current research into new technologies for superconducting reveals that the supercomputers of the future will be parallel in nature. Further, numerous such machines exist today, and are being used to solve other difficult problems. Thus it seems clear that it is not early to begin developing the capability to develop tracking codes for parallel architectures. This report discusses implementing parallel processing on the SCC.

  19. A Simple and Efficient Parallel Implementation of the Fast Marching Method

    NASA Astrophysics Data System (ADS)

    Yang, Jianming; Stern, Frederick

    2011-11-01

    The fast marching method is a widely used numerical method for solving the Eikonal equation arising from a variety of applications. However, this method is inherently serial and doesn't lend itself to a straightforward parallelization. In this study, we present a simple and efficient algorithm for the parallel implementation of the fast marching method using a domain decomposition approach. Properties of the Eikonal equation are explored to greatly relax the serial interdependence of neighboring sub-domains. Overlapping sub-domains are employed to reduce communication overhead and improve parallelism among sub-domains. There are no iterative procedures or rollback operations involved in the present algorithm and the changes to the serial version of the fast marching method are minimized. Examples are performed to demonstrate the efficiency of our parallel fast marching method. This study was supported by ONR.

  20. High-Efficiency Nonfullerene Organic Solar Cells with a Parallel Tandem Configuration.

    PubMed

    Zuo, Lijian; Yu, Jiangsheng; Shi, Xueliang; Lin, Francis; Tang, Weihua; Jen, Alex K-Y

    2017-09-01

    In this work, a highly efficient parallel connected tandem solar cell utilizing a nonfullerene acceptor is demonstrated. Guided by optical simulation, each of the active layer thicknesses of subcells are tuned to maximize its light trapping without spending intense effort to match photocurrent. Interestingly, a strong optical microcavity with dual oscillation centers is formed in a back subcell, which further enhances light absorption. The parallel tandem device shows an improved photon-to-electron response over the range between 450 and 800 nm, and a high short-circuit current density (J SC ) of 17.92 mA cm(-2) . In addition, the subcells show high fill factors due to reduced recombination loss under diluted light intensity. These merits enable an overall power conversion efficiency (PCE) of >10% for this tandem cell, which represents a ≈15% enhancement compared to the optimal single-junction device. Further application of the designed parallel tandem configuration to more efficient single-junction cells enable a PCE of >11%, which is the highest efficiency among all parallel connected organic solar cells (OSCs). This work stresses the importance of employing a parallel tandem configuration for achieving efficient light harvesting in nonfullerene-based OSCs. It provides a useful strategy for exploring the ultimate performance of organic solar cells. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  1. OASIS4: An Efficient Parallel Code Coupler for Earth System Modelling

    NASA Astrophysics Data System (ADS)

    Coquart, L.; Valcke, S.; Redler, R.; Ritzdorf, H.

    2009-04-01

    As a new development step of the OASIS coupler family, we present OASIS4 in its latest version. OASIS4 is a software allowing synchronized exchanges of coupling information between numerical codes representing different components of the climate system. The concepts of portability, flexibility, parallelism and efficiency are the main drivers for the OASIS4 development with which we target the needs of Earth system modelling in its full complexity. The development and maintenance of OASIS4 has been supported by EU and institutional funding within the PRISM Support Initiative for the past seven years. Here we present the latest version of the OASIS4 coupling software which now includes the commonly known point based 2d and 3d interpolation schemes (bilinear, trilinear, bicubic, nearest neighbour), and 2D conservative remapping. Furthermore, the new version of the software now provides a complete parallel search taking into account specific requirements at process boundaries in order to provide identical search results independently of the domain partitioning. The parallel "multi-grid" search ensures low CPU cost to perform the task of the neighbourhood search and at the same time showing a good scalability when applied to grid partitioned domains. OASIS4 is currently used in few climate applications such as in the FP6 European GEMS project for the 3D coupling between atmosphere and atmosphere chemistry, by the Swedish Meteorological and Hydrological Institute (SMHI) for regional models covering the Arctic Sea or the Baltic area, and by the Calcul Intensif pour le CLimat et l'Environment (CICLE) project funded by the French "Agence Nationale de la Recherche".

  2. Parallel deterioration to language processing in a bilingual speaker.

    PubMed

    Druks, Judit; Weekes, Brendan Stuart

    2013-01-01

    The convergence hypothesis [Green, D. W. (2003). The neural basis of the lexicon and the grammar in L2 acquisition: The convergence hypothesis. In R. van Hout, A. Hulk, F. Kuiken, & R. Towell (Eds.), The interface between syntax and the lexicon in second language acquisition (pp. 197-218). Amsterdam: John Benjamins] assumes that the neural substrates of language representations are shared between the languages of a bilingual speaker. One prediction of this hypothesis is that neurodegenerative disease should produce parallel deterioration to lexical and grammatical processing in bilingual aphasia. We tested this prediction with a late bilingual Hungarian (first language, L1)-English (second language, L2) speaker J.B. who had nonfluent progressive aphasia (NFPA). J.B. had acquired L2 in adolescence but was premorbidly proficient and used English as his dominant language throughout adult life. Our investigations showed comparable deterioration to lexical and grammatical knowledge in both languages during a one-year period. Parallel deterioration to language processing in a bilingual speaker with NFPA challenges the assumption that L1 and L2 rely on different brain mechanisms as assumed in some theories of bilingual language processing [Ullman, M. T. (2001). The neural basis of lexicon and grammar in first and second language: The declarative/procedural model. Bilingualism: Language and Cognition, 4(1), 105-122].

  3. Parallel-Processing Software for Creating Mosaic Images

    NASA Technical Reports Server (NTRS)

    Klimeck, Gerhard; Deen, Robert; McCauley, Michael; DeJong, Eric

    2008-01-01

    A computer program implements parallel processing for nearly real-time creation of panoramic mosaics of images of terrain acquired by video cameras on an exploratory robotic vehicle (e.g., a Mars rover). Because the original images are typically acquired at various camera positions and orientations, it is necessary to warp the images into the reference frame of the mosaic before stitching them together to create the mosaic. [Also see "Parallel-Processing Software for Correlating Stereo Images," Software Supplement to NASA Tech Briefs, Vol. 31, No. 9 (September 2007) page 26.] The warping algorithm in this computer program reflects the considerations that (1) for every pixel in the desired final mosaic, a good corresponding point must be found in one or more of the original images and (2) for this purpose, one needs a good mathematical model of the cameras and a good correlation of individual pixels with respect to their positions in three dimensions. The desired mosaic is divided into slices, each of which is assigned to one of a number of central processing units (CPUs) operating simultaneously. The results from the CPUs are gathered and placed into the final mosaic. The time taken to create the mosaic depends upon the number of CPUs, the speed of each CPU, and whether a local or a remote data-staging mechanism is used.

  4. Sentence comprehension: A parallel distributed processing approach. Technical report

    SciTech Connect

    McClelland, J.L.; St John, M.; Taraban, R.

    1989-07-14

    Basic aspects are reviewed of conventional approaches to sentence comprehension and point out are some of the difficulties faced by models that take these approaches. An alternative approach is described, based on the principles of parallel distributed processing, and shown how it offers different answers to basic questions about the nature of the language processing mechanism. An illustrative simulation model captures the key characteristics of the approach, and illustrates how it can cope with the difficulties faced by conventional models. Alternative ways of conceptualizing basic aspects of language processing within the framework of this approach will consider how it can address several arguments that might be brought to bear against it, and suggest avenues for future development.

  5. Parallel processes and the problem of internal time

    NASA Astrophysics Data System (ADS)

    Gernert, Dieter

    2001-06-01

    It is the purpose of this paper to study the problem of internal (subjective) time by a novel approach. Until now, this topic has mainly been discussed under the aspects of philosophy, psychology, and neurobiology. By way of contrast, we consider processes within an (abstract) information-processing system that fulfills certain minimum requirements. In this context, a special operator algebra is formulated, which has similarities to those occurring in quantum theory. A specific problem related to parallel (concurrent) processes leads to a concept of time windows (time slices) within the context of the operator algebra; this theoretical finding is compatible with known results. Finally, some hints for possible experiments are given, and it will be briefly discussed how this concept is related to traditional ones.

  6. Parallel Latent Semantic Analysis using a Graphics Processing Unit

    SciTech Connect

    Cui, Xiaohui; Potok, Thomas E; Cavanagh, Joseph M

    2009-01-01

    Latent Semantic Analysis (LSA) can be used to reduce the dimensions of large Term-Document datasets using Singular Value Decomposition. However, with the ever expanding size of data sets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. The Graphics Processing Unit (GPU) can solve some highly parallel problems much faster than the traditional sequential processor (CPU). Thus, a deployable system using a GPU to speedup large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a computer cluster. In this paper, we presented a parallel LSA implementation on the GPU, using NVIDIA Compute Unified Device Architecture (CUDA) and Compute Unified Basic Linear Algebra Subprograms (CUBLAS). The performance of this implementation is compared to traditional LSA implementation on CPU using an optimized Basic Linear Algebra Subprograms library. For large matrices that have dimensions divisible by 16, the GPU algorithm ran five to six times faster than the CPU version.

  7. Parallel processing architecture for H.264 deblocking filter on multi-core platforms

    NASA Astrophysics Data System (ADS)

    Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

    2012-03-01

    Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking

  8. An Optimization System with Parallel Processing for Reducing Common-Mode Current on Electronic Control Unit

    NASA Astrophysics Data System (ADS)

    Okazaki, Yuji; Uno, Takanori; Asai, Hideki

    In this paper, we propose an optimization system with parallel processing for reducing electromagnetic interference (EMI) on electronic control unit (ECU). We adopt simulated annealing (SA), genetic algorithm (GA) and taboo search (TS) to seek optimal solutions, and a Spice-like circuit simulator to analyze common-mode current. Therefore, the proposed system can determine the adequate combinations of the parasitic inductance and capacitance values on printed circuit board (PCB) efficiently and practically, to reduce EMI caused by the common-mode current. Finally, we apply the proposed system to an example circuit to verify the validity and efficiency of the system.

  9. Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

    NASA Technical Reports Server (NTRS)

    Oliker, Leonid; Biswas, Rupak

    1999-01-01

    The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.

  10. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

    PubMed Central

    Marçais, Guillaume; Kingsford, Carl

    2011-01-01

    Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish. Contact: gmarcais@umd.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21217122

  11. MASSIVELY PARALLEL LATENT SEMANTIC ANALYSES USING A GRAPHICS PROCESSING UNIT

    SciTech Connect

    Cavanagh, J.; Cui, S.

    2009-01-01

    Latent Semantic Analysis (LSA) aims to reduce the dimensions of large term-document datasets using Singular Value Decomposition. However, with the ever-expanding size of datasets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. A graphics processing unit (GPU) can solve some highly parallel problems much faster than a traditional sequential processor or central processing unit (CPU). Thus, a deployable system using a GPU to speed up large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a PC cluster. Due to the GPU’s application-specifi c architecture, harnessing the GPU’s computational prowess for LSA is a great challenge. We presented a parallel LSA implementation on the GPU, using NVIDIA® Compute Unifi ed Device Architecture and Compute Unifi ed Basic Linear Algebra Subprograms software. The performance of this implementation is compared to traditional LSA implementation on a CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1 000x1 000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran fi ve to six times faster than the CPU version. The large variation is due to architectural benefi ts of the GPU for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.

  12. Massively Parallel Processing for Fast and Accurate Stamping Simulations

    NASA Astrophysics Data System (ADS)

    Gress, Jeffrey J.; Xu, Siguang; Joshi, Ramesh; Wang, Chuan-tao; Paul, Sabu

    2005-08-01

    The competitive automotive market drives automotive manufacturers to speed up the vehicle development cycles and reduce the lead-time. Fast tooling development is one of the key areas to support fast and short vehicle development programs (VDP). In the past ten years, the stamping simulation has become the most effective validation tool in predicting and resolving all potential formability and quality problems before the dies are physically made. The stamping simulation and formability analysis has become an critical business segment in GM math-based die engineering process. As the simulation becomes as one of the major production tools in engineering factory, the simulation speed and accuracy are the two of the most important measures for stamping simulation technology. The speed and time-in-system of forming analysis becomes an even more critical to support the fast VDP and tooling readiness. Since 1997, General Motors Die Center has been working jointly with our software vendor to develop and implement a parallel version of simulation software for mass production analysis applications. By 2001, this technology was matured in the form of distributed memory processing (DMP) of draw die simulations in a networked distributed memory computing environment. In 2004, this technology was refined to massively parallel processing (MPP) and extended to line die forming analysis (draw, trim, flange, and associated spring-back) running on a dedicated computing environment. The evolution of this technology and the insight gained through the implementation of DM0P/MPP technology as well as performance benchmarks are discussed in this publication.

  13. Parallel information processing channels created in the retina

    PubMed Central

    Schiller, Peter H.

    2010-01-01

    In the retina, several parallel channels originate that extract different attributes from the visual scene. This review describes how these channels arise and what their functions are. Following the introduction four sections deal with these channels. The first discusses the “ON” and “OFF” channels that have arisen for the purpose of rapidly processing images in the visual scene that become visible by virtue of either light increment or light decrement; the ON channel processes images that become visible by virtue of light increment and the OFF channel processes images that become visible by virtue of light decrement. The second section examines the midget and parasol channels. The midget channel processes fine detail, wavelength information, and stereoscopic depth cues; the parasol channel plays a central role in processing motion and flicker as well as motion parallax cues for depth perception. Both these channels have ON and OFF subdivisions. The third section describes the accessory optic system that receives input from the retinal ganglion cells of Dogiel; these cells play a central role, in concert with the vestibular system, in stabilizing images on the retina to prevent the blurring of images that would otherwise occur when an organism is in motion. The last section provides a brief overview of several additional channels that originate in the retina. PMID:20876118

  14. Parallel distributed processing: Implications for cognition and development. Technical report

    SciTech Connect

    McClelland, J.L.

    1988-07-11

    This paper provides a brief overview of the connectionist or parallel distributed processing framework for modeling cognitive processes, and considers the application of the connectionist framework to problems of cognitive development. Several aspects of cognitive development might result from the process of learning as it occurs in multi-layer networks. This learning process has the characteristic that it reduces the discrepancy between expected and observed events. As it does this, representations develop on hidden units which dramatically change both the way in which the network represents the environment from which it learns and the expectations that the network generates about environmental events. The learning process exhibits relatively abrupt transitions corresponding to stage shifts in cognitive development. These points are illustrated using a network that learns to anticipate which side of a balance beam will go down, based on the number of weights on each side of the fulcrum and their distance from the fulcrum on each side of the beam. The network is trained in an environment in which weight more frequently governs which side will go down. It recapitulates the states of development seen in children, as well as the stage transitions, as it learns to represent weight and distance information.

  15. Low-cost high-efficiency optical coupling using through-silicon-hole in parallel optical transceiver module

    NASA Astrophysics Data System (ADS)

    Li, Baoxia; Wan, Lixi; Lv, Yao; Gao, Wei; Yang, Chengyue; Li, Zhihua; Zhang, Xu

    2009-06-01

    We present a cost-efficient parallel optical transceiver module based on a 1×4 VCSEL array, a 1×4 PD array, and a 12-wide multimode fiber ribbon for very-short-reach application. A passive alignment technique using through-silicon-hole (TSH) has been developed to realize high-efficient butt-coupling between optoelectronic arrays and multimode fibers. In this paper, the detail optical coupling structure, misalignment tolerance, micro-assembly process, and measurement results are mainly discussed. Finally, lensed multimode fibers formed by chemical etching are proposed, which exhibit a great potential for further improvement of coupling performance.

  16. Parallel Processing of Adaptive Meshes with Load Balancing

    NASA Technical Reports Server (NTRS)

    Das, Sajal K.; Harvey, Daniel J.; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

    2001-01-01

    Many scientific applications involve grids that lack a uniform underlying structure. These applications are often also dynamic in nature in that the grid structure significantly changes between successive phases of execution. In parallel computing environments, mesh adaptation of unstructured grids through selective refinement/coarsening has proven to be an effective approach. However, achieving load balance while minimizing interprocessor communication and redistribution costs is a difficult problem. Traditional dynamic load balancers are mostly inadequate because they lack a global view of system loads across processors. In this paper, we propose a novel and general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication topology, and compare its performance with a successful global load balancing environment, called PLUM, specifically created to handle adaptive unstructured applications. Our experimental results on an IBM SP2 demonstrate that the SBN-based load balancer achieves lower redistribution costs than that under PLUM by overlapping processing and data migration.

  17. Applying the Extended Parallel Process Model to workplace safety messages.

    PubMed

    Basil, Michael; Basil, Debra; Deshpande, Sameer; Lavack, Anne M

    2013-01-01

    The extended parallel process model (EPPM) proposes fear appeals are most effective when they combine threat and efficacy. Three studies conducted in the workplace safety context examine the use of various EPPM factors and their effects, especially multiplicative effects. Study 1 was a content analysis examining the use of EPPM factors in actual workplace safety messages. Study 2 experimentally tested these messages with 212 construction trainees. Study 3 replicated this experiment with 1,802 men across four English-speaking countries-Australia, Canada, the United Kingdom, and the United States. The results of these three studies (1) demonstrate the inconsistent use of EPPM components in real-world work safety communications, (2) support the necessity of self-efficacy for the effective use of threat, (3) show a multiplicative effect where communication effectiveness is maximized when all model components are present (severity, susceptibility, and efficacy), and (4) validate these findings with gory appeals across four English-speaking countries.

  18. "Let's Move" campaign: applying the extended parallel process model.

    PubMed

    Batchelder, Alicia; Matusitz, Jonathan

    2014-01-01

    This article examines Michelle Obama's health campaign, "Let's Move," through the lens of the extended parallel process model (EPPM). "Let's Move" aims to reduce the childhood obesity epidemic in the United States. Developed by Kim Witte, EPPM rests on the premise that people's attitudes can be changed when fear is exploited as a factor of persuasion. Fear appeals work best (a) when a person feels a concern about the issue or situation, and (b) when he or she believes to have the capability of dealing with that issue or situation. Overall, the analysis found that "Let's Move" is based on past health campaigns that have been successful. An important element of the campaign is the use of fear appeals (as it is postulated by EPPM). For example, part of the campaign's strategies is to explain the severity of the diseases associated with obesity. By looking at the steps of EPPM, readers can also understand the strengths and weaknesses of "Let's Move."

  19. Massively Parallel Latent Semantic Analyzes using a Graphics Processing Unit

    SciTech Connect

    Cavanagh, Joseph M; Cui, Xiaohui

    2009-01-01

    Latent Semantic Indexing (LSA) aims to reduce the dimensions of large Term-Document datasets using Singular Value Decomposition. However, with the ever expanding size of data sets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. The Graphics Processing Unit (GPU) can solve some highly parallel problems much faster than the traditional sequential processor (CPU). Thus, a deployable system using a GPU to speedup large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a computer cluster. Due to the GPU s application-specific architecture, harnessing the GPU s computational prowess for LSA is a great challenge. We present a parallel LSA implementation on the GPU, using NVIDIA Compute Unified Device Architecture and Compute Unified Basic Linear Algebra Subprograms. The performance of this implementation is compared to traditional LSA implementation on CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1000x1000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran five to six times faster than the CPU version. The large variation is due to architectural benefits the GPU has for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.

  20. A piloted comparison of elastic and rigid blade-element rotor models using parallel processing technology

    NASA Technical Reports Server (NTRS)

    Hill, Gary; Du Val, Ronald W.; Green, John A.; Huynh, Loc C.

    1990-01-01

    A piloted comparison of rigid and aeroelastic blade-element rotor models was conducted at the Crew Station Research and Development Facility (CSRDF) at Ames Research Center. FLIGHTLAB, a new simulation development and analysis tool, was used to implement these models in real time using parallel processing technology. Pilot comments and quantitative analysis performed both on-line and off-line confirmed that elastic degrees of freedom significantly affect perceived handling qualities. Trim comparisons show improved correlation with flight test data when elastic modes are modeled. The results demonstrate the efficiency with which the mathematical modeling sophistication of existing simulation facilities can be upgraded using parallel processing, and the importance of these upgrades to simulation fidelity.

  1. Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme

    NASA Astrophysics Data System (ADS)

    Huang, M.; Mielikainen, J.; Huang, B.; Chen, H.; Huang, H.-L. A.; Goldberg, M. D.

    2015-09-01

    The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU)-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.

  2. Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme

    NASA Astrophysics Data System (ADS)

    Huang, M.; Mielikainen, J.; Huang, B.; Chen, H.; Huang, H.-L. A.; Goldberg, M. D.

    2014-11-01

    The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using Graphics Processing Unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its CPU-based implementation. This paper presents our efficient GPU-based design on WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its Central Processing Unit (CPU) counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to one CPU core is only 3.5×. We can even boost the speedup to 360× with respect to one CPU core as two K40 GPUs are applied.

  3. Highly efficient and exact method for parallelization of grid-based algorithms and its implementation in DelPhi.

    PubMed

    Li, Chuan; Li, Lin; Zhang, Jie; Alexov, Emil

    2012-09-15

    The Gauss-Seidel (GS) method is a standard iterative numerical method widely used to solve a system of equations and, in general, is more efficient comparing to other iterative methods, such as the Jacobi method. However, standard implementation of the GS method restricts its utilization in parallel computing due to its requirement of using updated neighboring values (i.e., in current iteration) as soon as they are available. Here, we report an efficient and exact (not requiring assumptions) method to parallelize iterations and to reduce the computational time as a linear/nearly linear function of the number of processes or computing units. In contrast to other existing solutions, our method does not require any assumptions and is equally applicable for solving linear and nonlinear equations. This approach is implemented in the DelPhi program, which is a finite difference Poisson-Boltzmann equation solver to model electrostatics in molecular biology. This development makes the iterative procedure on obtaining the electrostatic potential distribution in the parallelized DelPhi several folds faster than that in the serial code. Further, we demonstrate the advantages of the new parallelized DelPhi by computing the electrostatic potential and the corresponding energies of large supramolecular structures.

  4. Mobile Devices and GPU Parallelism in Ionospheric Data Processing

    NASA Astrophysics Data System (ADS)

    Mascharka, D.; Pankratius, V.

    2015-12-01

    Scientific data acquisition in the field is often constrained by data transfer backchannels to analysis environments. Geoscientists are therefore facing practical bottlenecks with increasing sensor density and variety. Mobile devices, such as smartphones and tablets, offer promising solutions to key problems in scientific data acquisition, pre-processing, and validation by providing advanced capabilities in the field. This is due to affordable network connectivity options and the increasing mobile computational power. This contribution exemplifies a scenario faced by scientists in the field and presents the "Mahali TEC Processing App" developed in the context of the NSF-funded Mahali project. Aimed at atmospheric science and the study of ionospheric Total Electron Content (TEC), this app is able to gather data from various dual-frequency GPS receivers. It demonstrates parsing of full-day RINEX files on mobile devices and on-the-fly computation of vertical TEC values based on satellite ephemeris models that are obtained from NASA. Our experiments show how parallel computing on the mobile device GPU enables fast processing and visualization of up to 2 million datapoints in real-time using OpenGL. GPS receiver bias is estimated through minimum TEC approximations that can be interactively adjusted by scientists in the graphical user interface. Scientists can also perform approximate computations for "quickviews" to reduce CPU processing time and memory consumption. In the final stage of our mobile processing pipeline, scientists can upload data to the cloud for further processing. Acknowledgements: The Mahali project (http://mahali.mit.edu) is funded by the NSF INSPIRE grant no. AGS-1343967 (PI: V. Pankratius). We would like to acknowledge our collaborators at Boston College, Virginia Tech, Johns Hopkins University, Colorado State University, as well as the support of UNAVCO for loans of dual-frequency GPS receivers for use in this project, and Intel for loans of

  5. Processing modes and parallel processors in producing familiar keying sequences.

    PubMed

    Verwey, Willem B

    2003-05-01

    Recent theorizing indicates that the acquisition of movement sequence skill involves the development of several independent sequence representations at the same time. To examine this for the discrete sequence production task, participants in Experiment 1 produced a highly practiced sequence of six key presses in two conditions that allowed little preparation so that interkey intervals were slowed. Analyses of the distributions of moderately slowed interkey intervals indicated that this slowing was caused by the occasional use of two slower processing modes, that probably rely on independent sequence representations, and by reduced parallel processing in the fastest processing mode. Experiment 2 addressed the role of intention for the fast production of familiar keying sequences. It showed that the participants, who were not aware they were executing familiar sequences in a somewhat different task, had no benefits of prior practice. This suggests that the mechanisms underlying sequencing skills are not automatically activated by mere execution of familiar sequences, and that some form of top-down, intentional control remains necessary.

  6. Efficient Parallel Implementation of the CCSD External Exchange Operator and the Perturbative Triples (T) Energy Calculation.

    PubMed

    Janowski, Tomasz; Pulay, Peter

    2008-10-14

    A new, efficient parallel algorithm is presented for the most expensive step in coupled cluster singles and doubles (CCSD) energy calculations, the external exchange operator (EEO). The new implementation requires much less input/output than our previous algorithm and takes better advantage of integral screening. It is formulated as a series of matrix multiplications. Both the atomic orbital integrals and the corresponding CC coefficients are broken up into smaller blocks to diminish the memory requirement. Integrals are presorted to make their sparsity pattern more regular. This allows the simultaneous use of two normally conflicting techniques for speeding up the CCSD procedure: the use of highly efficient dense matrix multiplication routines and the efficient utilization of sparsity. We also describe an efficient parallel implementation of the perturbative triples correction to CCSD and related methods. Using the Array Files tool for distributed filesystems, parallelization is straightforward and does not compromise efficiency. Representative timings are shown for calculations with 282-1528 atomic orbitals, 68-228 correlated electrons, and various symmetries, C1 to C2h.

  7. Variation in efficiency of parallel algorithms. [for study of stiffness matrices in planar trusses

    NASA Technical Reports Server (NTRS)

    Hayashi, A.; Melosh, R. J.; Utku, S.; Salama, M.

    1985-01-01

    The present study has the objective to investigate some iterative parallel-processor linear equation solving algorithms with respect to efficiency for analyses of typical linear engineering systems. Attention is given to a set of n linear equations, Ku = p, where K = an n x n positive definite, sparsely populated, symmetric matrix, u = an n x 1 vector of unknown responses, and p = an n x 1 vector of prescribed constants. This study is concerned with a hybrid method in which iteration is used to solve the problem, while a direct method is used on the local processor level. Variations in the efficiency of parallel algorithms are explored. Measures of the efficiency are based on computer experiments regarding the algorithms. For all the algorithms, the wall clock time is found to decrease as the number of processors increases.

  8. Human pattern recognition: parallel processing and perceptual learning.

    PubMed

    Fahle, M

    1994-01-01

    A new theory of visual object recognition by Poggio et al that is based on multidimensional interpolation between stored templates requires fast, stimulus-specific learning in the visual cortex. Indeed, performance in a number of perceptual tasks improves as a result of practice. We distinguish between two phases of learning a vernier-acuity task, a fast one that takes place within less than 20 min and a slow phase that continues over 10 h of training and probably beyond. The improvement is specific for relatively 'simple' features, such as the orientation of the stimulus presented during training, for the position in the visual field, and for the eye through which learning occurred. Some of these results are simulated by means of a computer model that relies on object recognition by multidimensional interpolation between stored templates. Orientation specificity of learning is also found in a jump-displacement task. In a manner parallel to the improvement in performance, cortical potentials evoked by the jump displacement tend to decrease in latency and to increase in amplitude as a result of training. The distribution of potentials over the brain changes significantly as a result of repeated exposure to the same stimulus. The results both of psychophysical and of electrophysiological experiments indicate that some form of perceptual learning might occur very early during cortical information processing. The hypothesis that vernier breaks are detected 'early' during pattern recognition is supported by the fact that reaction times for the detection of verniers depend hardly at all on the number of stimuli presented simultaneously. Hence, vernier breaks can be detected in parallel at different locations in the visual field, indicating that deviation from straightness is an elementary feature for visual pattern recognition in humans that is detected at an early stage of pattern recognition. Several results obtained during the last few years are reviewed, some new

  9. On the parallel efficiency of the Frederickson-McBryan multigrid

    NASA Technical Reports Server (NTRS)

    Decker, Naomi H.

    1990-01-01

    To take full advantage of the parallelism in a standard multigrid algorithm requires as many processors as points. However, since coarse grids contain fewer points, most processors are idle during the coarse grid iterations. Frederickson and McBryan claim that retaining all points on all grid levels (using all processors) can lead to a superconvergent algorithm. The purpose of this work is to show that the parellel superconvergent multigrid (PSMG) algorithm of Frederickson and McBryan, though it achieves perfect processor utilization, is no more efficient than a parallel implementation of standard multigrid methods. PSMG is simply a new and perhaps simpler way of achieving the same results.

  10. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations

    NASA Technical Reports Server (NTRS)

    Stone, H. S.

    1971-01-01

    Tridiagonal linear systems of equations are solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computations on computers of the ILLIAC IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log sub 2 N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.

  11. Optoelectronic parallel processing with smart pixel arrays for automated screening of cervical smear imagery

    NASA Astrophysics Data System (ADS)

    Metz, John Langdon

    2000-10-01

    This thesis investigates the use of optoelectronic parallel processing systems with smart photosensor arrays (SPAs) to examine cervical smear images. The automation of cervical smear screening seeks to reduce human workload and improve the accuracy of detecting pre- cancerous and cancerous conditions. Increasing the parallelism of image processing improves the speed and accuracy of locating regions-of-interest (ROI) from images of the cervical smear for the first stage of a two-stage screening system. The two-stage approach first detects ROI optoelectronically before classifying them using more time consuming electronic algorithms. The optoelectronic hit/miss transform (HMT) is computed using gray scale modulation spatial light modulators in an optical correlator. To further the parallelism of this system, a novel CMOS SPA computes the post processing steps required by the HMT algorithm. The SPA reduces the subsequent bandwidth passed into the second, electronic image processing stage classifying the detected ROI. Limitations in the miss operation of the HMT suggest using only the hit operation for detecting ROI. This makes possible a single SPA chip approach using only the hit operation for ROI detection which may replace the optoelectronic correlator in the screening system. Both the HMT SPA postprocessor and the SPA ROI detector design provide compact, efficient, and low-cost optoelectronic solutions to performing ROI detection on cervical smears. Analysis of optoelectronic ROI detection with electronic ROI classification shows these systems have the potential to perform at, or above, the current error rates for manual classification of cervical smears.

  12. On the design and implementation of a parallel, object-oriented, image processing toolkit

    SciTech Connect

    Kamath, C; Baldwin, C; Fodor, I; Tang, N A

    2000-06-22

    Advanced in technology have enabled us to collect data from observations, experiments, and simulations at an ever increasing pace. As these data sets approach the terabyte and petabyte range, scientists are increasingly using semi-automated techniques from data mining and pattern recognition to find useful information in the data. In order for data mining to be successful, the raw data must first be processed into a form suitable for the detection of patterns. When the data is in the form of images, this can involve a substantial amount of processing on very large data sets. To help make this task more efficient, they are designing and implementing an object-oriented image processing toolkit that specifically targets massively-parallel, distributed-memory architectures. They first show that it is possible to use object-oriented technology to effectively address the diverse needs of image applications. Next, they describe how we abstract out the similarities in image processing algorithms to enable re-use in the software. They will also discuss the difficulties encountered in parallelizing image algorithms on massively parallel machines as well as the bottlenecks to high performance. They will demonstrate the work using images from an astronomical data set, and illustrate how techniques such as filters and denoising through the thresholding of wavelet coefficients can be applied when a large image is distributed across several processors.

  13. Efficient Extraction of Regional Subsets from Massive Climate Datasets using Parallel IO

    NASA Astrophysics Data System (ADS)

    Daily, J.; Schuchardt, K.; Palmer, B. J.

    2010-12-01

    The size of datasets produced by current climate models is increasing rapidly to the scale of petabytes. To handle data at this scale parallel analysis tools are required, however the majority of climate analysis software is serial and remains at the scale of workstations. Further, many climate analysis tools are designed to process regularly gridded data but lack sufficient features to handle unstructured grids. This paper presents a data-parallel subsetter capable of correctly handling unstructured grids while scaling to over 2000 cores. The approach is based on the partitioned global address space (PGAS) parallel programming model and one-sided communication. The paper demonstrates that parallel analysis of climate data succeeds in practice, although IO remains the single greatest bottleneck.

  14. Resolving Multiscale Processes in Tropical Cyclogenesis Using Parallel EEMD

    NASA Astrophysics Data System (ADS)

    Wu, Y.; Shen, B. W.; Cheung, S.; Li, J. L. F.; Liu, Z.

    2014-12-01

    The recent advance in high-resolution global models has suggested that improved multiscale simulations of tropical waves may help extend the lead time of tropical cyclone (TC) formation prediction (e.g., Shen et al., 2010ab, 2012, 2013a). In previous efforts in the multiscale analysis of tropical waves , the Ensemble Empirical Mode Decomposition (EEMD) has been successfully parallelized and used to detect atmospheric wave signals on different spatial scales (e.g. Shen et al., 2013b) that include Mixed Rossby Gravity (MRG) waves, Western Wind Belt (WWB), African Easterly Waves (AEWs), etc. We now extend the related studies to examine the evolution of the large scale waves and their association with the formation of tropical cyclones in the Atlantic for an extensive time period spanning multiple years. Our goal is to analyze the multiscale interaction in the initiation and early intensification stage of an AEW and its subsequent impact on TC genesis that involves mainly the large scale downscaling processes. Specific focus is on the impact of barotropic instability and critical level (CL, or steering level) that may appear in association with the AEW. The presence of the CL is believed to play an important role in providing a favorable environment in the early TC-genesis stage in the marsupial paradigm scenario. Preliminary analysis of the satellite data obtained from the newly launched Global Precipitation Measurement (GPM) mission linked to the TC genesis processes will be included.

  15. Comparing Two Opacity Models in Monte Carlo Radiative Heat Transfer: Computational Efficiency and Parallel Load Balancing

    NASA Astrophysics Data System (ADS)

    Cleveland, Mathew A.; Palmer, Todd S.

    2013-09-01

    Thermal heating from radiative heat transfer can have a significant effect on combustion systems. A variety of models have been developed to represent the strongly varying opacities found in combustion gases (Goutiere et al., 2000). This work evaluates the computational efficiency and load balance issues associated with two opacity models implemented in a 3D parallel Monte Carlo solver: the spectral-line-based weighted sum of gray gases (SLW) (Denison and Webb, 1993) and the spectral line-by-line (LBL) (Wang and Modest, 2007) opacity models. The parallel performance of the opacity models is evaluated using the Su and Olson (1999) frequency-dependent semi-analytic benchmark problem. Weak scaling, strong scaling, and history scaling studies were performed and comparisons were made for each opacity model. Comparisons of load balance sensitivities to these types of scaling were also evaluated. It was found that the SLW model has some attributes that might be valuable in a select set of parallel problems.

  16. An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method

    PubMed Central

    Marucci, Evandro A.; Neves, Leandro A.; Valêncio, Carlo R.; Pinto, Alex R.; Cansian, Adriano M.; de Souza, Rogeria C. G.; Shiyou, Yang; Machado, José M.

    2014-01-01

    With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Among them, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount of data, the union of low complexity methods with the use of parallel computing is becoming desirable. The k-mers counting is a very efficient method with good biological results. In this work, the development of a parallel algorithm for multiple sequence similarities calculation using the k-mers counting method is proposed. Tests show that the algorithm presents a very good scalability and a nearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiple sequence alignment tools, such as MAFFT and MUSCLE. PMID:25140318

  17. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-18

    Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (`PAMI`) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  18. Parallel elastic elements improve energy efficiency on the STEPPR bipedal walking robot

    DOE PAGES

    Mazumdar, Anirban; Spencer, Steven J.; Hobart, Clinton; ...

    2016-11-23

    This study describes how parallel elastic elements can be used to reduce energy consumption in the electric motor driven, fully-actuated, STEPPR bipedal walking robot without compromising or significantly limiting locomotive behaviors. A physically motivated approach is used to illustrate how selectively-engaging springs for hip adduction and ankle flexion predict benefits for three different flat ground walking gaits: human walking, human-like robot walking and crouched robot walking. Based on locomotion data, springs are designed and substantial reductions in power consumption are demonstrated using a bench dynamometer. These lessons are then applied to STEPPR (Sandia Transmission-Efficient Prototype Promoting Research), a fully actuatedmore » bipedal robot designed to explore the impact of tailored joint mechanisms on walking efficiency. Featuring high-torque brushless DC motors, efficient low-ratio transmissions, and high fidelity torque control, STEPPR provides the ability to incorporate novel joint-level mechanisms without dramatically altering high level control. Unique parallel elastic designs are incorporated into STEPPR, and walking data shows that hip adduction and ankle flexion springs significantly reduce the required actuator energy at those joints for several gaits. These results suggest that parallel joint springs offer a promising means of supporting quasi-static joint torques due to body mass during walking, relieving motors of the need to support these torques and substantially improving locomotive energy efficiency.« less

  19. Parallel elastic elements improve energy efficiency on the STEPPR bipedal walking robot

    SciTech Connect

    Mazumdar, Anirban; Spencer, Steven J.; Hobart, Clinton; Salton, Jonathan; Quigley, Morgan; Wu, Tingfan; Bertrand, Sylvain; Pratt, Jerry; Buerger, Stephen P.

    2016-11-23

    This study describes how parallel elastic elements can be used to reduce energy consumption in the electric motor driven, fully-actuated, STEPPR bipedal walking robot without compromising or significantly limiting locomotive behaviors. A physically motivated approach is used to illustrate how selectively-engaging springs for hip adduction and ankle flexion predict benefits for three different flat ground walking gaits: human walking, human-like robot walking and crouched robot walking. Based on locomotion data, springs are designed and substantial reductions in power consumption are demonstrated using a bench dynamometer. These lessons are then applied to STEPPR (Sandia Transmission-Efficient Prototype Promoting Research), a fully actuated bipedal robot designed to explore the impact of tailored joint mechanisms on walking efficiency. Featuring high-torque brushless DC motors, efficient low-ratio transmissions, and high fidelity torque control, STEPPR provides the ability to incorporate novel joint-level mechanisms without dramatically altering high level control. Unique parallel elastic designs are incorporated into STEPPR, and walking data shows that hip adduction and ankle flexion springs significantly reduce the required actuator energy at those joints for several gaits. These results suggest that parallel joint springs offer a promising means of supporting quasi-static joint torques due to body mass during walking, relieving motors of the need to support these torques and substantially improving locomotive energy efficiency.

  20. Accelerating the Gillespie Exact Stochastic Simulation Algorithm using hybrid parallel execution on graphics processing units.

    PubMed

    Komarov, Ivan; D'Souza, Roshan M

    2012-01-01

    The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.

  1. Accelerating the Gillespie Exact Stochastic Simulation Algorithm Using Hybrid Parallel Execution on Graphics Processing Units

    PubMed Central

    Komarov, Ivan; D'Souza, Roshan M.

    2012-01-01

    The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×−120× performance gain over various state-of-the-art serial algorithms when simulating different types of models. PMID:23152751

  2. Efficient parallel linear scaling construction of the density matrix for Born-Oppenheimer molecular dynamics.

    PubMed

    Mniszewski, S M; Cawkwell, M J; Wall, M E; Mohd-Yusof, J; Bock, N; Germann, T C; Niklasson, A M N

    2015-10-13

    We present an algorithm for the calculation of the density matrix that for insulators scales linearly with system size and parallelizes efficiently on multicore, shared memory platforms with small and controllable numerical errors. The algorithm is based on an implementation of the second-order spectral projection (SP2) algorithm [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] in sparse matrix algebra with the ELLPACK-R data format. We illustrate the performance of the algorithm within self-consistent tight binding theory by total energy calculations of gas phase poly(ethylene) molecules and periodic liquid water systems containing up to 15,000 atoms on up to 16 CPU cores. We consider algorithm-specific performance aspects, such as local vs nonlocal memory access and the degree of matrix sparsity. Comparisons to sparse matrix algebra implementations using off-the-shelf libraries on multicore CPUs, graphics processing units (GPUs), and the Intel many integrated core (MIC) architecture are also presented. The accuracy and stability of the algorithm are illustrated with long duration Born-Oppenheimer molecular dynamics simulations of 1000 water molecules and a 303 atom Trp cage protein solvated by 2682 water molecules.

  3. A message passing kernel for the hypercluster parallel processing test bed

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.; Quealy, Angela; Cole, Gary L.

    1989-01-01

    A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP.

  4. Nonlinear structural response using adaptive dynamic relaxation on a massively-parallel-processing system

    NASA Technical Reports Server (NTRS)

    Oakley, David R.; Knight, Norman F., Jr.

    1994-01-01

    A parallel adaptive dynamic relaxation (ADR) algorithm has been developed for nonlinear structural analysis. This algorithm has minimal memory requirements, is easily parallelizable and scalable to many processors, and is generally very reliable and efficient for highly nonlinear problems. Performance evaluations on single-processor computers have shown that the ADR algorithm is reliable and highly vectorizable, and that it is competitive with direct solution methods for the highly nonlinear problems considered. The present algorithm is implemented on the 512-processor Intel Touchstone DELTA system at Caltech, and it is designed to minimize the extent and frequency of interprocessor communication. The algorithm has been used to solve for the nonlinear static response of two and three dimensional hyperelastic systems involving contact. Impressive relative speedups have been achieved and demonstrate the high scalability of the ADR algorithm. For the class of problems addressed, the ADR algorithm represents a very promising approach for parallel-vector processing.

  5. Parallel processing using an optical delay-based reservoir computer

    NASA Astrophysics Data System (ADS)

    Van der Sande, Guy; Nguimdo, Romain Modeste; Verschaffelt, Guy

    2016-04-01

    Delay systems subject to delayed optical feedback have recently shown great potential in solving computationally hard tasks. By implementing a neuro-inspired computational scheme relying on the transient response to optical data injection, high processing speeds have been demonstrated. However, reservoir computing systems based on delay dynamics discussed in the literature are designed by coupling many different stand-alone components which lead to bulky, lack of long-term stability, non-monolithic systems. Here we numerically investigate the possibility of implementing reservoir computing schemes based on semiconductor ring lasers. Semiconductor ring lasers are semiconductor lasers where the laser cavity consists of a ring-shaped waveguide. SRLs are highly integrable and scalable, making them ideal candidates for key components in photonic integrated circuits. SRLs can generate light in two counterpropagating directions between which bistability has been demonstrated. We demonstrate that two independent machine learning tasks , even with different nature of inputs with different input data signals can be simultaneously computed using a single photonic nonlinear node relying on the parallelism offered by photonics. We illustrate the performance on simultaneous chaotic time series prediction and a classification of the Nonlinear Channel Equalization. We take advantage of different directional modes to process individual tasks. Each directional mode processes one individual task to mitigate possible crosstalk between the tasks. Our results indicate that prediction/classification with errors comparable to the state-of-the-art performance can be obtained even with noise despite the two tasks being computed simultaneously. We also find that a good performance is obtained for both tasks for a broad range of the parameters. The results are discussed in detail in [Nguimdo et al., IEEE Trans. Neural Netw. Learn. Syst. 26, pp. 3301-3307, 2015

  6. Seeing the forest for the trees: Networked workstations as a parallel processing computer

    NASA Technical Reports Server (NTRS)

    Breen, J. O.; Meleedy, D. M.

    1992-01-01

    Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.

  7. Fast phase processing in off-axis holography by CUDA including parallel phase unwrapping.

    PubMed

    Backoach, Ohad; Kariv, Saar; Girshovitz, Pinhas; Shaked, Natan T

    2016-02-22

    We present parallel processing implementation for rapid extraction of the quantitative phase maps from off-axis holograms on the Graphics Processing Unit (GPU) of the computer using computer unified device architecture (CUDA) programming. To obtain efficient implementation, we parallelized both the wrapped phase map extraction algorithm and the two-dimensional phase unwrapping algorithm. In contrast to previous implementations, we utilized unweighted least squares phase unwrapping algorithm that better suits parallelism. We compared the proposed algorithm run times on the CPU and the GPU of the computer for various sizes of off-axis holograms. Using the GPU implementation, we extracted the unwrapped phase maps from the recorded off-axis holograms at 35 frames per second (fps) for 4 mega pixel holograms, and at 129 fps for 1 mega pixel holograms, which presents the fastest processing framerates obtained so far, to the best of our knowledge. We then used common-path off-axis interferometric imaging to quantitatively capture the phase maps of a micro-organism with rapid flagellum movements.

  8. Techniques for Real-Time Parallel Processing: Sensor Processing Case Studies

    DTIC Science & Technology

    1994-04-01

    probabilistic data association (. IPDA ) algorithm. Taken together, this is an example of a class 3 algorithm whose running time is strongly dependent on...horizontally and solve a factor of 128/rn more problems, but this was not tested .) The code that computes the MGS in MPL is shown in Figure 5. There is a single...designed so that it could be implemented efficiently on both parallel and sequential machines. Our tests of the SISAL version of JPDA were performed on a

  9. Parallel processing methods for space based power systems

    NASA Technical Reports Server (NTRS)

    Berry, F. C.

    1993-01-01

    This report presents a method for doing load-flow analysis of a power system by using a decomposition approach. The power system for the Space Shuttle is used as a basis to build a model for the load-flow analysis. To test the decomposition method for doing load-flow analysis, simulations were performed on power systems of 16, 25, 34, 43, 52, 61, 70, and 79 nodes. Each of the power systems was divided into subsystems and simulated under steady-state conditions. The results from these tests have been found to be as accurate as tests performed using a standard serial simulator. The division of the power systems into different subsystems was done by assigning a processor to each area. There were 13 transputers available, therefore, up to 13 different subsystems could be simulated at the same time. This report has preliminary results for a load-flow analysis using a decomposition principal. The report shows that the decomposition algorithm for load-flow analysis is well suited for parallel processing and provides increases in the speed of execution.

  10. Parallel Computations in Insect and Mammalian Visual Motion Processing.

    PubMed

    Clark, Damon A; Demb, Jonathan B

    2016-10-24

    Sensory systems use receptors to extract information from the environment and neural circuits to perform subsequent computations. These computations may be described as algorithms composed of sequential mathematical operations. Comparing these operations across taxa reveals how different neural circuits have evolved to solve the same problem, even when using different mechanisms to implement the underlying math. In this review, we compare how insect and mammalian neural circuits have solved the problem of motion estimation, focusing on the fruit fly Drosophila and the mouse retina. Although the two systems implement computations with grossly different anatomy and molecular mechanisms, the underlying circuits transform light into motion signals with strikingly similar processing steps. These similarities run from photoreceptor gain control and spatiotemporal tuning to ON and OFF pathway structures, motion detection, and computed motion signals. The parallels between the two systems suggest that a limited set of algorithms for estimating motion satisfies both the needs of sighted creatures and the constraints imposed on them by metabolism, anatomy, and the structure and regularities of the visual world.

  11. An integrated approach to improving the parallel applications development process

    SciTech Connect

    Rasmussen, Craig E; Watson, Gregory R; Tibbitts, Beth R

    2009-01-01

    The development of parallel applications is becoming increasingly important to a broad range of industries. Traditionally, parallel programming was a niche area that was primarily exploited by scientists trying to model extremely complicated physical phenomenon. It is becoming increasingly clear, however, that continued hardware performance improvements through clock scaling and feature-size reduction are simply not going to be achievable for much longer. The hardware vendor's approach to addressing this issue is to employ parallelism through multi-processor and multi-core technologies. While there is little doubt that this approach produces scaling improvements, there are still many significant hurdles to be overcome before parallelism can be employed as a general replacement to more traditional programming techniques. The Parallel Tools Platform (PTP) Project was created in 2005 in an attempt to provide developers with new tools aimed at addressing some of the parallel development issues. Since then, the introduction of a new generation of peta-scale and multi-core systems has highlighted the need for such a platform. In this paper, we describe some of the challenges facing parallel application developers, present the current state of PTP, and provide a simple case study that demonstrates how PTP can be used to locate a potential deadlock situation in an MPI code.

  12. Introducing data parallelism into climate model post-processing through a parallel version of the NCAR Command Language (NCL)

    NASA Astrophysics Data System (ADS)

    Jacob, R. L.; Xu, X.; Krishna, J.; Tautges, T.

    2011-12-01

    The relationship between the needs of post-processing climate model output and the capability of the available tools has reached a crisis point. The large volume of data currently produced by climate models is overwhelming the current, decades-old analysis workflow. The tools used to implement that workflow are now a bottleneck in the climate science discovery processes. This crisis will only worsen as ultra-high resolution global climate models with horizontal scales of 4 km or smaller, running on leadership computing facilities, begin to produce tens to hundreds of terabytes for a single, hundred-year climate simulation. While climate models have used parallelism for several years, the post-processing tools are still mostly single-threaded applications. We have created a Parallel Climate Analysis Library (ParCAL) which implements many common climate analysis operations in a data-parallel fashion using the Message Passing Interface. ParCAL has in turn been built on sophisticated packages for describing grids in parallel (the Mesh Oriented database (MOAB) and for performing vector operations on arbitrary grids (Intrepid). ParCAL is also using parallel I/O through the PnetCDF library. ParCAL has been used to implement a parallel version of the NCAR Command Language (NCL). ParNCL/ParCAL not only speeds up analysis of large datasets but also allows operations to be performed on native grids, eliminating the need to transform everything to latitude-longitude grids. In most cases, users NCL scripts can run unaltered in parallel using ParNCL.

  13. Parallel-processing with surface plasmons, a new strategy for converting the broad solar spectrum

    NASA Technical Reports Server (NTRS)

    Anderson, L. M.

    1982-01-01

    A new strategy for efficient solar-energy conversion is based on parallel processing with surface plasmons: guided electromagnetic waves supported on thin films of common metals like aluminum or silver. The approach is unique in identifying a broadband carrier with suitable range for energy transport and an inelastic tunneling process which can be used to extract more energy from the more energetic carriers without requiring different materials for each frequency band. The aim is to overcome the fundamental 56-percent loss associated with mismatch between the broad solar spectrum and the monoenergetic conduction electrons used to transport energy in conventional silicon solar cells. This paper presents a qualitative discussion of the unknowns and barrier problems, including ideas for coupling surface plasmons into the tunnels, a step which has been the weak link in the efficiency chain.

  14. Parallel-processing with surface plasmons, a new strategy for converting the broad solar spectrum

    NASA Technical Reports Server (NTRS)

    Anderson, L. M.

    1982-01-01

    A new strategy for efficient solar-energy conversion is based on parallel processing with surface plasmons: guided electromagnetic waves supported on thin films of common metals like aluminum or silver. The approach is unique in identifying a broadband carrier with suitable range for energy transport and an inelastic tunneling process which can be used to extract more energy from the more energetic carriers without requiring different materials for each frequency band. The aim is to overcome the fundamental 56-percent loss associated with mismatch between the broad solar spectrum and the monoenergetic conduction electrons used to transport energy in conventional silicon solar cells. This paper presents a qualitative discussion of the unknowns and barrier problems, including ideas for coupling surface plasmons into the tunnels, a step which has been the weak link in the efficiency chain.

  15. A new method to customize protein expression vectors for fast, efficient and background free parallel cloning.

    PubMed

    Scholz, Judith; Besir, Hüseyin; Strasser, Claudia; Suppmann, Sabine

    2013-02-14

    Expression and purification of correctly folded proteins typically require screening of different parameters such as protein variants, solubility enhancing tags or expression hosts. Parallel vector series that cover all variations are available, but not without compromise. We have established a fast, efficient and absolutely background free cloning approach that can be applied to any selected vector. Here we describe a method to tailor selected expression vectors for parallel Sequence and Ligation Independent Cloning. SLIC cloning enables precise and sequence independent engineering and is based on joining vector and insert with 15-25 bp homologies on both DNA ends by homologous recombination. We modified expression vectors based on pET, pFastBac and pTT backbones for parallel PCR-based cloning and screening in E.coli, insect cells and HEK293E cells, respectively. We introduced the toxic ccdB gene under control of a strong constitutive promoter for counterselection of insert less vector. In contrast to DpnI treatment commonly used to reduce vector background, ccdB used in our vector series is 100% efficient in killing parental vector carrying cells and reduces vector background to zero. In addition, the 3' end of ccdB functions as a primer binding site common to all vectors. The second shared primer binding site is provided by a HRV 3C protease cleavage site located downstream of purification and solubility enhancing tags for tag removal. We have so far generated more than 30 different parallel expression vectors, and successfully cloned and expressed more than 250 genes with this vector series. There is no size restriction for gene insertion, clone efficiency is > 95% with clone numbers up to 200. The procedure is simple, fast, efficient and cost-effective. All expression vectors showed efficient expression of eGFP and different target proteins requested to be produced and purified at our Core Facility services. This new expression vector series allows efficient

  16. Efficient parallelization for AMR MHD multiphysics calculations; implementation in AstroBEAR

    NASA Astrophysics Data System (ADS)

    Carroll-Nellenback, Jonathan J.; Shroyer, Brandon; Frank, Adam; Ding, Chen

    2013-03-01

    Current adaptive mesh refinement (AMR) simulations require algorithms that are highly parallelized and manage memory efficiently. As compute engines grow larger, AMR simulations will require algorithms that achieve new levels of efficient parallelization and memory management. We have attempted to employ new techniques to achieve both of these goals. Patch or grid based AMR often employs ghost cells to decouple the hyperbolic advances of each grid on a given refinement level. This decoupling allows each grid to be advanced independently. In AstroBEAR we utilize this independence by threading the grid advances on each level with preference going to the finer level grids. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both physical space and AMR level. Threading of level advances can also improve performance by interleaving communication with computation, especially in deep simulations with many levels of refinement. While we see improvements of up to 30% on deep simulations run on a few cores, the speedup is typically more modest (5-20%) for larger scale simulations. To improve memory management we have employed a distributed tree algorithm that requires processors to only store and communicate local sections of the AMR tree structure with neighboring processors. Using this distributed approach we are able to get reasonable scaling efficiency (>80%) out to 12288 cores and up to 8 levels of AMR - independent of the use of threading.

  17. The finite element machine: An experiment in parallel processing

    NASA Technical Reports Server (NTRS)

    Storaasli, O. O.; Peebles, S. W.; Crockett, T. W.; Knott, J. D.; Adams, L.

    1982-01-01

    The finite element machine is a prototype computer designed to support parallel solutions to structural analysis problems. The hardware architecture and support software for the machine, initial solution algorithms and test applications, and preliminary results are described.

  18. Parameters that affect parallel processing for computational electromagnetic simulation codes on high performance computing clusters

    NASA Astrophysics Data System (ADS)

    Moon, Hongsik

    What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the

  19. Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.

    PubMed

    Kundeti, Vamsi K; Rajasekaran, Sanguthevar; Dinh, Hieu; Vaughn, Matthew; Thapar, Vishal

    2010-11-15

    Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet). In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster--both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. The bi-directed de Bruijn graph is a fundamental data structure for

  20. Parallel processing of general and specific threat during early stages of perception

    PubMed Central

    2016-01-01

    Differential processing of threat can consummate as early as 100 ms post-stimulus. Moreover, early perception not only differentiates threat from non-threat stimuli but also distinguishes among discrete threat subtypes (e.g. fear, disgust and anger). Combining spatial-frequency-filtered images of fear, disgust and neutral scenes with high-density event-related potentials and intracranial source estimation, we investigated the neural underpinnings of general and specific threat processing in early stages of perception. Conveyed in low spatial frequencies, fear and disgust images evoked convergent visual responses with similarly enhanced N1 potentials and dorsal visual (middle temporal gyrus) cortical activity (relative to neutral cues; peaking at 156 ms). Nevertheless, conveyed in high spatial frequencies, fear and disgust elicited divergent visual responses, with fear enhancing and disgust suppressing P1 potentials and ventral visual (occipital fusiform) cortical activity (peaking at 121 ms). Therefore, general and specific threat processing operates in parallel in early perception, with the ventral visual pathway engaged in specific processing of discrete threats and the dorsal visual pathway in general threat processing. Furthermore, selectively tuned to distinctive spatial-frequency channels and visual pathways, these parallel processes underpin dimensional and categorical threat characterization, promoting efficient threat response. These findings thus lend support to hybrid models of emotion. PMID:26412811

  1. Parallel processing of general and specific threat during early stages of perception.

    PubMed

    You, Yuqi; Li, Wen

    2016-03-01

    Differential processing of threat can consummate as early as 100 ms post-stimulus. Moreover, early perception not only differentiates threat from non-threat stimuli but also distinguishes among discrete threat subtypes (e.g. fear, disgust and anger). Combining spatial-frequency-filtered images of fear, disgust and neutral scenes with high-density event-related potentials and intracranial source estimation, we investigated the neural underpinnings of general and specific threat processing in early stages of perception. Conveyed in low spatial frequencies, fear and disgust images evoked convergent visual responses with similarly enhanced N1 potentials and dorsal visual (middle temporal gyrus) cortical activity (relative to neutral cues; peaking at 156 ms). Nevertheless, conveyed in high spatial frequencies, fear and disgust elicited divergent visual responses, with fear enhancing and disgust suppressing P1 potentials and ventral visual (occipital fusiform) cortical activity (peaking at 121 ms). Therefore, general and specific threat processing operates in parallel in early perception, with the ventral visual pathway engaged in specific processing of discrete threats and the dorsal visual pathway in general threat processing. Furthermore, selectively tuned to distinctive spatial-frequency channels and visual pathways, these parallel processes underpin dimensional and categorical threat characterization, promoting efficient threat response. These findings thus lend support to hybrid models of emotion. © The Author (2015). Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  2. Parallelized CCHE2D flow model with CUDA Fortran on Graphics Process Units

    USDA-ARS?s Scientific Manuscript database

    This paper presents the CCHE2D implicit flow model parallelized using CUDA Fortran programming technique on Graphics Processing Units (GPUs). A parallelized implicit Alternating Direction Implicit (ADI) solver using Parallel Cyclic Reduction (PCR) algorithm on GPU is developed and tested. This solve...

  3. Parallel processing for nonlinear dynamics simulations of structures including rotating bladed-disk assemblies

    NASA Technical Reports Server (NTRS)

    Hsieh, Shang-Hsien

    1993-01-01

    The principal objective of this research is to develop, test, and implement coarse-grained, parallel-processing strategies for nonlinear dynamic simulations of practical structural problems. There are contributions to four main areas: finite element modeling and analysis of rotational dynamics, numerical algorithms for parallel nonlinear solutions, automatic partitioning techniques to effect load-balancing among processors, and an integrated parallel analysis system.

  4. Thermal efficiency of arc welding processes

    SciTech Connect

    DuPont, J.N.; Marder, A.R.

    1995-12-01

    A study was conducted on the arc and melting efficiency of the plasma arc, gas tungsten arc, gas metal arc, and submerged arc welding processes. The results of this work are extended to develop a quantitative method for estimating weld metal dilution in a companion paper. Arc efficiency was determined as a function of current for each process using A36 steel base metal. Melting efficiency was evaluated with variations in arc power and travel speed during deposition of austenitic stainless steel filler metal onto A36 steel substrates. The arc efficiency did not vary significantly within a given process over the range of currents investigated. A semi-empirical relation was developed for the melting efficiency as a function of net arc power and travel speed, which described the experimental data well. An interaction was observed between the arc and melting efficiency. A low arc efficiency factor limits the power delivered to the substrate which, in turn, limits the maximum travel speed for a given set of conditions. High melting efficiency is favored by high arc powers and travel speeds. As a result, a low arc efficiency can limit the maximum obtainable melting efficiency.

  5. PPM A highly efficient parallel particle mesh library for the simulation of continuum systems

    NASA Astrophysics Data System (ADS)

    Sbalzarini, I. F.; Walther, J. H.; Bergdorf, M.; Hieber, S. E.; Kotsalis, E. M.; Koumoutsakos, P.

    2006-07-01

    This paper presents a highly efficient parallel particle-mesh (PPM) library, based on a unifying particle formulation for the simulation of continuous systems. In this formulation, the grid-free character of particle methods is relaxed by the introduction of a mesh for the reinitialization of the particles, the computation of the field equations, and the discretization of differential operators. The present utilization of the mesh does not detract from the adaptivity, the efficient handling of complex geometries, the minimal dissipation, and the good stability properties of particle methods. The coexistence of meshes and particles, allows for the development of a consistent and adaptive numerical method, but it presents a set of challenging parallelization issues that have hindered in the past the broader use of particle methods. The present library solves the key parallelization issues involving particle-mesh interpolations and the balancing of processor particle loading, using a novel adaptive tree for mixed domain decompositions along with a coloring scheme for the particle-mesh interpolation. The high parallel efficiency of the library is demonstrated in a series of benchmark tests on distributed memory and on a shared-memory vector architecture. The modularity of the method is shown by a range of simulations, from compressible vortex rings using a novel formulation of smooth particle hydrodynamics, to simulations of diffusion in real biological cell organelles. The present library enables large scale simulations of diverse physical problems using adaptive particle methods and provides a computational tool that is a viable alternative to mesh-based methods.

  6. A Pervasive Parallel Processing Framework for Data Visualization and Analysis at Extreme Scale

    SciTech Connect

    Moreland, Kenneth; Geveci, Berk

    2014-11-01

    The evolution of the computing world from teraflop to petaflop has been relatively effortless, with several of the existing programming models scaling effectively to the petascale. The migration to exascale, however, poses considerable challenges. All industry trends infer that the exascale machine will be built using processors containing hundreds to thousands of cores per chip. It can be inferred that efficient concurrency on exascale machines requires a massive amount of concurrent threads, each performing many operations on a localized piece of data. Currently, visualization libraries and applications are based off what is known as the visualization pipeline. In the pipeline model, algorithms are encapsulated as filters with inputs and outputs. These filters are connected by setting the output of one component to the input of another. Parallelism in the visualization pipeline is achieved by replicating the pipeline for each processing thread. This works well for today’s distributed memory parallel computers but cannot be sustained when operating on processors with thousands of cores. Our project investigates a new visualization framework designed to exhibit the pervasive parallelism necessary for extreme scale machines. Our framework achieves this by defining algorithms in terms of worklets, which are localized stateless operations. Worklets are atomic operations that execute when invoked unlike filters, which execute when a pipeline request occurs. The worklet design allows execution on a massive amount of lightweight threads with minimal overhead. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale machine.

  7. Parallel processing architecture for computing inverse differential kinematic equations of the PUMA arm

    NASA Technical Reports Server (NTRS)

    Hsia, T. C.; Lu, G. Z.; Han, W. H.

    1987-01-01

    In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.

  8. Adaptive plasticity and genetic divergence in feeding efficiency during parallel adaptive radiation of whitefish (Coregonus spp.).

    PubMed

    Lundsgaard-Hansen, B; Matthews, B; Vonlanthen, P; Taverna, A; Seehausen, O

    2013-03-01

    Parallel phenotypic divergence in replicated adaptive radiations could either result from parallel genetic divergence in response to similar divergent selection regimes or from equivalent phenotypically plastic response to the repeated occurrence of contrasting environments. In post-glacial fish, replicated divergence in phenotypes along the benthic-limnetic habitat axis is commonly observed. Here, we use two benthic-limnetic species pairs of whitefish from two Swiss lakes, raised in a common garden design, with reciprocal food treatments in one species pair, to experimentally measure whether feeding efficiency on benthic prey has a genetic basis or whether it underlies phenotypic plasticity (or both). To do so, we offered experimental fish mosquito larvae, partially burried in sand, and measured multiple feeding efficiency variables. Our results reveal both, genetic divergence as well as phenotypically plastic divergence in feeding efficiency, with the phenotypically benthic species raised on benthic food being the most efficient forager on benthic prey. This indicates that both, divergent natural selection on genetically heritable traits and adaptive phenotypic plasticity, are likely important mechanisms driving phenotypic divergence in adaptive radiation.

  9. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E

    2014-11-11

    Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

  10. A parallel-series-fed microstrip array with high efficiency and low cross-polarization

    NASA Technical Reports Server (NTRS)

    Huang, John

    1992-01-01

    The requirements of a microstrip array with a vertically polarized fan beam are addressed that correspond to its use in C-band interferometric SAR. A combination of parallel- and series-feed techniques are utilized in an array design with a three-stage parallel-fed configuration to enhance bandwidth performance. The linearly polarized traveling-wave microstrip array antenna is fed by microstrip transmission lines in two rows of 36 elements that resonate at 5.30 GHz. The transmission lines are impedance-matched at every junction for all the waves that travel toward the two ends of the array. The two measured principal-plane patterns are shown, and the measured narrow-beam pattern is found to agree with the calculated values. The VSWR bandwidths and narrow and broad beamwidths of the antenna are found to permit efficient performance. The efficiency is attributed to the parallel and series-feed configuration which allows proper impedance matching, and low cross-polarization is a result of the antiphase feed technique employed in the configuration.

  11. Graphical representation of parallel algorithmic processes. Master's thesis

    SciTech Connect

    Williams, E.M.

    1990-12-01

    Algorithm animation is a visualization method used to enhance understanding of functioning of an algorithm or program. Visualization is used for many purposes, including education, algorithm research, performance analysis, and program debugging. This research applies algorithm animation techniques to programs developed for parallel architectures, with specific on the Intel iPSC/2 hypercube. While both P-time and NP-time algorithms can potentially benefit from using visualization techniques, the set of NP-complete problems provides fertile ground for developing parallel applications, since the combinatoric nature of the problems makes finding the optimum solution impractical. The primary goals for this visualization system are: Data should be displayed as it is generated. The interface to the targe program should be transparent, allowing the animation of existing programs. Flexibility - the system should be able to animate any algorithm. The resulting system incorporates and extends two AFIT products: the AFIT Algorithm Animation Research Facility (AAARF) and the Parallel Resource Analysis Software Environment (PRASE). AAARF is an algorithm animation system developed primarily for sequential programs, but is easily adaptable for use with parallel programs. PRASE is an instrumentation package that extracts system performance data from programs on the Intel hypercubes. Since performance data is an essential part of analyzing any parallel program, views of the performance data are provided as an elementary part of the system. Custom software is designed to interface these systems and to display the program data. The program chosen as the example for this study is a member of the NP-complete problem set; it is a parallel implementation of a general.

  12. Efficiency analysis of parallelized wavelet-based FDTD model for simulating high-index optical devices

    NASA Astrophysics Data System (ADS)

    Ren, Rong; Wang, Jin; Jiang, Xiyan; Lu, Yunqing; Xu, Ji

    2014-10-01

    The finite-difference time-domain (FDTD) method, which solves time-dependent Maxwell's curl equations numerically, has been proved to be a highly efficient technique for numerous applications in electromagnetic. Despite the simplicity of the FDTD method, this technique suffers from serious limitations in case that substantial computer resource is required to solve electromagnetic problems with medium or large computational dimensions, for example in high-index optical devices. In our work, an efficient wavelet-based FDTD model has been implemented and extended in a parallel computation environment, to analyze high-index optical devices. This model is based on Daubechies compactly supported orthogonal wavelets and Deslauriers-Dubuc interpolating functions as biorthogonal wavelet bases, and thus is a very efficient algorithm to solve differential equations numerically. This wavelet-based FDTD model is a high-spatial-order FDTD indeed. Because of the highly linear numerical dispersion properties of this high-spatial-order FDTD, the required discretization can be coarser than that required in the standard FDTD method. In our work, this wavelet-based FDTD model achieved significant reduction in the number of cells, i.e. used memory. Also, as different segments of the optical device can be computed simultaneously, there was a significant gain in computation time. Substantially, we achieved speed-up factors higher than 30 in comparisons to using a single processor. Furthermore, the efficiency of the parallelized computation such as the influence of the discretization and the load sharing between different processors were analyzed. As a conclusion, this parallel-computing model is promising to analyze more complicated optical devices with large dimensions.

  13. Parallel processing implementations of a contextual classifier for multispectral remote sensing data

    NASA Technical Reports Server (NTRS)

    Siegel, H. J.; Swain, P. H.; Smith, B. W.

    1980-01-01

    The applicability of parallel processing schemes to the implementation of a contextual classification algorithm which exploits the spatial and spectral context of a multispectral remote sensing pixel to achieve classification is examined. Two algorithms for classifying each multivariate pixel taking into account the probable classifications of neighboring pixels are presented which make use of a size three horizontally linear neighborhood, and the serial computational complexity of the more efficient algorithm is shown to grow in proportion to the number of pixels and the cube of the number of possible categories. The implementation of the more efficient algorithm on a CDC Flexible Processor system and on a multimicroprocessor system such as the proposed PASM is then discussed. It is noted that the use of N processors to perform the calculations N times faster than a single processor overcomes the principal disadvantage of contexual classifiers, i.e., their computational complexity.

  14. Parallel Processing Creates a Low-Cost Growth Path.

    ERIC Educational Resources Information Center

    Shekhel, Alex; Freeman, Eva

    1987-01-01

    Discusses the advantages of parallel processor computers in terms of expandibility, cost, performance and reliability, and suggests that such computers be used in library automation systems as a cost effective approach to planning for the growth of information services and computer applications. (CLB)

  15. Large Scale Finite Element Modeling Using Scalable Parallel Processing

    NASA Technical Reports Server (NTRS)

    Cwik, T.; Katz, D.; Zuffada, C.; Jamnejad, V.

    1995-01-01

    An iterative solver for use with finite element codes was developed for the Cray T3D massively parallel processor at the Jet Propulsion Laboratory. Finite element modeling is useful for simulating scattered or radiated electromagnetic fields from complex three-dimensional objects with geometry variations smaller than an electrical wavelength.

  16. Listening beneath the Words: Parallel Processes in Music and Psychotherapy

    ERIC Educational Resources Information Center

    Shapiro, Yakov; Marks-Tarlow,Terry; Fridman, Joseph

    2017-01-01

    The authors investigate the parallels between musical performance and psychoanalytical therapy, using the former as a metaphor for the way therapist and patient jointly compose the therapeutic experience and better the treatment it offers. [Note: The volume and issue number (v9 n1) shown on this PDF is incorrect. The correct citation is v9 n2.

  17. Optical Digital Parallel Truth-Table Look-Up Processing

    NASA Astrophysics Data System (ADS)

    Mirsalehi, Mir Mojtaba

    During the last decade, a number of optical digital processors have been proposed that combine the parallelism and speed of optics with the accuracy and flexibility of a digital representation. In this thesis, two types of such processors (an EXCLUSIVE OR-based processor and a NAND-based processor) that function as content-addressable memories (CAM's) are analyzed. The main factors that affect the performance of the EXCLUSIVE OR-based processor are found to be the Gaussian nature of the reference beam and the finite square aperture of the crystal. A quasi-one-dimensional model is developed to analyze the effect of the Gaussian reference beam, and a circular aperture is used to increase the dynamic range in the output power. The main factors that affect the performance of the NAND-based processor are found to be the variations in the amplitudes and the relative phase of the laser beams during the recording process. A mathematical model is developed for analyzing the probability of error in the output of the processor. Using this model, the performance of the processor for some practical cases is analyzed. Techniques that have been previously used to reduce the number of reference patterns in a CAM include: using the residue number system and applying logical minimization methods. In the present work, these and additional techniques are investigated. A systematic procedure is developed for selecting the optimum set of moduli. The effect of coding is investigated and it is shown that multi-level coding, when used in conjunction with logical minimization techniques, significantly reduces the number of reference patterns. The Quine-McCluskey method is extended to multiple -valued logic and a computer program based on this extension is used for logical minimization. The results show that for moduli expressable as p('n), where p is a prime number and n is an integer greater than one, p-level coding provides significant reduction. The NAND-based processor is modified for

  18. Parallel design of JPEG-LS encoder on graphics processing units

    NASA Astrophysics Data System (ADS)

    Duan, Hao; Fang, Yong; Huang, Bormin

    2012-01-01

    With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.

  19. Control of automatic processes: A parallel distributed-processing model of the stroop effect. Technical report

    SciTech Connect

    Cohen, J.D.; Dunbar, K.; McClelland, J.L.

    1988-06-16

    A growing body of evidence suggests that traditional views of automaticity are in need of revision. For example, automaticity has often been treated as an all-or-none phenomenon, and traditional theories have held that automatic processes are independent of attention. Yet recent empirial data suggests that automatic processes are continuous, and furthermore are subject to attentional control. In this paper we present a model of attention which addresses these issues. Using a parallel distributed processing framework we propose that the attributes of automaticity depend upon the strength of a process and that strength increases with training. Using the Stroop effect as an example, we show how automatic processes are continuous and emerge gradually with practice. Specifically, we present a computational model of the Stroop task which simulates the time course of processing as well as the effects of learning.

  20. A 16-bit parallel processing in a molecular assembly

    PubMed Central

    Bandyopadhyay, Anirban; Acharya, Somobrata

    2008-01-01

    A machine assembly consisting of 17 identical molecules of 2,3,5,6-tetramethyl-1–4-benzoquinone (DRQ) executes 16 instructions at a time. A single DRQ is positioned at the center of a circular ring formed by 16 other DRQs, controlling their operation in parallel through hydrogen-bond channels. Each molecule is a logic machine and generates four instructions by rotating its alkyl groups. A single instruction executed by a scanning tunneling microscope tip on the central molecule can change decisions of 16 machines simultaneously, in four billion (416) ways. This parallel communication represents a significant conceptual advance relative to today's fastest processors, which execute only one instruction at a time. PMID:18332437

  1. A Parallel Processing Algorithm for Remote Sensing Classification

    NASA Technical Reports Server (NTRS)

    Gualtieri, J. Anthony

    2005-01-01

    A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.

  2. An approach to real-time simulation using parallel processing

    NASA Technical Reports Server (NTRS)

    Blech, R. A.; Arpasi, D. J.

    1981-01-01

    Current applications of real-time simulations to the development of complex aircraft propulsion system controls have demonstrated the need for accurate, portable, and low-cost simulators. This paper presents a preliminary simulator design that uses a parallel computer organization to provide these features. The hardware and software for this prototype simulator are discussed. A detailed discussion of the inter-computer data transfer mechanism is also presented.

  3. Efficient uncertainty quantification of large two-dimensional optical systems with a parallelized stochastic Galerkin method.

    PubMed

    Zubac, Z; Fostier, J; De Zutter, D; Vande Ginste, D

    2015-11-30

    It is well-known that geometrical variations due to manufacturing tolerances can degrade the performance of optical devices. In recent literature, polynomial chaos expansion (PCE) methods were proposed to model this statistical behavior. Nonetheless, traditional PCE solvers require a lot of memory and their computational complexity leads to prohibitively long simulation times, making these methods non-tractable for large optical systems. The uncertainty quantification (UQ) of various types of large, two-dimensional lens systems is presented in this paper, leveraging a novel parallelized Multilevel Fast Multipole Method (MLFMM) based Stochastic Galerkin Method (SGM). It is demonstrated that this technique can handle large optical structures in reasonable time, e.g., a stochastic lens system with more than 10 million unknowns was solved in less than an hour by using 3 compute nodes. The SGM, which is an intrusive PCE method, guarantees the accuracy of the method. By conjunction with MLFMM, usage of a preconditioner and by constructing and implementing a parallelized algorithm, a high efficiency is achieved. This is demonstrated with parallel scalability graphs. The novel approach is illustrated for different types of lens system and numerical results are validated against a collocation method, which relies on reusing a traditional deterministic solver. The last example concerns a Cassegrain system with five random variables, for which a speed-up of more than 12× compared to a collocation method is achieved.

  4. Parallel processing data network of master and slave transputers controlled by a serial control network

    DOEpatents

    Crosetto, D.B.

    1996-12-31

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.

  5. Parallel processing data network of master and slave transputers controlled by a serial control network

    DOEpatents

    Crosetto, Dario B.

    1996-01-01

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.

  6. Parallel damage in mitochondrial and lysosomal compartments promotes efficient cell death with autophagy: The case of the pentacyclic triterpenoids

    PubMed Central

    Martins, Waleska K.; Costa, Érico T.; Cruz, Mário C.; Stolf, Beatriz S.; Miotto, Ronei; Cordeiro, Rodrigo M.; Baptista, Maurício S.

    2015-01-01

    The role of autophagy in cell death is still controversial and a lot of debate has concerned the transition from its pro-survival to its pro-death roles. The similar structure of the triterpenoids Betulinic (BA) and Oleanolic (OA) acids allowed us to prove that this transition involves parallel damage in mitochondria and lysosome. After treating immortalized human skin keratinocytes (HaCaT) with either BA or OA, we evaluated cell viability, proliferation and mechanism of cell death, function and morphology of mitochondria and lysosomes, and the status of the autophagy flux. We also quantified the interactions of BA and OA with membrane mimics, both in-vitro and in-silico. Essentially, OA caused mitochondrial damage that relied on autophagy to rescue cellular homeostasis, which failed upon lysosomal inhibition by Chloroquine or Bafilomycin-A1. BA caused parallel damage on mitochondria and lysosome, turning autophagy into a destructive process. The higher cytotoxicity of BA correlated with its stronger efficiency in damaging membrane mimics. Based on these findings, we underlined the concept that autophagy will turn into a destructive outcome when there is parallel damage in mitochondrial and lysosomal membranes. We trust that this concept will help the development of new drugs against aggressive cancers. PMID:26213355

  7. Parallel damage in mitochondrial and lysosomal compartments promotes efficient cell death with autophagy: The case of the pentacyclic triterpenoids.

    PubMed

    Martins, Waleska K; Costa, Érico T; Cruz, Mário C; Stolf, Beatriz S; Miotto, Ronei; Cordeiro, Rodrigo M; Baptista, Maurício S

    2015-07-27

    The role of autophagy in cell death is still controversial and a lot of debate has concerned the transition from its pro-survival to its pro-death roles. The similar structure of the triterpenoids Betulinic (BA) and Oleanolic (OA) acids allowed us to prove that this transition involves parallel damage in mitochondria and lysosome. After treating immortalized human skin keratinocytes (HaCaT) with either BA or OA, we evaluated cell viability, proliferation and mechanism of cell death, function and morphology of mitochondria and lysosomes, and the status of the autophagy flux. We also quantified the interactions of BA and OA with membrane mimics, both in-vitro and in-silico. Essentially, OA caused mitochondrial damage that relied on autophagy to rescue cellular homeostasis, which failed upon lysosomal inhibition by Chloroquine or Bafilomycin-A1. BA caused parallel damage on mitochondria and lysosome, turning autophagy into a destructive process. The higher cytotoxicity of BA correlated with its stronger efficiency in damaging membrane mimics. Based on these findings, we underlined the concept that autophagy will turn into a destructive outcome when there is parallel damage in mitochondrial and lysosomal membranes. We trust that this concept will help the development of new drugs against aggressive cancers.

  8. Toward a Model Framework of Generalized Parallel Componential Processing of Multi-Symbol Numbers

    ERIC Educational Resources Information Center

    Huber, Stefan; Cornelsen, Sonja; Moeller, Korbinian; Nuerk, Hans-Christoph

    2015-01-01

    In this article, we propose and evaluate a new model framework of parallel componential multi-symbol number processing, generalizing the idea of parallel componential processing of multi-digit numbers to the case of negative numbers by considering the polarity signs similar to single digits. In a first step, we evaluated this account by defining…

  9. Toward a Model Framework of Generalized Parallel Componential Processing of Multi-Symbol Numbers

    ERIC Educational Resources Information Center

    Huber, Stefan; Cornelsen, Sonja; Moeller, Korbinian; Nuerk, Hans-Christoph

    2015-01-01

    In this article, we propose and evaluate a new model framework of parallel componential multi-symbol number processing, generalizing the idea of parallel componential processing of multi-digit numbers to the case of negative numbers by considering the polarity signs similar to single digits. In a first step, we evaluated this account by defining…

  10. Correlated activity supports efficient cortical processing

    PubMed Central

    Hung, Chou P.; Cui, Ding; Chen, Yueh-peng; Lin, Chia-pei; Levine, Matthew R.

    2015-01-01

    Visual recognition is a computational challenge that is thought to occur via efficient coding. An important concept is sparseness, a measure of coding efficiency. The prevailing view is that sparseness supports efficiency by minimizing redundancy and correlations in spiking populations. Yet, we recently reported that “choristers”, neurons that behave more similarly (have correlated stimulus preferences and spontaneous coincident spiking), carry more generalizable object information than uncorrelated neurons (“soloists”) in macaque inferior temporal (IT) cortex. The rarity of choristers (as low as 6% of IT neurons) indicates that they were likely missed in previous studies. Here, we report that correlation strength is distinct from sparseness (choristers are not simply broadly tuned neurons), that choristers are located in non-granular output layers, and that correlated activity predicts human visual search efficiency. These counterintuitive results suggest that a redundant correlational structure supports efficient processing and behavior. PMID:25610392

  11. HHFW Heating Efficiency on NSTX versus B{sub {phi}} and Antenna k{sub parallel}

    SciTech Connect

    Hosea, J.; Bell, R.; Bernabei, S.; LeBlanc, B.; Phillips, C. K.; Wilson, J. R.; Delgado-Aparicio, L.; Tritz, K.; Ryan, P.; Wilgen, J.; Sabbagh, S.; Yuh, H.

    2007-09-28

    HHFW RF power delivered to the core plasma of NSTX is strongly reduced as the launched wavelength is increased--for B{sub {phi}} = 4.5 kG, heating is {approx}1/2 as effective at k{sub {phi}} = -7 m{sup -1} as at 14 m{sup -1} and {approx}1/10 as effective at -3 m{sup -1}. Measured edge ion heating, attributable to parametric decay (PDI), increases with wavelength but not fast enough to account for the observed power loss. Surface fast waves (FW) may enhance both PDI and also losses in sheaths and structures around the machine--FW fields propagate closer to the wall with decreasing B{sub {phi}} and k{sub parallel} (onset n{sub e}{proportional_to}B{sub {phi}}xk{sub parallel}{sup 2}). A dramatic increase in core heating efficiency is observed at -7 m{sup -1} when B{sub {phi}} is increased to 5.5 kG--central T{sub e} near 4 keV at P{sub RF} = 2 MW. Also, the PDI losses are a weak function of B{sub {phi}} and k{sub parallel}, whereas the far-field RF poloidal magnetic field (at 5.5 kG) increases a factor of {approx}3 when k{sub parallel} is reduced from 14 m{sup -1} to -3 m{sup -1}, suggesting a large increase in wall/sheath power loss and a major effect of surface fast waves on edge losses.

  12. Studies in optical parallel processing. [All optical and electro-optic approaches

    NASA Technical Reports Server (NTRS)

    Lee, S. H.

    1978-01-01

    Threshold and A/D devices for converting a gray scale image into a binary one were investigated for all-optical and opto-electronic approaches to parallel processing. Integrated optical logic circuits (IOC) and optical parallel logic devices (OPA) were studied as an approach to processing optical binary signals. In the IOC logic scheme, a single row of an optical image is coupled into the IOC substrate at a time through an array of optical fibers. Parallel processing is carried out out, on each image element of these rows, in the IOC substrate and the resulting output exits via a second array of optical fibers. The OPAL system for parallel processing which uses a Fabry-Perot interferometer for image thresholding and analog-to-digital conversion, achieves a higher degree of parallel processing than is possible with IOC.

  13. New process technologies improve IGBT module efficiency

    SciTech Connect

    Motto, E.R.; Donlon, J.F.; Mori, Satoshi; Iida, Takahiko

    1995-12-31

    New process technologies are extending the application range of IGBT modules. A 1,400V IGBT with significantly improved efficiency has been developed using an optimized epitaxial (punch-through) process. This new 1,400V device has a square turn-off switching SOA making it suitable for 575/600 VAC inverter applications. A very low saturation voltage 250V IGBT has been developed using a trench gate structure. This new 250V device offers significant size and efficiency advantages in battery powered applications including fork lift truck and UPS inverters.

  14. Efficient sequential and parallel algorithms for finding edit distance based motifs.

    PubMed

    Pal, Soumitra; Xiao, Peng; Rajasekaran, Sanguthevar

    2016-08-18

    Motif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable and there is a pressing need to develop efficient, exact and approximation algorithms to solve this problem. In this paper, we present several novel, exact, sequential and parallel algorithms for solving the (l,d) Edit-distance-based Motif Search (EMS) problem: given two integers l,d and n biological strings, find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion. One popular technique to solve the problem is to explore for each input string the set of all possible l-mers that belong to the d-neighborhood of any substring of the input string and output those which are common for all input strings. We introduce a novel and provably efficient neighborhood exploration technique. We show that it is enough to consider the candidates in neighborhood which are at a distance exactly d. We compactly represent these candidate motifs using wildcard characters and efficiently explore them with very few repetitions. Our sequential algorithm uses a trie based data structure to efficiently store and sort the candidate motifs. Our parallel algorithm in a multi-core shared memory setting uses arrays for storing and a novel modification of radix-sort for sorting the candidate motifs. The algorithms for EMS are customarily evaluated on several challenging instances such as (8,1), (12,2), (16,3), (20,4), and so on. The best previously known algorithm, EMS1, is sequential and in estimated 3 days solves up to instance (16,3). Our sequential algorithms are more than 20 times faster on (16,3). On other hard instances such as (9,2), (11,3), (13,4), our algorithms are much faster. Our parallel algorithm has more than 600 % scaling performance while using 16 threads. Our algorithms have pushed up the state-of-the-art of EMS solvers and we believe that the techniques introduced in

  15. Parallel architectures for image processing; Proceedings of the Meeting, Santa Clara, CA, Feb. 14, 15, 1990

    SciTech Connect

    Ghosh, J.; Harrison, C.G.

    1990-01-01

    The present conference discusses topics in the fields of VLSI-based and real-time image-processing systems, parallel architectures for image processing, image-processing algorithms, and image processing on the basis of artificial neural networks. Attention is given to a fixed-point VLSI architecture for high-speed image reconstruction, an orthogonal multiprocessor for image processing with neural networks, massively parallel processors in real-time applications, the use of the adiabatic approximation as a tool in image estimation, parallel algorithms for contour-extraction and coding, and a parallel architecture for multidimensional image processing. Also discussed are concurrent image-processing on hypercube multicomputers, neural-network simulation on a reduced-mesh-of-trees organization, and a goal-seeking neural net for recall and recognition.

  16. Efficient time-dependent density functional theory approximations for hybrid density functionals: Analytical gradients and parallelization

    NASA Astrophysics Data System (ADS)

    Petrenko, Taras; Kossmann, Simone; Neese, Frank

    2011-02-01

    In this paper, we present the implementation of efficient approximations to time-dependent density functional theory (TDDFT) within the Tamm-Dancoff approximation (TDA) for hybrid density functionals. For the calculation of the TDDFT/TDA excitation energies and analytical gradients, we combine the resolution of identity (RI-J) algorithm for the computation of the Coulomb terms and the recently introduced "chain of spheres exchange" (COSX) algorithm for the calculation of the exchange terms. It is shown that for extended basis sets, the RIJCOSX approximation leads to speedups of up to 2 orders of magnitude compared to traditional methods, as demonstrated for hydrocarbon chains. The accuracy of the adiabatic transition energies, excited state structures, and vibrational frequencies is assessed on a set of 27 excited states for 25 molecules with the configuration interaction singles and hybrid TDDFT/TDA methods using various basis sets. Compared to the canonical values, the typical error in transition energies is of the order of 0.01 eV. Similar to the ground-state results, excited state equilibrium geometries differ by less than 0.3 pm in the bond distances and 0.5° in the bond angles from the canonical values. The typical error in the calculated excited state normal coordinate displacements is of the order of 0.01, and relative error in the calculated excited state vibrational frequencies is less than 1%. The errors introduced by the RIJCOSX approximation are, thus, insignificant compared to the errors related to the approximate nature of the TDDFT methods and basis set truncation. For TDDFT/TDA energy and gradient calculations on Ag-TB2-helicate (156 atoms, 2732 basis functions), it is demonstrated that the COSX algorithm parallelizes almost perfectly (speedup ˜26-29 for 30 processors). The exchange-correlation terms also parallelize well (speedup ˜27-29 for 30 processors). The solution of the Z-vector equations shows a speedup of ˜24 on 30 processors. The

  17. Efficient time-dependent density functional theory approximations for hybrid density functionals: analytical gradients and parallelization.

    PubMed

    Petrenko, Taras; Kossmann, Simone; Neese, Frank

    2011-02-07

    In this paper, we present the implementation of efficient approximations to time-dependent density functional theory (TDDFT) within the Tamm-Dancoff approximation (TDA) for hybrid density functionals. For the calculation of the TDDFT/TDA excitation energies and analytical gradients, we combine the resolution of identity (RI-J) algorithm for the computation of the Coulomb terms and the recently introduced "chain of spheres exchange" (COSX) algorithm for the calculation of the exchange terms. It is shown that for extended basis sets, the RIJCOSX approximation leads to speedups of up to 2 orders of magnitude compared to traditional methods, as demonstrated for hydrocarbon chains. The accuracy of the adiabatic transition energies, excited state structures, and vibrational frequencies is assessed on a set of 27 excited states for 25 molecules with the configuration interaction singles and hybrid TDDFT/TDA methods using various basis sets. Compared to the canonical values, the typical error in transition energies is of the order of 0.01 eV. Similar to the ground-state results, excited state equilibrium geometries differ by less than 0.3 pm in the bond distances and 0.5° in the bond angles from the canonical values. The typical error in the calculated excited state normal coordinate displacements is of the order of 0.01, and relative error in the calculated excited state vibrational frequencies is less than 1%. The errors introduced by the RIJCOSX approximation are, thus, insignificant compared to the errors related to the approximate nature of the TDDFT methods and basis set truncation. For TDDFT/TDA energy and gradient calculations on Ag-TB2-helicate (156 atoms, 2732 basis functions), it is demonstrated that the COSX algorithm parallelizes almost perfectly (speedup ~26-29 for 30 processors). The exchange-correlation terms also parallelize well (speedup ~27-29 for 30 processors). The solution of the Z-vector equations shows a speedup of ~24 on 30 processors. The

  18. Signal processing applications of massively parallel charge domain computing devices

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Barhen, Jacob (Inventor); Toomarian, Nikzad (Inventor)

    1999-01-01

    The present invention is embodied in a charge coupled device (CCD)/charge injection device (CID) architecture capable of performing a Fourier transform by simultaneous matrix vector multiplication (MVM) operations in respective plural CCD/CID arrays in parallel in O(1) steps. For example, in one embodiment, a first CCD/CID array stores charge packets representing a first matrix operator based upon permutations of a Hartley transform and computes the Fourier transform of an incoming vector. A second CCD/CID array stores charge packets representing a second matrix operator based upon different permutations of a Hartley transform and computes the Fourier transform of an incoming vector. The incoming vector is applied to the inputs of the two CCD/CID arrays simultaneously, and the real and imaginary parts of the Fourier transform are produced simultaneously in the time required to perform a single MVM operation in a CCD/CID array.

  19. Quality and efficiency successes leveraging IT and new processes.

    PubMed

    Chaiken, Barry P; Christian, Charles E; Johnson, Liz

    2007-01-01

    Today, healthcare annually invests billions of dollars in information technology, including clinical systems, electronic medical records and interoperability platforms. While continued investment and parallel development of standards are critical to secure exponential benefits from clinical information technology, intelligent and creative redesign of processes through path innovation is necessary to deliver meaningful value. Reports from two organizations included in this report review the steps taken to reinvent clinical processes that best leverage information technology to deliver safer and more efficient care. Good Samaritan Hospital, Vincennes, Indiana, implemented electronic charting, point-of-care bar coding of medications prior to administration, and integrated clinical documentation for nursing, laboratory, radiology and pharmacy. Tenet Healthcare, during its implementation and deployment of multiple clinical systems across several hospitals, focused on planning that included team-based process redesign. In addition, Tenet constructed valuable and measurable metrics that link outcomes with its strategic goals.

  20. Creation of the BMA ensemble for SST using a parallel processing technique

    NASA Astrophysics Data System (ADS)

    Kim, Kwangjin; Lee, Yang Won

    2013-10-01

    Despite the same purpose, each satellite product has different value because of its inescapable uncertainty. Also the satellite products have been calculated for a long time, and the kinds of the products are various and enormous. So the efforts for reducing the uncertainty and dealing with enormous data will be necessary. In this paper, we create an ensemble Sea Surface Temperature (SST) using MODIS Aqua, MODIS Terra and COMS (Communication Ocean and Meteorological Satellite). We used Bayesian Model Averaging (BMA) as ensemble method. The principle of the BMA is synthesizing the conditional probability density function (PDF) using posterior probability as weight. The posterior probability is estimated using EM algorithm. The BMA PDF is obtained by weighted average. As the result, the ensemble SST showed the lowest RMSE and MAE, which proves the applicability of BMA for satellite data ensemble. As future work, parallel processing techniques using Hadoop framework will be adopted for more efficient computation of very big satellite data.

  1. Parallel-Processing CMOS Circuitry for M-QAM and 8PSK TCM

    NASA Technical Reports Server (NTRS)

    Gray, Andrew; Lee, Dennis; Hoy, Scott; Fisher, Dave; Fong, Wai; Ghuman, Parminder

    2009-01-01

    There has been some additional development of parts reported in "Multi-Modulator for Bandwidth-Efficient Communication" (NPO-40807), NASA Tech Briefs, Vol. 32, No. 6 (June 2009), page 34. The focus was on 1) The generation of M-order quadrature amplitude modulation (M-QAM) and octonary-phase-shift-keying, trellis-coded modulation (8PSK TCM), 2) The use of square-root raised-cosine pulse-shaping filters, 3) A parallel-processing architecture that enables low-speed [complementary metal oxide/semiconductor (CMOS)] circuitry to perform the coding, modulation, and pulse-shaping computations at a high rate; and 4) Implementation of the architecture in a CMOS field-programmable gate array.

  2. Parallel processing of remotely sensed data: Application to the ATSR-2 instrument

    NASA Astrophysics Data System (ADS)

    Simpson, J.; McIntire, T.; Berg, J.; Tsou, Y.

    2007-01-01

    Massively parallel computational paradigms can mitigate many issues associated with the analysis of large and complex remotely sensed data sets. Recently, the Beowulf cluster has emerged as the most attractive, massively parallel architecture due to its low cost and high performance. Whereas most Beowulf designs have emphasized numerical modeling applications, the Parallel Image Processing Environment (PIPE) specifically addresses the unique requirements of remote sensing applications. Automated, parallelization of user-defined analyses is fully supported. A neural network application, applied to Along Track Scanning Radiometer-2 (ATSR-2) data shows the advantages and performance characteristics of PIPE.

  3. Arts Integration Parallels Between Music and Reading: Process, Product and Affective Response.

    ERIC Educational Resources Information Center

    Merrion, Margaret Dee

    The process of aesthetic education is not limited to the fine arts. Parallels may be identified in the language arts and particularly in the art of creative reading. As in a musical experience, a creative reader will apprehend the content of the literature and couple personal feelings with the events of the reading experience. Parallel brain…

  4. BioFVM: an efficient, parallelized diffusive transport solver for 3-D biological simulations

    PubMed Central

    Ghaffarizadeh, Ahmadreza; Friedman, Samuel H.; Macklin, Paul

    2016-01-01

    Motivation: Computational models of multicellular systems require solving systems of PDEs for release, uptake, decay and diffusion of multiple substrates in 3D, particularly when incorporating the impact of drugs, growth substrates and signaling factors on cell receptors and subcellular systems biology. Results: We introduce BioFVM, a diffusive transport solver tailored to biological problems. BioFVM can simulate release and uptake of many substrates by cell and bulk sources, diffusion and decay in large 3D domains. It has been parallelized with OpenMP, allowing efficient simulations on desktop workstations or single supercomputer nodes. The code is stable even for large time steps, with linear computational cost scalings. Solutions are first-order accurate in time and second-order accurate in space. The code can be run by itself or as part of a larger simulator. Availability and implementation: BioFVM is written in C ++ with parallelization in OpenMP. It is maintained and available for download at http://BioFVM.MathCancer.org and http://BioFVM.sf.net under the Apache License (v2.0). Contact: paul.macklin@usc.edu. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26656933

  5. Parallel and series FED microstrip array with high efficiency and low cross polarization

    NASA Technical Reports Server (NTRS)

    Huang, John (Inventor)

    1995-01-01

    A microstrip array antenna for vertically polarized fan beam (approximately 2 deg x 50 deg) for C-band SAR applications with a physical area of 1.7 m by 0.17 m comprises two rows of patch elements and employs a parallel feed to left- and right-half sections of the rows. Each section is divided into two segments that are fed in parallel with the elements in each segment fed in series through matched transmission lines for high efficiency. The inboard section has half the number of patch elements of the outboard section, and the outboard sections, which have tapered distribution with identical transmission line sections, terminated with half wavelength long open-circuit stubs so that the remaining energy is reflected and radiated in phase. The elements of the two inboard segments of the two left- and right-half sections are provided with tapered transmission lines from element to element for uniform power distribution over the central third of the entire array antenna. The two rows of array elements are excited at opposite patch feed locations with opposite (180 deg difference) phases for reduced cross-polarization.

  6. An efficient and fast parallel method for Volterra integral equations of Abel type

    NASA Astrophysics Data System (ADS)

    Capobianco, Giovanni; Conte, Dajana

    2006-05-01

    In this paper we present an efficient and fast parallel waveform relaxation method for Volterra integral equations of Abel type, obtained by reformulating a nonstationary waveform relaxation method for systems of equations with linear coefficient constant kernel. To this aim we consider the Laplace transform of the equation and here we apply the recurrence relation given by the Chebyshev polynomial acceleration for algebraic linear systems. Back in the time domain, we obtain a three term recursion which requires, at each iteration, the evaluation of convolution integrals, where only the Laplace transform of the kernel is known. For this calculation we can use a fast convolution algorithm. Numerical experiments have been done also on problems where it is not possible to use the original nonstationary method, obtaining good results in terms of improvement of the rate of convergence with respect the stationary method.

  7. A framework for parallelized efficient global optimization with application to vehicle crashworthiness optimization

    NASA Astrophysics Data System (ADS)

    Hamza, Karim; Shalaby, Mohamed

    2014-09-01

    This article presents a framework for simulation-based design optimization of computationally expensive problems, where economizing the generation of sample designs is highly desirable. One popular approach for such problems is efficient global optimization (EGO), where an initial set of design samples is used to construct a kriging model, which is then used to generate new 'infill' sample designs at regions of the search space where there is high expectancy of improvement. This article attempts to address one of the limitations of EGO, where generation of infill samples can become a difficult optimization problem in its own right, as well as allow the generation of multiple samples at a time in order to take advantage of parallel computing in the evaluation of the new samples. The proposed approach is tested on analytical functions, and then applied to the vehicle crashworthiness design of a full Geo Metro model undergoing frontal crash conditions.

  8. Efficient Massively-Parallel Approach for Soving the Time-Dependent Schrodinger Equation

    NASA Astrophysics Data System (ADS)

    Schneider, B. I.; Hu, S. X.; Collins, L. A.

    2006-05-01

    A variety of problems in physics and chemistry require the solution of the time-dependent Schr"odinger equation (TDSE), including atoms and molecules in oscillating electromagnetic fields, atomic collisions, ultracold systems, and materials subjected to external forces. We describe an approach in which the Finite Element Discrete Variable Representation (FEDVR) is combined with the Real-Space Product (RSP) algorithm to generate an efficient and highly accurate method for the solution of both the linear and nonlinear TDSE. The FEDVR provides a highly-accurate spatial representation using a minimum number of grid points (N) while the RSP algorithm propagates the wavefunction in O(N) operations per time step. Parallelization of the method is transparent and is implemented by distributing one or two spatial dimension across the available processors within the Message-Passing-Interface (MPI) scheme. The complete formalism and a number of three-dimensional (3D) examples are given.

  9. Neural processes in symmetry perception: a parallel spatio-temporal model.

    PubMed

    Zhu, Tao

    2014-04-01

    Symmetry is usually computationally expensive to detect reliably, while it is relatively easy to perceive. In spite of many attempts to understand the neurofunctional properties of symmetry processing, no symmetry-specific activation was found in earlier cortical areas. Psychophysical evidence relating to the processing mechanisms suggests that the basic processes of symmetry perception would not perform a serial, point-by-point comparison of structural features but rather operate in parallel. Here, modeling of neural processes in psychophysical detection of bilateral texture symmetry is considered. A simple fine-grained algorithm that is capable of performing symmetry estimation without explicit comparison of remote elements is introduced. A computational model of symmetry perception is then described to characterize the underlying mechanisms as one-dimensional spatio-temporal neural processes, each of which is mediated by intracellular horizontal connections in primary visual cortex and adopts the proposed algorithm for the neural computation. Simulated experiments have been performed to show the efficiency and the dynamics of the model. Model and human performances are comparable for symmetry perception of intensity images. Interestingly, the responses of V1 neurons to propagation activities reflecting higher-order perceptual computations have been reported in neurophysiologic experiments.

  10. Parallel Block Structured Adaptive Mesh Refinement on Graphics Processing Units

    SciTech Connect

    Beckingsale, D. A.; Gaudin, W. P.; Hornung, R. D.; Gunney, B. T.; Gamblin, T.; Herdman, J. A.; Jarvis, S. A.

    2014-11-17

    Block-structured adaptive mesh refinement is a technique that can be used when solving partial differential equations to reduce the number of zones necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a native GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an eight-node cluster, and over four thousand nodes of Oak Ridge National Laboratory’s Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and has been scaled to over four thousand GPUs using a combination of MPI and CUDA.

  11. Parallel systems of error processing in the brain.

    PubMed

    Yordanova, Juliana; Falkenstein, Michael; Hohnsbein, Joachim; Kolev, Vasil

    2004-06-01

    Major neurophysiological principles of performance monitoring are not precisely known. It is a current debate in cognitive neuroscience if an error-detection neural system is involved in behavioral control and adaptation. Such a system should generate error-specific signals, but their existence is questioned by observations that correct and incorrect reactions may elicit similar neuroelectric potentials. A new approach based on a time-frequency decomposition of event-related brain potentials was applied to extract covert sub-components from the classical error-related negativity (Ne) and correct-response-related negativity (Nc) in humans. A unique error-specific sub-component from the delta (1.5-3.5 Hz) frequency band was revealed only for Ne, which was associated with error detection at the level of overall performance monitoring. A sub-component from the theta frequency band (4-8 Hz) was associated with motor response execution, but this sub-component also differentiated error from correct reactions indicating error detection at the level of movement monitoring. It is demonstrated that error-specific signals do exist in the brain. More importantly, error detection may occur in multiple functional systems operating in parallel at different levels of behavioral control.

  12. Advantages of Parallel Processing and the Effects of Communications Time

    NASA Technical Reports Server (NTRS)

    Eddy, Wesley M.; Allman, Mark

    2000-01-01

    Many computing tasks involve heavy mathematical calculations, or analyzing large amounts of data. These operations can take a long time to complete using only one computer. Networks such as the Internet provide many computers with the ability to communicate with each other. Parallel or distributed computing takes advantage of these networked computers by arranging them to work together on a problem, thereby reducing the time needed to obtain the solution. The drawback to using a network of computers to solve a problem is the time wasted in communicating between the various hosts. The application of distributed computing techniques to a space environment or to use over a satellite network would therefore be limited by the amount of time needed to send data across the network, which would typically take much longer than on a terrestrial network. This experiment shows how much faster a large job can be performed by adding more computers to the task, what role communications time plays in the total execution time, and the impact a long-delay network has on a distributed computing system.

  13. Integration of optoelectronic technologies for chip-to- chip interconnections and parallel pipeline processing

    NASA Astrophysics Data System (ADS)

    Wu, Jenming

    Digital information services such as multimedia systems and data communications require the processing and transfer of tremendous amount of data. These data need to be stored, accessed and delivered efficiently and reliably at high speed for various user applications. This represents a great challenge for current electronic systems. Electronics is effective in providing high performance processing and computation, but its input/outputs (I/Os) bandwidth is unable to scale with its processing power. The signal I/Os or interconnections are needed between processors and input devices, between processors for multiprocessor systems, and between processors and storage devices. Novel chip-to-chip interconnect technologies are needed to meet this challenge. This work integrates optoelectronic technologies for chip-to-chip interconnects and parallel pipeline processing. Photonic and electronic technologies are complementary to each other in the sense that electronics is more suitable for high-speed, low cost computation, and photonics is more suitable for high-bandwidth information transmission. Smart pixel technology uses electronics for logic switching and optics for chip-to- chip interconnects, thus combining the abilities of photonics and electronics nicely. This work describes both vertical and horizontal integration of smart pixel technologies for chip-to-chip optical interconnects and its applications. We present smart pixel VLSI designs in both hybrid CMOS/MQW smart pixel and monolithic GaAs smart pixel technologies. We use the CMOS/MQW technology for smart pixel array cellular logic (SPARCL) processors for SIMD parallel pipeline processing. We have tested the chip and constructed a prototype system for device characterization and system demonstration. We have verified the functionality of the system and characterized the electrical functions of the chip and the optoelectronic properties of the MQW devices. We have developed algorithms that utilize SPARCL for various

  14. Design of a dataway processor for a parallel image signal processing system

    NASA Astrophysics Data System (ADS)

    Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu

    1995-04-01

    Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.

  15. A generic simulation cell method for developing extensible, efficient and readable parallel computational models

    NASA Astrophysics Data System (ADS)

    Honkonen, I.

    2015-03-01

    I present a method for developing extensible and modular computational models without sacrificing serial or parallel performance or source code readability. By using a generic simulation cell method I show that it is possible to combine several distinct computational models to run in the same computational grid without requiring modification of existing code. This is an advantage for the development and testing of, e.g., geoscientific software as each submodel can be developed and tested independently and subsequently used without modification in a more complex coupled program. An implementation of the generic simulation cell method presented here, generic simulation cell class (gensimcell), also includes support for parallel programming by allowing model developers to select which simulation variables of, e.g., a domain-decomposed model to transfer between processes via a Message Passing Interface (MPI) library. This allows the communication strategy of a program to be formalized by explicitly stating which variables must be transferred between processes for the correct functionality of each submodel and the entire program. The generic simulation cell class requires a C++ compiler that supports a version of the language standardized in 2011 (C++11). The code is available at https://github.com/nasailja/gensimcell for everyone to use, study, modify and redistribute; those who do are kindly requested to acknowledge and cite this work.

  16. The generic simulation cell method for developing extensible, efficient and readable parallel computational models

    NASA Astrophysics Data System (ADS)

    Honkonen, I.

    2014-07-01

    I present a method for developing extensible and modular computational models without sacrificing serial or parallel performance or source code readability. By using a generic simulation cell method I show that it is possible to combine several distinct computational models to run in the same computational grid without requiring any modification of existing code. This is an advantage for the development and testing of computational modeling software as each submodel can be developed and tested independently and subsequently used without modification in a more complex coupled program. Support for parallel programming is also provided by allowing users to select which simulation variables to transfer between processes via a Message Passing Interface library. This allows the communication strategy of a program to be formalized by explicitly stating which variables must be transferred between processes for the correct functionality of each submodel and the entire program. The generic simulation cell class presented here requires a C++ compiler that supports variadic templates which were standardized in 2011 (C++11). The code is available at: https://github.com/nasailja/gensimcell for everyone to use, study, modify and redistribute; those that do are kindly requested to cite this work.

  17. Parallel plan execution with self-processing networks

    NASA Technical Reports Server (NTRS)

    Dautrechy, C. Lynne; Reggia, James A.

    1989-01-01

    A critical issue for space operations is how to develop and apply advanced automation techniques to reduce the cost and complexity of working in space. In this context, it is important to examine how recent advances in self-processing networks can be applied for planning and scheduling tasks. For this reason, the feasibility of applying self-processing network models to a variety of planning and control problems relevant to spacecraft activities is being explored. Goals are to demonstrate that self-processing methods are applicable to these problems, and that MIRRORS/II, a general purpose software environment for implementing self-processing models, is sufficiently robust to support development of a wide range of application prototypes. Using MIRRORS/II and marker passing modelling techniques, a model of the execution of a Spaceworld plan was implemented. This is a simplified model of the Voyager spacecraft which photographed Jupiter, Saturn, and their satellites. It is shown that plan execution, a task usually solved using traditional artificial intelligence (AI) techniques, can be accomplished using a self-processing network. The fact that self-processing networks were applied to other space-related tasks, in addition to the one discussed here, demonstrates the general applicability of this approach to planning and control problems relevant to spacecraft activities. It is also demonstrated that MIRRORS/II is a powerful environment for the development and evaluation of self-processing systems.

  18. Parallel plan execution with self-processing networks

    NASA Technical Reports Server (NTRS)

    D'Autrechy, C. Lynne; Reggia, James A.

    1989-01-01

    A critical issue for space operations is how to develop and apply advanced automation techniques to reduce the cost and complexity of working in space. In this context, it is important to examine how recent advances in self-processing networks can be applied for planning and scheduling tasks. For this reason, the feasibility of applying self-processing network models to a variety of planning and control problems relevant to spacecraft activities is being explored. Goals are to demonstrate that self-processing methods are applicable to these problems, and that MIRRORS/II, a general purpose software environment for implementing self-processing models, is sufficiently robust to support development of a wide range of application prototypes. Using MIRRORS/II and marker passing modelling techniques, a model of the execution of a Spaceworld plan was implemented. This is a simplified model of the Voyager spacecraft which photographed Jupiter, Saturn, and their satellites. It is shown that plan execution, a task usually solved using traditional artificial intelligence (AI) techniques, can be accomplished using a self-processing network. The fact that self-processing networks were applied to other space-related tasks, in addition to the one discussed here, demonstrates the general applicability of this approach to planning and control problems relevant to spacecraft activities. It is also demonstrated that MIRRORS/II is a powerful environment for the development and evaluation of self-processing systems.

  19. Control of automatic processes: A parallel distributed-processing account of the Stroop effect. Technical report

    SciTech Connect

    Cohen, J.D.; Dunbar, K.; McClelland, J.L.

    1989-11-22

    A growing body of evidence suggests that traditional views of automaticity are in need of revision. For example, automaticity has often been treated as an all-or-none phenomenon, and traditional theories have held that automatic processes are independent of attention. Yet recent empirical data suggest that automatic processes are continuous, and furthermore are subject to attentional control. In this paper we present a model of attention which addresses these issues. Using a parallel distributed processing framework we propose that the attributes of automaticity depend upon the strength of a processing pathway and that strength increases with training. Using the Stroop effect as an example, we show how automatic processes are continuous and emerge gradually with practice. Specifically, we present a computational model of the Stroop task which simulates the time course of processing as well as the effects of learning. This was accomplished by combining the cascade mechanism described by McClelland (1979) with the back propagation learning algorithm (Rumelhart, Hinton, Williams, 1986). The model is able to simulate performance in the standard Stroop task, as well as aspects of performance in variants of this task which manipulate SOA, response set, and degree of practice. In the discussion we contrast our model with other models, and indicate how it relates to many of the central issues in the literature on attention, automaticity, and interference.

  20. An iterative expanding and shrinking process for processor allocation in mixed-parallel workflow scheduling.

    PubMed

    Huang, Kuo-Chan; Wu, Wei-Ya; Wang, Feng-Jian; Liu, Hsiao-Ching; Hung, Chun-Hao

    2016-01-01

    Parallel computation has been widely applied in a variety of large-scale scientific and engineering applications. Many studies indicate that exploiting both task and data parallelisms, i.e. mixed-parallel workflows, to solve large computational problems can get better efficacy compared with either pure task parallelism or pure data parallelism. Scheduling traditional workflows of pure task parallelism on parallel systems has long been known to be an NP-complete problem. Mixed-parallel workflow scheduling has to deal with an additional challenging issue of processor allocation. In this paper, we explore the processor allocation issue in scheduling mixed-parallel workflows of moldable tasks, called M-task, and propose an Iterative Allocation Expanding and Shrinking (IAES) approach. Compared to previous approaches, our IAES has two distinguishing features. The first is allocating more processors to the tasks on allocated critical paths for effectively reducing the makespan of workflow execution. The second is allowing the processor allocation of an M-task to shrink during the iterative procedure, resulting in a more flexible and effective process for finding better allocation. The proposed IAES approach has been evaluated with a series of simulation experiments and compared to several well-known previous methods, including CPR, CPA, MCPA, and MCPA2. The experimental results indicate that our IAES approach outperforms those previous methods significantly in most situations, especially when nodes of the same layer in a workflow might have unequal workloads.

  1. Next Generation Parallelization Systems for Processing and Control of PDS Image Node Assets

    NASA Astrophysics Data System (ADS)

    Verma, R.

    2017-06-01

    We present next-generation parallelization tools to help Planetary Data System (PDS) Imaging Node (IMG) better monitor, process, and control changes to nearly 650 million file assets and over a dozen machines on which they are referenced or stored.

  2. Real-time parallel implementation of Pulse-Doppler radar signal processing chain on a massively parallel machine based on multi-core DSP and Serial RapidIO interconnect

    NASA Astrophysics Data System (ADS)

    Klilou, Abdessamad; Belkouch, Said; Elleaume, Philippe; Le Gall, Philippe; Bourzeix, François; Hassani, Moha M'Rabet

    2014-12-01

    Pulse-Doppler radars require high-computing power. A massively parallel machine has been developed in this paper to implement a Pulse-Doppler radar signal processing chain in real-time fashion. The proposed machine consists of two C6678 digital signal processors (DSPs), each with eight DSP cores, interconnected with Serial RapidIO (SRIO) bus. In this study, each individual core is considered as the basic processing element; hence, the proposed parallel machine contains 16 processing elements. A straightforward model has been adopted to distribute the Pulse-Doppler radar signal processing chain. This model provides low latency, but communication inefficiency limits system performance. This paper proposes several optimizations that greatly reduce the inter-processor communication in a straightforward model and improves the parallel efficiency of the system. A use case of the Pulse-Doppler radar signal processing chain has been used to illustrate and validate the concept of the proposed mapping model. Experimental results show that the parallel efficiency of the proposed parallel machine is about 90%.

  3. Probabilistic Modeling of Tephra Dispersion using Parallel Processing

    NASA Astrophysics Data System (ADS)

    Hincks, T.; Bonadonna, C.; Connor, L.; Connor, C.; Sparks, S.

    2002-12-01

    Numerical models of tephra accumulation are important tools in assessing hazards of volcanic eruptions. Such tools can be used far in advance of future eruptions to calculate possible hazards as conditional probabilities. For example, given that a volcanic eruption occurs, what is the expected range of tephra deposition in a specific location or across a region? An empirical model is presented that uses physical characteristics (e.g., volume, column height, particle size distribution) of a volcanic eruption to calculate expected tephra accumulation at geographic locations distant from the vent. This model results from the combination of the Connor et al. (2001) and Bonadonna et al. (1998, 2002) numerical approaches and is based on application of the diffusion advection equation using a stratified atmosphere and particle fall velocities that account for particle shape, density, and variation in Reynold's number along the path of decent. Distribution of particles in the eruption column is a major source of uncertainty in estimation of tephra hazards. We adopt an approach in which several models of the volcanic column may be used and the impact of these various source term models on hazard estimated. Cast probabilistically, this model can use characteristics of historical eruptions, or data from analogous eruptions, to predict the expected tephra deposition from future eruptions. Application of such a model for computing a large number of events over a grid of many points is computationally expensive. In fact, the utility of the model for stochastic simulations of volcanic eruptions was limited by long execution time. To address this concern, we created a parallel version in C and MPI, a message passing interface, to run on a Beowulf cluster, a private network of reasonably high performance computers. We have discovered that grid or input decomposition and self-scheduling techniques lead to essentially linear speed-up in the code. This means that the code is readily

  4. Parallel Processing Response Times and Experimental Determination of the Stopping Rule

    PubMed

    Townsend; Colonius

    1997-12-01

    It was formerly demonstrated that virtually all reasonable exhaustive serial models, and a more constrained set of exhaustive parallel models, cannot predict critical effects associated with self-terminating models. The present investigation greatly generalizes the parallel class of models covered by similar "impossibility" theorems. Specifically, we prove that if an exhaustive parallel model is not super capacity, and if targets are processed at least as fast as non-targets, then it cannot predict such (self-terminating) effects. Such effects are ubiquitous in the experimental literature, offering strong confirmation for self-terminating processing. Copyright 1997 Academic Press. Copyright 1997 Academic Press

  5. Performance of a VME-based parallel processing LIDAR data acquisition system (summary)

    SciTech Connect

    Moore, K.; Buttler, B.; Caffrey, M.; Soriano, C.

    1995-05-01

    It may be possible to make accurate real time, autonomous, 2 and 3 dimensional wind measurements remotely with an elastic backscatter Light Detection and Ranging (LIDAR) system by incorporating digital parallel processing hardware into the data acquisition system. In this paper, we report the performance of a commercially available digital parallel processing system in implementing the maximum correlation technique for wind sensing using actual LIDAR data. Timing and numerical accuracy are benchmarked against a standard microprocessor impementation.

  6. Efficient iteration in data-parallel programs with irregular and dynamically distributed data structures

    SciTech Connect

    Littlefield, R.J.

    1990-02-01

    To implement an efficient data-parallel program on a non-shared memory MIMD multicomputer, data and computations must be properly partitioned to achieve good load balance and locality of reference. Programs with irregular data reference patterns often require irregular partitions. Although good partitions may be easy to determine, they can be difficult or impossible to implement in programming languages that provide only regular data distributions, such as blocked or cyclic arrays. We are developing Onyx, a programming system that provides a shared memory model of distributed data structures and extends the concept of data distribution to include irregular and dynamic distributions. This provides a powerful means to specify irregular partitions. Perhaps surprisingly, programs using it can also execute efficiently. In this paper, we describe and evaluate the Onyx implementation of a model problem that repeatedly executes an irregular but fixed data reference pattern. On an NCUBE hypercube, the speed of the Onyx implementation is comparable to that of carefully handwritten message-passing code.

  7. An efficient parallel algorithm: Poststack and prestack Kirchhoff 3D depth migration using flexi-depth iterations

    NASA Astrophysics Data System (ADS)

    Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh

    2015-07-01

    This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.

  8. Parallel computing for simultaneous iterative tomographic imaging by graphics processing units

    NASA Astrophysics Data System (ADS)

    Bello-Maldonado, Pedro D.; López, Ricardo; Rogers, Colleen; Jin, Yuanwei; Lu, Enyue

    2016-05-01

    In this paper, we address the problem of accelerating inversion algorithms for nonlinear acoustic tomographic imaging by parallel computing on graphics processing units (GPUs). Nonlinear inversion algorithms for tomographic imaging often rely on iterative algorithms for solving an inverse problem, thus computationally intensive. We study the simultaneous iterative reconstruction technique (SIRT) for the multiple-input-multiple-output (MIMO) tomography algorithm which enables parallel computations of the grid points as well as the parallel execution of multiple source excitation. Using graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA) programming model an overall improvement of 26.33x was achieved when combining both approaches compared with sequential algorithms. Furthermore we propose an adaptive iterative relaxation factor and the use of non-uniform weights to improve the overall convergence of the algorithm. Using these techniques, fast computations can be performed in parallel without the loss of image quality during the reconstruction process.

  9. Parallel Distributed Processing: Implications for Cognition and Development

    DTIC Science & Technology

    1988-07-11

    cope with these kinds of compensation relations between variables. 11 Figure 4: Balance beam of the kind first used by Inhelder and Piaget (1958), and...developed by Inhelder and Piaget (1958), the so-called ba/ance-beam task, Is Eustrated In Figure 4. In this task, the child is shown a balance beam with pegs... Piaget stressed the continuity of the accomodation process, in spite of the overtly stage-like character of development, though he never gave a

  10. Harmony Theory: A Mathematical Framework for Stochastic Parallel Processing.

    DTIC Science & Technology

    1983-12-01

    9. PERFORMING ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT , TASK Center for Human Information Processing AREA & WORK UNIT NUMBERS...bsinfluene at * his Ideas peiva" this rsearch. Mw search reported here was conducted under Contract N00014-79-C-0323, NR 667-437 with the Personel sad...perspective, Hofstadter (1983) is pursuing a related approach to perceptual grouping; his ideas have been inspirational for my work (Hofs- tadter, 1979). An e

  11. A VLSI Architecture for Output Probability Computations of HMM-Based Recognition Systems with Store-Based Block Parallel Processing

    NASA Astrophysics Data System (ADS)

    Nakamura, Kazuhiro; Yamamoto, Masatoshi; Takagi, Kazuyoshi; Takagi, Naofumi

    In this paper, a fast and memory-efficient VLSI architecture for output probability computations of continuous Hidden Markov Models (HMMs) is presented. These computations are the most time-consuming part of HMM-based recognition systems. High-speed VLSI architectures with small registers and low-power dissipation are required for the development of mobile embedded systems with capable human interfaces. We demonstrate store-based block parallel processing (StoreBPP) for output probability computations and present a VLSI architecture that supports it. When the number of HMM states is adequate for accurate recognition, compared with conventional stream-based block parallel processing (StreamBPP) architectures, the proposed architecture requires fewer registers and processing elements and less processing time. The processing elements used in the StreamBPP architecture are identical to those used in the StoreBPP architecture. From a VLSI architectural viewpoint, a comparison shows the efficiency of the proposed architecture through efficient use of registers for storing input feature vectors and intermediate results during computation.

  12. Connectionism, parallel constraint satisfaction processes, and gestalt principles: (re) introducing cognitive dynamics to social psychology.

    PubMed

    Read, S J; Vanman, E J; Miller, L C

    1997-01-01

    We argue that recent work in connectionist modeling, in particular the parallel constraint satisfaction processes that are central to many of these models, has great importance for understanding issues of both historical and current concern for social psychologists. We first provide a brief description of connectionist modeling, with particular emphasis on parallel constraint satisfaction processes. Second, we examine the tremendous similarities between parallel constraint satisfaction processes and the Gestalt principles that were the foundation for much of modem social psychology. We propose that parallel constraint satisfaction processes provide a computational implementation of the principles of Gestalt psychology that were central to the work of such seminal social psychologists as Asch, Festinger, Heider, and Lewin. Third, we then describe how parallel constraint satisfaction processes have been applied to three areas that were key to the beginnings of modern social psychology and remain central today: impression formation and causal reasoning, cognitive consistency (balance and cognitive dissonance), and goal-directed behavior. We conclude by discussing implications of parallel constraint satisfaction principles for a number of broader issues in social psychology, such as the dynamics of social thought and the integration of social information within the narrow time frame of social interaction.

  13. Architecture and design of a 500-MHz gallium-arsenide processing element for a parallel supercomputer

    NASA Technical Reports Server (NTRS)

    Fouts, Douglas J.; Butner, Steven E.

    1991-01-01

    The design of the processing element of GASP, a GaAs supercomputer with a 500-MHz instruction issue rate and 1-GHz subsystem clocks, is presented. The novel, functionally modular, block data flow architecture of GASP is described. The architecture and design of a GASP processing element is then presented. The processing element (PE) is implemented in a hybrid semiconductor module with 152 custom GaAs ICs of eight different types. The effects of the implementation technology on both the system-level architecture and the PE design are discussed. SPICE simulations indicate that parts of the PE are capable of being clocked at 1 GHz, while the rest of the PE uses a 500-MHz clock. The architecture utilizes data flow techniques at a program block level, which allows efficient execution of parallel programs while maintaining reasonably good performance on sequential programs. A simulation study of the architecture indicates that an instruction execution rate of over 30,000 MIPS can be attained with 65 PEs.

  14. Architecture and design of a 500-MHz gallium-arsenide processing element for a parallel supercomputer

    NASA Technical Reports Server (NTRS)

    Fouts, Douglas J.; Butner, Steven E.

    1991-01-01

    The design of the processing element of GASP, a GaAs supercomputer with a 500-MHz instruction issue rate and 1-GHz subsystem clocks, is presented. The novel, functionally modular, block data flow architecture of GASP is described. The architecture and design of a GASP processing element is then presented. The processing element (PE) is implemented in a hybrid semiconductor module with 152 custom GaAs ICs of eight different types. The effects of the implementation technology on both the system-level architecture and the PE design are discussed. SPICE simulations indicate that parts of the PE are capable of being clocked at 1 GHz, while the rest of the PE uses a 500-MHz clock. The architecture utilizes data flow techniques at a program block level, which allows efficient execution of parallel programs while maintaining reasonably good performance on sequential programs. A simulation study of the architecture indicates that an instruction execution rate of over 30,000 MIPS can be attained with 65 PEs.

  15. Multi-digit number processing beyond the two-digit number range: a combination of sequential and parallel processes.

    PubMed

    Meyerhoff, Hauke S; Moeller, Korbinian; Debus, Kolja; Nuerk, Hans-Christoph

    2012-05-01

    Investigations of multi-digit number processing typically focus on two-digit numbers. Here, we aim to investigate the generality of results from two-digit numbers for four- and six-digit numbers. Previous studies on two-digit numbers mostly suggested a parallel processing of tens and units. In contrast, the few studies examining the processing of larger numbers suggest sequential processing of the individual constituting digits. In this study, we combined the methodological approaches of studies implying either parallel or sequential processing. Participants completed a number magnitude comparison task on two-, four-, and six-digit numbers including unit-decade compatible and incompatible differing digit pairs (e.g., 32_47, 3<4 and 2<7 vs. 37_52, 3<5 but 7>2, respectively) at all possible digit positions. Response latencies and fixation behavior indicated that sequential and parallel decomposition is not exclusive in multi-digit number processing. Instead, our results clearly suggested that sequential and parallel processing strategies seem to be combined when processing multi-digit numbers beyond the two-digit number range. To account for the results, we propose a chunking hypothesis claiming that multi-digit numbers are separated into chunks of shorter digit strings. While the different chunks are processed sequentially digits within these chunks are processed in parallel. Copyright © 2012 Elsevier B.V. All rights reserved.

  16. A VLSI Architecture with Multiple Fast Store-Based Block Parallel Processing for Output Probability and Likelihood Score Computations in HMM-Based Isolated Word Recognition

    NASA Astrophysics Data System (ADS)

    Nakamura, Kazuhiro; Shimazaki, Ryo; Yamamoto, Masatoshi; Takagi, Kazuyoshi; Takagi, Naofumi

    This paper presents a memory-efficient VLSI architecture for output probability computations (OPCs) of continuous hidden Markov models (HMMs) and likelihood score computations (LSCs). These computations are the most time consuming part of HMM-based isolated word recognition systems. We demonstrate multiple fast store-based block parallel processing (MultipleFastStoreBPP) for OPCs and LSCs and present a VLSI architecture that supports it. Compared with conventional fast store-based block parallel processing (FastStoreBPP) and stream-based block parallel processing (StreamBPP) architectures, the proposed architecture requires fewer registers and less processing time. The processing elements (PEs) used in the FastStoreBPP and StreamBPP architectures are identical to those used in the MultipleFastStoreBPP architecture. From a VLSI architectural viewpoint, a comparison shows that the proposed architecture is an improvement over the others, through efficient use of PEs and registers for storing input feature vectors.

  17. An Integrated Circuit Design of High Efficiency Parallel-SSHI Rectifier for Piezoelectric Energy Harvesting

    NASA Astrophysics Data System (ADS)

    Hsieh, Y. C.; Chen, J. J.; Chen, H. S.; Wu, W. J.

    2016-11-01

    This paper presents the design and implementation of a rectifier for piezoelectric energy harvesting based on the parallel-synchronized-switch harvesting-on-inductor (P-SSHI) technique, also known as bias flip circuit[1]. The circuit is implemented with 0.25 μm CMOS high voltage process with only 0.9648 mm2 chip area. Post-layout simulation of the circuit shows the circuit extracts 336% more power compared with the full-bridge rectifier. The system's average control power loss is 26 μW while operating with a self-made MEMS piezoelectric transducer with output current 25 μA 120Hz and internal capacitance 6.45nF. The output power is 43.42 μW under optimal load of 1.5 MΩ.

  18. Sculpting in cyberspace: Parallel processing the development of new software

    NASA Technical Reports Server (NTRS)

    Fisher, Rob

    1993-01-01

    Stimulating creativity in problem solving, particularly where software development is involved, is applicable to many disciplines. Metaphorical thinking keeps the problem in focus but in a different light, jarring people out of their mental ruts and sparking fresh insights. It forces the mind to stretch to find patterns between dissimilar concepts, in the hope of discovering unusual ideas in odd associations (Technology Review January 1993, p. 37). With a background in Engineering and Visual Design from MIT, I have for the past 30 years pursued a career as a sculptor of interdisciplinary monumental artworks that bridge the fields of science, engineering and art. Since 1979, I have pioneered the application of computer simulation to solve the complex problems associated with these projects. A recent project for the roof of the Carnegie Science Center in Pittsburgh made particular use of the metaphoric creativity technique described above. The problem-solving process led to the creation of hybrid software combining scientific, architectural and engineering visualization techniques. David Steich, a Doctoral Candidate in Electrical Engineering at Penn State, was commissioned to develop special software that enabled me to create innovative free-form sculpture. This paper explores the process of inventing the software through a detailed analysis of the interaction between an artist and a computer programmer.

  19. Parallel Processing in the Corticogeniculate Pathway of the Macaque Monkey

    PubMed Central

    Briggs, Farran; Usrey, W. Martin

    2009-01-01

    Summary Although corticothalamic feedback is ubiquitous across species and modalities, its role in sensory processing is unclear. This study provides the first detailed description of the visual physiology of corticogeniculate neurons in the primate. Using electrical stimulation to identify corticogeniculate neurons, we distinguish three groups of neurons with response properties that closely resemble those of neurons in the magnocellular, parvocellular and koniocellular layers of their target structure, the lateral geniculate nucleus (LGN) of the thalamus. Our results indicate that corticogeniculate feedback in the primate is stream-specific and provide strong evidence in support of the view that corticothalamic feedback can influence the transmission of sensory information from the thalamus to the cortex in a stream-selective manner. PMID:19376073

  20. Parallel versus sequential processing in print and braille reading.

    PubMed

    Veispak, Anneli; Boets, Bart; Ghesquière, Pol

    2012-01-01

    In the current study we investigated word, pseudoword and story reading in Dutch speaking braille and print readers. To examine developmental patterns, these reading skills were assessed in both children and adults. The results reveal that braille readers read less accurately and fast than print readers. While item length has no impact on word reading accuracy and speed in the group of print readers, it has a significant impact on reading accuracy and speed in the group of braille readers, particularly in the younger sample. This suggests that braille readers rely more strongly on an enduring sequential reading strategy. Comparison of the different reading tasks suggests that the advantage in accuracy and speed of reading in adult as compared to young braille readers is achieved through semantic top-down processing. Copyright © 2012 Elsevier Ltd. All rights reserved.

  1. Non-parallel processing: Gendered attrition in academic computer science

    NASA Astrophysics Data System (ADS)

    Cohoon, Joanne Louise Mcgrath

    2000-10-01

    This dissertation addresses the issue of disproportionate female attrition from computer science as an instance of gender segregation in higher education. By adopting a theoretical framework from organizational sociology, it demonstrates that the characteristics and processes of computer science departments strongly influence female retention. The empirical data identifies conditions under which women are retained in the computer science major at comparable rates to men. The research for this dissertation began with interviews of students, faculty, and chairpersons from five computer science departments. These exploratory interviews led to a survey of faculty and chairpersons at computer science and biology departments in Virginia. The data from these surveys are used in comparisons of the computer science and biology disciplines, and for statistical analyses that identify which departmental characteristics promote equal attrition for male and female undergraduates in computer science. This three-pronged methodological approach of interviews, discipline comparisons, and statistical analyses shows that departmental variation in gendered attrition rates can be explained largely by access to opportunity, relative numbers, and other characteristics of the learning environment. Using these concepts, this research identifies nine factors that affect the differential attrition of women from CS departments. These factors are: (1) The gender composition of enrolled students and faculty; (2) Faculty turnover; (3) Institutional support for the department; (4) Preferential attitudes toward female students; (5) Mentoring and supervising by faculty; (6) The local job market, starting salaries, and competitiveness of graduates; (7) Emphasis on teaching; and (8) Joint efforts for student success. This work contributes to our understanding of the gender segregation process in higher education. In addition, it contributes information that can lead to effective solutions for an

  2. Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation. Appendix F. Studies in Parallel Image Processing.

    DTIC Science & Technology

    1984-08-01

    represented in the iimage , tie greater the potential benefits to be derived from SIMI)’ implementation of the process . This section begins with an...AD0A167 317 DISTRIBUTED COMPUTING FOR SIGNRL PROCESSING : MODELING 1.12 OF ASYNCHRONOUS PAR.. (U) PURDUE UNIV LAFAYETTE IN SCHOOL OF ELECTRICRL...Sffllffllflflflf lllI ..hhmhmhmhmh. II.LP NA -. II ’** -u 118 U2- miT 11111125 .h 4 6 MICR ACP’ CHART .............. 71Ph.D. Thesis by: Gie-Hing Lin

  3. The role of parallelism in the real-time processing of anaphora

    PubMed Central

    Poirier, Josée; Walenski, Matthew; Shapiro, Lewis P.

    2012-01-01

    Parallelism effects refer to the facilitated processing of a target structure when it follows a similar, parallel structure. In coordination, a parallelism-related conjunction triggers the expectation that a second conjunct with the same structure as the first conjunct should occur. It has been proposed that parallelism effects reflect the use of the first structure as a template that guides the processing of the second. In this study, we examined the role of parallelism in real-time anaphora resolution by charting activation patterns in coordinated constructions containing anaphora, Verb-Phrase Ellipsis (VPE) and Noun-Phrase Traces (NP-traces). Specifically, we hypothesised that an expectation of parallelism would incite the parser to assume a structure similar to the first conjunct in the second, anaphora-containing conjunct. The speculation of a similar structure would result in early postulation of covert anaphora. Experiment 1 confirms that following a parallelism-related conjunction, first-conjunct material is activated in the second conjunct. Experiment 2 reveals that an NP-trace in the second conjunct is posited immediately where licensed, which is earlier than previously reported in the literature. In light of our findings, we propose an intricate relation between structural expectations and anaphor resolution. PMID:23741080

  4. Comparisons of elastic and rigid blade-element rotor models using parallel processing technology for piloted simulations

    NASA Technical Reports Server (NTRS)

    Hill, Gary; Duval, Ronald W.; Green, John A.; Huynh, Loc C.

    1991-01-01

    A piloted comparison of rigid and aeroelastic blade-element rotor models was conducted at the Crew Station Research and Development Facility (CSRDF) at Ames Research Center. A simulation development and analysis tool, FLIGHTLAB, was used to implement these models in real time using parallel processing technology. Pilot comments and quantitative analysis performed both on-line and off-line confirmed that elastic degrees of freedom significantly affect perceived handling qualities. Trim comparisons show improved correlation with flight test data when elastic modes are modeled. The results demonstrate the efficiency with which the mathematical modeling sophistication of existing simulation facilities can be upgraded using parallel processing, and the importance of these upgrades to simulation fidelity.

  5. Comparison of elastic and rigid blade-element rotor models using parallel processing technology for piloted simulations

    NASA Technical Reports Server (NTRS)

    Hill, Gary; Du Val, Ronald W.; Green, John A.; Huynh, Loc C.

    1991-01-01

    A piloted comparison of rigid and aeroelastic blade-element rotor models was conducted at the Crew Station Research and Development Facility (CSRDF) at Ames Research Center. A simulation development and analysis tool, FLIGHTLAB, was used to implement these models in real time using parallel processing technology. Pilot comments and qualitative analysis performed both on-line and off-line confirmed that elastic degrees of freedom significantly affect perceived handling qualities. Trim comparisons show improved correlation with flight test data when elastic modes are modeled. The results demonstrate the efficiency with which the mathematical modeling sophistication of existing simulation facilities can be upgraded using parallel processing, and the importance of these upgrades to simulation fidelity.

  6. Adapting high-level language programs for parallel processing using data flow

    NASA Technical Reports Server (NTRS)

    Standley, Hilda M.

    1988-01-01

    EASY-FLOW, a very high-level data flow language, is introduced for the purpose of adapting programs written in a conventional high-level language to a parallel environment. The level of parallelism provided is of the large-grained variety in which parallel activities take place between subprograms or processes. A program written in EASY-FLOW is a set of subprogram calls as units, structured by iteration, branching, and distribution constructs. A data flow graph may be deduced from an EASY-FLOW program.

  7. Parallel Numerical Solution Process of a Two Dimensional Time Dependent Nonlinear Partial Differential Equation

    NASA Astrophysics Data System (ADS)

    Martin, I.; Tirado, F.; Vazquez, L.

    We present a process to achieve the solution of the two dimensional nonlinear Schrödinger equation using a multigrid technique on a distributed memory machine. Some features about the multigrid technique as its good convergence and parallel properties are explained in this paper. This makes multigrid method the optimal one to solve the systems of equations arising at each time step from an implicit numerical scheme. We give some experimental results about the parallel numerical simulation of this equation on a message passing parallel machine.

  8. Seventh SIAM Conference on Parallel Processing for Scientific Computing. Final technical report

    SciTech Connect

    1996-10-01

    The Seventh SIAM Conference on Parallel Processing for Scientific Computing was held in downtown San Francisco on the dates above. More than 400 people attended the meeting. This SIAM conference is, in this organizer`s opinion, the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Other, related areas, most notably parallel software and applications, are also well represented. The strong contributed sessions and minisymposia at the meeting attest to these claims.

  9. Parallel computer processing and modeling: applications for the ICU

    NASA Astrophysics Data System (ADS)

    Baxter, Grant; Pranger, L. Alex; Draghic, Nicole; Sims, Nathaniel M.; Wiesmann, William P.

    2003-07-01

    Current patient monitoring procedures in hospital intensive care units (ICUs) generate vast quantities of medical data, much of which is considered extemporaneous and not evaluated. Although sophisticated monitors to analyze individual types of patient data are routinely used in the hospital setting, this equipment lacks high order signal analysis tools for detecting long-term trends and correlations between different signals within a patient data set. Without the ability to continuously analyze disjoint sets of patient data, it is difficult to detect slow-forming complications. As a result, the early onset of conditions such as pneumonia or sepsis may not be apparent until the advanced stages. We report here on the development of a distributed software architecture test bed and software medical models to analyze both asynchronous and continuous patient data in real time. Hardware and software has been developed to support a multi-node distributed computer cluster capable of amassing data from multiple patient monitors and projecting near and long-term outcomes based upon the application of physiologic models to the incoming patient data stream. One computer acts as a central coordinating node; additional computers accommodate processing needs. A simple, non-clinical model for sepsis detection was implemented on the system for demonstration purposes. This work shows exceptional promise as a highly effective means to rapidly predict and thereby mitigate the effect of nosocomial infections.

  10. Integration Of Parallel Image Processing With Symbolic And Neural Computations For Imagery Exploitation

    NASA Astrophysics Data System (ADS)

    Roman, Evelyn

    1990-02-01

    In this paper we discuss the work being done at Itek combining parallel, symbolic, and neural methodologies at different stages of processing for imagery exploitation. We describe a prototype system we have been implementing combining real-time parallel image processing on an 8-stage parallel image-processing engine (PIPE) computer with expert system software such as our Multi-Sensor Exploitation Assistant system on the Symbolics LISP machine and with neural computations on the PIPE and on its host IBM AT for target recognition and change detection applications. We also provide a summary of basic neural concepts, and show the commonality between neural nets and related mathematics, artificial intelligence, and traditional image processing concepts. This provides us with numerous choices for the implementation of constraint satisfaction, transformational invariance, inference and representational mechanisms, and software lifecycle engineering methodologies in the different computational layers. Our future work may include optical processing as well, for a real-time capability complementing the PIPE's.

  11. Nonlinear vector eigen-solver and parallel reassembly processing for structural nonlinear vibration

    NASA Astrophysics Data System (ADS)

    Xue, D. Y.; Mei, Chuh

    1993-12-01

    In the frequency domain solution of large amplitude nonlinear vibration, two operations are computationally costly. They are: (1) the iterative eigen-solution and (2) the iterative nonlinear matrix reassembly. This study introduces a nonlinear eigen-solver which greatly speeds up the solution procedure by using a combination of vector iteration and nonlinear matrix updating. A feature of this new method is that it avoids repeatedly using a costly eigen-solver or equation solver. This solution procedure has also been engaged in parallel processing to further speed up the computation. Parallel nonlinear matrix reassembly is the main interest in this parallel processing. Force Macro is used in the parallel program on a CRAY-2S supercomputer.

  12. Industrial burner and process efficiency program

    NASA Astrophysics Data System (ADS)

    Huebner, S. R.; Prakash, S. N.

    1981-03-01

    A laboratory prototype burner which is compatible with a FM (frequency modulation) combustion control system where temperature control is accomplished by regulating the ratio of burner on-time to burner off-time was developed. This multifuel (natural gas and No. 2 fuel oil) high velocity burner is capable of repeated pulse ignition at maximum rated capability (1 million Btu-hour) with preheated air (from ambient to 1100F). A digital control in the FM mode was developed. Experimental data from tests in a laboratory furnace indicated that when applied to a batch type thermal process where appreciable turndown is presently obtained by excess air operation, the FM combustion system provides improvements in process fuel efficiency and gains in productivity.

  13. A computationally efficient parallel Levenberg-Marquardt algorithm for highly parameterized inverse model analyses

    NASA Astrophysics Data System (ADS)

    Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.

    2016-09-01

    Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace such that the dimensionality of the problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2-D and a random hydraulic conductivity field in 3-D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ˜101 to ˜102 in a multicore computational environment. Therefore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate to large-scale problems.

  14. A computationally efficient parallel Levenberg-Marquardt algorithm for highly parameterized inverse model analyses

    SciTech Connect

    Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.

    2016-09-01

    Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of the problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~101 to ~102 in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.

  15. A computationally efficient parallel Levenberg-Marquardt algorithm for highly parameterized inverse model analyses

    DOE PAGES

    Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.

    2016-09-01

    Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of themore » problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~101 to ~102 in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.« less

  16. A computationally efficient parallel Levenberg-Marquardt algorithm for highly parameterized inverse model analyses

    SciTech Connect

    Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.

    2016-09-01

    Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of the problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~101 to ~102 in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.

  17. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples

    PubMed Central

    Smith, Andrew M.; Heisler, Lawrence E.; St.Onge, Robert P.; Farias-Hesson, Eveline; Wallace, Iain M.; Bodeau, John; Harris, Adam N.; Perry, Kathleen M.; Giaever, Guri; Pourmand, Nader; Nislow, Corey

    2010-01-01

    Next-generation sequencing has proven an extremely effective technology for molecular counting applications where the number of sequence reads provides a digital readout for RNA-seq, ChIP-seq, Tn-seq and other applications. The extremely large number of sequence reads that can be obtained per run permits the analysis of increasingly complex samples. For lower complexity samples, however, a point of diminishing returns is reached when the number of counts per sequence results in oversampling with no increase in data quality. A solution to making next-generation sequencing as efficient and affordable as possible involves assaying multiple samples in a single run. Here, we report the successful 96-plexing of complex pools of DNA barcoded yeast mutants and show that such ‘Bar-seq’ assessment of these samples is comparable with data provided by barcode microarrays, the current benchmark for this application. The cost reduction and increased throughput permitted by highly multiplexed sequencing will greatly expand the scope of chemogenomics assays and, equally importantly, the approach is suitable for other sequence counting applications that could benefit from massive parallelization. PMID:20460461

  18. Parallel cascade selection molecular dynamics for efficient conformational sampling and free energy calculation of proteins

    NASA Astrophysics Data System (ADS)

    Kitao, Akio; Harada, Ryuhei; Nishihara, Yasutaka; Tran, Duy Phuoc

    2016-12-01

    Parallel Cascade Selection Molecular Dynamics (PaCS-MD) was proposed as an efficient conformational sampling method to investigate conformational transition pathway of proteins. In PaCS-MD, cycles of (i) selection of initial structures for multiple independent MD simulations and (ii) conformational sampling by independent MD simulations are repeated until the convergence of the sampling. The selection is conducted so that protein conformation gradually approaches a target. The selection of snapshots is a key to enhance conformational changes by increasing the probability of rare event occurrence. Since the procedure of PaCS-MD is simple, no modification of MD programs is required; the selections of initial structures and the restart of the next cycle in the MD simulations can be handled with relatively simple scripts with straightforward implementation. Trajectories generated by PaCS-MD were further analyzed by the Markov state model (MSM), which enables calculation of free energy landscape. The combination of PaCS-MD and MSM is reported in this work.

  19. Parallel artificial liquid membrane extraction as an efficient tool for removal of phospholipids from human plasma.

    PubMed

    Ask, Kristine Skoglund; Bardakci, Turgay; Parmer, Marthe Petrine; Halvorsen, Trine Grønhaug; Øiestad, Elisabeth Leere; Pedersen-Bjergaard, Stig; Gjelstad, Astrid

    2016-09-10

    Generic Parallel Artificial Liquid Membrane Extraction (PALME) methods for non-polar basic and non-polar acidic drugs from human plasma were investigated with respect to phospholipid removal. In both cases, extractions in 96-well format were performed from plasma (125μL), through 4μL organic solvent used as supported liquid membranes (SLMs), and into 50μL aqueous acceptor solutions. The acceptor solutions were subsequently analysed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) using in-source fragmentation and monitoring the m/z 184→184 transition for investigation of phosphatidylcholines (PC), sphingomyelins (SM), and lysophosphatidylcholines (Lyso-PC). In both generic methods, no phospholipids were detected in the acceptor solutions. Thus, PALME appeared to be highly efficient for phospholipid removal. To further support this, qualitative (post-column infusion) and quantitative matrix effects were investigated with fluoxetine, fluvoxamine, and quetiapine as model analytes. No signs of matrix effects were observed. Finally, PALME was evaluated for the aforementioned drug substances, and data were in accordance with European Medicines Agency (EMA) guidelines. Copyright © 2016 Elsevier B.V. All rights reserved.

  20. Efficient massively parallel simulation of dynamic channel assignment schemes for wireless cellular communications

    NASA Technical Reports Server (NTRS)

    Greenberg, Albert G.; Lubachevsky, Boris D.; Nicol, David M.; Wright, Paul E.

    1994-01-01

    Fast, efficient parallel algorithms are presented for discrete event simulations of dynamic channel assignment schemes for wireless cellular communication networks. The driving events are call arrivals and departures, in continuous time, to cells geographically distributed across the service area. A dynamic channel assignment scheme decides which call arrivals to accept, and which channels to allocate to the accepted calls, attempting to minimize call blocking while ensuring co-channel interference is tolerably low. Specifically, the scheme ensures that the same channel is used concurrently at different cells only if the pairwise distances between those cells are sufficiently large. Much of the complexity of the system comes from ensuring this separation. The network is modeled as a system of interacting continuous time automata, each corresponding to a cell. To simulate the model, conservative methods are used; i.e., methods in which no errors occur in the course of the simulation and so no rollback or relaxation is needed. Implemented on a 16K processor MasPar MP-1, an elegant and simple technique provides speedups of about 15 times over an optimized serial simulation running on a high speed workstation. A drawback of this technique, typical of conservative methods, is that processor utilization is rather low. To overcome this, new methods were developed that exploit slackness in event dependencies over short intervals of time, thereby raising the utilization to above 50 percent and the speedup over the optimized serial code to about 120 times.

  1. High Speed Publication Subscription Brokering Through Highly Parallel Processing on Field Programmable Gate Array (FPGA)

    DTIC Science & Technology

    2010-01-01

    and that Unix style newlines are being used. Section 2. Hardware Required for a Single Node All of the information in the multi- node hardware...AFRL-RI-RS-TR-2010-29 Final Technical Report January 2010 HIGH SPEED PUBLICATION SUBSCRIPTION BROKERING THROUGH HIGHLY PARALLEL ...2007 – August 2009 4. TITLE AND SUBTITLE HIGH SPEED PUBLICATION SUBSCRIPTION BROKERING THROUGH HIGHLY PARALLEL PROCESSING ON FIELD PROGRAMMABLE

  2. Design and implementation of the parallel processing system of multi-channel polarization images

    NASA Astrophysics Data System (ADS)

    Li, Zhi-yong; Huang, Qin-chao

    2013-08-01

    Compared with traditional optical intensity image processing, polarization images processing has two main problems. One is that the amount of data is larger. The other is that processing tasks is more complex. To resolve these problems, the parallel processing system of multi-channel polarization images is designed by the multi-DSP technique. It contains a communication control unit (CCU) and a data processing array (DPA). CCU controls communications inside and outside the system. Its logics are designed by a FPGA chip. DPA is made up of four Digital Signal Processor (DSP) chips, which are interlinked by the loose coupling method. DPA implements processing tasks including images registration and images synthesis by parallel processing methods. The polarization images parallel processing model is designed on multi levels including the system task, the algorithm and the operation. Its program is designed by the assemble language. While the polarization image resolution is 782x582 pixels, the pixel data length is 12 bits in the experiment. After it received 3 channels of polarization image simultaneously, this system implements parallel task to acquire the target polarization characteristics. Experimental results show that this system has good real-time and reliability. The processing time of images registration is 293.343ms while the registration accuracy achieves 0.5 pixel. The processing time of images synthesis is 3.199ms.

  3. New optical scheme for parallel processing of 1D gray images

    NASA Astrophysics Data System (ADS)

    Huang, Guoliang; Jin, Guofan; Wu, Minxian; Yan, Yingbai

    1994-06-01

    Based on mathematical morphology and digital umbra shading and shadowing algorithm, a new scheme for realizing the fundamental morphological operation of one dimensional gray images is proposed. The mathematical formula for the parallel processing of 1D gray images is summarized; some important conclusions of morphological processing from binary images to gray images are obtained. The advantages of this scheme is simple in structure, high resolution in gray level, and good in parallelism. It can raise the speed of performing morphological processing of gray images greatly and obtain more accurate results.

  4. Efficient algorithms for processing remotely sensed imagery

    NASA Astrophysics Data System (ADS)

    Zhang, Zengyan

    At regional and global scales, satellite-based sensors are the primary source of information to study the Earth's environment, as they provide the needed dynamic temporal view of the Earth's surface. We focus this dissertation on the development of efficient methodologies and algorithms to generate custom tailored data products using Global Area Coverage (GAC) data from the NOAA Advanced Very High Resolution Radiometer (AVHRR) sensor. Furthermore, we show the retrieval of the global Bidirectional Reflectance Distribution Function (BRDF) and albedo of the Earth land surface using Pathfinder AVHRR Land (PAL) data sets which can be generated using our system. These are the first algorithms to retrieve such land surface properties on a global scale. We start by describing a software system Kronos, which allows the generation of a rich set of data products that can be easily specified through a Java interface by scientists wishing to carry out Earth system modeling or analysis. Kronos is based on a flexible methodology and consists of four major components: ingest and preprocessing, indexing and storage, search and processing engine, and a Java interface. Then efficient algorithms for custom-tailored AVHRR data products generation are developed, implemented, and tested on the UMIACS high performance computer systems. A major challenge lies in the efficient processing, storage, and archival of the available large amounts of data in such a way that pertinent information can be extracted with ease. Finally high performance algorithms are developed to retrieve global BRDF and albedo in the red and near-infrared wavelengths using three widely different models with multiangular, multi-temporal, and multi-band PAL data. Given the volume of data involved (about 27 GBytes), I/O time as well as the overall computational complexity are minimized. The algorithms access the global data once, followed by a redistribution of land pixel data to balance the computational loads among the

  5. A Novel and Efficient One-Step Parallel Synthesis of Dibenzopyranones via Suzuki-Miyaura Cross Coupling

    PubMed Central

    Vishnumurthy, Kodumuru; Makriyannis, Alexandros

    2010-01-01

    Microwave promoted novel and efficient one-step parallel synthesis of dibenzopyranones and heterocyclic analogues from bromo arylcarboxylates and o-hydroxyarylboronic acids via Suzuki-Miyaura cross coupling reaction is described. Spontaneous lactonization gave dibenzopyranones and heterocyclic analogues bearing electron donating and withdrawing groups on both aromatic rings in good to excellent yields. PMID:20831265

  6. Parallel pipeline networking and signal processing with field-programmable gate arrays (FPGAs) and VCSEL-MSM smart pixels

    NASA Astrophysics Data System (ADS)

    Kuznia, C. B.; Sawchuk, Alexander A.; Zhang, Liping; Hoanca, Bogdan; Hong, Sunkwang; Min, Chris; Pansatiankul, Dhawat E.; Alpaslan, Zahir Y.

    2000-05-01

    We present a networking and signal processing architecture called Transpar-TR (Translucent Smart Pixel Array-Token- Ring) that utilizes smart pixel technology to perform 2D parallel optical data transfer between digital processing nodes. Transpar-TR moves data through the network in the form of 3D packets (2D spatial and 1D time). By utilizing many spatial parallel channels, Transpar-TR can achieve high throughput, low latency communication between nodes, even with each channel operating at moderate data rates. The 2D array of optical channels is created by an array of smart pixels, each with an optical input and optical output. Each smart pixel consists of two sections, an optical network interface and ALU-based processor with local memory. The optical network interface is responsible for transmitting and receiving optical data packets using a slotted token ring network protocol. The smart pixel array operates as a single-instruction multiple-data processor when processing data. The Transpar-TR network, consisting of networked smart pixel arrays, can perform pipelined parallel processing very efficiently on 2D data structures such as images and video. This paper discusses the Transpar-TR implementation in which each node is the printed circuit board integration of a VCSEL-MSM chip, a transimpedance receiver array chip and an FPGA chip.

  7. Application of bistable optical logic gate arrays to all-optical digital parallel processing

    NASA Astrophysics Data System (ADS)

    Walker, A. C.

    1986-05-01

    Arrays of bistable optical gates can form the basis of an all-optical digital parallel processor. Two classes of signal input geometry exist - on- and off-axis - and lead to distinctly different device characteristics. The optical implementation of multisignal fan-in to an array of intrinsically bistable optical gates using the more efficient off-axis option is discussed together with the construction of programmable read/write memories from optically bistable devices. Finally the design of a demonstration all-optical parallel processor incorporating these concepts is presented.

  8. Relative Language Exposure, Processing Efficiency and Vocabulary in Spanish-English Bilingual Toddlers

    ERIC Educational Resources Information Center

    Hurtado, Nereyda; Gruter, Theres; Marchman, Virginia A.; Fernald, Anne

    2014-01-01

    Research with monolingual children has shown that early efficiency in real-time word recognition predicts later language and cognitive outcomes. In parallel research with young bilingual children, processing ability and vocabulary size are closely related within each language, although not across the two languages. For children in dual-language…

  9. Relative Language Exposure, Processing Efficiency and Vocabulary in Spanish-English Bilingual Toddlers

    ERIC Educational Resources Information Center

    Hurtado, Nereyda; Gruter, Theres; Marchman, Virginia A.; Fernald, Anne

    2014-01-01

    Research with monolingual children has shown that early efficiency in real-time word recognition predicts later language and cognitive outcomes. In parallel research with young bilingual children, processing ability and vocabulary size are closely related within each language, although not across the two languages. For children in dual-language…

  10. Friction Stir Processing for Efficient Manufacturing

    SciTech Connect

    Mr. Christopher B. Smith; Dr. Oyelayo Ajayi

    2012-01-31

    Friction at contacting surfaces in relative motion is a major source of parasitic energy loss in machine systems and manufacturing processes. Consequently, friction reduction usually translates to efficiency gain and reduction in energy consumption. Furthermore, friction at surfaces eventually leads to wear and failure of the components thereby compromising reliability and durability. In order to reduce friction and wear in tribological components, material surfaces are often hardened by a variety of methods, including conventional heat treatment, laser surface hardening, and thin-film coatings. While these surface treatments are effective when used in conjunction with lubrication to prevent failure, they are all energy intensive and could potentially add significant cost. A new concept for surface hardening of metallic materials and components is Friction Stir Processing (FSP). Compared to the current surface hardening technologies, FSP is more energy efficient has no emission or waste by products and may result in better tribological performance. FSP involves plunging a rotating tool to a predetermined depth (case layer thickness) and translating the FSP tool along the area to be processed. This action of the tool produces heating and severe plastic deformation of the processed area. For steel the temperature is high enough to cause phase transformation, ultimately forming hard martensitic phase. Indeed, FSP has been used for surface modification of several metals and alloys so as to homogenize the microstructure and refine the grain size, both of which led to improved fatigue and corrosion resistance. Based on the effect of FSP on near-surface layer material, it was expected to have beneficial effects on friction and wear performance of metallic materials. However, little or no knowledge existed on the impact of FSP concerning friction and wear performance the subject of the this project and final report. Specifically for steel, which is the most dominant

  11. Parallel processing for large-scale nonlinear control experiments in economics

    SciTech Connect

    Amman, H.M. ); Kendrick, D.A. . Dept. of Economics)

    1991-01-01

    In general, the econometric models relevant for purposes of evaluating economic policy contain a large number of nonlinear equations. Therefore, in applying optimal control techniques, computational difficulties are encountered. This paper presents the most common algorithm for computing nonlinear control problems and investigates the degree to which vector processing and parallel processing can facilitate optimal control experiments.

  12. SURVEY OF HIGHLY PARALLEL INFORMATION PROCESSING TECHNOLOGY AND SYSTEMS. PHASE I OF AN IMPLICATIONS STUDY,

    DTIC Science & Technology

    The purpose of this report is to present the results of a survey of the technology of highly parallel information processing technology and systems... processing technology and systems. Completion of this study will require a survey of naval systems to determine which can benefit from the technology

  13. Parallel Processing of the Target Language during Source Language Comprehension in Interpreting

    ERIC Educational Resources Information Center

    Dong, Yanping; Lin, Jiexuan

    2013-01-01

    Two experiments were conducted to test the hypothesis that the parallel processing of the target language (TL) during source language (SL) comprehension in interpreting may be influenced by two factors: (i) link strength from SL to TL, and (ii) the interpreter's cognitive resources supplement to TL processing during SL comprehension. The…

  14. Parallel Processing of the Target Language during Source Language Comprehension in Interpreting

    ERIC Educational Resources Information Center

    Dong, Yanping; Lin, Jiexuan

    2013-01-01

    Two experiments were conducted to test the hypothesis that the parallel processing of the target language (TL) during source language (SL) comprehension in interpreting may be influenced by two factors: (i) link strength from SL to TL, and (ii) the interpreter's cognitive resources supplement to TL processing during SL comprehension. The…

  15. Parallels between a Collaborative Research Process and the Middle Level Philosophy

    ERIC Educational Resources Information Center

    Dever, Robin; Ross, Diane; Miller, Jennifer; White, Paula; Jones, Karen

    2014-01-01

    The characteristics of the middle level philosophy as described in This We Believe closely parallel the collaborative research process. The journey of one research team is described in relationship to these characteristics. The collaborative process includes strengths such as professional relationships, professional development, courageous…

  16. a Hadoop-Based Distributed Framework for Efficient Managing and Processing Big Remote Sensing Images

    NASA Astrophysics Data System (ADS)

    Wang, C.; Hu, F.; Hu, X.; Zhao, S.; Wen, W.; Yang, C.

    2015-07-01

    Various sensors from airborne and satellite platforms are producing large volumes of remote sensing images for mapping, environmental monitoring, disaster management, military intelligence, and others. However, it is challenging to efficiently storage, query and process such big data due to the data- and computing- intensive issues. In this paper, a Hadoop-based framework is proposed to manage and process the big remote sensing data in a distributed and parallel manner. Especially, remote sensing data can be directly fetched from other data platforms into the Hadoop Distributed File System (HDFS). The Orfeo toolbox, a ready-to-use tool for large image processing, is integrated into MapReduce to provide affluent image processing operations. With the integration of HDFS, Orfeo toolbox and MapReduce, these remote sensing images can be directly processed in parallel in a scalable computing environment. The experiment results show that the proposed framework can efficiently manage and process such big remote sensing data.

  17. Efficient Separations and Processing Integrated Program

    SciTech Connect

    Fryberger, T.

    1994-12-31

    The Efficient Separations and Processing Integrated Program (ESP-IP) was created in 1991 to support the Department of Energy`s (DOE) weapons complex cleanup effort. The Office within DOE responsible for the cleanup is the Office of Environmental Restoration and Waste Management (EM). The goal of EM is to minimize risks to human health and the environment, and bring DOE sites in regulatory compliance by the year 2019. EM established the Office of Technology Development (OTD) to develop new technologies that are safer, faster, more effective and less expensive than current methods. In an effort to focus resources and address opportunities, OTD has developed Integrated Programs (IPs) to support applied research activities in key application areas required in each stage of the remediation process (e.g. characterization, treatment, disposal). The mission of the ESP-IP is to identify, develop, and coordinate technologies for economical separation of toxic components from wastes that have accumulated at DOE sites since the mid-1940s. These wastes and environmental problems, located at more than 100 contaminated installations in 36 states and territories, are the result of half a century of nuclear processing activities by DOE and its predecessor organizations. These wastes include: high-level radioactive wastes stored in underground tanks, transuranic wastes, low-level radioactive wastes and mixed wastes buried in landfills, contaminated soils, and contaminated groundwater.

  18. The hypercluster: A parallel processing test-bed architecture for computational mechanics applications

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1987-01-01

    The development of numerical methods and software tools for parallel processors can be aided through the use of a hardware test-bed. The test-bed architecture must be flexible enough to support investigations into architecture-algorithm interactions. One way to implement a test-bed is to use a commercial parallel processor. Unfortunately, most commercial parallel processors are fixed in their interconnection and/or processor architecture. In this paper, we describe a modified n cube architecture, called the hypercluster, which is a superset of many other processor and interconnection architectures. The hypercluster is intended to support research into parallel processing of computational fluid and structural mechanics problems which may require a number of different architectural configurations. An example of how a typical partial differential equation solution algorithm maps on to the hypercluster is given.

  19. Compressed bitmap indices for efficient query processing

    SciTech Connect

    Wu, Kesheng; Otoo, Ekow; Shoshani, Arie

    2001-09-30

    Many database applications make extensive use of bitmap indexing schemes. In this paper, we study how to improve the efficiencies of these indexing schemes by proposing new compression schemes for the bitmaps. Most compression schemes are designed primarily to achieve good compression. During query processing they can be orders of magnitude slower than their uncompressed counterparts. The new schemes are designed to bridge this performance gap by reducing compression effectiveness and improving operation speed. In a number of tests on both synthetic data and real application data, we found that the new schemes significantly outperform the well-known compression schemes while using only modestly more space. For example, compared to the Byte-aligned Bitmap Code, the new schemes are 12 times faster and it uses only 50 percent more space. The new schemes use much less space(<30 percent) than the uncompressed scheme and are faster in a majority of the test cases.

  20. Rapid Pattern Recognition of Three Dimensional Objects Using Parallel Processing Within a Hierarchy of Hexagonal Grids

    NASA Astrophysics Data System (ADS)

    Tang, Haojun

    1995-01-01

    This thesis describes using parallel processing within a hierarchy of hexagonal grids to achieve rapid recognition of patterns. A seven-pixel basic hexagonal neighborhood, a sixty-one-pixel superneighborhood and pyramids of a 2-to-4 area ratio are employed. The hexagonal network achieves improved accuracy over the square network for object boundaries. The hexagonal grid with less directional sensitivity is a better approximation of the human vision grid, is more suited to natural scenes than the square grid and avoids the 4-neighbor/8-neighbor problem. Parallel processing in image analysis saves considerable time versus the traditional line-by-line method. Hexagonal parallel processing combines the optimum hexagonal geometry with the parallel structure. Our work has surveyed behavior and internal properties to construct the image on the different level of hexagonal pixel grids in a parallel computation scheme. A computer code has been developed to detect edges of digital images of real objects taken with a CCD camera within a hexagonal grid at any level. The algorithm uses the differences of the local gray level and those of its six neighbors, and is able to determine the boundary of a digital image in parallel. Also a series of algorithms and techniques have been built up to manage edge linking, feature extraction, etc. The digital images obtained from the improved CRS digital image processing system are a good approximation to the images which would be obtained with a real physical hexagonal grid. We envision that our work done within this little-known area will have some important applications in real-time machine vision. A parallel two-layer hexagonal-array retina has been designed to do pattern recognition using simple operations such as differencing, rationing, thresholding, etc. which may occur in the human retina and other biological vision systems.

  1. Advancing the extended parallel process model through the inclusion of response cost measures.

    PubMed

    Rintamaki, Lance S; Yang, Z Janet

    2014-01-01

    This study advances the Extended Parallel Process Model through the inclusion of response cost measures, which are drawbacks associated with a proposed response to a health threat. A sample of 502 college students completed a questionnaire on perceptions regarding sexually transmitted infections and condom use after reading information from the Centers for Disease Control and Prevention on the health risks of sexually transmitted infections and the utility of latex condoms in preventing sexually transmitted infection transmission. The questionnaire included standard Extended Parallel Process Model assessments of perceived threat and efficacy, as well as questions pertaining to response costs associated with condom use. Results from hierarchical ordinary least squares regression demonstrated how the addition of response cost measures improved the predictive power of the Extended Parallel Process Model, supporting the inclusion of this variable in the model.

  2. Toward a model framework of generalized parallel componential processing of multi-symbol numbers.

    PubMed

    Huber, Stefan; Cornelsen, Sonja; Moeller, Korbinian; Nuerk, Hans-Christoph

    2015-05-01

    In this article, we propose and evaluate a new model framework of parallel componential multi-symbol number processing, generalizing the idea of parallel componential processing of multi-digit numbers to the case of negative numbers by considering the polarity signs similar to single digits. In a first step, we evaluated this account by defining and investigating a sign-decade compatibility effect for the comparison of positive and negative numbers, which extends the unit-decade compatibility effect in 2-digit number processing. Then, we evaluated whether the model is capable of accounting for previous findings in negative number processing. In a magnitude comparison task, in which participants had to single out the larger of 2 integers, we observed a reliable sign-decade compatibility effect with prolonged reaction times for incompatible (e.g., -97 vs. +53; in which the number with the larger decade digit has the smaller, i.e., negative polarity sign) as compared with sign-decade compatible number pairs (e.g., -53 vs. +97). Moreover, an analysis of participants' eye fixation behavior corroborated our model of parallel componential processing of multi-symbol numbers. These results are discussed in light of concurrent theoretical notions about negative number processing. On the basis of the present results, we propose a generalized integrated model framework of parallel componential multi-symbol processing. (c) 2015 APA, all rights reserved).

  3. An efficient highly parallel implementation of a large air pollution model on an IBM blue gene supercomputer

    NASA Astrophysics Data System (ADS)

    Ostromsky, Tz.; Georgiev, K.; Zlatev, Z.

    2012-10-01

    In this paper we discuss the efficient distributed-memory parallelization strategy of the Unified Danish Eulerian Model (UNI-DEM). We apply an improved decomposition strategy to the spatial domain in order to get more parallel tasks (based on the larger number of subdomains) with less communications between them (due to optimization of the overlapping area when the advection-diffusion problem is solved numerically). This kind of rectangular block partitioning (with a squareshape trend) allows us not only to increase significantly the number of potential parallel tasks, but also to reduce the local memory requirements per task, which is critical for the distributed-memory implementation of the higher-resolution/finergrid versions of UNI-DEM on some parallel systems, and particularly on the IBM BlueGene/P platform - our target hardware. We will show by experiments that our new parallel implementation can use rather efficiently the resources of the powerful IBM BlueGene/P supercomputer, the largest in Bulgaria, up to its full capacity. It turned out to be extremely useful in the large and computationally expensive numerical experiments, carried out to calculate some initial data for sensitivity analysis of the Danish Eulerian model.

  4. A Generic and Efficient E-field Parallel Imaging Correlator for Next-Generation Radio Telescopes

    NASA Astrophysics Data System (ADS)

    Thyagarajan, Nithyanandan; Beardsley, Adam P.; Bowman, Judd D.; Morales, Miguel F.

    2017-05-01

    Modern radio telescopes are favouring densely packed array layouts with large numbers of antennas (NA ≳ 1000). Since the complexity of traditional correlators scales as O(N_A^2), there will be a steep cost for realizing the full imaging potential of these powerful instruments. Through our generic and efficient E-field Parallel Imaging Correlator (epic), we present the first software demonstration of a generalized direct imaging algorithm, namely the Modular Optimal Frequency Fourier imager. Not only does it bring down the cost for dense layouts to O(N_A log _2N_A) but can also image from irregular layouts and heterogeneous arrays of antennas. epic is highly modular, parallelizable, implemented in object-oriented python, and publicly available. We have verified the images produced to be equivalent to those from traditional techniques to within a precision set by gridding coarseness. We have also validated our implementation on data observed with the Long Wavelength Array (LWA1). We provide a detailed framework for imaging with heterogeneous arrays and show that epic robustly estimates the input sky model for such arrays. Antenna layouts with dense filling factors consisting of a large number of antennas such as LWA, the Square Kilometre Array, Hydrogen Epoch of Reionization Array, and Canadian Hydrogen Intensity Mapping Experiment will gain significant computational advantage by deploying an optimized version of epic. The algorithm is a strong candidate for instruments targeting transient searches of fast radio bursts as well as planetary and exoplanetary phenomena due to the availability of high-speed calibrated time-domain images and low output bandwidth relative to visibility-based systems.

  5. A Generic and Efficient E-field Parallel Imaging Correlator for Next-Generation Radio Telescopes

    NASA Astrophysics Data System (ADS)

    Thyagarajan, Nithyanandan; Beardsley, Adam P.; Bowman, Judd D.; Morales, Miguel F.

    2017-01-01

    Modern radio telescopes are favouring densely packed array layouts with large numbers of antennas (NA ≳ 1000). Since the complexity of traditional correlators scales as O(N_{A}^2), there will be a steep cost for realizing the full imaging potential of these powerful instruments. Through our generic and efficient E-field Parallel Imaging Correlator (EPIC), we present the first software demonstration of a generalized direct imaging algorithm, namely, the Modular Optimal Frequency Fourier (MOFF) imager. Not only does it bring down the cost for dense layouts to O(N_{A} log _2N_{A}) but can also image from irregular layouts and heterogeneous arrays of antennas. EPIC is highly modular, parallelizable, implemented in object-oriented Python, and publicly available. We have verified the images produced to be equivalent to those from traditional techniques to within a precision set by gridding coarseness. We have also validated our implementation on data observed with the Long Wavelength Array (LWA1). We provide a detailed framework for imaging with heterogeneous arrays and show that EPIC robustly estimates the input sky model for such arrays. Antenna layouts with dense filling factors consisting of a large number of antennas such as LWA, the Square Kilometre Array, Hydrogen Epoch of Reionization Array, and Canadian Hydrogen Intensity Mapping Experiment will gain significant computational advantage by deploying an optimized version of EPIC. The algorithm is a strong candidate for instruments targeting transient searches of Fast Radio Bursts (FRB) as well as planetary and exoplanetary phenomena due to the availability of high-speed calibrated time-domain images and low output bandwidth relative to visibility-based systems.

  6. Managing internode data communications for an uninitialized process in a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E

    2014-05-20

    A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.

  7. The effect of preventive educational program in cigarette smoking: Extended Parallel Process Model

    PubMed Central

    Gharlipour, Zabihollah; Hazavehei, Seyed Mohammad Mehdi; Moeini, Babak; Nazari, Mahin; Beigi, Abbas Moghim; Tavassoli, Elahe; Heydarabadi, Akbar Babaei; Reisi, Mahnoush; Barkati, Hasan

    2015-01-01

    Background: Cigarette smoking is one of the preventable causes of diseases and deaths. The most important preventive measure is technique to resist against peer pressure. Any educational program should design with an emphasis upon theories of behavioral change and based on effective educational program. To investigate the interventions through educational program in prevention of cigarette smoking, this paper has used the Extended Parallel Process Model (EPPM). Materials and Methods: This study is a quasi-experimental study. Two middle schools were randomly selected from male students in Shiraz. Therefore, we randomly selected 120 students for the experimental group and 120 students for the control group. After diagnostic evaluation, educational interventions on the consequences of smoking and preventive skills were applied. Results: Our results indicated that there was a significant difference between students in the control and experimental groups in the means of perceived susceptibility (P < 0.000, t = 6.84), perceived severity (P < 0.000, t = −11.46), perceived response efficacy (P < 0.000, t = −7.07), perceived self-efficacy (P < 0.000, t = −11.64), and preventive behavior (P < 0.000, t = −24.36). Conclusions: EPPM along with educating skills necessary to resist against peer pressure had significant level of efficiency in improving preventive behavior of cigarette smoking among adolescents. However, this study recommends further studies on ways of increasing perceived susceptibility in cigarette smoking among adolescents. PMID:25767815

  8. The effect of preventive educational program in cigarette smoking: Extended Parallel Process Model.

    PubMed

    Gharlipour, Zabihollah; Hazavehei, Seyed Mohammad Mehdi; Moeini, Babak; Nazari, Mahin; Beigi, Abbas Moghim; Tavassoli, Elahe; Heydarabadi, Akbar Babaei; Reisi, Mahnoush; Barkati, Hasan

    2015-01-01

    Cigarette smoking is one of the preventable causes of diseases and deaths. The most important preventive measure is technique to resist against peer pressure. Any educational program should design with an emphasis upon theories of behavioral change and based on effective educational program. To investigate the interventions through educational program in prevention of cigarette smoking, this paper has used the Extended Parallel Process Model (EPPM). This study is a quasi-experimental study. Two middle schools were randomly selected from male students in Shiraz. Therefore, we randomly selected 120 students for the experimental group and 120 students for the control group. After diagnostic evaluation, educational interventions on the consequences of smoking and preventive skills were applied. Our results indicated that there was a significant difference between students in the control and experimental groups in the means of perceived susceptibility (P < 0.000, t = 6.84), perceived severity (P < 0.000, t = -11.46), perceived response efficacy (P < 0.000, t = -7.07), perceived self-efficacy (P < 0.000, t = -11.64), and preventive behavior (P < 0.000, t = -24.36). EPPM along with educating skills necessary to resist against peer pressure had significant level of efficiency in improving preventive behavior of cigarette smoking among adolescents. However, this study recommends further studies on ways of increasing perceived susceptibility in cigarette smoking among adolescents.

  9. Development of Power Supply System with Distributed Generators using Parallel Processing Method

    NASA Astrophysics Data System (ADS)

    Hirose, Kenichi; Takeda, Takashi; Okui, Yoshiaki; Yukita, Kazuto; Goto, Yasuyuki; Ichiyanagi, Katsuhiro; Matsumura, Toshiro

    This paper describes a novel power system which consists of distributed energy resources (DER) with a static switch at the point of common coupling. Usage of the static switch with a parallel processing control is a new application of line interactive type uninterruptible power supply (UPS). In recent years, various ways of design, operation, and control methods have been studied in order to find more effective ways to utilize renewable energy and to reduce impact for environment. One of features of a proposed power system can interconnect to existing utility grid without interruption. Electrical power distribution to the loads by the power system can be continued between the states of interconnection and isolate operation seamlessly. The novel power system has other benefits such as more efficiency, demand site management, easy to control power system inside, improvement of reliability for power distribution, the minimum requirement of protection relays for grid interconnection. The proposed power system has been operated with the actual loads of 20kW in the campus of the Aichi Institute of Technology since 2007.

  10. Application of integration algorithms in a parallel processing environment for the simulation of jet engines

    SciTech Connect

    Krosel, S.M.; Milner, E.J.

    1982-01-01

    Illustrates the application of predictor-corrector integration algorithms developed for the digital parallel processing environment. The algorithms are implemented and evaluated through the use of a software simulator which provides an approximate representation of the parallel processing hardware. Test cases which focus on the use of the algorithms are presented and a specific application using a linear model of a turbofan engine is considered. Results are presented showing the effects of integration step size and the number of processors on simulation accuracy. Real-time performance, inter-processor communication and algorithm startup are also discussed. 10 references.

  11. Application of integration algorithms in a parallel processing environment for the simulation of jet engines

    NASA Technical Reports Server (NTRS)

    Krosel, S. M.; Milner, E. J.

    1982-01-01

    The application of Predictor corrector integration algorithms developed for the digital parallel processing environment are investigated. The algorithms are implemented and evaluated through the use of a software simulator which provides an approximate representation of the parallel processing hardware. Test cases which focus on the use of the algorithms are presented and a specific application using a linear model of a turbofan engine is considered. Results are presented showing the effects of integration step size and the number of processors on simulation accuracy. Real time performance, interprocessor communication, and algorithm startup are also discussed.

  12. Parallel implementation of efficient preconditioned linear solver for grid-based applications in chemical physics. II: QMR linear solver

    NASA Astrophysics Data System (ADS)

    Chen, Wenwu; Poirier, Bill

    2006-11-01

    Linear systems in chemical physics often involve matrices with a certain sparse block structure. These can often be solved very effectively using iterative methods (sequence of matrix-vector products) in conjunction with a block Jacobi preconditioner [B. Poirier, Numer. Linear Algebra Appl. 7 (2000) 715]. In a two-part series, we present an efficient parallel implementation, incorporating several additional refinements. The present study (paper II) indicates that the basic parallel sparse matrix-vector product operation itself is the overall scalability bottleneck, faring much more poorly than the specialized, block Jacobi routines considered in a companion paper (paper I). However, a simple dimensional combination scheme is found to alleviate this difficulty.

  13. Next generation Purex modeling by way of parallel processing with high performance computers

    SciTech Connect

    DeMuth, S.F.

    1993-08-01

    The Plutonium and Uranium Extraction (Purex) process is the predominant method used worldwide for solvent extraction in reprocessing spent nuclear fuels. Proper flowsheet design has a significant impact on the character of the process waste. Past Purex flowsheet modeling has been based on equilibrium conditions. It can be shown for the Purex process that optimum separation does not necessarily occur at equilibrium conditions. The next generation Purex flowsheet models should incorporate the fundamental diffusion and chemical kinetic processes required to study time-dependent behavior. Use of parallel processing with high-performance computers will permit transient multistage and multispecies design calculations based on mass transfer with simultaneous chemical reaction models. This paper presents an applicable mass transfer with chemical reaction model for the Purex system and presents a parallel processing solution methodology.

  14. Processing communications events in parallel active messaging interface by awakening thread from wait state

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-10-22

    Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.

  15. Toward energy-efficient and distributed mobile health monitoring using parallel offloading.

    PubMed

    Ahnn, Jong Hoon; Potkonjak, Miodrag

    2013-01-01

    Although mobile health monitoring where mobile sensors continuously gather, process, and update sensor readings (e.g. vital signals) from patient's sensors is emerging, little effort has been investigated in an energy-efficient management of sensor information gathering and processing. Mobile health monitoring with the focus of energy consumption may instead be holistically analyzed and systematically designed as a global solution to optimization subproblems. We propose a distributed and energy-saving mobile health platform, called mHealthMon where mobile users publish/access sensor data via a cloud computing-based distributed P2P overlay network. The key objective is to satisfy the mobile health monitoring application's quality of service requirements by modeling each subsystem: mobile clients with medical sensors, wireless network medium, and distributed cloud services. By simulations based on experimental data, we present the proposed system can achieve up to 10.1 times more energy-efficient and 20.2 times faster compared to a standalone mobile health monitoring application, in various mobile health monitoring scenarios applying a realistic mobility model.

  16. mHealthMon: toward energy-efficient and distributed mobile health monitoring using parallel offloading.

    PubMed

    Ahnn, Jong Hoon; Potkonjak, Miodrag

    2013-10-01

    Although mobile health monitoring where mobile sensors continuously gather, process, and update sensor readings (e.g. vital signals) from patient's sensors is emerging, little effort has been investigated in an energy-efficient management of sensor information gathering and processing. Mobile health monitoring with the focus of energy consumption may instead be holistically analyzed and systematically designed as a global solution to optimization subproblems. This paper presents an attempt to decompose the very complex mobile health monitoring system whose layer in the system corresponds to decomposed subproblems, and interfaces between them are quantified as functions of the optimization variables in order to orchestrate the subproblems. We propose a distributed and energy-saving mobile health platform, called mHealthMon where mobile users publish/access sensor data via a cloud computing-based distributed P2P overlay network. The key objective is to satisfy the mobile health monitoring application's quality of service requirements by modeling each subsystem: mobile clients with medical sensors, wireless network medium, and distributed cloud services. By simulations based on experimental data, we present the proposed system can achieve up to 10.1 times more energy-efficient and 20.2 times faster compared to a standalone mobile health monitoring application, in various mobile health monitoring scenarios applying a realistic mobility model.

  17. Parallel and serial processes in the human oculomotor system: bimodal integration and express saccades.

    PubMed

    Nozawa, G; Reuter-Lorenz, P A; Hughes, H C

    1994-01-01

    Saccadic reaction times (SRTs) were analyzed in the context of stochastic models of information processing (e.g., Townsend and Ashby 1983) to reveal the processing architecture(s) underlying integrative interactions between visual and auditory inputs and the mechanisms of express saccades. The results support the following conclusions. Bimodal (visual+auditory) targets are processed in parallel, and facilitate SRT to an extent that exceeds levels attainable by probability summation. This strongly implies neural summation between elements responding to spatially aligned visual and auditory inputs in the human oculomotor system. Second, express saccades are produced within a separable processing stage that is organized in series with that responsible for intersensory integration. A model is developed that implements this combination of parallel and serial processing. The activity in parallel input channels is summed within a sensory stage which is organized in series with a pre-motor and motor stage. The time course of each subprocess is considered a random variable, and different experimental manipulations can selectively influence different stages. Parallels between the model and physiological data are explored.

  18. Rapid attentional selection processes operate independently and in parallel for multiple targets.

    PubMed

    Grubert, Anna; Eimer, Martin

    2016-12-01

    The question whether multiple objects are selected serially or in parallel remains contentious. Previous studies employed the N2pc component as a marker of attentional selection to show that multiple selection processes can be activated concurrently. The present study demonstrates that the concurrent selection of multiple targets reflects genuinely parallel processing that is unaffected by whether or when an additional selection process is elicited simultaneously for another target. Experiment 1 showed that N2pc components triggered during the selection of a colour-defined target were not modulated by the presence versus absence of a second target that appeared in close temporal proximity. Experiment 2 revealed that the same rapid parallel selection processes were elicited regardless of whether two targets appeared simultaneously or in two successive displays. Results show that rapid attentional selection processes within the first 200ms after stimulus onset can be triggered in parallel for multiple objects in the visual field. Copyright © 2016 Elsevier B.V. All rights reserved.

  19. Bin-Hash Indexing: A Parallel Method for Fast Query Processing

    SciTech Connect

    Bethel, Edward W; Gosink, Luke J.; Wu, Kesheng; Bethel, Edward Wes; Owens, John D.; Joy, Kenneth I.

    2008-06-27

    This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not be resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies.

  20. An efficient parallel algebraic multigrid method for 3D injection moulding simulation based on finite volume method

    NASA Astrophysics Data System (ADS)

    Hu, Zixiang; Zhang, Yun; Liang, Junjie; Shi, Songxin; Zhou, Huamin

    2014-07-01

    Elapsed time is always one of the most important performance measures for polymer injection moulding simulation. Solving pressure correction equations is the most time-consuming part in the mould filling simulation using finite volume method with SIMPLE-like algorithms. Algebraic multigrid (AMG) is one of the most promising methods for this type of elliptic equations. It, thus, has better performance by contrast with some common one-level iterative methods, especially for large problems. And it is also suitable for parallel computing. However, AMG is not easy to be applied due to its complex theory and poor generality for the large range of computational fluid dynamics applications. This paper gives a robust and efficient parallel AMG solver, A1-pAMG, for 3D mould filling simulation of injection moulding. Numerical experiments demonstrate that, A1-pAMG has better parallel performance than the classical AMG, and also has algorithmic scalability in the context of 3D unstructured problems.

  1. Parallel Digital Watermarking Process on Ultrasound Medical Images in Multicores Environment

    PubMed Central

    Khor, Hui Liang; Liew, Siau-Chuin; Zain, Jasni Mohd.

    2016-01-01

    With the advancement of technology in communication network, it facilitated digital medical images transmitted to healthcare professionals via internal network or public network (e.g., Internet), but it also exposes the transmitted digital medical images to the security threats, such as images tampering or inserting false data in the images, which may cause an inaccurate diagnosis and treatment. Medical image distortion is not to be tolerated for diagnosis purposes; thus a digital watermarking on medical image is introduced. So far most of the watermarking research has been done on single frame medical image which is impractical in the real environment. In this paper, a digital watermarking on multiframes medical images is proposed. In order to speed up multiframes watermarking processing time, a parallel watermarking processing on medical images processing by utilizing multicores technology is introduced. An experiment result has shown that elapsed time on parallel watermarking processing is much shorter than sequential watermarking processing. PMID:26981111

  2. Improving computational efficiency and tractability of protein design using a piecemeal approach. A strategy for parallel and distributed protein design.

    PubMed

    Pitman, Derek J; Schenkelberg, Christian D; Huang, Yao-Ming; Teets, Frank D; DiTursi, Daniel; Bystroff, Christopher

    2014-04-15

    Accuracy in protein design requires a fine-grained rotamer search, multiple backbone conformations, and a detailed energy function, creating a burden in runtime and memory requirements. A design task may be split into manageable pieces in both three-dimensional space and in the rotamer search space to produce small, fast jobs that are easily distributed. However, these jobs must overlap, presenting a problem in resolving conflicting solutions in the overlap regions. Piecemeal design, in which the design space is split into overlapping regions and rotamer search spaces, accelerates the design process whether jobs are run in series or in parallel. Large jobs that cannot fit in memory were made possible by splitting. Accepting the consensus amino acid selection in conflict regions led to non-optimal choices. Instead, conflicts were resolved using a second pass, in which the split regions were re-combined and designed as one, producing results that were closer to optimal with a minimal increase in runtime over the consensus strategy. Splitting the search space at the rotamer level instead of at the amino acid level further improved the efficiency by reducing the search space in the second pass. Programs for splitting protein design expressions are available at www.bioinfo.rpi.edu/tools/piecemeal.html CONTACT: bystrc@rpi.edu Supplementary information: Supplementary data are available at Bioinformatics online. © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  3. The Large Laboratory Course: Organize It to Parallel Industrial Process Development.

    ERIC Educational Resources Information Center

    Eckert, Roger E.; Ybarra, Robert M.

    1988-01-01

    Describes a senior level chemical engineering course at Purdue University that parallels an industrial process development department. Stresses the course organization, manager-engineer contract, evaluation of students, course evaluation, and gives examples of course improvements made during the course. (CW)

  4. Parallel processing in the honeybee olfactory pathway: structure, function, and evolution.

    PubMed

    Rössler, Wolfgang; Brill, Martin F

    2013-11-01

    Animals face highly complex and dynamic olfactory stimuli in their natural environments, which require fast and reliable olfactory processing. Parallel processing is a common principle of sensory systems supporting this task, for example in visual and auditory systems, but its role in olfaction remained unclear. Studies in the honeybee focused on a dual olfactory pathway. Two sets of projection neurons connect glomeruli in two antennal-lobe hemilobes via lateral and medial tracts in opposite sequence with the mushroom bodies and lateral horn. Comparative studies suggest that this dual-tract circuit represents a unique adaptation in Hymenoptera. Imaging studies indicate that glomeruli in both hemilobes receive redundant sensory input. Recent simultaneous multi-unit recordings from projection neurons of both tracts revealed widely overlapping response profiles strongly indicating parallel olfactory processing. Whereas lateral-tract neurons respond fast with broad (generalistic) profiles, medial-tract neurons are odorant specific and respond slower. In analogy to "what-" and "where" subsystems in visual pathways, this suggests two parallel olfactory subsystems providing "what-" (quality) and "when" (temporal) information. Temporal response properties may support across-tract coincidence coding in higher centers. Parallel olfactory processing likely enhances perception of complex odorant mixtures to decode the diverse and dynamic olfactory world of a social insect.

  5. An Inconvenient Truth: An Application of the Extended Parallel Process Model

    ERIC Educational Resources Information Center

    Goodall, Catherine E.; Roberto, Anthony J.

    2008-01-01

    "An Inconvenient Truth" is an Academy Award-winning documentary about global warming presented by Al Gore. This documentary is appropriate for a lesson on fear appeals and the extended parallel process model (EPPM). The EPPM is concerned with the effects of perceived threat and efficacy on behavior change. Perceived threat is composed of an…

  6. Tracking the Continuity of Language Comprehension: Computer Mouse Trajectories Suggest Parallel Syntactic Processing

    ERIC Educational Resources Information Center

    Farmer, Thomas A.; Cargill, Sarah A.; Hindy, Nicholas C.; Dale, Rick; Spivey, Michael J.

    2007-01-01

    Although several theories of online syntactic processing assume the parallel activation of multiple syntactic representations, evidence supporting simultaneous activation has been inconclusive. Here, the continuous and non-ballistic properties of computer mouse movements are exploited, by recording their streaming x, y coordinates to procure…

  7. User's guide to the Parallel Processing Extension of the Prognosis Model

    Treesearch

    Nicholas L. Crookston; Albert R. Stage

    1991-01-01

    The Parallel Processing Extension (PPE) of the Prognosis Model was designed to analyze responses of numerous stands to coordinated management and pest impacts that operate at the landscape level of forests. Vegetation-related resource supply analysis can be readily performed for a thousand or more sample stands for projections 400 years into the future. Capabilities...

  8. Parallel Distributed Processing at 25: Further Explorations in the Microstructure of Cognition

    ERIC Educational Resources Information Center

    Rogers, Timothy T.; McClelland, James L.

    2014-01-01

    This paper introduces a special issue of "Cognitive Science" initiated on the 25th anniversary of the publication of "Parallel Distributed Processing" (PDP), a two-volume work that introduced the use of neural network models as vehicles for understanding cognition. The collection surveys the core commitments of the PDP…

  9. Using the Extended Parallel Process Model to Examine Teachers' Likelihood of Intervening in Bullying

    ERIC Educational Resources Information Center

    Duong, Jeffrey; Bradshaw, Catherine P.

    2013-01-01

    Background: Teachers play a critical role in protecting students from harm in schools, but little is known about their attitudes toward addressing problems like bullying. Previous studies have rarely used theoretical frameworks, making it difficult to advance this area of research. Using the Extended Parallel Process Model (EPPM), we examined the…

  10. Parallel Distributed Processing at 25: Further Explorations in the Microstructure of Cognition

    ERIC Educational Resources Information Center

    Rogers, Timothy T.; McClelland, James L.

    2014-01-01

    This paper introduces a special issue of "Cognitive Science" initiated on the 25th anniversary of the publication of "Parallel Distributed Processing" (PDP), a two-volume work that introduced the use of neural network models as vehicles for understanding cognition. The collection surveys the core commitments of the PDP…

  11. Real-time SHVC software decoding with multi-threaded parallel processing

    NASA Astrophysics Data System (ADS)

    Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu

    2014-09-01

    This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.

  12. High Performance Parallel Processing Project: Industrial computing initiative. Progress reports for fiscal year 1995

    SciTech Connect

    Koniges, A.

    1996-02-09

    This project is a package of 11 individual CRADA`s plus hardware. This innovative project established a three-year multi-party collaboration that is significantly accelerating the availability of commercial massively parallel processing computing software technology to U.S. government, academic, and industrial end-users. This report contains individual presentations from nine principal investigators along with overall program information.

  13. Using the Extended Parallel Process Model to Examine Teachers' Likelihood of Intervening in Bullying

    ERIC Educational Resources Information Center

    Duong, Jeffrey; Bradshaw, Catherine P.

    2013-01-01

    Background: Teachers play a critical role in protecting students from harm in schools, but little is known about their attitudes toward addressing problems like bullying. Previous studies have rarely used theoretical frameworks, making it difficult to advance this area of research. Using the Extended Parallel Process Model (EPPM), we examined the…

  14. Parallel Process and Isomorphism: A Model for Decision Making in the Supervisory Triad

    ERIC Educational Resources Information Center

    Koltz, Rebecca L.; Odegard, Melissa A.; Feit, Stephen S.; Provost, Kent; Smith, Travis

    2012-01-01

    Parallel process and isomorphism are two supervisory concepts that are often discussed independently but rarely discussed in connection with each other. These two concepts, philosophically, have different historical roots, as well as different implications for interventions with regard to the supervisory triad. The authors examine the difference…

  15. Cocaine Use and Delinquent Behavior among High-Risk Youths: A Growth Model of Parallel Processes

    ERIC Educational Resources Information Center

    Dembo, Richard; Sullivan, Christopher

    2009-01-01

    We report the results of a parallel-process, latent growth model analysis examining the relationships between cocaine use and delinquent behavior among youths. The study examined a sample of 278 justice-involved juveniles completing at least one of three follow-up interviews as part of a National Institute on Drug Abuse-funded study. The results…

  16. A Neurally Plausible Parallel Distributed Processing Model of Event-Related Potential Word Reading Data

    ERIC Educational Resources Information Center

    Laszlo, Sarah; Plaut, David C.

    2012-01-01

    The Parallel Distributed Processing (PDP) framework has significant potential for producing models of cognitive tasks that approximate how the brain performs the same tasks. To date, however, there has been relatively little contact between PDP modeling and data from cognitive neuroscience. In an attempt to advance the relationship between…

  17. One Factor or Two Parallel Processes? Comorbidity and Development of Adolescent Anxiety and Depressive Disorder Symptoms

    ERIC Educational Resources Information Center

    Hale, William W., III; Raaijmakers, Quinten A. W.; Muris, Peter; van Hoof, Anne; Meeus, Wim H. J.

    2009-01-01

    Background: This study investigates whether anxiety and depressive disorder symptoms of adolescents from the general community are best described by a model that assumes they are indicative of one general factor or by a model that assumes they are two distinct disorders with parallel growth processes. Additional analyses were conducted to explore…

  18. Parallel Process and Isomorphism: A Model for Decision Making in the Supervisory Triad

    ERIC Educational Resources Information Center

    Koltz, Rebecca L.; Odegard, Melissa A.; Feit, Stephen S.; Provost, Kent; Smith, Travis

    2012-01-01

    Parallel process and isomorphism are two supervisory concepts that are often discussed independently but rarely discussed in connection with each other. These two concepts, philosophically, have different historical roots, as well as different implications for interventions with regard to the supervisory triad. The authors examine the difference…

  19. A Neurally Plausible Parallel Distributed Processing Model of Event-Related Potential Word Reading Data

    ERIC Educational Resources Information Center

    Laszlo, Sarah; Plaut, David C.

    2012-01-01

    The Parallel Distributed Processing (PDP) framework has significant potential for producing models of cognitive tasks that approximate how the brain performs the same tasks. To date, however, there has been relatively little contact between PDP modeling and data from cognitive neuroscience. In an attempt to advance the relationship between…

  20. An Inconvenient Truth: An Application of the Extended Parallel Process Model

    ERIC Educational Resources Information Center

    Goodall, Catherine E.; Roberto, Anthony J.

    2008-01-01

    "An Inconvenient Truth" is an Academy Award-winning documentary about global warming presented by Al Gore. This documentary is appropriate for a lesson on fear appeals and the extended parallel process model (EPPM). The EPPM is concerned with the effects of perceived threat and efficacy on behavior change. Perceived threat is composed of an…

  1. Cocaine Use and Delinquent Behavior among High-Risk Youths: A Growth Model of Parallel Processes

    ERIC Educational Resources Information Center

    Dembo, Richard; Sullivan, Christopher

    2009-01-01

    We report the results of a parallel-process, latent growth model analysis examining the relationships between cocaine use and delinquent behavior among youths. The study examined a sample of 278 justice-involved juveniles completing at least one of three follow-up interviews as part of a National Institute on Drug Abuse-funded study. The results…

  2. Tracking the Continuity of Language Comprehension: Computer Mouse Trajectories Suggest Parallel Syntactic Processing

    ERIC Educational Resources Information Center

    Farmer, Thomas A.; Cargill, Sarah A.; Hindy, Nicholas C.; Dale, Rick; Spivey, Michael J.

    2007-01-01

    Although several theories of online syntactic processing assume the parallel activation of multiple syntactic representations, evidence supporting simultaneous activation has been inconclusive. Here, the continuous and non-ballistic properties of computer mouse movements are exploited, by recording their streaming x, y coordinates to procure…

  3. Efficient audio signal processing for embedded systems

    NASA Astrophysics Data System (ADS)

    Chiu, Leung Kin

    As mobile platforms continue to pack on more computational power, electronics manufacturers start to differentiate their products by enhancing the audio features. However, consumers also demand smaller devices that could operate for longer time, hence imposing design constraints. In this research, we investigate two design strategies that would allow us to efficiently process audio signals on embedded systems such as mobile phones and portable electronics. In the first strategy, we exploit properties of the human auditory system to process audio signals. We designed a sound enhancement algorithm to make piezoelectric loudspeakers sound ”richer" and "fuller." Piezoelectric speakers have a small form factor but exhibit poor response in the low-frequency region. In the algorithm, we combine psychoacoustic bass extension and dynamic range compression to improve the perceived bass coming out from the tiny speakers. We also developed an audio energy reduction algorithm for loudspeaker power management. The perceptually transparent algorithm extends the battery life of mobile devices and prevents thermal damage in speakers. This method is similar to audio compression algorithms, which encode audio signals in such a ways that the compression artifacts are not easily perceivable. Instead of reducing the storage space, however, we suppress the audio contents that are below the hearing threshold, therefore reducing the signal energy. In the second strategy, we use low-power analog circuits to process the signal before digitizing it. We designed an analog front-end for sound detection and implemented it on a field programmable analog array (FPAA). The system is an example of an analog-to-information converter. The sound classifier front-end can be used in a wide range of applications because programmable floating-gate transistors are employed to store classifier weights. Moreover, we incorporated a feature selection algorithm to simplify the analog front-end. A machine

  4. Highly efficient and exact method for parallelization of grid-based algorithms and its implementation in DelPhi

    PubMed Central

    Li, Chuan; Li, Lin; Zhang, Jie; Alexov, Emil

    2012-01-01

    The Gauss-Seidel method is a standard iterative numerical method widely used to solve a system of equations and, in general, is more efficient comparing to other iterative methods, such as the Jacobi method. However, standard implementation of the Gauss-Seidel method restricts its utilization in parallel computing due to its requirement of using updated neighboring values (i.e., in current iteration) as soon as they are available. Here we report an efficient and exact (not requiring assumptions) method to parallelize iterations and to reduce the computational time as a linear/nearly linear function of the number of CPUs. In contrast to other existing solutions, our method does not require any assumptions and is equally applicable for solving linear and nonlinear equations. This approach is implemented in the DelPhi program, which is a finite difference Poisson-Boltzmann equation solver to model electrostatics in molecular biology. This development makes the iterative procedure on obtaining the electrostatic potential distribution in the parallelized DelPhi several folds faster than that in the serial code. Further we demonstrate the advantages of the new parallelized DelPhi by computing the electrostatic potential and the corresponding energies of large supramolecular structures. PMID:22674480

  5. Parallel pulse processing and data acquisition for high speed, low error flow cytometry

    DOEpatents

    Engh, G.J. van den; Stokdijk, W.

    1992-09-22

    A digitally synchronized parallel pulse processing and data acquisition system for a flow cytometer has multiple parallel input channels with independent pulse digitization and FIFO storage buffer. A trigger circuit controls the pulse digitization on all channels. After an event has been stored in each FIFO, a bus controller moves the oldest entry from each FIFO buffer onto a common data bus. The trigger circuit generates an ID number for each FIFO entry, which is checked by an error detection circuit. The system has high speed and low error rate. 17 figs.

  6. Parallel pulse processing and data acquisition for high speed, low error flow cytometry

    DOEpatents

    van den Engh, Gerrit J.; Stokdijk, Willem

    1992-01-01

    A digitally synchronized parallel pulse processing and data acquisition system for a flow cytometer has multiple parallel input channels with independent pulse digitization and FIFO storage buffer. A trigger circuit controls the pulse digitization on all channels. After an event has been stored in each FIFO, a bus controller moves the oldest entry from each FIFO buffer onto a common data bus. The trigger circuit generates an ID number for each FIFO entry, which is checked by an error detection circuit. The system has high speed and low error rate.

  7. Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms

    NASA Astrophysics Data System (ADS)

    Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel

    2016-04-01

    Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and

  8. Parallel multigrid solver of radiative transfer equation for photon transport via graphics processing unit.

    PubMed

    Gao, Hao; Phan, Lan; Lin, Yuting

    2012-09-01

    A graphics processing unit-based parallel multigrid solver for a radiative transfer equation with vacuum boundary condition or reflection boundary condition is presented for heterogeneous media with complex geometry based on two-dimensional triangular meshes or three-dimensional tetrahedral meshes. The computational complexity of this parallel solver is linearly proportional to the degrees of freedom in both angular and spatial variables, while the full multigrid method is utilized to minimize the number of iterations. The overall gain of speed is roughly 30 to 300 fold with respect to our prior multigrid solver, which depends on the underlying regime and the parallelization. The numerical validations are presented with the MATLAB codes at https://sites.google.com/site/rtefastsolver/.

  9. Parallel multigrid solver of radiative transfer equation for photon transport via graphics processing unit

    PubMed Central

    Phan, Lan; Lin, Yuting

    2012-01-01

    Abstract. A graphics processing unit–based parallel multigrid solver for a radiative transfer equation with vacuum boundary condition or reflection boundary condition is presented for heterogeneous media with complex geometry based on two-dimensional triangular meshes or three-dimensional tetrahedral meshes. The computational complexity of this parallel solver is linearly proportional to the degrees of freedom in both angular and spatial variables, while the full multigrid method is utilized to minimize the number of iterations. The overall gain of speed is roughly 30 to 300 fold with respect to our prior multigrid solver, which depends on the underlying regime and the parallelization. The numerical validations are presented with the MATLAB codes at https://sites.google.com/site/rtefastsolver/. PMID:23085905

  10. A domain decomposition parallel processing algorithm for molecular dynamics simulations of polymers

    NASA Astrophysics Data System (ADS)

    Brown, David; Clarke, Julian H. R.; Okuda, Motoi; Yamazaki, Takao

    1994-10-01

    We describe in this paper a domain decomposition molecular dynamics algorithm for use on distributed memory parallel computers which is capable of handling systems containing rigid bond constraints and three- and four-body potentials as well as non-bonded potentials. The algorithm has been successfully implemented on the Fujitsu 1024 processor element AP1000 machine. The performance has been compared with and benchmarked against the alternative cloning method of parallel processing [D. Brown, J.H.R. Clarke, M. Okuda and T. Yamazaki, J. Chem. Phys., 100 (1994) 1684] and results obtained using other scalar and vector machines. Two parallel versions of the SHAKE algorithm, which solves the bond length constraints problem, have been compared with regard to optimising the performance of this procedure.

  11. Solving the Quadratic Assignment Problems using Parallel ACO with Symmetric Multi Processing

    NASA Astrophysics Data System (ADS)

    Tsutsui, Shigeyoshi

    In this paper, we propose several types of parallel ant colony optimization algorithms with symmetric multi processing for solving the quadratic assignment problem (QAP). These models include the master-slave models and the island models. As a base ant colony optimization algorithm, we used the cunning Ant System (cAS) which showed promising performance our in previous studies. We evaluated each parallel algorithm with a condition that the run time for each parallel algorithm and the base sequential algorithm are the same. The results suggest that using the master-slave model with increased iteration of ant colony optimization algorithms is promising in solving quadratic assignment problems for real or real-like instances.

  12. Hippocampal-prefrontal dynamics in spatial working memory: interactions and independent parallel processing.

    PubMed

    Churchwell, John C; Kesner, Raymond P

    2011-12-01

    Memory processes may be independent, compete, operate in parallel, or interact. In accordance with this view, behavioral studies suggest that the hippocampus (HPC) and prefrontal cortex (PFC) may act as an integrated circuit during performance of tasks that require working memory over longer delays, whereas during short delays the HPC and PFC may operate in parallel or have completely dissociable functions. In the present investigation we tested rats in a spatial delayed non-match to sample working memory task using short and long time delays to evaluate the hypothesis that intermediate CA1 region of the HPC (iCA1) and medial PFC (mPFC) interact and operate in parallel under different temporal working memory constraints. In order to assess the functional role of these structures, we used an inactivation strategy in which each subject received bilateral chronic cannula implantation of the iCA1 and mPFC, allowing us to perform bilateral, contralateral, ipsilateral, and combined bilateral inactivation of structures and structure pairs within each subject. This novel approach allowed us to test for circuit-level systems interactions, as well as independent parallel processing, while we simultaneously parametrically manipulated the temporal dimension of the task. The current results suggest that, at longer delays, iCA1 and mPFC interact to coordinate retrospective and prospective memory processes in anticipation of obtaining a remote goal, whereas at short delays either structure may independently represent spatial information sufficient to successfully complete the task.

  13. A multi-satellite orbit determination problem in a parallel processing environment

    NASA Technical Reports Server (NTRS)

    Deakyne, M. S.; Anderle, R. J.

    1988-01-01

    The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.

  14. Massively parallel data processing for quantitative total flow imaging with optical coherence microscopy and tomography

    NASA Astrophysics Data System (ADS)

    Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo

    2017-08-01

    We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU

  15. Parallel processing in a host plus multiple array processor system for radar

    NASA Technical Reports Server (NTRS)

    Barkan, B. Z.

    1983-01-01

    Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.

  16. Future of Power Efficient Processing (BRIEFING CHARTS)

    DTIC Science & Technology

    2007-03-07

    Fab : TI, Dr. Dennis Buss, 65-nm CMOS 8 0 0.2 0.4 0.6 0.8 1 1.2 10 −15 10 −14... x ( W / c m 2 ) Year New transistor paradigm ! M o d u l e H e a t F l u x ( W / c m 2 ) New Paradigm ? Power = Dynamic + Static...Power management – Sub-threshold , Parallelism, 3D – ‘ESE’, ‘3D-IC’ • III) Ultra-low power CMOS – Tunable threshold – Steeper sub-threshold

  17. An automated workflow for parallel processing of large multiview SPIM recordings

    PubMed Central

    Schmied, Christopher; Steinbach, Peter; Pietzsch, Tobias; Preibisch, Stephan; Tomancak, Pavel

    2016-01-01

    Summary: Selective Plane Illumination Microscopy (SPIM) allows to image developing organisms in 3D at unprecedented temporal resolution over long periods of time. The resulting massive amounts of raw image data requires extensive processing interactively via dedicated graphical user interface (GUI) applications. The consecutive processing steps can be easily automated and the individual time points can be processed independently, which lends itself to trivial parallelization on a high performance computing (HPC) cluster. Here, we introduce an automated workflow for processing large multiview, multichannel, multiillumination time-lapse SPIM data on a single workstation or in parallel on a HPC cluster. The pipeline relies on snakemake to resolve dependencies among consecutive processing steps and can be easily adapted to any cluster environment for processing SPIM data in a fraction of the time required to collect it. Availability and implementation: The code is distributed free and open source under the MIT license http://opensource.org/licenses/MIT. The source code can be downloaded from github: https://github.com/mpicbg-scicomp/snakemake-workflows. Documentation can be found here: http://fiji.sc/Automated_workflow_for_parallel_Multiview_Reconstruction. Contact: schmied@mpi-cbg.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26628585

  18. pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs

    SciTech Connect

    Wu, Changjun; Kalyanaraman, Anantharaman; Cannon, William R.

    2012-09-15

    Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problemparticularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.

  19. Realizing high photovoltaic efficiency with parallel multijunction solar cells based on spectrum-splitting and -concentrating diffractive optical element

    NASA Astrophysics Data System (ADS)

    Wang, Jin-Ze; Huang, Qing-Li; Xu, Xin; Quan, Bao-Gang; Luo, Jian-Heng; Zhang, Yan; Ye, Jia-Sheng; Li, Dong-Mei; Meng, Qing-Bo; Yang, Guo-Zhen

    2015-05-01

    Based on the facts that multijunction solar cells can increase the efficiency and concentration can reduce the cost dramatically, a special design of parallel multijunction solar cells was presented. The design employed a diffractive optical element (DOE) to split and concentrate the sunlight. A rainbow region and a zero-order diffraction region were generated on the output plane where solar cells with corresponding band gaps were placed. An analytical expression of the light intensity distribution on the output plane of the special DOE was deduced, and the limiting photovoltaic efficiency of such parallel multijunction solar cells was obtained based on Shockley-Queisser’s theory. An efficiency exceeding the Shockley-Queisser limit (33%) can be expected using multijunction solar cells consisting of separately fabricated subcells. The results provide an important alternative approach to realize high photovoltaic efficiency without the need for expensive epitaxial technology widely used in tandem solar cells, thus stimulating the research and application of high efficiency and low cost solar cells. Project supported by the National Natural Science Foundation of China (Grant Nos. 91233202, 21173260, and 51072221) and the National Basic Research Program of China (Grant No. 2012CB932903).

  20. XPP A High Performance Parallel Signal Processing Platform for Space Applications

    NASA Astrophysics Data System (ADS)

    Schueler, E.; Syed, M.; Helfers, T.

    This document presents the eXtreme Processing Platform (XPP), a new runtime reconfigurable data processing technology developed by PACT GmbH which combines the performance of ASICs with the flexibility of DSPs. The XPP is built using a scalable array of arithmetic processing elements, embedded memories, high bandwidth I/O, an auto synchronizing packet oriented data network, an internal event network that enables the control of program flow using execution flags, and is designed to support different types of parallelism, like multithreading, multitasking and multiple parallel instances. The technology promises to provide high flexible payloads in future telecommunication satellites, scientific missions and short time to market, much needed to cope up with the significant changes in telecommunication market and rapidly changing customer needs.

  1. Application of parallel computing to seismic damage process simulation of an arch dam

    NASA Astrophysics Data System (ADS)

    Zhong, Hong; Lin, Gao; Li, Jianbo

    2010-06-01

    The simulation of damage process of high arch dam subjected to strong earthquake shocks is significant to the evaluation of its performance and seismic safety, considering the catastrophic effect of dam failure. However, such numerical simulation requires rigorous computational capacity. Conventional serial computing falls short of that and parallel computing is a fairly promising solution to this problem. The parallel finite element code PDPAD was developed for the damage prediction of arch dams utilizing the damage model with inheterogeneity of concrete considered. Developed with programming language Fortran, the code uses a master/slave mode for programming, domain decomposition method for allocation of tasks, MPI (Message Passing Interface) for communication and solvers from AZTEC library for solution of large-scale equations. Speedup test showed that the performance of PDPAD was quite satisfactory. The code was employed to study the damage process of a being-built arch dam on a 4-node PC Cluster, with more than one million degrees of freedom considered. The obtained damage mode was quite similar to that of shaking table test, indicating that the proposed procedure and parallel code PDPAD has a good potential in simulating seismic damage mode of arch dams. With the rapidly growing need for massive computation emerged from engineering problems, parallel computing will find more and more applications in pertinent areas.

  2. cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing.

    PubMed

    Takeuchi, Toshiki; Yamada, Atsuo; Aoki, Takashi; Nishimura, Kunihiro

    2016-01-01

    Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required. We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.

  3. ISP: an optimal out-of-core image-set processing streaming architecture for parallel heterogeneous systems.

    PubMed

    Ha, Linh Khanh; Krüger, Jens; Dihl Comba, João Luiz; Silva, Cláudio T; Joshi, Sarang

    2012-06-01

    Image population analysis is the class of statistical methods that plays a central role in understanding the development, evolution, and disease of a population. However, these techniques often require excessive computational power and memory that are compounded with a large number of volumetric inputs. Restricted access to supercomputing power limits its influence in general research and practical applications. In this paper we introduce ISP, an Image-Set Processing streaming framework that harnesses the processing power of commodity heterogeneous CPU/GPU systems and attempts to solve this computational problem. In ISP, we introduce specially designed streaming algorithms and data structures that provide an optimal solution for out-of-core multiimage processing problems both in terms of memory usage and computational efficiency. ISP makes use of the asynchronous execution mechanism supported by parallel heterogeneous systems to efficiently hide the inherent latency of the processing pipeline of out-of-core approaches. Consequently, with computationally intensive problems, the ISP out-of-core solution can achieve the same performance as the in-core solution. We demonstrate the efficiency of the ISP framework on synthetic and real datasets.

  4. Parallel and Space-Efficient Construction of Burrows-Wheeler Transform and Suffix Array for Big Genome Data.

    PubMed

    Liu, Yongchao; Hankeln, Thomas; Schmidt, Bertil

    2016-01-01

    Next-generation sequencing technologies have led to the sequencing of more and more genomes, propelling related research into the era of big data. In this paper, we present ParaBWT, a parallelized Burrows-Wheeler transform (BWT) and suffix array construction algorithm for big genome data. In ParaBWT, we have investigated a progressive construction approach to constructing the BWT of single genome sequences in linear space complexity, but with a small constant factor. This approach has been further parallelized using multi-threading based on a master-slave coprocessing model. After gaining the BWT, the suffix array is constructed in a memory-efficient manner. The performance of ParaBWT has been evaluated using two sequences generated from two human genome assemblies: the Ensembl Homo sapiens assembly and the human reference genome. Our performance comparison to FMD-index and Bwt-disk reveals that on 12 CPU cores, ParaBWT runs up to 2.2× faster than FMD-index and up to 99.0× faster than Bwt-disk. BWT construction algorithms for very long genomic sequences are time consuming and (due to their incremental nature) inherently difficult to parallelize. Thus, their parallelization is challenging and even relatively small speedups like the ones of our method over FMD-index are of high importance to research. ParaBWT is written in C++, and is freely available at http://parabwt.sourceforge.net.

  5. Parallel Processing of Big Point Clouds Using Z-Order Partitioning

    NASA Astrophysics Data System (ADS)

    Alis, C.; Boehm, J.; Liu, K.

    2016-06-01

    As laser scanning technology improves and costs are coming down, the amount of point cloud data being generated can be prohibitively difficult and expensive to process on a single machine. This data explosion is not only limited to point cloud data. Voluminous amounts of high-dimensionality and quickly accumulating data, collectively known as Big Data, such as those generated by social media, Internet of Things devices and commercial transactions, are becoming more prevalent as well. New computing paradigms and frameworks are being developed to efficiently handle the processing of Big Data, many of which utilize a compute cluster composed of several commodity grade machines to process chunks of data in parallel. A central concept in many of these frameworks is data locality. By its nature, Big Data is large enough that the entire dataset would not fit on the memory and hard drives of a single node hence replicating the entire dataset to each worker node is impractical. The data must then be partitioned across worker nodes in a manner that minimises data transfer across the network. This is a challenge for point cloud data because there exist different ways to partition data and they may require data transfer. We propose a partitioning based on Z-order which is a form of locality-sensitive hashing. The Z-order or Morton code is computed by dividing each dimension to form a grid then interleaving the binary representation of each dimension. For example, the Z-order code for the grid square with coordinates (x = 1 = 012, y = 3 = 112) is 10112 = 11. The number of points in each partition is controlled by the number of bits per dimension: the more bits, the fewer the points. The number of bits per dimension also controls the level of detail with more bits yielding finer partitioning. We present this partitioning method by implementing it on Apache Spark and investigating how different parameters affect the accuracy and running time of the k nearest neighbour algorithm

  6. Library preparation and multiplex capture for massive parallel sequencing applications made efficient and easy.

    PubMed

    Neiman, Mårten; Sundling, Simon; Grönberg, Henrik; Hall, Per; Czene, Kamila; Lindberg, Johan; Klevebring, Daniel

    2012-01-01

    During the recent years, rapid development of sequencing technologies and a competitive market has enabled researchers to perform massive sequencing projects at a reasonable cost. As the price for the actual sequencing reactions drops, enabling more samples to be sequenced, the relative price for preparing libraries gets larger and the practical laboratory work becomes complex and tedious. We present a cost-effective strategy for simplified library preparation compatible with both whole genome- and targeted sequencing experiments. An optimized enzyme composition and reaction buffer reduces the number of required clean-up steps and allows for usage of bulk enzymes which makes the whole process cheap, efficient and simple. We also present a two-tagging strategy, which allows for multiplex sequencing of targeted regions. To prove our concept, we have prepared libraries for low-pass sequencing from 100 ng DNA, performed 2-, 4- and 8-plex exome capture and a 96-plex capture of a 500 kb region. In all samples we see a high concordance (>99.4%) of SNP calls when comparing to commercially available SNP-chip platforms.

  7. Efficient Process for Making Polycrystalline Silicon

    NASA Technical Reports Server (NTRS)

    Mccormick, J. R.; Plahutnik, F. JR.; Sawyer, D. H.; Arvidson, A. N.; Goldfarb, S. M.

    1985-01-01

    Solar cells made with lower capital and operating costs. Process based on chemical-vapor deposition (CVD) of dichlorosilane produces high-grade polycrystalline silicon for solar cells. Process has potential as cost-effective replacement for CVD of trichlorosilane.

  8. Serial processing in reading aloud: no challenge for a parallel model.

    PubMed

    Zorzi, M

    2000-04-01

    K. Rastle and M. Coltheart (1999) challenged parallel models of reading by showing that the cost of irregularity in low-frequency exception words was modulated by the position of the irregularity in the word. This position-of-irregularity effect was taken as strong evidence of serial processing in reading. This article refutes Rastle and Coltheart's theoretical conclusions in 3 ways: First, a parallel model, the connectionist dual process model (M. Zorzi, G. Houghton, & B. Butterworth, 1998b), produces a position-of-irregularity effect. Second, the supposed serial effect can be reduced to a position-specific grapheme-phoneme consistency effect. Third, the position-of-irregularity effect vanishes when the experimental data are reanalyzed using grapheme-phoneme consistency as the covariate. This demonstration has broader implications for studies aiming at adjudicating between models: Strong inferences should be avoided until the computational models are actually tested.

  9. Serial and parallel processes in eye movement control: current controversies and future directions.

    PubMed

    Murray, Wayne S; Fischer, Martin H; Tatler, Benjamin W

    2013-01-01

    In this editorial for the special issue on serial and parallel processing in reading we explore the background to the current debate concerning whether the word recognition processes in reading are strictly serial-sequential or take place in an overlapping parallel fashion. We consider the history of the controversy and some of the underlying assumptions, together with an analysis of the types of evidence and arguments that have been adduced to both sides of the debate, concluding that both accounts necessarily presuppose some weakening of, or elasticity in, the eye-mind assumption. We then consider future directions, both for reading research and for scene viewing, and wrap up the editorial with a brief overview of the following articles and their conclusions.

  10. Timescale-invariant pattern recognition by feedforward inhibition and parallel signal processing.

    PubMed

    Creutzig, Felix; Benda, Jan; Wohlgemuth, Sandra; Stumpner, Andreas; Ronacher, Bernhard; Herz, Andreas V M

    2010-06-01

    The timescale-invariant recognition of temporal stimulus sequences is vital for many species and poses a challenge for their sensory systems. Here we present a simple mechanistic model to address this computational task, based on recent observations in insects that use rhythmic acoustic communication signals for mate finding. In the model framework, feedforward inhibition leads to burst-like response patterns in one neuron of the circuit. Integrating these responses over a fixed time window by a readout neuron creates a timescale-invariant stimulus representation. Only two additional processing channels, each with a feature detector and a readout neuron, plus one final coincidence detector for all three parallel signal streams, are needed to account for the behavioral data. In contrast to previous solutions to the general time-warp problem, no time delay lines or sophisticated neural architectures are required. Our results suggest a new computational role for feedforward inhibition and underscore the power of parallel signal processing.

  11. The Masterson Approach with play therapy: a parallel process between mother and child.

    PubMed

    Mulherin, M A

    2001-01-01

    This paper discusses a case in which the Masterson Approach was used with play therapy to treat a child with a developing personality disorder. It describes the parallel progression of the child and mother in adjunct therapy throughout a six-year period. The unique value of the Masterson Approach is that it provides the therapist with a framework and tool to diagnose and treat a child during the dynamic process of play. The case describes the mother-child dyad throughout therapy. It traces their parallel processes that involve separation, individuation, rapprochement, and the recovery of real self-capacities. Each stage of treatment is described, including verbal interventions. The child's internal affective state and intrapsychic structure during the various stages of treatment are illustrated by representative pictures.

  12. Parallel, multi-stage processing of colors, faces and shapes in macaque inferior temporal cortex

    PubMed Central

    Lafer-Sousa, Rosa; Conway, Bevil R.

    2014-01-01

    Visual-object processing culminates in inferior temporal (IT) cortex. To assess the organization of IT, we measured fMRI responses in alert monkey to achromatic images (faces, fruit, bodies, places) and colored gratings. IT contained multiple color-biased regions, which were typically ventral to face patches and, remarkably, yoked to them, spaced regularly at four locations predicted by known anatomy. Color and face selectivity increased for more anterior regions, indicative of a broad hierarchical arrangement. Responses to non-face shapes were found across IT, but were stronger outside color-biased regions and face patches, consistent with multiple parallel streams. IT also contained multiple coarse eccentricity maps: face patches overlapped central representations; color-biased regions spanned mid-peripheral representations; and place-biased regions overlapped peripheral representations. These results suggest that IT comprises parallel, multi-stage processing networks subject to one organizing principle. PMID:24141314

  13. Efficient high-throughput biological process characterization: Definitive screening design with the ambr250 bioreactor system.

    PubMed

    Tai, Mitchell; Ly, Amanda; Leung, Inne; Nayar, Gautam

    2015-01-01

    The burgeoning pipeline for new biologic drugs has increased the need for high-throughput process characterization to efficiently use process development resources. Breakthroughs in highly automated and parallelized upstream process development have led to technologies such as the 250-mL automated mini bioreactor (ambr250™) system. Furthermore, developments in modern design of experiments (DoE) have promoted the use of definitive screening design (DSD) as an efficient method to combine factor screening and characterization. Here we utilize the 24-bioreactor ambr250™ system with 10-factor DSD to demonstrate a systematic experimental workflow to efficiently characterize an Escherichia coli (E. coli) fermentation process for recombinant protein production. The generated process model is further validated by laboratory-scale experiments and shows how the strategy is useful for quality by design (QbD) approaches to control strategies for late-stage characterization. © 2015 American Institute of Chemical Engineers.

  14. Graphics processing unit parallel accelerated solution of the discrete ordinates for photon transport in biological tissues.

    PubMed

    Peng, Kuan; Gao, Xinbo; Qu, Xiaochao; Ren, Nunu; Chen, Xueli; He, Xiaowei; Wang, Xiaorei; Liang, Jimin; Tian, Jie

    2011-07-20

    As a widely used numerical solution for the radiation transport equation (RTE), the discrete ordinates can predict the propagation of photons through biological tissues more accurately relative to the diffusion equation. The discrete ordinates reduce the RTE to a serial of differential equations that can be solved by source iteration (SI). However, the tremendous time consumption of SI, which is partly caused by the expensive computation of each SI step, limits its applications. In this paper, we present a graphics processing unit (GPU) parallel accelerated SI method for discrete ordinates. Utilizing the calculation independence on the levels of the discrete ordinate equation and spatial element, the proposed method reduces the time cost of each SI step by parallel calculation. The photon reflection at the boundary was calculated based on the results of the last SI step to ensure the calculation independence on the level of the discrete ordinate equation. An element sweeping strategy was proposed to detect the calculation independence on the level of the spatial element. A GPU parallel frame called the compute unified device architecture was employed to carry out the parallel computation. The simulation experiments, which were carried out with a cylindrical phantom and numerical mouse, indicated that the time cost of each SI step can be reduced up to a factor of 228 by the proposed method with a GTX 260 graphics card. © 2011 Optical Society of America

  15. Parallelizing Serial Code for a Distributed Processing Environment with an Application to High Frequency Electromagnetic Scattering

    DTIC Science & Technology

    1991-12-01

    1,k) 2p(i+’/2,j+ /,k) 1+++ At •1(22) B(i +l/’j +/2’k)58 1+ 0.m(i +𔃼,j +/2,k) 2p(i+ 1A ,j+1/,k) ,[ ZEn(i+/2J+1’k)-E(i ’ ij k] where 5 is the lattice...1991. 19. Tipler , Paul A. Physics. New York: Worth Publishers Inc., 1976. 20. Work, Paul Rt and Gary B. Lamont. "Efficient Parallelization of Serial

  16. Queueing Network Models for Parallel Processing of Task Systems: an Operational Approach

    NASA Technical Reports Server (NTRS)

    Mak, Victor W. K.

    1986-01-01

    Computer performance modeling of possibly complex computations running on highly concurrent systems is considered. Earlier works in this area either dealt with a very simple program structure or resulted in methods with exponential complexity. An efficient procedure is developed to compute the performance measures for series-parallel-reducible task systems using queueing network models. The procedure is based on the concept of hierarchical decomposition and a new operational approach. Numerical results for three test cases are presented and compared to those of simulations.

  17. Modulator and VCSEL-MSM smart pixels for parallel pipeline networking and signal processing

    NASA Astrophysics Data System (ADS)

    Chen, C.-H.; Hoanca, Bogdan; Kuznia, C. B.; Pansatiankul, Dhawat E.; Zhang, Liping; Sawchuk, Alexander A.

    1999-07-01

    TRANslucent Smart Pixel Array (TRANSPAR) systems perform high performance parallel pipeline networking and signal processing based on optical propagation of 3D data packets. The TRANSPAR smart pixel devices use either self-electro- optic effect GaAs multiple quantum well modulators or CMOS- VCSEL-MSM (CMOS-Vertical Cavity Surface Emitting Laser- Metal-Semiconductor-Metal) technology. The data packets transfer among high throughput photonic network nodes using multiple access/collision detection or token-ring protocols.

  18. Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts

    SciTech Connect

    1997-12-31

    This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.

  19. Some computational challenges of developing efficient parallel algorithms for data-dependent computations in thermal-hydraulics supercomputer applications

    SciTech Connect

    Woodruff, S.B.

    1992-01-01

    The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems.

  20. Parallel particle swarm optimization on a graphics processing unit with application to trajectory optimization

    NASA Astrophysics Data System (ADS)

    Wu, Q.; Xiong, F.; Wang, F.; Xiong, Y.

    2016-10-01

    In order to reduce the computational time, a fully parallel implementation of the particle swarm optimization (PSO) algorithm on a graphics processing unit (GPU) is presented. Instead of being executed on the central processing unit (CPU) sequentially, PSO is executed in parallel via the GPU on the compute unified device architecture (CUDA) platform. The processes of fitness evaluation, updating of velocity and position of all particles are all parallelized and introduced in detail. Comparative studies on the optimization of four benchmark functions and a trajectory optimization problem are conducted by running PSO on the GPU (GPU-PSO) and CPU (CPU-PSO). The impact of design dimension, number of particles and size of the thread-block in the GPU and their interactions on the computational time is investigated. The results show that the computational time of the developed GPU-PSO is much shorter than that of CPU-PSO, with comparable accuracy, which demonstrates the remarkable speed-up capability of GPU-PSO.