bit-wise parallel algorithm: Topics by Science.gov

Sample records for bit-wise parallel algorithm

Data communications for a collective operation in a parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Faraj, Daniel A.

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks,more » a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.« less
Data communications for a collective operation in a parallel active messaging interface of a parallel computer

DOEpatents

Faraj, Daniel A

2013-07-16

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
A gossip based information fusion protocol for distributed frequent itemset mining

NASA Astrophysics Data System (ADS)

Sohrabi, Mohammad Karim

2018-07-01

The computational complexity, huge memory space requirement, and time-consuming nature of frequent pattern mining process are the most important motivations for distribution and parallelization of this mining process. On the other hand, the emergence of distributed computational and operational environments, which causes the production and maintenance of data on different distributed data sources, makes the parallelization and distribution of the knowledge discovery process inevitable. In this paper, a gossip based distributed itemset mining (GDIM) algorithm is proposed to extract frequent itemsets, which are special types of frequent patterns, in a wireless sensor network environment. In this algorithm, local frequent itemsets of each sensor are extracted using a bit-wise horizontal approach (LHPM) from the nodes which are clustered using a leach-based protocol. Heads of clusters exploit a gossip based protocol in order to communicate each other to find the patterns which their global support is equal to or more than the specified support threshold. Experimental results show that the proposed algorithm outperforms the best existing gossip based algorithm in term of execution time.
Real-time software receiver

NASA Technical Reports Server (NTRS)

Psiaki, Mark L. (Inventor); Kintner, Jr., Paul M. (Inventor); Ledvina, Brent M. (Inventor); Powell, Steven P. (Inventor)

2007-01-01

A real-time software receiver that executes on a general purpose processor. The software receiver includes data acquisition and correlator modules that perform, in place of hardware correlation, baseband mixing and PRN code correlation using bit-wise parallelism.
Real-time software receiver

NASA Technical Reports Server (NTRS)

Psiaki, Mark L. (Inventor); Ledvina, Brent M. (Inventor); Powell, Steven P. (Inventor); Kintner, Jr., Paul M. (Inventor)

2006-01-01

A real-time software receiver that executes on a general purpose processor. The software receiver includes data acquisition and correlator modules that perform, in place of hardware correlation, baseband mixing and PRN code correlation using bit-wise parallelism.
A DNA-based semantic fusion model for remote sensing data.

PubMed

Sun, Heng; Weng, Jian; Yu, Guangchuang; Massawe, Richard H

2013-01-01

Semantic technology plays a key role in various domains, from conversation understanding to algorithm analysis. As the most efficient semantic tool, ontology can represent, process and manage the widespread knowledge. Nowadays, many researchers use ontology to collect and organize data's semantic information in order to maximize research productivity. In this paper, we firstly describe our work on the development of a remote sensing data ontology, with a primary focus on semantic fusion-driven research for big data. Our ontology is made up of 1,264 concepts and 2,030 semantic relationships. However, the growth of big data is straining the capacities of current semantic fusion and reasoning practices. Considering the massive parallelism of DNA strands, we propose a novel DNA-based semantic fusion model. In this model, a parallel strategy is developed to encode the semantic information in DNA for a large volume of remote sensing data. The semantic information is read in a parallel and bit-wise manner and an individual bit is converted to a base. By doing so, a considerable amount of conversion time can be saved, i.e., the cluster-based multi-processes program can reduce the conversion time from 81,536 seconds to 4,937 seconds for 4.34 GB source data files. Moreover, the size of result file recording DNA sequences is 54.51 GB for parallel C program compared with 57.89 GB for sequential Perl. This shows that our parallel method can also reduce the DNA synthesis cost. In addition, data types are encoded in our model, which is a basis for building type system in our future DNA computer. Finally, we describe theoretically an algorithm for DNA-based semantic fusion. This algorithm enables the process of integration of the knowledge from disparate remote sensing data sources into a consistent, accurate, and complete representation. This process depends solely on ligation reaction and screening operations instead of the ontology.
A DNA-Based Semantic Fusion Model for Remote Sensing Data

PubMed Central

Sun, Heng; Weng, Jian; Yu, Guangchuang; Massawe, Richard H.

2013-01-01

Semantic technology plays a key role in various domains, from conversation understanding to algorithm analysis. As the most efficient semantic tool, ontology can represent, process and manage the widespread knowledge. Nowadays, many researchers use ontology to collect and organize data's semantic information in order to maximize research productivity. In this paper, we firstly describe our work on the development of a remote sensing data ontology, with a primary focus on semantic fusion-driven research for big data. Our ontology is made up of 1,264 concepts and 2,030 semantic relationships. However, the growth of big data is straining the capacities of current semantic fusion and reasoning practices. Considering the massive parallelism of DNA strands, we propose a novel DNA-based semantic fusion model. In this model, a parallel strategy is developed to encode the semantic information in DNA for a large volume of remote sensing data. The semantic information is read in a parallel and bit-wise manner and an individual bit is converted to a base. By doing so, a considerable amount of conversion time can be saved, i.e., the cluster-based multi-processes program can reduce the conversion time from 81,536 seconds to 4,937 seconds for 4.34 GB source data files. Moreover, the size of result file recording DNA sequences is 54.51 GB for parallel C program compared with 57.89 GB for sequential Perl. This shows that our parallel method can also reduce the DNA synthesis cost. In addition, data types are encoded in our model, which is a basis for building type system in our future DNA computer. Finally, we describe theoretically an algorithm for DNA-based semantic fusion. This algorithm enables the process of integration of the knowledge from disparate remote sensing data sources into a consistent, accurate, and complete representation. This process depends solely on ligation reaction and screening operations instead of the ontology. PMID:24116207
Bit-parallel arithmetic in a massively-parallel associative processor

NASA Technical Reports Server (NTRS)

Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.

1992-01-01

A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.

PubMed

Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F

2018-03-01

Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.
Supercomputing on massively parallel bit-serial architectures

NASA Technical Reports Server (NTRS)

Iobst, Ken

1985-01-01

Research on the Goodyear Massively Parallel Processor (MPP) suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined train broadcast, and conditional branching at the processing element (PE) control unit level. The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. A description is given of the MPP design. Features of the system and its operation are illustrated in the form of charts and diagrams.
Bit-Wise Arithmetic Coding For Compression Of Data

NASA Technical Reports Server (NTRS)

Kiely, Aaron

1996-01-01

Bit-wise arithmetic coding is data-compression scheme intended especially for use with uniformly quantized data from source with Gaussian, Laplacian, or similar probability distribution function. Code words of fixed length, and bits treated as being independent. Scheme serves as means of progressive transmission or of overcoming buffer-overflow or rate constraint limitations sometimes arising when data compression used.
Low complexity 1D IDCT for 16-bit parallel architectures

NASA Astrophysics Data System (ADS)

Bivolarski, Lazar

2007-09-01

This paper shows that using the Loeffler, Ligtenberg, and Moschytz factorization of 8-point IDCT [2] one-dimensional (1-D) algorithm as a fast approximation of the Discrete Cosine Transform (DCT) and using only 16 bit numbers, it is possible to create in an IEEE 1180-1990 compliant and multiplierless algorithm with low computational complexity. This algorithm as characterized by its structure is efficiently implemented on parallel high performance architectures as well as due to its low complexity is sufficient for wide range of other architectures. Additional constraint on this work was the requirement of compliance with the existing MPEG standards. The hardware implementation complexity and low resources where also part of the design criteria for this algorithm. This implementation is also compliant with the precision requirements described in MPEG IDCT precision specification ISO/IEC 23002-1. Complexity analysis is performed as an extension to the simple measure of shifts and adds for the multiplierless algorithm as additional operations are included in the complexity measure to better describe the actual transform implementation complexity.
TREATMENT INTENSIFICATION WITH INSULIN DEGLUDEC/INSULIN ASPART TWICE DAILY: RANDOMIZED STUDY TO COMPARE SIMPLE AND STEP-WISE TITRATION ALGORITHMS.

PubMed

Gerety, Gregg; Bebakar, Wan Mohamad Wan; Chaykin, Louis; Ozkaya, Mesut; Macura, Stanislava; Hersløv, Malene Lundgren; Behnke, Thomas

2016-05-01

This 26-week, multicenter, randomized, open-label, parallel-group, treat-to-target trial in adults with type 2 diabetes compared the efficacy and safety of treatment intensification algorithms with twice-daily (BID) insulin degludec/insulin aspart (IDegAsp). Patients randomized 1:1 to IDegAsp BID used either a 'Simple' algorithm (twice-weekly dose adjustments based on a single prebreakfast and pre-evening meal self-monitored plasma glucose [SMPG] measurement; IDegAsp[BIDSimple], n = 136) or a 'Stepwise' algorithm (once-weekly dose adjustments based on the lowest of 3 pre-breakfast and 3 pre-evening meal SMPG values; IDegAsp[BIDStep-wise], n = 136). After 26 weeks, mean change from baseline in glycated hemoglobin (HbA1c) with IDegAsp[BIDSimple] was noninferior to IDegAsp[BIDStep-wise] (-15 mmol/mol versus -14 mmol/mol; 95% confidence interval [CI] upper limit, <4 mmol/mol) (baseline HbA1c: 66.3 mmol/mol IDegAsp[BIDSimple] and 66.6 mmol/mol IDegAsp[BIDStep-wise]). The proportion of patients who achieved HbA1c <7.0% (<53 mmol/mol) at the end of the trial was 66.9% with IDegAsp[BIDSimple] and 62.5% with IDegAsp[BIDStep-wise]. Fasting plasma glucose levels were reduced with each titration algorithm (-1.51 mmol/L IDegAsp[BIDSimple] versus -1.95 mmol/L IDegAsp[BIDStep-wise]). Weight gain was 3.8 kg IDegAsp[BIDSimple] versus 2.6 kg IDegAsp[BIDStep-wise], and rates of overall confirmed hypoglycemia (5.16 episodes per patient-year of exposure [PYE] versus 8.93 PYE) and nocturnal confirmed hypoglycemia (0.78 PYE versus 1.33 PYE) were significantly lower with IDegAsp[BIDStep-wise] versus IDegAsp[BIDSimple]. There were no significant differences in insulin dose increments between groups. Treatment intensification with IDegAsp[BIDSimple] was noninferior to IDegAsp[BIDStep-wise]. Both titration algorithms were well tolerated; however, the more conservative step-wise algorithm led to less weight gain and fewer hypoglycemic episodes. Clinicaltrials.gov: NCT01680341.
A real-time implementation of an advanced sensor failure detection, isolation, and accommodation algorithm

NASA Technical Reports Server (NTRS)

Delaat, J. C.; Merrill, W. C.

1983-01-01

A sensor failure detection, isolation, and accommodation algorithm was developed which incorporates analytic sensor redundancy through software. This algorithm was implemented in a high level language on a microprocessor based controls computer. Parallel processing and state-of-the-art 16-bit microprocessors are used along with efficient programming practices to achieve real-time operation.
High-Speed Systolic Array Testbed.

DTIC Science & Technology

1987-10-01

applications since the concept was introduced by H.T. Kung In 1978. This highly parallel architecture of nearet neighbor data communciation and...must be addressed. For instance, should bit-serial or bit parallei computation be utilized. Does the dynamic range of the candidate applications or...numericai stability of the algorithms used require computations In fixed point and Integer format or the architecturally more complex and slower floating
Research on parallel combinatory spread spectrum communication system with double information matching

NASA Astrophysics Data System (ADS)

Xue, Wei; Wang, Qi; Wang, Tianyu

2018-04-01

This paper presents an improved parallel combinatory spread spectrum (PC/SS) communication system with the method of double information matching (DIM). Compared with conventional PC/SS system, the new model inherits the advantage of high transmission speed, large information capacity and high security. Besides, the problem traditional system will face is the high bit error rate (BER) and since its data-sequence mapping algorithm. Hence the new model presented shows lower BER and higher efficiency by its optimization of mapping algorithm.
Homology search with binary and trinary scoring matrices.

PubMed

Smith, Scott F

2006-01-01

Protein homology search can be accelerated with the use of bit-parallel algorithms in conjunction with constraints on the values contained in the scoring matrices. Trinary scoring matrices (containing only the values -1, 0, and 1) allow for significant acceleration without significant reduction in the receiver operating characteristic (ROC) score of a Smith-Waterman search. Binary scoring matrices (containing the values 0 and 1) result in some reduction in ROC score, but result in even more acceleration. Binary scoring matrices and five-bit saturating scores can be used for fast prefilters to the Smith-Waterman algorithm.
Bit-wise arithmetic coding for data compression

NASA Technical Reports Server (NTRS)

Kiely, A. B.

1994-01-01

This article examines the problem of compressing a uniformly quantized independent and identically distributed (IID) source. We present a new compression technique, bit-wise arithmetic coding, that assigns fixed-length codewords to the quantizer output and uses arithmetic coding to compress the codewords, treating the codeword bits as independent. We examine the performance of this method and evaluate the overhead required when used block-adaptively. Simulation results are presented for Gaussian and Laplacian sources. This new technique could be used as the entropy coder in a transform or subband coding system.
Design and implementation of a hybrid MPI-CUDA model for the Smith-Waterman algorithm.

PubMed

Khaled, Heba; Faheem, Hossam El Deen Mostafa; El Gohary, Rania

2015-01-01

This paper provides a novel hybrid model for solving the multiple pair-wise sequence alignment problem combining message passing interface and CUDA, the parallel computing platform and programming model invented by NVIDIA. The proposed model targets homogeneous cluster nodes equipped with similar Graphical Processing Unit (GPU) cards. The model consists of the Master Node Dispatcher (MND) and the Worker GPU Nodes (WGN). The MND distributes the workload among the cluster working nodes and then aggregates the results. The WGN performs the multiple pair-wise sequence alignments using the Smith-Waterman algorithm. We also propose a modified implementation to the Smith-Waterman algorithm based on computing the alignment matrices row-wise. The experimental results demonstrate a considerable reduction in the running time by increasing the number of the working GPU nodes. The proposed model achieved a performance of about 12 Giga cell updates per second when we tested against the SWISS-PROT protein knowledge base running on four nodes.
Achievable Information Rates for Coded Modulation With Hard Decision Decoding for Coherent Fiber-Optic Systems

NASA Astrophysics Data System (ADS)

Sheikh, Alireza; Amat, Alexandre Graell i.; Liva, Gianluigi

2017-12-01

We analyze the achievable information rates (AIRs) for coded modulation schemes with QAM constellations with both bit-wise and symbol-wise decoders, corresponding to the case where a binary code is used in combination with a higher-order modulation using the bit-interleaved coded modulation (BICM) paradigm and to the case where a nonbinary code over a field matched to the constellation size is used, respectively. In particular, we consider hard decision decoding, which is the preferable option for fiber-optic communication systems where decoding complexity is a concern. Recently, Liga \\emph{et al.} analyzed the AIRs for bit-wise and symbol-wise decoders considering what the authors called \\emph{hard decision decoder} which, however, exploits \\emph{soft information} of the transition probabilities of discrete-input discrete-output channel resulting from the hard detection. As such, the complexity of the decoder is essentially the same as the complexity of a soft decision decoder. In this paper, we analyze instead the AIRs for the standard hard decision decoder, commonly used in practice, where the decoding is based on the Hamming distance metric. We show that if standard hard decision decoding is used, bit-wise decoders yield significantly higher AIRs than symbol-wise decoders. As a result, contrary to the conclusion by Liga \\emph{et al.}, binary decoders together with the BICM paradigm are preferable for spectrally-efficient fiber-optic systems. We also design binary and nonbinary staircase codes and show that, in agreement with the AIRs, binary codes yield better performance.

Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Scalable load balancing for massively parallel distributed Monte Carlo particle transport

DOE Office of Scientific and Technical Information (OSTI.GOV)

O'Brien, M. J.; Brantley, P. S.; Joy, K. I.

2013-07-01

In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrencemore » Livermore National Laboratory. (authors)« less
Implementation and analysis of a Navier-Stokes algorithm on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1988-01-01

The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Binary Interval Search: a scalable algorithm for counting interval intersections.

PubMed

Layer, Ryan M; Skadron, Kevin; Robins, Gabriel; Hall, Ira M; Quinlan, Aaron R

2013-01-01

The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. https://github.com/arq5x/bits.
Real-time multi-mode neutron multiplicity counter

DOEpatents

Rowland, Mark S; Alvarez, Raymond A

2013-02-26

Embodiments are directed to a digital data acquisition method that collects data regarding nuclear fission at high rates and performs real-time preprocessing of large volumes of data into directly useable forms for use in a system that performs non-destructive assaying of nuclear material and assemblies for mass and multiplication of special nuclear material (SNM). Pulses from a multi-detector array are fed in parallel to individual inputs that are tied to individual bits in a digital word. Data is collected by loading a word at the individual bit level in parallel, to reduce the latency associated with current shift-register systems. The word is read at regular intervals, all bits simultaneously, with no manipulation. The word is passed to a number of storage locations for subsequent processing, thereby removing the front-end problem of pulse pileup. The word is used simultaneously in several internal processing schemes that assemble the data in a number of more directly useable forms. The detector includes a multi-mode counter that executes a number of different count algorithms in parallel to determine different attributes of the count data.
A novel bit-wise adaptable entropy coding technique

NASA Technical Reports Server (NTRS)

Kiely, A.; Klimesh, M.

2001-01-01

We present a novel entropy coding technique which is adaptable in that each bit to be encoded may have an associated probability esitmate which depends on previously encoded bits. The technique may have advantages over arithmetic coding. The technique can achieve arbitrarily small redundancy and admits a simple and fast decoder.
Associative architecture for image processing

NASA Astrophysics Data System (ADS)

Adar, Rutie; Akerib, Avidan

1997-09-01

This article presents a new generation in parallel processing architecture for real-time image processing. The approach is implemented in a real time image processor chip, called the XiumTM-2, based on combining a fully associative array which provides the parallel engine with a serial RISC core on the same die. The architecture is fully programmable and can be programmed to implement a wide range of color image processing, computer vision and media processing functions in real time. The associative part of the chip is based on patented pending methodology of Associative Computing Ltd. (ACL), which condenses 2048 associative processors, each of 128 'intelligent' bits. Each bit can be a processing bit or a memory bit. At only 33 MHz and 0.6 micron manufacturing technology process, the chip has a computational power of 3 billion ALU operations per second and 66 billion string search operations per second. The fully programmable nature of the XiumTM-2 chip enables developers to use ACL tools to write their own proprietary algorithms combined with existing image processing and analysis functions from ACL's extended set of libraries.
Design of Belief Propagation Based on FPGA for the Multistereo CAFADIS Camera

PubMed Central

Magdaleno, Eduardo; Lüke, Jonás Philipp; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel

2010-01-01

In this paper we describe a fast, specialized hardware implementation of the belief propagation algorithm for the CAFADIS camera, a new plenoptic sensor patented by the University of La Laguna. This camera captures the lightfield of the scene and can be used to find out at which depth each pixel is in focus. The algorithm has been designed for FPGA devices using VHDL. We propose a parallel and pipeline architecture to implement the algorithm without external memory. Although the BRAM resources of the device increase considerably, we can maintain real-time restrictions by using extremely high-performance signal processing capability through parallelism and by accessing several memories simultaneously. The quantifying results with 16 bit precision have shown that performances are really close to the original Matlab programmed algorithm. PMID:22163404
Design of belief propagation based on FPGA for the multistereo CAFADIS camera.

PubMed

Magdaleno, Eduardo; Lüke, Jonás Philipp; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel

2010-01-01

In this paper we describe a fast, specialized hardware implementation of the belief propagation algorithm for the CAFADIS camera, a new plenoptic sensor patented by the University of La Laguna. This camera captures the lightfield of the scene and can be used to find out at which depth each pixel is in focus. The algorithm has been designed for FPGA devices using VHDL. We propose a parallel and pipeline architecture to implement the algorithm without external memory. Although the BRAM resources of the device increase considerably, we can maintain real-time restrictions by using extremely high-performance signal processing capability through parallelism and by accessing several memories simultaneously. The quantifying results with 16 bit precision have shown that performances are really close to the original Matlab programmed algorithm.
Solving the Cauchy-Riemann equations on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
Shor's quantum factoring algorithm on a photonic chip.

PubMed

Politi, Alberto; Matthews, Jonathan C F; O'Brien, Jeremy L

2009-09-04

Shor's quantum factoring algorithm finds the prime factors of a large number exponentially faster than any other known method, a task that lies at the heart of modern information security, particularly on the Internet. This algorithm requires a quantum computer, a device that harnesses the massive parallelism afforded by quantum superposition and entanglement of quantum bits (or qubits). We report the demonstration of a compiled version of Shor's algorithm on an integrated waveguide silica-on-silicon chip that guides four single-photon qubits through the computation to factor 15.
DFT algorithms for bit-serial GaAs array processor architectures

NASA Technical Reports Server (NTRS)

Mcmillan, Gary B.

1988-01-01

Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Binary Interval Search: a scalable algorithm for counting interval intersections

PubMed Central

Layer, Ryan M.; Skadron, Kevin; Robins, Gabriel; Hall, Ira M.; Quinlan, Aaron R.

2013-01-01

Motivation: The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. Results: We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. Availability: https://github.com/arq5x/bits. Contact: arq5x@virginia.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23129298
Design and evaluation of an architecture for a digital signal processor for instrumentation applications

NASA Astrophysics Data System (ADS)

Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos

1990-03-01

The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
A novel parallel architecture for local histogram equalization

NASA Astrophysics Data System (ADS)

Ohannessian, Mesrob I.; Choueiter, Ghinwa F.; Diab, Hassan

2005-07-01

Local histogram equalization is an image enhancement algorithm that has found wide application in the pre-processing stage of areas such as computer vision, pattern recognition and medical imaging. The computationally intensive nature of the procedure, however, is a main limitation when real time interactive applications are in question. This work explores the possibility of performing parallel local histogram equalization, using an array of special purpose elementary processors, through an HDL implementation that targets FPGA or ASIC platforms. A novel parallelization scheme is presented and the corresponding architecture is derived. The algorithm is reduced to pixel-level operations. Processing elements are assigned image blocks, to maintain a reasonable performance-cost ratio. To further simplify both processor and memory organizations, a bit-serial access scheme is used. A brief performance assessment is provided to illustrate and quantify the merit of the approach.
Neighborhood comparison operator

NASA Technical Reports Server (NTRS)

Gennery, Donald B. (Inventor)

1987-01-01

Digital values in a moving window are compared by an operator having nine comparators (18) connected to line buffers (16) for receiving a succession of central pixels together with eight neighborhood pixels. A single bit of program control determines whether the neighborhood pixels are to be compared with the central pixel or a threshold value. The central pixel is always compared with the threshold. The comparator output, plus 2 bits indicating odd-even pixel/line information about the central pixel, addresses a lookup table (20) to provide 14 bits of information, including 2 bits which control a selector (22) to pass either the central pixel value, the other 12 bits of table information, or the bit-wise logic OR of all neighboring pixels.
A system for routing arbitrary directed graphs on SIMD architectures

NASA Technical Reports Server (NTRS)

Tomboulian, Sherryl

1987-01-01

There are many problems which can be described in terms of directed graphs that contain a large number of vertices where simple computations occur using data from connecting vertices. A method is given for parallelizing such problems on an SIMD machine model that is bit-serial and uses only nearest neighbor connections for communication. Each vertex of the graph will be assigned to a processor in the machine. Algorithms are given that will be used to implement movement of data along the arcs of the graph. This architecture and algorithms define a system that is relatively simple to build and can do graph processing. All arcs can be transversed in parallel in time O(T), where T is empirically proportional to the diameter of the interconnection network times the average degree of the graph. Modifying or adding a new arc takes the same time as parallel traversal.
An asymptotic-preserving Lagrangian algorithm for the time-dependent anisotropic heat transport equation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chacon, Luis; del-Castillo-Negrete, Diego; Hauck, Cory D.

2014-09-01

We propose a Lagrangian numerical algorithm for a time-dependent, anisotropic temperature transport equation in magnetized plasmas in the large guide field regime. The approach is based on an analytical integral formal solution of the parallel (i.e., along the magnetic field) transport equation with sources, and it is able to accommodate both local and non-local parallel heat flux closures. The numerical implementation is based on an operator-split formulation, with two straightforward steps: a perpendicular transport step (including sources), and a Lagrangian (field-line integral) parallel transport step. Algorithmically, the first step is amenable to the use of modern iterative methods, while themore » second step has a fixed cost per degree of freedom (and is therefore scalable). Accuracy-wise, the approach is free from the numerical pollution introduced by the discrete parallel transport term when the perpendicular to parallel transport coefficient ratio X ⊥ /X ∥ becomes arbitrarily small, and is shown to capture the correct limiting solution when ε = X⊥L 2 ∥/X1L 2 ⊥ → 0 (with L∥∙ L⊥ , the parallel and perpendicular diffusion length scales, respectively). Therefore, the approach is asymptotic-preserving. We demonstrate the capabilities of the scheme with several numerical experiments with varying magnetic field complexity in two dimensions, including the case of transport across a magnetic island.« less
Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Neighborhood comparison operator

NASA Technical Reports Server (NTRS)

Gennery, D. B. (Inventor)

1985-01-01

Digital values in a moving window are compared by an operator having nine comparators connected to line buffers for receiving a succession of central pixels together with eight neighborhood pixels. A single bit of program control determines whether the neighborhood pixels are to be compared with the central pixel or a threshold value. The central pixel is always compared with the threshold. The omparator output plus 2 bits indicating odd-even pixel/line information about the central pixel addresses a lookup table to provide 14 bits of information, including 2 bits which control a selector to pass either the central pixel value, the other 12 bits of table information, or the bit-wise logical OR of all nine pixels through circuit that implements a very wide OR gate.

Reducing computational costs in large scale 3D EIT by using a sparse Jacobian matrix with block-wise CGLS reconstruction.

PubMed

Yang, C L; Wei, H Y; Adler, A; Soleimani, M

2013-06-01

Electrical impedance tomography (EIT) is a fast and cost-effective technique to provide a tomographic conductivity image of a subject from boundary current-voltage data. This paper proposes a time and memory efficient method for solving a large scale 3D EIT inverse problem using a parallel conjugate gradient (CG) algorithm. The 3D EIT system with a large number of measurement data can produce a large size of Jacobian matrix; this could cause difficulties in computer storage and the inversion process. One of challenges in 3D EIT is to decrease the reconstruction time and memory usage, at the same time retaining the image quality. Firstly, a sparse matrix reduction technique is proposed using thresholding to set very small values of the Jacobian matrix to zero. By adjusting the Jacobian matrix into a sparse format, the element with zeros would be eliminated, which results in a saving of memory requirement. Secondly, a block-wise CG method for parallel reconstruction has been developed. The proposed method has been tested using simulated data as well as experimental test samples. Sparse Jacobian with a block-wise CG enables the large scale EIT problem to be solved efficiently. Image quality measures are presented to quantify the effect of sparse matrix reduction in reconstruction results.
Robust Segmentation of Overlapping Cells in Histopathology Specimens Using Parallel Seed Detection and Repulsive Level Set

PubMed Central

Qi, Xin; Xing, Fuyong; Foran, David J.; Yang, Lin

2013-01-01

Automated image analysis of histopathology specimens could potentially provide support for early detection and improved characterization of breast cancer. Automated segmentation of the cells comprising imaged tissue microarrays (TMA) is a prerequisite for any subsequent quantitative analysis. Unfortunately, crowding and overlapping of cells present significant challenges for most traditional segmentation algorithms. In this paper, we propose a novel algorithm which can reliably separate touching cells in hematoxylin stained breast TMA specimens which have been acquired using a standard RGB camera. The algorithm is composed of two steps. It begins with a fast, reliable object center localization approach which utilizes single-path voting followed by mean-shift clustering. Next, the contour of each cell is obtained using a level set algorithm based on an interactive model. We compared the experimental results with those reported in the most current literature. Finally, performance was evaluated by comparing the pixel-wise accuracy provided by human experts with that produced by the new automated segmentation algorithm. The method was systematically tested on 234 image patches exhibiting dense overlap and containing more than 2200 cells. It was also tested on whole slide images including blood smears and tissue microarrays containing thousands of cells. Since the voting step of the seed detection algorithm is well suited for parallelization, a parallel version of the algorithm was implemented using graphic processing units (GPU) which resulted in significant speed-up over the C/C++ implementation. PMID:22167559
Demonstration of an optoelectronic interconnect architecture for a parallel modified signed-digit adder and subtracter

NASA Astrophysics Data System (ADS)

Sun, Degui; Wang, Na-Xin; He, Li-Ming; Weng, Zhao-Heng; Wang, Daheng; Chen, Ray T.

1996-06-01

A space-position-logic-encoding scheme is proposed and demonstrated. This encoding scheme not only makes the best use of the convenience of binary logic operation, but is also suitable for the trinary property of modified signed- digit (MSD) numbers. Based on the space-position-logic-encoding scheme, a fully parallel modified signed-digit adder and subtractor is built using optoelectronic switch technologies in conjunction with fiber-multistage 3D optoelectronic interconnects. Thus an effective combination of a parallel algorithm and a parallel architecture is implemented. In addition, the performance of the optoelectronic switches used in this system is experimentally studied and verified. Both the 3-bit experimental model and the experimental results of a parallel addition and a parallel subtraction are provided and discussed. Finally, the speed ratio between the MSD adder and binary adders is discussed and the advantage of the MSD in operating speed is demonstrated.
A transient FETI methodology for large-scale parallel implicit computations in structural mechanics

NASA Technical Reports Server (NTRS)

Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier

1992-01-01

Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.
Parallel exploitation of a spatial-spectral classification approach for hyperspectral images on RVC-CAL

NASA Astrophysics Data System (ADS)

Lazcano, R.; Madroñal, D.; Fabelo, H.; Ortega, S.; Salvador, R.; Callicó, G. M.; Juárez, E.; Sanz, C.

2017-10-01

Hyperspectral Imaging (HI) assembles high resolution spectral information from hundreds of narrow bands across the electromagnetic spectrum, thus generating 3D data cubes in which each pixel gathers the spectral information of the reflectance of every spatial pixel. As a result, each image is composed of large volumes of data, which turns its processing into a challenge, as performance requirements have been continuously tightened. For instance, new HI applications demand real-time responses. Hence, parallel processing becomes a necessity to achieve this requirement, so the intrinsic parallelism of the algorithms must be exploited. In this paper, a spatial-spectral classification approach has been implemented using a dataflow language known as RVCCAL. This language represents a system as a set of functional units, and its main advantage is that it simplifies the parallelization process by mapping the different blocks over different processing units. The spatial-spectral classification approach aims at refining the classification results previously obtained by using a K-Nearest Neighbors (KNN) filtering process, in which both the pixel spectral value and the spatial coordinates are considered. To do so, KNN needs two inputs: a one-band representation of the hyperspectral image and the classification results provided by a pixel-wise classifier. Thus, spatial-spectral classification algorithm is divided into three different stages: a Principal Component Analysis (PCA) algorithm for computing the one-band representation of the image, a Support Vector Machine (SVM) classifier, and the KNN-based filtering algorithm. The parallelization of these algorithms shows promising results in terms of computational time, as the mapping of them over different cores presents a speedup of 2.69x when using 3 cores. Consequently, experimental results demonstrate that real-time processing of hyperspectral images is achievable.
A Modified Differential Coherent Bit Synchronization Algorithm for BeiDou Weak Signals with Large Frequency Deviation.

PubMed

Han, Zhifeng; Liu, Jianye; Li, Rongbing; Zeng, Qinghua; Wang, Yi

2017-07-04

BeiDou system navigation messages are modulated with a secondary NH (Neumann-Hoffman) code of 1 kbps, where frequent bit transitions limit the coherent integration time to 1 millisecond. Therefore, a bit synchronization algorithm is necessary to obtain bit edges and NH code phases. In order to realize bit synchronization for BeiDou weak signals with large frequency deviation, a bit synchronization algorithm based on differential coherent and maximum likelihood is proposed. Firstly, a differential coherent approach is used to remove the effect of frequency deviation, and the differential delay time is set to be a multiple of bit cycle to remove the influence of NH code. Secondly, the maximum likelihood function detection is used to improve the detection probability of weak signals. Finally, Monte Carlo simulations are conducted to analyze the detection performance of the proposed algorithm compared with a traditional algorithm under the CN0s of 20~40 dB-Hz and different frequency deviations. The results show that the proposed algorithm outperforms the traditional method with a frequency deviation of 50 Hz. This algorithm can remove the effect of BeiDou NH code effectively and weaken the influence of frequency deviation. To confirm the feasibility of the proposed algorithm, real data tests are conducted. The proposed algorithm is suitable for BeiDou weak signal bit synchronization with large frequency deviation.
Circuit for high resolution decoding of multi-anode microchannel array detectors

NASA Technical Reports Server (NTRS)

Kasle, David B. (Inventor)

1995-01-01

A circuit for high resolution decoding of multi-anode microchannel array detectors consisting of input registers accepting transient inputs from the anode array; anode encoding logic circuits connected to the input registers; midpoint pipeline registers connected to the anode encoding logic circuits; and pixel decoding logic circuits connected to the midpoint pipeline registers is described. A high resolution algorithm circuit operates in parallel with the pixel decoding logic circuit and computes a high resolution least significant bit to enhance the multianode microchannel array detector's spatial resolution by halving the pixel size and doubling the number of pixels in each axis of the anode array. A multiplexer is connected to the pixel decoding logic circuit and allows a user selectable pixel address output according to the actual multi-anode microchannel array detector anode array size. An output register concatenates the high resolution least significant bit onto the standard ten bit pixel address location to provide an eleven bit pixel address, and also stores the full eleven bit pixel address. A timing and control state machine is connected to the input registers, the anode encoding logic circuits, and the output register for managing the overall operation of the circuit.
Frequent statistics of link-layer bit stream data based on AC-IM algorithm

NASA Astrophysics Data System (ADS)

Cao, Chenghong; Lei, Yingke; Xu, Yiming

2017-08-01

At present, there are many relevant researches on data processing using classical pattern matching and its improved algorithm, but few researches on statistical data of link-layer bit stream. This paper adopts a frequent statistical method of link-layer bit stream data based on AC-IM algorithm for classical multi-pattern matching algorithms such as AC algorithm has high computational complexity, low efficiency and it cannot be applied to binary bit stream data. The method's maximum jump distance of the mode tree is length of the shortest mode string plus 3 in case of no missing? In this paper, theoretical analysis is made on the principle of algorithm construction firstly, and then the experimental results show that the algorithm can adapt to the binary bit stream data environment and extract the frequent sequence more accurately, the effect is obvious. Meanwhile, comparing with the classical AC algorithm and other improved algorithms, AC-IM algorithm has a greater maximum jump distance and less time-consuming.
Adaptive image coding based on cubic-spline interpolation

NASA Astrophysics Data System (ADS)

Jiang, Jian-Xing; Hong, Shao-Hua; Lin, Tsung-Ching; Wang, Lin; Truong, Trieu-Kien

2014-09-01

It has been investigated that at low bit rates, downsampling prior to coding and upsampling after decoding can achieve better compression performance than standard coding algorithms, e.g., JPEG and H. 264/AVC. However, at high bit rates, the sampling-based schemes generate more distortion. Additionally, the maximum bit rate for the sampling-based scheme to outperform the standard algorithm is image-dependent. In this paper, a practical adaptive image coding algorithm based on the cubic-spline interpolation (CSI) is proposed. This proposed algorithm adaptively selects the image coding method from CSI-based modified JPEG and standard JPEG under a given target bit rate utilizing the so called ρ-domain analysis. The experimental results indicate that compared with the standard JPEG, the proposed algorithm can show better performance at low bit rates and maintain the same performance at high bit rates.
Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2

PubMed Central

Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.

2014-01-01

SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745
High Rate Digital Demodulator ASIC

NASA Technical Reports Server (NTRS)

Ghuman, Parminder; Sheikh, Salman; Koubek, Steve; Hoy, Scott; Gray, Andrew

1998-01-01

The architecture of High Rate (600 Mega-bits per second) Digital Demodulator (HRDD) ASIC capable of demodulating BPSK and QPSK modulated data is presented in this paper. The advantages of all-digital processing include increased flexibility and reliability with reduced reproduction costs. Conventional serial digital processing would require high processing rates necessitating a hardware implementation in other than CMOS technology such as Gallium Arsenide (GaAs) which has high cost and power requirements. It is more desirable to use CMOS technology with its lower power requirements and higher gate density. However, digital demodulation of high data rates in CMOS requires parallel algorithms to process the sampled data at a rate lower than the data rate. The parallel processing algorithms described here were developed jointly by NASA's Goddard Space Flight Center (GSFC) and the Jet Propulsion Laboratory (JPL). The resulting all-digital receiver has the capability to demodulate BPSK, QPSK, OQPSK, and DQPSK at data rates in excess of 300 Mega-bits per second (Mbps) per channel. This paper will provide an overview of the parallel architecture and features of the HRDR ASIC. In addition, this paper will provide an over-view of the implementation of the hardware architectures used to create flexibility over conventional high rate analog or hybrid receivers. This flexibility includes a wide range of data rates, modulation schemes, and operating environments. In conclusion it will be shown how this high rate digital demodulator can be used with an off-the-shelf A/D and a flexible analog front end, both of which are numerically computer controlled, to produce a very flexible, low cost high rate digital receiver.
Topology-changing shape optimization with the genetic algorithm

NASA Astrophysics Data System (ADS)

Lamberson, Steven E., Jr.

The goal is to take a traditional shape optimization problem statement and modify it slightly to allow for prescribed changes in topology. This modification enables greater flexibility in the choice of parameters for the topology optimization problem, while improving the direct physical relevance of the results. This modification involves changing the optimization problem statement from a nonlinear programming problem into a form of mixed-discrete nonlinear programing problem. The present work demonstrates one possible way of using the Genetic Algorithm (GA) to solve such a problem, including the use of "masking bits" and a new modification to the bit-string affinity (BSA) termination criterion specifically designed for problems with "masking bits." A simple ten-bar truss problem proves the utility of the modified BSA for this type of problem. A more complicated two dimensional bracket problem is solved using both the proposed approach and a more traditional topology optimization approach (Solid Isotropic Microstructure with Penalization or SIMP) to enable comparison. The proposed approach is able to solve problems with both local and global constraints, which is something traditional methods cannot do. The proposed approach has a significantly higher computational burden --- on the order of 100 times larger than SIMP, although the proposed approach is able to offset this with parallel computing.
DNABIT Compress - Genome compression algorithm.

PubMed

Rajarajeswari, Pothuraju; Apparao, Allam

2011-01-22

Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, "DNABIT Compress" for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that "DNABIT Compress" algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases.
Parallel workflow manager for non-parallel bioinformatic applications to solve large-scale biological problems on a supercomputer.

PubMed

Suplatov, Dmitry; Popova, Nina; Zhumatiy, Sergey; Voevodin, Vladimir; Švedas, Vytas

2016-04-01

Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .
The Complexity of Bit Retrieval

DOE PAGES

Elser, Veit

2018-09-20

Bit retrieval is the problem of reconstructing a periodic binary sequence from its periodic autocorrelation, with applications in cryptography and x-ray crystallography. After defining the problem, with and without noise, we describe and compare various algorithms for solving it. A geometrical constraint satisfaction algorithm, relaxed-reflect-reflect, is currently the best algorithm for noisy bit retrieval.
The Complexity of Bit Retrieval

DOE Office of Scientific and Technical Information (OSTI.GOV)

Elser, Veit

Bit retrieval is the problem of reconstructing a periodic binary sequence from its periodic autocorrelation, with applications in cryptography and x-ray crystallography. After defining the problem, with and without noise, we describe and compare various algorithms for solving it. A geometrical constraint satisfaction algorithm, relaxed-reflect-reflect, is currently the best algorithm for noisy bit retrieval.
Compression of multispectral Landsat imagery using the Embedded Zerotree Wavelet (EZW) algorithm

NASA Technical Reports Server (NTRS)

Shapiro, Jerome M.; Martucci, Stephen A.; Czigler, Martin

1994-01-01

The Embedded Zerotree Wavelet (EZW) algorithm has proven to be an extremely efficient and flexible compression algorithm for low bit rate image coding. The embedding algorithm attempts to order the bits in the bit stream in numerical importance and thus a given code contains all lower rate encodings of the same algorithm. Therefore, precise bit rate control is achievable and a target rate or distortion metric can be met exactly. Furthermore, the technique is fully image adaptive. An algorithm for multispectral image compression which combines the spectral redundancy removal properties of the image-dependent Karhunen-Loeve Transform (KLT) with the efficiency, controllability, and adaptivity of the embedded zerotree wavelet algorithm is presented. Results are shown which illustrate the advantage of jointly encoding spectral components using the KLT and EZW.
A high data rate universal lattice decoder on FPGA

NASA Astrophysics Data System (ADS)

Ma, Jing; Huang, Xinming; Kura, Swapna

2005-06-01

This paper presents the architecture design of a high data rate universal lattice decoder for MIMO channels on FPGA platform. A phost strategy based lattice decoding algorithm is modified in this paper to reduce the complexity of the closest lattice point search. The data dependency of the improved algorithm is examined and a parallel and pipeline architecture is developed with the iterative decoding function on FPGA and the division intensive channel matrix preprocessing on DSP. Simulation results demonstrate that the improved lattice decoding algorithm provides better bit error rate and less iteration number compared with the original algorithm. The system prototype of the decoder shows that it supports data rate up to 7Mbit/s on a Virtex2-1000 FPGA, which is about 8 times faster than the original algorithm on FPGA platform and two-orders of magnitude better than its implementation on a DSP platform.
DNABIT Compress – Genome compression algorithm

PubMed Central

Rajarajeswari, Pothuraju; Apparao, Allam

2011-01-01

Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, “DNABIT Compress” for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that “DNABIT Compress” algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases. PMID:21383923
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing

NASA Astrophysics Data System (ADS)

Johl, John T.; Baker, Nick C.

1988-10-01

The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.

Design of a massively parallel computer using bit serial processing elements

NASA Technical Reports Server (NTRS)

Aburdene, Maurice F.; Khouri, Kamal S.; Piatt, Jason E.; Zheng, Jianqing

1995-01-01

A 1-bit serial processor designed for a parallel computer architecture is described. This processor is used to develop a massively parallel computational engine, with a single instruction-multiple data (SIMD) architecture. The computer is simulated and tested to verify its operation and to measure its performance for further development.
Massively parallel processor computer

NASA Technical Reports Server (NTRS)

Fung, L. W. (Inventor)

1983-01-01

An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.
A novel image encryption algorithm based on synchronized random bit generated in cascade-coupled chaotic semiconductor ring lasers

NASA Astrophysics Data System (ADS)

Li, Jiafu; Xiang, Shuiying; Wang, Haoning; Gong, Junkai; Wen, Aijun

2018-03-01

In this paper, a novel image encryption algorithm based on synchronization of physical random bit generated in a cascade-coupled semiconductor ring lasers (CCSRL) system is proposed, and the security analysis is performed. In both transmitter and receiver parts, the CCSRL system is a master-slave configuration consisting of a master semiconductor ring laser (M-SRL) with cross-feedback and a solitary SRL (S-SRL). The proposed image encryption algorithm includes image preprocessing based on conventional chaotic maps, pixel confusion based on control matrix extracted from physical random bit, and pixel diffusion based on random bit stream extracted from physical random bit. Firstly, the preprocessing method is used to eliminate the correlation between adjacent pixels. Secondly, physical random bit with verified randomness is generated based on chaos in the CCSRL system, and is used to simultaneously generate the control matrix and random bit stream. Finally, the control matrix and random bit stream are used for the encryption algorithm in order to change the position and the values of pixels, respectively. Simulation results and security analysis demonstrate that the proposed algorithm is effective and able to resist various typical attacks, and thus is an excellent candidate for secure image communication application.
Effect of data truncation in an implementation of pixel clustering on a custom computing machine

NASA Astrophysics Data System (ADS)

Leeser, Miriam E.; Theiler, James P.; Estlick, Michael; Kitaryeva, Natalya V.; Szymanski, John J.

2000-10-01

We investigate the effect of truncating the precision of hyperspectral image data for the purpose of more efficiently segmenting the image using a variant of k-means clustering. We describe the implementation of the algorithm on field-programmable gate array (FPGA) hardware. Truncating the data to only a few bits per pixel in each spectral channel permits a more compact hardware design, enabling greater parallelism, and ultimately a more rapid execution. It also enables the storage of larger images in the onboard memory. In exchange for faster clustering, however, one trades off the quality of the produced segmentation. We find, however, that the clustering algorithm can tolerate considerable data truncation with little degradation in cluster quality. This robustness to truncated data can be extended by computing the cluster centers to a few more bits of precision than the data. Since there are so many more pixels than centers, the more aggressive data truncation leads to significant gains in the number of pixels that can be stored in memory and processed in hardware concurrently.
FPGA implementation of low complexity LDPC iterative decoder

NASA Astrophysics Data System (ADS)

Verma, Shivani; Sharma, Sanjay

2016-07-01

Low-density parity-check (LDPC) codes, proposed by Gallager, emerged as a class of codes which can yield very good performance on the additive white Gaussian noise channel as well as on the binary symmetric channel. LDPC codes have gained lots of importance due to their capacity achieving property and excellent performance in the noisy channel. Belief propagation (BP) algorithm and its approximations, most notably min-sum, are popular iterative decoding algorithms used for LDPC and turbo codes. The trade-off between the hardware complexity and the decoding throughput is a critical factor in the implementation of the practical decoder. This article presents introduction to LDPC codes and its various decoding algorithms followed by realisation of LDPC decoder by using simplified message passing algorithm and partially parallel decoder architecture. Simplified message passing algorithm has been proposed for trade-off between low decoding complexity and decoder performance. It greatly reduces the routing and check node complexity of the decoder. Partially parallel decoder architecture possesses high speed and reduced complexity. The improved design of the decoder possesses a maximum symbol throughput of 92.95 Mbps and a maximum of 18 decoding iterations. The article presents implementation of 9216 bits, rate-1/2, (3, 6) LDPC decoder on Xilinx XC3D3400A device from Spartan-3A DSP family.
High performance compression of science data

NASA Technical Reports Server (NTRS)

Storer, James A.; Cohn, Martin

1992-01-01

In the future, NASA expects to gather over a tera-byte per day of data requiring space for levels of archival storage. Data compression will be a key component in systems that store this data (e.g., optical disk and tape) as well as in communications systems (both between space and Earth and between scientific locations on Earth). We propose to develop algorithms that can be a basis for software and hardware systems that compress a wide variety of scientific data with different criteria for fidelity/bandwidth tradeoffs. The algorithmic approaches we consider are specially targeted for parallel computation where data rates of over 1 billion bits per second are achievable with current technology.
High performance compression of science data

NASA Technical Reports Server (NTRS)

Storer, James A.; Cohn, Martin

1993-01-01

In the future, NASA expects to gather over a tera-byte per day of data requiring space for levels of archival storage. Data compression will be a key component in systems that store this data (e.g., optical disk and tape) as well as in communications systems (both between space and Earth and between scientific locations on Earth). We propose to develop algorithms that can be a basis for software and hardware systems that compress a wide variety of scientific data with different criteria for fidelity/bandwidth tradeoffs. The algorithmic approaches we consider are specially targeted for parallel computation where data rates of over 1 billion bits per second are achievable with current technology.
Real-time multiplicity counter

DOEpatents

Rowland, Mark S [Alamo, CA; Alvarez, Raymond A [Berkeley, CA

2010-07-13

A neutron multi-detector array feeds pulses in parallel to individual inputs that are tied to individual bits in a digital word. Data is collected by loading a word at the individual bit level in parallel. The word is read at regular intervals, all bits simultaneously, to minimize latency. The electronics then pass the word to a number of storage locations for subsequent processing, thereby removing the front-end problem of pulse pileup.
Parallelized implicit propagators for the finite-difference Schrödinger equation

NASA Astrophysics Data System (ADS)

Parker, Jonathan; Taylor, K. T.

1995-08-01

We describe the application of block Gauss-Seidel and block Jacobi iterative methods to the design of implicit propagators for finite-difference models of the time-dependent Schrödinger equation. The block-wise iterative methods discussed here are mixed direct-iterative methods for solving simultaneous equations, in the sense that direct methods (e.g. LU decomposition) are used to invert certain block sub-matrices, and iterative methods are used to complete the solution. We describe parallel variants of the basic algorithm that are well suited to the medium- to coarse-grained parallelism of work-station clusters, and MIMD supercomputers, and we show that under a wide range of conditions, fine-grained parallelism of the computation can be achieved. Numerical tests are conducted on a typical one-electron atom Hamiltonian. The methods converge robustly to machine precision (15 significant figures), in some cases in as few as 6 or 7 iterations. The rate of convergence is nearly independent of the finite-difference grid-point separations.
Bell-Curve Based Evolutionary Optimization Algorithm

NASA Technical Reports Server (NTRS)

Sobieszczanski-Sobieski, J.; Laba, K.; Kincaid, R.

1998-01-01

The paper presents an optimization algorithm that falls in the category of genetic, or evolutionary algorithms. While the bit exchange is the basis of most of the Genetic Algorithms (GA) in research and applications in America, some alternatives, also in the category of evolutionary algorithms, but use a direct, geometrical approach have gained popularity in Europe and Asia. The Bell-Curve Based Evolutionary Algorithm (BCB) is in this alternative category and is distinguished by the use of a combination of n-dimensional geometry and the normal distribution, the bell-curve, in the generation of the offspring. The tool for creating a child is a geometrical construct comprising a line connecting two parents and a weighted point on that line. The point that defines the child deviates from the weighted point in two directions: parallel and orthogonal to the connecting line, the deviation in each direction obeying a probabilistic distribution. Tests showed satisfactory performance of BCB. The principal advantage of BCB is its controllability via the normal distribution parameters and the geometrical construct variables.
32-Bit-Wide Memory Tolerates Failures

NASA Technical Reports Server (NTRS)

Buskirk, Glenn A.

1990-01-01

Electronic memory system of 32-bit words corrects bit errors caused by some common type of failures - even failure of entire 4-bit-wide random-access-memory (RAM) chip. Detects failure of two such chips, so user warned that ouput of memory may contain errors. Includes eight 4-bit-wide DRAM's configured so each bit of each DRAM assigned to different one of four parallel 8-bit words. Each DRAM contributes only 1 bit to each 8-bit word.
Using game theory for perceptual tuned rate control algorithm in video coding

NASA Astrophysics Data System (ADS)

Luo, Jiancong; Ahmad, Ishfaq

2005-03-01

This paper proposes a game theoretical rate control technique for video compression. Using a cooperative gaming approach, which has been utilized in several branches of natural and social sciences because of its enormous potential for solving constrained optimization problems, we propose a dual-level scheme to optimize the perceptual quality while guaranteeing "fairness" in bit allocation among macroblocks. At the frame level, the algorithm allocates target bits to frames based on their coding complexity. At the macroblock level, the algorithm distributes bits to macroblocks by defining a bargaining game. Macroblocks play cooperatively to compete for shares of resources (bits) to optimize their quantization scales while considering the Human Visual System"s perceptual property. Since the whole frame is an entity perceived by viewers, macroblocks compete cooperatively under a global objective of achieving the best quality with the given bit constraint. The major advantage of the proposed approach is that the cooperative game leads to an optimal and fair bit allocation strategy based on the Nash Bargaining Solution. Another advantage is that it allows multi-objective optimization with multiple decision makers (macroblocks). The simulation results testify the algorithm"s ability to achieve accurate bit rate with good perceptual quality, and to maintain a stable buffer level.
Note: optical receiver system for 152-channel magnetoencephalography.

PubMed

Kim, Jin-Mok; Kwon, Hyukchan; Yu, Kwon-kyu; Lee, Yong-Ho; Kim, Kiwoong

2014-11-01

An optical receiver system composing 13 serial data restore/synchronizer modules and a single module combiner converted optical 32-bit serial data into 32-bit synchronous parallel data for a computer to acquire 152-channel magnetoencephalography (MEG) signals. A serial data restore/synchronizer module identified 32-bit channel-voltage bits from 48-bit streaming serial data, and then consecutively reproduced 13 times of 32-bit serial data, acting in a synchronous clock. After selecting a single among 13 reproduced data in each module, a module combiner converted it into 32-bit parallel data, which were carried to 32-port digital input board in a computer. When the receiver system together with optical transmitters were applied to 152-channel superconducting quantum interference device sensors, this MEG system maintained a field noise level of 3 fT/√Hz @ 100 Hz at a sample rate of 1 kSample/s per channel.
Simulated quantum computation of molecular energies.

PubMed

Aspuru-Guzik, Alán; Dutoi, Anthony D; Love, Peter J; Head-Gordon, Martin

2005-09-09

The calculation time for the energy of atoms and molecules scales exponentially with system size on a classical computer but polynomially using quantum algorithms. We demonstrate that such algorithms can be applied to problems of chemical interest using modest numbers of quantum bits. Calculations of the water and lithium hydride molecular ground-state energies have been carried out on a quantum computer simulator using a recursive phase-estimation algorithm. The recursive algorithm reduces the number of quantum bits required for the readout register from about 20 to 4. Mappings of the molecular wave function to the quantum bits are described. An adiabatic method for the preparation of a good approximate ground-state wave function is described and demonstrated for a stretched hydrogen molecule. The number of quantum bits required scales linearly with the number of basis functions, and the number of gates required grows polynomially with the number of quantum bits.
a Real-Time Computer Music Synthesis System

NASA Astrophysics Data System (ADS)

Lent, Keith Henry

A real time sound synthesis system has been developed at the Computer Music Center of The University of Texas at Austin. This system consists of several stand alone processors that were constructed jointly with White Instruments in Austin. These processors can be programmed as general purpose computers, but are provided with a number of specialized interfaces including: MIDI, 8 bit parallel, high speed serial, 2 channels analog input (18 bit A/Ds, 48kHz sample rate), and 4 channels analog output (18 bit D/As). In addition, a basic music synthesis language (Music56000) has been written in assembly code. On top of this, a symbolic compiler (PatchWork) has been developed to enable algorithms which run in these processors to be created graphically. And finally, a number of efficient time domain numerical models have been developed to enable the construction, simulation, control, and synthesis of many musical acoustics systems in real time on these processors. Specifically, assembly language models for cylindrical and conical horn sections, dissipative losses, tone holes, bells, and a number of linear and nonlinear boundary conditions have been developed.
Parallel design patterns for a low-power, software-defined compressed video encoder

NASA Astrophysics Data System (ADS)

Bruns, Michael W.; Hunt, Martin A.; Prasad, Durga; Gunupudi, Nageswara R.; Sonachalam, Sekar

2011-06-01

Video compression algorithms such as H.264 offer much potential for parallel processing that is not always exploited by the technology of a particular implementation. Consumer mobile encoding devices often achieve real-time performance and low power consumption through parallel processing in Application Specific Integrated Circuit (ASIC) technology, but many other applications require a software-defined encoder. High quality compression features needed for some applications such as 10-bit sample depth or 4:2:2 chroma format often go beyond the capability of a typical consumer electronics device. An application may also need to efficiently combine compression with other functions such as noise reduction, image stabilization, real time clocks, GPS data, mission/ESD/user data or software-defined radio in a low power, field upgradable implementation. Low power, software-defined encoders may be implemented using a massively parallel memory-network processor array with 100 or more cores and distributed memory. The large number of processor elements allow the silicon device to operate more efficiently than conventional DSP or CPU technology. A dataflow programming methodology may be used to express all of the encoding processes including motion compensation, transform and quantization, and entropy coding. This is a declarative programming model in which the parallelism of the compression algorithm is expressed as a hierarchical graph of tasks with message communication. Data parallel and task parallel design patterns are supported without the need for explicit global synchronization control. An example is described of an H.264 encoder developed for a commercially available, massively parallel memorynetwork processor device.
Lossy compression of weak lensing data

DOE PAGES

Vanderveld, R. Ali; Bernstein, Gary M.; Stoughton, Chris; ...

2011-07-12

Future orbiting observatories will survey large areas of sky in order to constrain the physics of dark matter and dark energy using weak gravitational lensing and other methods. Lossy compression of the resultant data will improve the cost and feasibility of transmitting the images through the space communication network. We evaluate the consequences of the lossy compression algorithm of Bernstein et al. (2010) for the high-precision measurement of weak-lensing galaxy ellipticities. This square-root algorithm compresses each pixel independently, and the information discarded is by construction less than the Poisson error from photon shot noise. For simulated space-based images (without cosmicmore » rays) digitized to the typical 16 bits per pixel, application of the lossy compression followed by image-wise lossless compression yields images with only 2.4 bits per pixel, a factor of 6.7 compression. We demonstrate that this compression introduces no bias in the sky background. The compression introduces a small amount of additional digitization noise to the images, and we demonstrate a corresponding small increase in ellipticity measurement noise. The ellipticity measurement method is biased by the addition of noise, so the additional digitization noise is expected to induce a multiplicative bias on the galaxies measured ellipticities. After correcting for this known noise-induced bias, we find a residual multiplicative ellipticity bias of m {approx} -4 x 10 -4. This bias is small when compared to the many other issues that precision weak lensing surveys must confront, and furthermore we expect it to be reduced further with better calibration of ellipticity measurement methods.« less
A Fast, Automatic Segmentation Algorithm for Locating and Delineating Touching Cell Boundaries in Imaged Histopathology

PubMed Central

Qi, Xin; Xing, Fuyong; Foran, David J.; Yang, Lin

2013-01-01

Summary Background Automated analysis of imaged histopathology specimens could potentially provide support for improved reliability in detection and classification in a range of investigative and clinical cancer applications. Automated segmentation of cells in the digitized tissue microarray (TMA) is often the prerequisite for quantitative analysis. However overlapping cells usually bring significant challenges for traditional segmentation algorithms. Objectives In this paper, we propose a novel, automatic algorithm to separate overlapping cells in stained histology specimens acquired using bright-field RGB imaging. Methods It starts by systematically identifying salient regions of interest throughout the image based upon their underlying visual content. The segmentation algorithm subsequently performs a quick, voting based seed detection. Finally, the contour of each cell is obtained using a repulsive level set deformable model using the seeds generated in the previous step. We compared the experimental results with the most current literature, and the pixel wise accuracy between human experts' annotation and those generated using the automatic segmentation algorithm. Results The method is tested with 100 image patches which contain more than 1000 overlapping cells. The overall precision and recall of the developed algorithm is 90% and 78%, respectively. We also implement the algorithm on GPU. The parallel implementation is 22 times faster than its C/C++ sequential implementation. Conclusion The proposed overlapping cell segmentation algorithm can accurately detect the center of each overlapping cell and effectively separate each of the overlapping cells. GPU is proven to be an efficient parallel platform for overlapping cell segmentation. PMID:22526139
Short Note on Complexity of Multi-Value Byzantine Agreement

DTIC Science & Technology

2010-07-27

which lead to nBl /D bits over the whole algorithm. Broadcasts in extended step: In the extended step, every node broadcasts D bits. Thus nDB bits...bits, as: (n− 1)l + n(n− 1)(k +D/k)l/D + nBl /D + nDBt(t+ 1) (4) = (n− 1)l +O(n2kl/D + n2l/k + nBl /D + n3BD). (5) Notice that broadcast algorithm of
Efficient Bit-to-Symbol Likelihood Mappings

NASA Technical Reports Server (NTRS)

Moision, Bruce E.; Nakashima, Michael A.

2010-01-01

This innovation is an efficient algorithm designed to perform bit-to-symbol and symbol-to-bit likelihood mappings that represent a significant portion of the complexity of an error-correction code decoder for high-order constellations. Recent implementation of the algorithm in hardware has yielded an 8- percent reduction in overall area relative to the prior design.

Multi-bit operations in vertical spintronic shift registers

NASA Astrophysics Data System (ADS)

Lavrijsen, Reinoud; Petit, Dorothée C. M. C.; Fernández-Pacheco, Amalio; Lee, JiHyun; Mansell, Mansell; Cowburn, Russell P.

2014-03-01

Spintronic devices have in general demonstrated the feasibility of non-volatile memory storage and simple Boolean logic operations. Modern microprocessors have one further frequently used digital operation: bit-wise operations on multiple bits simultaneously. Such operations are important for binary multiplication and division and in efficient microprocessor architectures such as reduced instruction set computing (RISC). In this paper we show a four-stage vertical serial shift register made from RKKY coupled ultrathin (0.9 nm) perpendicularly magnetised layers into which a 3-bit data word is injected. The entire four stage shift register occupies a total length (thickness) of only 16 nm. We show how under the action of an externally applied magnetic field bits can be shifted together as a word and then manipulated individually, including being brought together to perform logic operations. This is one of the highest level demonstrations of logic operation ever performed on data in the magnetic state and brings closer the possibility of ultrahigh density all-magnetic microprocessors.
Multi-bit operations in vertical spintronic shift registers.

PubMed

Lavrijsen, Reinoud; Petit, Dorothée C M C; Fernández-Pacheco, Amalio; Lee, Jihyun; Mansell, Mansell; Cowburn, Russell P

2014-03-14

Spintronic devices have in general demonstrated the feasibility of non-volatile memory storage and simple Boolean logic operations. Modern microprocessors have one further frequently used digital operation: bit-wise operations on multiple bits simultaneously. Such operations are important for binary multiplication and division and in efficient microprocessor architectures such as reduced instruction set computing (RISC). In this paper we show a four-stage vertical serial shift register made from RKKY coupled ultrathin (0.9 nm) perpendicularly magnetised layers into which a 3-bit data word is injected. The entire four stage shift register occupies a total length (thickness) of only 16 nm. We show how under the action of an externally applied magnetic field bits can be shifted together as a word and then manipulated individually, including being brought together to perform logic operations. This is one of the highest level demonstrations of logic operation ever performed on data in the magnetic state and brings closer the possibility of ultrahigh density all-magnetic microprocessors.
Hash Bit Selection for Nearest Neighbor Search.

PubMed

Xianglong Liu; Junfeng He; Shih-Fu Chang

2017-11-01

To overcome the barrier of storage and computation when dealing with gigantic-scale data sets, compact hashing has been studied extensively to approximate the nearest neighbor search. Despite the recent advances, critical design issues remain open in how to select the right features, hashing algorithms, and/or parameter settings. In this paper, we address these by posing an optimal hash bit selection problem, in which an optimal subset of hash bits are selected from a pool of candidate bits generated by different features, algorithms, or parameters. Inspired by the optimization criteria used in existing hashing algorithms, we adopt the bit reliability and their complementarity as the selection criteria that can be carefully tailored for hashing performance in different tasks. Then, the bit selection solution is discovered by finding the best tradeoff between search accuracy and time using a modified dynamic programming method. To further reduce the computational complexity, we employ the pairwise relationship among hash bits to approximate the high-order independence property, and formulate it as an efficient quadratic programming method that is theoretically equivalent to the normalized dominant set problem in a vertex- and edge-weighted graph. Extensive large-scale experiments have been conducted under several important application scenarios of hash techniques, where our bit selection framework can achieve superior performance over both the naive selection methods and the state-of-the-art hashing algorithms, with significant accuracy gains ranging from 10% to 50%, relatively.
Implementation of cryptographic hash function SHA256 in C++

NASA Astrophysics Data System (ADS)

Shrivastava, Akash

2012-02-01

This abstract explains the implementation of SHA Secure hash algorithm 256 using C++. The SHA-2 is a strong hashing algorithm used in almost all kinds of security applications. The algorithm consists of 2 phases: Preprocessing and hash computation. Preprocessing involves padding a message, parsing the padded message into m-bits blocks, and setting initialization values to be used in the hash computation. It generates a message schedule from padded message and uses that schedule, along with functions, constants, and word operations to iteratively generate a series of hash values. The final hash value generated by the computation is used to determine the message digest. SHA-2 includes a significant number of changes from its predecessor, SHA-1. SHA-2 consists of a set of four hash functions with digests that are 224, 256, 384 or 512 bits. The algorithm outputs a 256 bits message block with an internal state block of 256 bits and initial block size of 512 bits. Maximum message length in bit is generated is 2^64 -1, over all computed over a series of 64 rounds consisting or several operations such as and, or, Xor, Shr, Rot. The code will provide clear understanding of the hash algorithm and generates hash values to retrieve message digest.
Accelerating String Set Matching in FPGA Hardware for Bioinformatics Research

PubMed Central

Dandass, Yoginder S; Burgess, Shane C; Lawrence, Mark; Bridges, Susan M

2008-01-01

Background This paper describes techniques for accelerating the performance of the string set matching problem with particular emphasis on applications in computational proteomics. The process of matching peptide sequences against a genome translated in six reading frames is part of a proteogenomic mapping pipeline that is used as a case-study. The Aho-Corasick algorithm is adapted for execution in field programmable gate array (FPGA) devices in a manner that optimizes space and performance. In this approach, the traditional Aho-Corasick finite state machine (FSM) is split into smaller FSMs, operating in parallel, each of which matches up to 20 peptides in the input translated genome. Each of the smaller FSMs is further divided into five simpler FSMs such that each simple FSM operates on a single bit position in the input (five bits are sufficient for representing all amino acids and special symbols in protein sequences). Results This bit-split organization of the Aho-Corasick implementation enables efficient utilization of the limited random access memory (RAM) resources available in typical FPGAs. The use of on-chip RAM as opposed to FPGA logic resources for FSM implementation also enables rapid reconfiguration of the FPGA without the place and routing delays associated with complex digital designs. Conclusion Experimental results show storage efficiencies of over 80% for several data sets. Furthermore, the FPGA implementation executing at 100 MHz is nearly 20 times faster than an implementation of the traditional Aho-Corasick algorithm executing on a 2.67 GHz workstation. PMID:18412963
Self-recovery fragile watermarking algorithm based on SPHIT

NASA Astrophysics Data System (ADS)

Xin, Li Ping

2015-12-01

A fragile watermark algorithm is proposed, based on SPIHT coding, which can recover the primary image itself. The novelty of the algorithm is that it can tamper location and Self-restoration. The recovery has been very good effect. The first, utilizing the zero-tree structure, the algorithm compresses and encodes the image itself, and then gained self correlative watermark data, so as to greatly reduce the quantity of embedding watermark. Then the watermark data is encoded by error correcting code, and the check bits and watermark bits are scrambled and embedded to enhance the recovery ability. At the same time, by embedding watermark into the latter two bit place of gray level image's bit-plane code, the image after embedded watermark can gain nicer visual effect. The experiment results show that the proposed algorithm may not only detect various processing such as noise adding, cropping, and filtering, but also recover tampered image and realize blind-detection. Peak signal-to-noise ratios of the watermark image were higher than other similar algorithm. The attack capability of the algorithm was enhanced.
Encryption and decryption using FPGA

NASA Astrophysics Data System (ADS)

Nayak, Nikhilesh; Chandak, Akshay; Shah, Nisarg; Karthikeyan, B.

2017-11-01

In this paper, we are performing multiple cryptography methods on a set of data and comparing their outputs. Here AES algorithm and RSA algorithm are used. Using AES Algorithm an 8 bit input (plain text) gets encrypted using a cipher key and the result is displayed on tera term (serially). For simulation a 128 bit input is used and operated with a 128 bit cipher key to generate encrypted text. The reverse operations are then performed to get decrypted text. In RSA Algorithm file handling is used to input plain text. This text is then operated on to get the encrypted and decrypted data, which are then stored in a file. Finally the results of both the algorithms are compared.
Implementation of digital equality comparator circuit on memristive memory crossbar array using material implication logic

NASA Astrophysics Data System (ADS)

Haron, Adib; Mahdzair, Fazren; Luqman, Anas; Osman, Nazmie; Junid, Syed Abdul Mutalib Al

2018-03-01

One of the most significant constraints of Von Neumann architecture is the limited bandwidth between memory and processor. The cost to move data back and forth between memory and processor is considerably higher than the computation in the processor itself. This architecture significantly impacts the Big Data and data-intensive application such as DNA analysis comparison which spend most of the processing time to move data. Recently, the in-memory processing concept was proposed, which is based on the capability to perform the logic operation on the physical memory structure using a crossbar topology and non-volatile resistive-switching memristor technology. This paper proposes a scheme to map digital equality comparator circuit on memristive memory crossbar array. The 2-bit, 4-bit, 8-bit, 16-bit, 32-bit, and 64-bit of equality comparator circuit are mapped on memristive memory crossbar array by using material implication logic in a sequential and parallel method. The simulation results show that, for the 64-bit word size, the parallel mapping exhibits 2.8× better performance in total execution time than sequential mapping but has a trade-off in terms of energy consumption and area utilization. Meanwhile, the total crossbar area can be reduced by 1.2× for sequential mapping and 1.5× for parallel mapping both by using the overlapping technique.
Analog front-end design of the STS/MUCH-XYTER2—full size prototype ASIC for the CBM experiment

NASA Astrophysics Data System (ADS)

Kleczek, Rafal

2017-01-01

The design of the analog front-end of the STS/MUCH-XYTER2 ASIC, a full-size prototype chip for the Silicon Tracking System (STS, based on double-sided silicon strip sensors) and Muon Chamber (MUCH, based on gas sensors) detectors is presented. The ASIC contains 128 charge processing channels, each built of a charge sensitive amplifier, a polarity selection circuit and two pulse shaping amplifiers forming two parallel signal paths. The first path is used for timing measurement with a fast discriminator. The second path allows low-noise amplitude measurement with a 5-bit continuous-time flash ADC. Different operating conditions and constraints posed by two target detectors' applications require front-end electronics flexibility to meet extended system-wise requirements. The presented circuit implements switchable shaper peaking time, gain switching and trimming, input amplifier pulsed reset circuit, fail-safe measures. The power consumption is scalable (for the STS and the MUCH modes), but limited to 10 mW/channel.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

NASA Astrophysics Data System (ADS)

Yi, Hongsuk

2014-03-01

Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
Rate distortion optimal bit allocation methods for volumetric data using JPEG 2000.

PubMed

Kosheleva, Olga M; Usevitch, Bryan E; Cabrera, Sergio D; Vidal, Edward

2006-08-01

Computer modeling programs that generate three-dimensional (3-D) data on fine grids are capable of generating very large amounts of information. These data sets, as well as 3-D sensor/measured data sets, are prime candidates for the application of data compression algorithms. A very flexible and powerful compression algorithm for imagery data is the newly released JPEG 2000 standard. JPEG 2000 also has the capability to compress volumetric data, as described in Part 2 of the standard, by treating the 3-D data as separate slices. As a decoder standard, JPEG 2000 does not describe any specific method to allocate bits among the separate slices. This paper proposes two new bit allocation algorithms for accomplishing this task. The first procedure is rate distortion optimal (for mean squared error), and is conceptually similar to postcompression rate distortion optimization used for coding codeblocks within JPEG 2000. The disadvantage of this approach is its high computational complexity. The second bit allocation algorithm, here called the mixed model (MM) approach, mathematically models each slice's rate distortion curve using two distinct regions to get more accurate modeling at low bit rates. These two bit allocation algorithms are applied to a 3-D Meteorological data set. Test results show that the MM approach gives distortion results that are nearly identical to the optimal approach, while significantly reducing computational complexity.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Reed, D.A.; Grunwald, D.C.

The spectrum of parallel processor designs can be divided into three sections according to the number and complexity of the processors. At one end there are simple, bit-serial processors. Any one of thee processors is of little value, but when it is coupled with many others, the aggregate computing power can be large. This approach to parallel processing can be likened to a colony of termites devouring a log. The most notable examples of this approach are the NASA/Goodyear Massively Parallel Processor, which has 16K one-bit processors, and the Thinking Machines Connection Machine, which has 64K one-bit processors. At themore » other end of the spectrum, a small number of processors, each built using the fastest available technology and the most sophisticated architecture, are combined. An example of this approach is the Cray X-MP. This type of parallel processing is akin to four woodmen attacking the log with chainsaws.« less
Steganography on quantum pixel images using Shannon entropy

NASA Astrophysics Data System (ADS)

Laurel, Carlos Ortega; Dong, Shi-Hai; Cruz-Irisson, M.

2016-07-01

This paper presents a steganographical algorithm based on least significant bit (LSB) from the most significant bit information (MSBI) and the equivalence of a bit pixel image to a quantum pixel image, which permits to make the information communicate secretly onto quantum pixel images for its secure transmission through insecure channels. This algorithm offers higher security since it exploits the Shannon entropy for an image.
Injecting Errors for Testing Built-In Test Software

NASA Technical Reports Server (NTRS)

Gender, Thomas K.; Chow, James

2010-01-01

Two algorithms have been conceived to enable automated, thorough testing of Built-in test (BIT) software. The first algorithm applies to BIT routines that define pass/fail criteria based on values of data read from such hardware devices as memories, input ports, or registers. This algorithm simulates effects of errors in a device under test by (1) intercepting data from the device and (2) performing AND operations between the data and the data mask specific to the device. This operation yields values not expected by the BIT routine. This algorithm entails very small, permanent instrumentation of the software under test (SUT) for performing the AND operations. The second algorithm applies to BIT programs that provide services to users application programs via commands or callable interfaces and requires a capability for test-driver software to read and write the memory used in execution of the SUT. This algorithm identifies all SUT code execution addresses where errors are to be injected, then temporarily replaces the code at those addresses with small test code sequences to inject latent severe errors, then determines whether, as desired, the SUT detects the errors and recovers
Negative base encoding in optical linear algebra processors

NASA Technical Reports Server (NTRS)

Perlee, C.; Casasent, D.

1986-01-01

In the digital multiplication by analog convolution algorithm, the bits of two encoded numbers are convolved to form the product of the two numbers in mixed binary representation; this output can be easily converted to binary. Attention is presently given to negative base encoding, treating base -2 initially, and then showing that the negative base system can be readily extended to any radix. In general, negative base encoding in optical linear algebra processors represents a more efficient technique than either sign magnitude or 2's complement encoding, when the additions of digitally encoded products are performed in parallel.
A novel image encryption algorithm using chaos and reversible cellular automata

NASA Astrophysics Data System (ADS)

Wang, Xingyuan; Luan, Dapeng

2013-11-01

In this paper, a novel image encryption scheme is proposed based on reversible cellular automata (RCA) combining chaos. In this algorithm, an intertwining logistic map with complex behavior and periodic boundary reversible cellular automata are used. We split each pixel of image into units of 4 bits, then adopt pseudorandom key stream generated by the intertwining logistic map to permute these units in confusion stage. And in diffusion stage, two-dimensional reversible cellular automata which are discrete dynamical systems are applied to iterate many rounds to achieve diffusion on bit-level, in which we only consider the higher 4 bits in a pixel because the higher 4 bits carry almost the information of an image. Theoretical analysis and experimental results demonstrate the proposed algorithm achieves a high security level and processes good performance against common attacks like differential attack and statistical attack. This algorithm belongs to the class of symmetric systems.
Dual Super-Systolic Core for Real-Time Reconstructive Algorithms of High-Resolution Radar/SAR Imaging Systems

PubMed Central

Atoche, Alejandro Castillo; Castillo, Javier Vázquez

2012-01-01

A high-speed dual super-systolic core for reconstructive signal processing (SP) operations consists of a double parallel systolic array (SA) machine in which each processing element of the array is also conceptualized as another SA in a bit-level fashion. In this study, we addressed the design of a high-speed dual super-systolic array (SSA) core for the enhancement/reconstruction of remote sensing (RS) imaging of radar/synthetic aperture radar (SAR) sensor systems. The selected reconstructive SP algorithms are efficiently transformed in their parallel representation and then, they are mapped into an efficient high performance embedded computing (HPEC) architecture in reconfigurable Xilinx field programmable gate array (FPGA) platforms. As an implementation test case, the proposed approach was aggregated in a HW/SW co-design scheme in order to solve the nonlinear ill-posed inverse problem of nonparametric estimation of the power spatial spectrum pattern (SSP) from a remotely sensed scene. We show how such dual SSA core, drastically reduces the computational load of complex RS regularization techniques achieving the required real-time operational mode. PMID:22736964
A Fast and Accurate Algorithm for l1 Minimization Problems in Compressive Sampling (Preprint)

DTIC Science & Technology

2013-01-22

However, updating uk+1 via the formulation of Step 2 in Algorithm 1 can be implemented through the use of the component-wise Gauss - Seidel iteration which...may accelerate the rate of convergence of the algorithm and therefore reduce the total CPU-time consumed. The efficiency of component-wise Gauss - Seidel ...Micchelli, L. Shen, and Y. Xu, A proximity algorithm accelerated by Gauss - Seidel iterations for L1/TV denoising models, Inverse Problems, 28 (2012), p
Microfluidic Pneumatic Logic Circuits and Digital Pneumatic Microprocessors for Integrated Microfluidic Systems

PubMed Central

Rhee, Minsoung

2010-01-01

We have developed pneumatic logic circuits and microprocessors built with microfluidic channels and valves in polydimethylsiloxane (PDMS). The pneumatic logic circuits perform various combinational and sequential logic calculations with binary pneumatic signals (atmosphere and vacuum), producing cascadable outputs based on Boolean operations. A complex microprocessor is constructed from combinations of various logic circuits and receives pneumatically encoded serial commands at a single input line. The device then decodes the temporal command sequence by spatial parallelization, computes necessary logic calculations between parallelized command bits, stores command information for signal transportation and maintenance, and finally executes the command for the target devices. Thus, such pneumatic microprocessors will function as a universal on-chip control platform to perform complex parallel operations for large-scale integrated microfluidic devices. To demonstrate the working principles, we have built 2-bit, 3-bit, 4-bit, and 8-bit microprecessors to control various target devices for applications such as four color dye mixing, and multiplexed channel fluidic control. By significantly reducing the need for external controllers, the digital pneumatic microprocessor can be used as a universal on-chip platform to autonomously manipulate microfluids in a high throughput manner. PMID:19823730
Adaptive intercolor error prediction coder for lossless color (rgb) picutre compression

NASA Astrophysics Data System (ADS)

Mann, Y.; Peretz, Y.; Mitchell, Harvey B.

2001-09-01

Most of the current lossless compression algorithms, including the new international baseline JPEG-LS algorithm, do not exploit the interspectral correlations that exist between the color planes in an input color picture. To improve the compression performance (i.e., lower the bit rate) it is necessary to exploit these correlations. A major concern is to find efficient methods for exploiting the correlations that, at the same time, are compatible with and can be incorporated into the JPEG-LS algorithm. One such algorithm is the method of intercolor error prediction (IEP), which when used with the JPEG-LS algorithm, results on average in a reduction of 8% in the overall bit rate. We show how the IEP algorithm can be simply modified and that it nearly doubles the size of the reduction in bit rate to 15%.

Parallel seed-based approach to multiple protein structure similarities detection

DOE PAGES

Chapuis, Guillaume; Le Boudic-Jamin, Mathilde; Andonov, Rumen; ...

2015-01-01

Finding similarities between protein structures is a crucial task in molecular biology. Most of the existing tools require proteins to be aligned in order-preserving way and only find single alignments even when multiple similar regions exist. We propose a new seed-based approach that discovers multiple pairs of similar regions. Its computational complexity is polynomial and it comes with a quality guarantee—the returned alignments have both root mean squared deviations (coordinate-based as well as internal-distances based) lower than a given threshold, if such exist. We do not require the alignments to be order preserving (i.e., we consider nonsequential alignments), which makesmore » our algorithm suitable for detecting similar domains when comparing multidomain proteins as well as to detect structural repetitions within a single protein. Because the search space for nonsequential alignments is much larger than for sequential ones, the computational burden is addressed by extensive use of parallel computing techniques: a coarse-grain level parallelism making use of available CPU cores for computation and a fine-grain level parallelism exploiting bit-level concurrency as well as vector instructions.« less
Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

2017-12-01

As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
A Compression Algorithm for Field Programmable Gate Arrays in the Space Environment

DTIC Science & Technology

2011-12-01

Bit 1 ,Bit 0P  . (V.3) Equation (V.3) is implemented with a string of XOR gates and Bit Basher blocks, as shown in Figure 31. As discussed in...5], the string of Bit Basher blocks are used to separate each 35-bit value into 35 one-bit values, and the string of XOR gates is used to
Two-pass imputation algorithm for missing value estimation in gene expression time series.

PubMed

Tsiporkova, Elena; Boeva, Veselka

2007-10-01

Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.
FIVQ algorithm for interference hyper-spectral image compression

NASA Astrophysics Data System (ADS)

Wen, Jia; Ma, Caiwen; Zhao, Junsuo

2014-07-01

Based on the improved vector quantization (IVQ) algorithm [1] which was proposed in 2012, this paper proposes a further improved vector quantization (FIVQ) algorithm for LASIS (Large Aperture Static Imaging Spectrometer) interference hyper-spectral image compression. To get better image quality, IVQ algorithm takes both the mean values and the VQ indices as the encoding rules. Although IVQ algorithm can improve both the bit rate and the image quality, it still can be further improved in order to get much lower bit rate for the LASIS interference pattern with the special optical characteristics based on the pushing and sweeping in LASIS imaging principle. In the proposed algorithm FIVQ, the neighborhood of the encoding blocks of the interference pattern image, which are using the mean value rules, will be checked whether they have the same mean value as the current processing block. Experiments show the proposed algorithm FIVQ can get lower bit rate compared to that of the IVQ algorithm for the LASIS interference hyper-spectral sequences.
A visual parallel-BCI speller based on the time-frequency coding strategy.

PubMed

Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong

2014-04-01

Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min(-1), with an average of 54.0 bit min(-1) and 43.0 bit min(-1) in the three rounds and five rounds, respectively. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.
An adaptive bit synchronization algorithm under time-varying environment.

NASA Technical Reports Server (NTRS)

Chow, L. R.; Owen, H. A., Jr.; Wang, P. P.

1973-01-01

This paper presents an adaptive estimation algorithm for bit synchronization, assuming that the parameters of the incoming data process are time-varying. Experiment results have proved that this synchronizer is workable either judged by the amount of data required or the speed of convergence.
Image Steganography In Securing Sound File Using Arithmetic Coding Algorithm, Triple Data Encryption Standard (3DES) and Modified Least Significant Bit (MLSB)

NASA Astrophysics Data System (ADS)

Nasution, A. B.; Efendi, S.; Suwilo, S.

2018-04-01

The amount of data inserted in the form of audio samples that use 8 bits with LSB algorithm, affect the value of PSNR which resulted in changes in image quality of the insertion (fidelity). So in this research will be inserted audio samples using 5 bits with MLSB algorithm to reduce the number of data insertion where previously the audio sample will be compressed with Arithmetic Coding algorithm to reduce file size. In this research will also be encryption using Triple DES algorithm to better secure audio samples. The result of this research is the value of PSNR more than 50dB so it can be concluded that the image quality is still good because the value of PSNR has exceeded 40dB.
Development of an Innovative Algorithm for Aerodynamics-Structure Interaction Using Lattice Boltzmann Method

NASA Technical Reports Server (NTRS)

Mei, Ren-Wei; Shyy, Wei; Yu, Da-Zhi; Luo, Li-Shi; Rudy, David (Technical Monitor)

2001-01-01

The lattice Boltzmann equation (LBE) is a kinetic formulation which offers an alternative computational method capable of solving fluid dynamics for various systems. Major advantages of the method are owing to the fact that the solution for the particle distribution functions is explicit, easy to implement, and the algorithm is natural to parallelize. In this final report, we summarize the works accomplished in the past three years. Since most works have been published, the technical details can be found in the literature. Brief summary will be provided in this report. In this project, a second-order accurate treatment of boundary condition in the LBE method is developed for a curved boundary and tested successfully in various 2-D and 3-D configurations. To evaluate the aerodynamic force on a body in the context of LBE method, several force evaluation schemes have been investigated. A simple momentum exchange method is shown to give reliable and accurate values for the force on a body in both 2-D and 3-D cases. Various 3-D LBE models have been assessed in terms of efficiency, accuracy, and robustness. In general, accurate 3-D results can be obtained using LBE methods. The 3-D 19-bit model is found to be the best one among the 15-bit, 19-bit, and 27-bit LBE models. To achieve desired grid resolution and to accommodate the far field boundary conditions in aerodynamics computations, a multi-block LBE method is developed by dividing the flow field into various blocks each having constant lattice spacing. Substantial contribution to the LBE method is also made through the development of a new, generalized lattice Boltzmann equation constructed in the moment space in order to improve the computational stability, detailed theoretical analysis on the stability, dispersion, and dissipation characteristics of the LBE method, and computational studies of high Reynolds number flows with singular gradients. Finally, a finite difference-based lattice Boltzmann method is developed for inviscid compressible flows.
Bit error rate tester using fast parallel generation of linear recurring sequences

DOEpatents

Pierson, Lyndon G.; Witzke, Edward L.; Maestas, Joseph H.

2003-05-06

A fast method for generating linear recurring sequences by parallel linear recurring sequence generators (LRSGs) with a feedback circuit optimized to balance minimum propagation delay against maximal sequence period. Parallel generation of linear recurring sequences requires decimating the sequence (creating small contiguous sections of the sequence in each LRSG). A companion matrix form is selected depending on whether the LFSR is right-shifting or left-shifting. The companion matrix is completed by selecting a primitive irreducible polynomial with 1's most closely grouped in a corner of the companion matrix. A decimation matrix is created by raising the companion matrix to the (n*k).sup.th power, where k is the number of parallel LRSGs and n is the number of bits to be generated at a time by each LRSG. Companion matrices with 1's closely grouped in a corner will yield sparse decimation matrices. A feedback circuit comprised of XOR logic gates implements the decimation matrix in hardware. Sparse decimation matrices can be implemented with minimum number of XOR gates, and therefore a minimum propagation delay through the feedback circuit. The LRSG of the invention is particularly well suited to use as a bit error rate tester on high speed communication lines because it permits the receiver to synchronize to the transmitted pattern within 2n bits.
A Subsystem Test Bed for Chinese Spectral Radioheliograph

NASA Astrophysics Data System (ADS)

Zhao, An; Yan, Yihua; Wang, Wei

2014-11-01

The Chinese Spectral Radioheliograph is a solar dedicated radio interferometric array that will produce high spatial resolution, high temporal resolution, and high spectral resolution images of the Sun simultaneously in decimetre and centimetre wave range. Digital processing of intermediate frequency signal is an important part in a radio telescope. This paper describes a flexible and high-speed digital down conversion system for the CSRH by applying complex mixing, parallel filtering, and extracting algorithms to process IF signal at the time of being designed and incorporates canonic-signed digit coding and bit-plane method to improve program efficiency. The DDC system is intended to be a subsystem test bed for simulation and testing for CSRH. Software algorithms for simulation and hardware language algorithms based on FPGA are written which use less hardware resources and at the same time achieve high performances such as processing high-speed data flow (1 GHz) with 10 MHz spectral resolution. An experiment with the test bed is illustrated by using geostationary satellite data observed on March 20, 2014. Due to the easy alterability of the algorithms on FPGA, the data can be recomputed with different digital signal processing algorithms for selecting optimum algorithm.
Bit Grooming: statistically accurate precision-preserving quantization with compression, evaluated in the netCDF Operators (NCO, v4.4.8+)

NASA Astrophysics Data System (ADS)

Zender, Charles S.

2016-09-01

Geoscientific models and measurements generate false precision (scientifically meaningless data bits) that wastes storage space. False precision can mislead (by implying noise is signal) and be scientifically pointless, especially for measurements. By contrast, lossy compression can be both economical (save space) and heuristic (clarify data limitations) without compromising the scientific integrity of data. Data quantization can thus be appropriate regardless of whether space limitations are a concern. We introduce, implement, and characterize a new lossy compression scheme suitable for IEEE floating-point data. Our new Bit Grooming algorithm alternately shaves (to zero) and sets (to one) the least significant bits of consecutive values to preserve a desired precision. This is a symmetric, two-sided variant of an algorithm sometimes called Bit Shaving that quantizes values solely by zeroing bits. Our variation eliminates the artificial low bias produced by always zeroing bits, and makes Bit Grooming more suitable for arrays and multi-dimensional fields whose mean statistics are important. Bit Grooming relies on standard lossless compression to achieve the actual reduction in storage space, so we tested Bit Grooming by applying the DEFLATE compression algorithm to bit-groomed and full-precision climate data stored in netCDF3, netCDF4, HDF4, and HDF5 formats. Bit Grooming reduces the storage space required by initially uncompressed and compressed climate data by 25-80 and 5-65 %, respectively, for single-precision values (the most common case for climate data) quantized to retain 1-5 decimal digits of precision. The potential reduction is greater for double-precision datasets. When used aggressively (i.e., preserving only 1-2 digits), Bit Grooming produces storage reductions comparable to other quantization techniques such as Linear Packing. Unlike Linear Packing, whose guaranteed precision rapidly degrades within the relatively narrow dynamic range of values that it can compress, Bit Grooming guarantees the specified precision throughout the full floating-point range. Data quantization by Bit Grooming is irreversible (i.e., lossy) yet transparent, meaning that no extra processing is required by data users/readers. Hence Bit Grooming can easily reduce data storage volume without sacrificing scientific precision or imposing extra burdens on users.
A generalized memory test algorithm

NASA Technical Reports Server (NTRS)

Milner, E. J.

1982-01-01

A general algorithm for testing digital computer memory is presented. The test checks that (1) every bit can be cleared and set in each memory work, and (2) bits are not erroneously cleared and/or set elsewhere in memory at the same time. The algorithm can be applied to any size memory block and any size memory word. It is concise and efficient, requiring the very few cycles through memory. For example, a test of 16-bit-word-size memory requries only 384 cycles through memory. Approximately 15 seconds were required to test a 32K block of such memory, using a microcomputer having a cycle time of 133 nanoseconds.
Time-space modal logic for verification of bit-slice circuits

NASA Astrophysics Data System (ADS)

Hiraishi, Hiromi

1996-03-01

The major goal of this paper is to propose a new modal logic aiming at formal verification of bit-slice circuits. The new logic is called as time-space modal logic and its major feature is that it can handle two transition relations: one for time transition and the other for space transition. As for a verification algorithm, a symbolic model checking algorithm of the new logic is shown. This could be applicable to verification of bit-slice microprocessor of infinite bit width and 1D systolic array of infinite length. A simple benchmark result shows the effectiveness of the proposed approach.
A 1-channel 3-band wide dynamic range compression chip for vibration transducer of implantable hearing aids.

PubMed

Kim, Dongwook; Seong, Kiwoong; Kim, Myoungnam; Cho, Jinho; Lee, Jyunghyun

2014-01-01

In this paper, a digital audio processing chip which uses a wide dynamic range compression (WDRC) algorithm is designed and implemented for implantable hearing aids system. The designed chip operates at a single voltage of 3.3V and drives a 16 bit parallel input and output at 32 kHz sample. The designed chip has 1-channel 3-band WDRC composed of a FIR filter bank, a level detector, and a compression part. To verify the performance of the designed chip, we measured the frequency separations of bands and compression gain control to reflect the hearing threshold level.
Faster Bit-Parallel Algorithms for Unordered Pseudo-tree Matching and Tree Homeomorphism

NASA Astrophysics Data System (ADS)

Kaneta, Yusaku; Arimura, Hiroki

In this paper, we consider the unordered pseudo-tree matching problem, which is a problem of, given two unordered labeled trees P and T, finding all occurrences of P in T via such many-one embeddings that preserve node labels and parent-child relationship. This problem is closely related to tree pattern matching problem for XPath queries with child axis only. If m > w , we present an efficient algorithm that solves the problem in O(nm log(w)/w) time using O(hm/w + mlog(w)/w) space and O(m log(w)) preprocessing on a unit-cost arithmetic RAM model with addition, where m is the number of nodes in P, n is the number of nodes in T, h is the height of T, and w is the word length. We also discuss a modification of our algorithm for the unordered tree homeomorphism problem, which corresponds to a tree pattern matching problem for XPath queries with descendant axis only.
Internet Protocol Security (IPSEC): Testing and Implications on IPv4 and IPv6 Networks

DTIC Science & Technology

2008-08-27

Message Authentication Code-Message Digest 5-96). Due to the processing power consumption and slowness of public key authentication methods, RSA ...MODP) group with a 768 -bit modulus 2. a MODP group with a 1024-bit modulus 3. an Elliptic Curve Group over GF[ 2n ] (EC2N) group with a 155-bit...nonces, digital signatures using the Digital Signature Algorithm, and the Rivest-Shamir- Adelman ( RSA ) algorithm. For more information about the
Performance of the JPEG Estimated Spectrum Adaptive Postfilter (JPEG-ESAP) for Low Bit Rates

NASA Technical Reports Server (NTRS)

Linares, Irving (Inventor)

2016-01-01

Frequency-based, pixel-adaptive filtering using the JPEG-ESAP algorithm for low bit rate JPEG formatted color images may allow for more compressed images while maintaining equivalent quality at a smaller file size or bitrate. For RGB, an image is decomposed into three color bands--red, green, and blue. The JPEG-ESAP algorithm is then applied to each band (e.g., once for red, once for green, and once for blue) and the output of each application of the algorithm is rebuilt as a single color image. The ESAP algorithm may be repeatedly applied to MPEG-2 video frames to reduce their bit rate by a factor of 2 or 3, while maintaining equivalent video quality, both perceptually, and objectively, as recorded in the computed PSNR values.
Visual Perception Based Rate Control Algorithm for HEVC

NASA Astrophysics Data System (ADS)

Feng, Zeqi; Liu, PengYu; Jia, Kebin

2018-01-01

For HEVC, rate control is an indispensably important video coding technology to alleviate the contradiction between video quality and the limited encoding resources during video communication. However, the rate control benchmark algorithm of HEVC ignores subjective visual perception. For key focus regions, bit allocation of LCU is not ideal and subjective quality is unsatisfied. In this paper, a visual perception based rate control algorithm for HEVC is proposed. First bit allocation weight of LCU level is optimized based on the visual perception of luminance and motion to ameliorate video subjective quality. Then λ and QP are adjusted in combination with the bit allocation weight to improve rate distortion performance. Experimental results show that the proposed algorithm reduces average 0.5% BD-BR and maximum 1.09% BD-BR at no cost in bitrate accuracy compared with HEVC (HM15.0). The proposed algorithm devotes to improving video subjective quality under various video applications.
Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes

PubMed Central

2013-01-01

Motivation Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. Results We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. Availability The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana. PMID:24564704

Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.

PubMed

Wang, Yue; Goh, Wilson; Wong, Limsoon; Montana, Giovanni

2013-01-01

Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana.
A visual parallel-BCI speller based on the time-frequency coding strategy

NASA Astrophysics Data System (ADS)

Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong

2014-04-01

Objective. Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. Approach. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Main results. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min-1, with an average of 54.0 bit min-1 and 43.0 bit min-1 in the three rounds and five rounds, respectively. Significance. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.
Parallelizing quantum circuit synthesis

NASA Astrophysics Data System (ADS)

Di Matteo, Olivia; Mosca, Michele

2016-03-01

Quantum circuit synthesis is the process in which an arbitrary unitary operation is decomposed into a sequence of gates from a universal set, typically one which a quantum computer can implement both efficiently and fault-tolerantly. As physical implementations of quantum computers improve, the need is growing for tools that can effectively synthesize components of the circuits and algorithms they will run. Existing algorithms for exact, multi-qubit circuit synthesis scale exponentially in the number of qubits and circuit depth, leaving synthesis intractable for circuits on more than a handful of qubits. Even modest improvements in circuit synthesis procedures may lead to significant advances, pushing forward the boundaries of not only the size of solvable circuit synthesis problems, but also in what can be realized physically as a result of having more efficient circuits. We present a method for quantum circuit synthesis using deterministic walks. Also termed pseudorandom walks, these are walks in which once a starting point is chosen, its path is completely determined. We apply our method to construct a parallel framework for circuit synthesis, and implement one such version performing optimal T-count synthesis over the Clifford+T gate set. We use our software to present examples where parallelization offers a significant speedup on the runtime, as well as directly confirm that the 4-qubit 1-bit full adder has optimal T-count 7 and T-depth 3.
A Fast Multiple Sampling Method for Low-Noise CMOS Image Sensors With Column-Parallel 12-bit SAR ADCs.

PubMed

Kim, Min-Kyu; Hong, Seong-Kwan; Kwon, Oh-Kyong

2015-12-26

This paper presents a fast multiple sampling method for low-noise CMOS image sensor (CIS) applications with column-parallel successive approximation register analog-to-digital converters (SAR ADCs). The 12-bit SAR ADC using the proposed multiple sampling method decreases the A/D conversion time by repeatedly converting a pixel output to 4-bit after the first 12-bit A/D conversion, reducing noise of the CIS by one over the square root of the number of samplings. The area of the 12-bit SAR ADC is reduced by using a 10-bit capacitor digital-to-analog converter (DAC) with four scaled reference voltages. In addition, a simple up/down counter-based digital processing logic is proposed to perform complex calculations for multiple sampling and digital correlated double sampling. To verify the proposed multiple sampling method, a 256 × 128 pixel array CIS with 12-bit SAR ADCs was fabricated using 0.18 μm CMOS process. The measurement results shows that the proposed multiple sampling method reduces each A/D conversion time from 1.2 μs to 0.45 μs and random noise from 848.3 μV to 270.4 μV, achieving a dynamic range of 68.1 dB and an SNR of 39.2 dB.
The need to approximate the use-case in clinical machine learning

PubMed Central

Saeb, Sohrab; Jayaraman, Arun; Mohr, David C.; Kording, Konrad P.

2017-01-01

Abstract The availability of smartphone and wearable sensor technology is leading to a rapid accumulation of human subject data, and machine learning is emerging as a technique to map those data into clinical predictions. As machine learning algorithms are increasingly used to support clinical decision making, it is vital to reliably quantify their prediction accuracy. Cross-validation (CV) is the standard approach where the accuracy of such algorithms is evaluated on part of the data the algorithm has not seen during training. However, for this procedure to be meaningful, the relationship between the training and the validation set should mimic the relationship between the training set and the dataset expected for the clinical use. Here we compared two popular CV methods: record-wise and subject-wise. While the subject-wise method mirrors the clinically relevant use-case scenario of diagnosis in newly recruited subjects, the record-wise strategy has no such interpretation. Using both a publicly available dataset and a simulation, we found that record-wise CV often massively overestimates the prediction accuracy of the algorithms. We also conducted a systematic review of the relevant literature, and found that this overly optimistic method was used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smartphones to predict clinical outcomes. As we move towards an era of machine learning-based diagnosis and treatment, using proper methods to evaluate their accuracy is crucial, as inaccurate results can mislead both clinicians and data scientists. PMID:28327985
The VLSI design of a Reed-Solomon encoder using Berlekamps bit-serial multiplier algorithm

NASA Technical Reports Server (NTRS)

Truong, T. K.; Deutsch, L. J.; Reed, I. S.; Hsu, I. S.; Wang, K.; Yeh, C. S.

1982-01-01

Realization of a bit-serial multiplication algorithm for the encoding of Reed-Solomon (RS) codes on a single VLSI chip using NMOS technology is demonstrated to be feasible. A dual basis (255, 223) over a Galois field is used. The conventional RS encoder for long codes ofter requires look-up tables to perform the multiplication of two field elements. Berlekamp's algorithm requires only shifting and exclusive-OR operations.
Scheme for Entering Binary Data Into a Quantum Computer

NASA Technical Reports Server (NTRS)

Williams, Colin

2005-01-01

A quantum algorithm provides for the encoding of an exponentially large number of classical data bits by use of a smaller (polynomially large) number of quantum bits (qubits). The development of this algorithm was prompted by the need, heretofore not satisfied, for a means of entering real-world binary data into a quantum computer. The data format provided by this algorithm is suitable for subsequent ultrafast quantum processing of the entered data. Potential applications lie in disciplines (e.g., genomics) in which one needs to search for matches between parts of very long sequences of data. For example, the algorithm could be used to encode the N-bit-long human genome in only log2N qubits. The resulting log2N-qubit state could then be used for subsequent quantum data processing - for example, to perform rapid comparisons of sequences.
Finite element computation on nearest neighbor connected machines

NASA Technical Reports Server (NTRS)

Mcaulay, A. D.

1984-01-01

Research aimed at faster, more cost effective parallel machines and algorithms for improving designer productivity with finite element computations is discussed. A set of 8 boards, containing 4 nearest neighbor connected arrays of commercially available floating point chips and substantial memory, are inserted into a commercially available machine. One-tenth Mflop (64 bit operation) processors provide an 89% efficiency when solving the equations arising in a finite element problem for a single variable regular grid of size 40 by 40 by 40. This is approximately 15 to 20 times faster than a much more expensive machine such as a VAX 11/780 used in double precision. The efficiency falls off as faster or more processors are envisaged because communication times become dominant. A novel successive overrelaxation algorithm which uses cyclic reduction in order to permit data transfer and computation to overlap in time is proposed.
Experiences modeling ocean circulation problems on a 30 node commodity cluster with 3840 GPU processor cores.

NASA Astrophysics Data System (ADS)

Hill, C.

2008-12-01

Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes for which this technology is currently most useful. However, many interesting problems fit within this envelope. Looking forward, we extrapolate our experience to estimate full-scale ocean model performance and applicability. Finally we describe preliminary hybrid mixed 32-bit and 64-bit experiments with graphics cards that support 64-bit arithmetic, albeit at a lower performance.
Bit Grooming: Statistically accurate precision-preserving quantization with compression, evaluated in the netCDF operators (NCO, v4.4.8+)

DOE PAGES

Zender, Charles S.

2016-09-19

Geoscientific models and measurements generate false precision (scientifically meaningless data bits) that wastes storage space. False precision can mislead (by implying noise is signal) and be scientifically pointless, especially for measurements. By contrast, lossy compression can be both economical (save space) and heuristic (clarify data limitations) without compromising the scientific integrity of data. Data quantization can thus be appropriate regardless of whether space limitations are a concern. We introduce, implement, and characterize a new lossy compression scheme suitable for IEEE floating-point data. Our new Bit Grooming algorithm alternately shaves (to zero) and sets (to one) the least significant bits ofmore » consecutive values to preserve a desired precision. This is a symmetric, two-sided variant of an algorithm sometimes called Bit Shaving that quantizes values solely by zeroing bits. Our variation eliminates the artificial low bias produced by always zeroing bits, and makes Bit Grooming more suitable for arrays and multi-dimensional fields whose mean statistics are important. Bit Grooming relies on standard lossless compression to achieve the actual reduction in storage space, so we tested Bit Grooming by applying the DEFLATE compression algorithm to bit-groomed and full-precision climate data stored in netCDF3, netCDF4, HDF4, and HDF5 formats. Bit Grooming reduces the storage space required by initially uncompressed and compressed climate data by 25–80 and 5–65 %, respectively, for single-precision values (the most common case for climate data) quantized to retain 1–5 decimal digits of precision. The potential reduction is greater for double-precision datasets. When used aggressively (i.e., preserving only 1–2 digits), Bit Grooming produces storage reductions comparable to other quantization techniques such as Linear Packing. Unlike Linear Packing, whose guaranteed precision rapidly degrades within the relatively narrow dynamic range of values that it can compress, Bit Grooming guarantees the specified precision throughout the full floating-point range. Data quantization by Bit Grooming is irreversible (i.e., lossy) yet transparent, meaning that no extra processing is required by data users/readers. Hence Bit Grooming can easily reduce data storage volume without sacrificing scientific precision or imposing extra burdens on users.« less
Personal supercomputing by using transputer and Intel 80860 in plasma engineering

NASA Astrophysics Data System (ADS)

Ido, S.; Aoki, K.; Ishine, M.; Kubota, M.

1992-09-01

Transputer (T800) and 64-bit RISC Intel 80860 (i860) added on a personal computer can be used as an accelerator. When 32-bit T800s in a parallel system or 64-bit i860s are used, scientific calculations are carried out several ten times as fast as in the case of commonly used 32-bit personal computers or UNIX workstations. Benchmark tests and examples of physical simulations using T800s and i860 are reported.
Consistency-based rectification of nonrigid registrations

PubMed Central

Gass, Tobias; Székely, Gábor; Goksel, Orcun

2015-01-01

Abstract. We present a technique to rectify nonrigid registrations by improving their group-wise consistency, which is a widely used unsupervised measure to assess pair-wise registration quality. While pair-wise registration methods cannot guarantee any group-wise consistency, group-wise approaches typically enforce perfect consistency by registering all images to a common reference. However, errors in individual registrations to the reference then propagate, distorting the mean and accumulating in the pair-wise registrations inferred via the reference. Furthermore, the assumption that perfect correspondences exist is not always true, e.g., for interpatient registration. The proposed consistency-based registration rectification (CBRR) method addresses these issues by minimizing the group-wise inconsistency of all pair-wise registrations using a regularized least-squares algorithm. The regularization controls the adherence to the original registration, which is additionally weighted by the local postregistration similarity. This allows CBRR to adaptively improve consistency while locally preserving accurate pair-wise registrations. We show that the resulting registrations are not only more consistent, but also have lower average transformation error when compared to known transformations in simulated data. On clinical data, we show improvements of up to 50% target registration error in breathing motion estimation from four-dimensional MRI and improvements in atlas-based segmentation quality of up to 65% in terms of mean surface distance in three-dimensional (3-D) CT. Such improvement was observed consistently using different registration algorithms, dimensionality (two-dimensional/3-D), and modalities (MRI/CT). PMID:26158083
Serial-to-parallel color-TV converter

NASA Technical Reports Server (NTRS)

Doak, T. W.; Merwin, R. B.; Zuckswert, S. E.; Sepper, W.

1976-01-01

Solid analog-to-digital converter eliminates flicker and problems with time base stability and gain variation in sequential color TV cameras. Device includes 3-bit delta modulator; two-field memory; timing, switching, and sync network; and three 3-bit delta demodulators
Use of One Time Pad Algorithm for Bit Plane Security Improvement

NASA Astrophysics Data System (ADS)

Suhardi; Suwilo, Saib; Budhiarti Nababan, Erna

2017-12-01

BPCS (Bit-Plane Complexity Segmentation) which is one of the steganography techniques that utilizes the human vision characteristics that cannot see the change in binary patterns that occur in the image. This technique performs message insertion by making a switch to a high-complexity bit-plane or noise-like regions with bits of secret messages. Bit messages that were previously stored precisely result the message extraction process to be done easily by rearranging a set of previously stored characters in noise-like region in the image. Therefore the secret message becomes easily known by others. In this research, the process of replacing bit plane with message bits is modified by utilizing One Time Pad cryptography technique which aims to increase security in bit plane. In the tests performed, the combination of One Time Pad cryptographic algorithm to the steganography technique of BPCS works well in the insertion of messages into the vessel image, although in insertion into low-dimensional images is poor. The comparison of the original image with the stegoimage looks identical and produces a good quality image with a mean value of PSNR above 30db when using a largedimensional image as the cover messages.
A biclustering algorithm for extracting bit-patterns from binary datasets.

PubMed

Rodriguez-Baena, Domingo S; Perez-Pulido, Antonio J; Aguilar-Ruiz, Jesus S

2011-10-01

Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html dsrodbae@upo.es Supplementary data are available at Bioinformatics online.
Solving Coupled Gross--Pitaevskii Equations on a Cluster of PlayStation 3 Computers

NASA Astrophysics Data System (ADS)

Edwards, Mark; Heward, Jeffrey; Clark, C. W.

2009-05-01

At Georgia Southern University we have constructed an 8+1--node cluster of Sony PlayStation 3 (PS3) computers with the intention of using this computing resource to solve problems related to the behavior of ultra--cold atoms in general with a particular emphasis on studying bose--bose and bose--fermi mixtures confined in optical lattices. As a first project that uses this computing resource, we have implemented a parallel solver of the coupled time--dependent, one--dimensional Gross--Pitaevskii (TDGP) equations. These equations govern the behavior of dual-- species bosonic mixtures. We chose the split--operator/FFT to solve the coupled 1D TDGP equations. The fast Fourier transform component of this solver can be readily parallelized on the PS3 cpu known as the Cell Broadband Engine (CellBE). Each CellBE chip contains a single 64--bit PowerPC Processor Element known as the PPE and eight ``Synergistic Processor Element'' identified as the SPE's. We report on this algorithm and compare its performance to a non--parallel solver as applied to modeling evaporative cooling in dual--species bosonic mixtures.
Direct kinematics solution architectures for industrial robot manipulators: Bit-serial versus parallel

NASA Astrophysics Data System (ADS)

Lee, J.; Kim, K.

A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
Direct kinematics solution architectures for industrial robot manipulators: Bit-serial versus parallel

NASA Technical Reports Server (NTRS)

Lee, J.; Kim, K.

1991-01-01

A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
A review on "A Novel Technique for Image Steganography Based on Block-DCT and Huffman Encoding"

NASA Astrophysics Data System (ADS)

Das, Rig; Tuithung, Themrichon

2013-03-01

This paper reviews the embedding and extraction algorithm proposed by "A. Nag, S. Biswas, D. Sarkar and P. P. Sarkar" on "A Novel Technique for Image Steganography based on Block-DCT and Huffman Encoding" in "International Journal of Computer Science and Information Technology, Volume 2, Number 3, June 2010" [3] and shows that the Extraction of Secret Image is Not Possible for the algorithm proposed in [3]. 8 bit Cover Image of size is divided into non joint blocks and a two dimensional Discrete Cosine Transformation (2-D DCT) is performed on each of the blocks. Huffman Encoding is performed on an 8 bit Secret Image of size and each bit of the Huffman Encoded Bit Stream is embedded in the frequency domain by altering the LSB of the DCT coefficients of Cover Image blocks. The Huffman Encoded Bit Stream and Huffman Table
Adaptive distributed source coding.

PubMed

Varodayan, David; Lin, Yao-Chung; Girod, Bernd

2012-05-01

We consider distributed source coding in the presence of hidden variables that parameterize the statistical dependence among sources. We derive the Slepian-Wolf bound and devise coding algorithms for a block-candidate model of this problem. The encoder sends, in addition to syndrome bits, a portion of the source to the decoder uncoded as doping bits. The decoder uses the sum-product algorithm to simultaneously recover the source symbols and the hidden statistical dependence variables. We also develop novel techniques based on density evolution (DE) to analyze the coding algorithms. We experimentally confirm that our DE analysis closely approximates practical performance. This result allows us to efficiently optimize parameters of the algorithms. In particular, we show that the system performs close to the Slepian-Wolf bound when an appropriate doping rate is selected. We then apply our coding and analysis techniques to a reduced-reference video quality monitoring system and show a bit rate saving of about 75% compared with fixed-length coding.

Trellises and Trellis-Based Decoding Algorithms for Linear Block Codes. Part 3; The Map and Related Decoding Algirithms

NASA Technical Reports Server (NTRS)

Lin, Shu; Fossorier, Marc

1998-01-01

In a coded communication system with equiprobable signaling, MLD minimizes the word error probability and delivers the most likely codeword associated with the corresponding received sequence. This decoding has two drawbacks. First, minimization of the word error probability is not equivalent to minimization of the bit error probability. Therefore, MLD becomes suboptimum with respect to the bit error probability. Second, MLD delivers a hard-decision estimate of the received sequence, so that information is lost between the input and output of the ML decoder. This information is important in coded schemes where the decoded sequence is further processed, such as concatenated coding schemes, multi-stage and iterative decoding schemes. In this chapter, we first present a decoding algorithm which both minimizes bit error probability, and provides the corresponding soft information at the output of the decoder. This algorithm is referred to as the MAP (maximum aposteriori probability) decoding algorithm.
Parallel Processing of Big Point Clouds Using Z-Order Partitioning

NASA Astrophysics Data System (ADS)

Alis, C.; Boehm, J.; Liu, K.

2016-06-01

As laser scanning technology improves and costs are coming down, the amount of point cloud data being generated can be prohibitively difficult and expensive to process on a single machine. This data explosion is not only limited to point cloud data. Voluminous amounts of high-dimensionality and quickly accumulating data, collectively known as Big Data, such as those generated by social media, Internet of Things devices and commercial transactions, are becoming more prevalent as well. New computing paradigms and frameworks are being developed to efficiently handle the processing of Big Data, many of which utilize a compute cluster composed of several commodity grade machines to process chunks of data in parallel. A central concept in many of these frameworks is data locality. By its nature, Big Data is large enough that the entire dataset would not fit on the memory and hard drives of a single node hence replicating the entire dataset to each worker node is impractical. The data must then be partitioned across worker nodes in a manner that minimises data transfer across the network. This is a challenge for point cloud data because there exist different ways to partition data and they may require data transfer. We propose a partitioning based on Z-order which is a form of locality-sensitive hashing. The Z-order or Morton code is computed by dividing each dimension to form a grid then interleaving the binary representation of each dimension. For example, the Z-order code for the grid square with coordinates (x = 1 = 012, y = 3 = 112) is 10112 = 11. The number of points in each partition is controlled by the number of bits per dimension: the more bits, the fewer the points. The number of bits per dimension also controls the level of detail with more bits yielding finer partitioning. We present this partitioning method by implementing it on Apache Spark and investigating how different parameters affect the accuracy and running time of the k nearest neighbour algorithm for a hemispherical and a triangular wave point cloud.
SOPanG: online text searching over a pan-genome.

PubMed

Cislak, Aleksander; Grabowski, Szymon; Holub, Jan

2018-06-22

The many thousands of high-quality genomes available nowadays imply a shift from single genome to pan-genomic analyses. A basic algorithmic building brick for such a scenario is online search over a collection of similar texts, a problem with surprisingly few solutions presented so far. We present SOPanG, a simple tool for exact pattern matching over an elastic-degenerate string, a recently proposed simplified model for the pan-genome. Thanks to bit-parallelism, it achieves pattern matching speeds above 400MB/s, more than an order of magnitude higher than of other software. SOPanG is available for free from: https://github.com/MrAlexSee/sopang. Supplementary data are available at Bioinformatics online.
The need to approximate the use-case in clinical machine learning.

PubMed

Saeb, Sohrab; Lonini, Luca; Jayaraman, Arun; Mohr, David C; Kording, Konrad P

2017-05-01

The availability of smartphone and wearable sensor technology is leading to a rapid accumulation of human subject data, and machine learning is emerging as a technique to map those data into clinical predictions. As machine learning algorithms are increasingly used to support clinical decision making, it is vital to reliably quantify their prediction accuracy. Cross-validation (CV) is the standard approach where the accuracy of such algorithms is evaluated on part of the data the algorithm has not seen during training. However, for this procedure to be meaningful, the relationship between the training and the validation set should mimic the relationship between the training set and the dataset expected for the clinical use. Here we compared two popular CV methods: record-wise and subject-wise. While the subject-wise method mirrors the clinically relevant use-case scenario of diagnosis in newly recruited subjects, the record-wise strategy has no such interpretation. Using both a publicly available dataset and a simulation, we found that record-wise CV often massively overestimates the prediction accuracy of the algorithms. We also conducted a systematic review of the relevant literature, and found that this overly optimistic method was used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smartphones to predict clinical outcomes. As we move towards an era of machine learning-based diagnosis and treatment, using proper methods to evaluate their accuracy is crucial, as inaccurate results can mislead both clinicians and data scientists. © The Author 2017. Published by Oxford University Press.
A Fast Multiple Sampling Method for Low-Noise CMOS Image Sensors With Column-Parallel 12-bit SAR ADCs

PubMed Central

Kim, Min-Kyu; Hong, Seong-Kwan; Kwon, Oh-Kyong

2015-01-01

This paper presents a fast multiple sampling method for low-noise CMOS image sensor (CIS) applications with column-parallel successive approximation register analog-to-digital converters (SAR ADCs). The 12-bit SAR ADC using the proposed multiple sampling method decreases the A/D conversion time by repeatedly converting a pixel output to 4-bit after the first 12-bit A/D conversion, reducing noise of the CIS by one over the square root of the number of samplings. The area of the 12-bit SAR ADC is reduced by using a 10-bit capacitor digital-to-analog converter (DAC) with four scaled reference voltages. In addition, a simple up/down counter-based digital processing logic is proposed to perform complex calculations for multiple sampling and digital correlated double sampling. To verify the proposed multiple sampling method, a 256 × 128 pixel array CIS with 12-bit SAR ADCs was fabricated using 0.18 μm CMOS process. The measurement results shows that the proposed multiple sampling method reduces each A/D conversion time from 1.2 μs to 0.45 μs and random noise from 848.3 μV to 270.4 μV, achieving a dynamic range of 68.1 dB and an SNR of 39.2 dB. PMID:26712765
A novel frame-level constant-distortion bit allocation for smooth H.264/AVC video quality

NASA Astrophysics Data System (ADS)

Liu, Li; Zhuang, Xinhua

2009-01-01

It is known that quality fluctuation has a major negative effect on visual perception. In previous work, we introduced a constant-distortion bit allocation method [1] for H.263+ encoder. However, the method in [1] can not be adapted to the newest H.264/AVC encoder directly as the well-known chicken-egg dilemma resulted from the rate-distortion optimization (RDO) decision process. To solve this problem, we propose a new two stage constant-distortion bit allocation (CDBA) algorithm with enhanced rate control for H.264/AVC encoder. In stage-1, the algorithm performs RD optimization process with a constant quantization QP. Based on prediction residual signals from stage-1 and target distortion for smooth video quality purpose, the frame-level bit target is allocated by using a close-form approximations of ratedistortion relationship similar to [1], and a fast stage-2 encoding process is performed with enhanced basic unit rate control. Experimental results show that, compared with original rate control algorithm provided by H.264/AVC reference software JM12.1, the proposed constant-distortion frame-level bit allocation scheme reduces quality fluctuation and delivers much smoother PSNR on all testing sequences.
Universal Decoder for PPM of any Order

NASA Technical Reports Server (NTRS)

Moision, Bruce E.

2010-01-01

A recently developed algorithm for demodulation and decoding of a pulse-position- modulation (PPM) signal is suitable as a basis for designing a single hardware decoding apparatus to be capable of handling any PPM order. Hence, this algorithm offers advantages of greater flexibility and lower cost, in comparison with prior such algorithms, which necessitate the use of a distinct hardware implementation for each PPM order. In addition, in comparison with the prior algorithms, the present algorithm entails less complexity in decoding at large orders. An unavoidably lengthy presentation of background information, including definitions of terms, is prerequisite to a meaningful summary of this development. As an aid to understanding, the figure illustrates the relevant processes of coding, modulation, propagation, demodulation, and decoding. An M-ary PPM signal has M time slots per symbol period. A pulse (signifying 1) is transmitted during one of the time slots; no pulse (signifying 0) is transmitted during the other time slots. The information intended to be conveyed from the transmitting end to the receiving end of a radio or optical communication channel is a K-bit vector u. This vector is encoded by an (N,K) binary error-correcting code, producing an N-bit vector a. In turn, the vector a is subdivided into blocks of m = log2(M) bits and each such block is mapped to an M-ary PPM symbol. The resultant coding/modulation scheme can be regarded as equivalent to a nonlinear binary code. The binary vector of PPM symbols, x is transmitted over a Poisson channel, such that there is obtained, at the receiver, a Poisson-distributed photon count characterized by a mean background count nb during no-pulse time slots and a mean signal-plus-background count of ns+nb during a pulse time slot. In the receiver, demodulation of the signal is effected in an iterative soft decoding process that involves consideration of relationships among photon counts and conditional likelihoods of m-bit vectors of coded bits. Inasmuch as the likelihoods of all the m-bit vectors of coded bits mapping to the same PPM symbol are correlated, the best performance is obtained when the joint mbit conditional likelihoods are utilized. Unfortunately, the complexity of decoding, measured in the number of operations per bit, grows exponentially with m, and can thus become prohibitively expensive for large PPM orders. For a system required to handle multiple PPM orders, the cost is even higher because it is necessary to have separate decoding hardware for each order. This concludes the prerequisite background information. In the present algorithm, the decoding process as described above is modified by, among other things, introduction of an lbit marginalizer sub-algorithm. The term "l-bit marginalizer" signifies that instead of m-bit conditional likelihoods, the decoder computes l-bit conditional likelihoods, where l is fixed. Fixing l, regardless of the value of m, makes it possible to use a single hardware implementation for any PPM order. One could minimize the decoding complexity and obtain an especially simple design by fixing l at 1, but this would entail some loss of performance. An intermediate solution is to fix l at some value, greater than 1, that may be less than or greater than m. This solution makes it possible to obtain the desired flexibility to handle any PPM order while compromising between complexity and loss of performance.
On adaptive learning rate that guarantees convergence in feedforward networks.

PubMed

Behera, Laxmidhar; Kumar, Swagat; Patnaik, Awhan

2006-09-01

This paper investigates new learning algorithms (LF I and LF II) based on Lyapunov function for the training of feedforward neural networks. It is observed that such algorithms have interesting parallel with the popular backpropagation (BP) algorithm where the fixed learning rate is replaced by an adaptive learning rate computed using convergence theorem based on Lyapunov stability theory. LF II, a modified version of LF I, has been introduced with an aim to avoid local minima. This modification also helps in improving the convergence speed in some cases. Conditions for achieving global minimum for these kind of algorithms have been studied in detail. The performances of the proposed algorithms are compared with BP algorithm and extended Kalman filtering (EKF) on three bench-mark function approximation problems: XOR, 3-bit parity, and 8-3 encoder. The comparisons are made in terms of number of learning iterations and computational time required for convergence. It is found that the proposed algorithms (LF I and II) are much faster in convergence than other two algorithms to attain same accuracy. Finally, the comparison is made on a complex two-dimensional (2-D) Gabor function and effect of adaptive learning rate for faster convergence is verified. In a nutshell, the investigations made in this paper help us better understand the learning procedure of feedforward neural networks in terms of adaptive learning rate, convergence speed, and local minima.
Hamming and Accumulator Codes Concatenated with MPSK or QAM

NASA Technical Reports Server (NTRS)

Divsalar, Dariush; Dolinar, Samuel

2009-01-01

In a proposed coding-and-modulation scheme, a high-rate binary data stream would be processed as follows: 1. The input bit stream would be demultiplexed into multiple bit streams. 2. The multiple bit streams would be processed simultaneously into a high-rate outer Hamming code that would comprise multiple short constituent Hamming codes a distinct constituent Hamming code for each stream. 3. The streams would be interleaved. The interleaver would have a block structure that would facilitate parallelization for high-speed decoding. 4. The interleaved streams would be further processed simultaneously into an inner two-state, rate-1 accumulator code that would comprise multiple constituent accumulator codes - a distinct accumulator code for each stream. 5. The resulting bit streams would be mapped into symbols to be transmitted by use of a higher-order modulation - for example, M-ary phase-shift keying (MPSK) or quadrature amplitude modulation (QAM). The novelty of the scheme lies in the concatenation of the multiple-constituent Hamming and accumulator codes and the corresponding parallel architectures of the encoder and decoder circuitry (see figure) needed to process the multiple bit streams simultaneously. As in the cases of other parallel-processing schemes, one advantage of this scheme is that the overall data rate could be much greater than the data rate of each encoder and decoder stream and, hence, the encoder and decoder could handle data at an overall rate beyond the capability of the individual encoder and decoder circuits.
Accelerating scientific computations with mixed precision algorithms

NASA Astrophysics Data System (ADS)

Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack; Kurzak, Jakub; Langou, Julie; Langou, Julien; Luszczek, Piotr; Tomov, Stanimire

2009-12-01

On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. Program summaryProgram title: ITER-REF Catalogue identifier: AECO_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECO_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 7211 No. of bytes in distributed program, including test data, etc.: 41 862 Distribution format: tar.gz Programming language: FORTRAN 77 Computer: desktop, server Operating system: Unix/Linux RAM: 512 Mbytes Classification: 4.8 External routines: BLAS (optional) Nature of problem: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. Solution method: Mixed precision algorithms stem from the observation that, in many cases, a single precision solution of a problem can be refined to the point where double precision accuracy is achieved. A common approach to the solution of linear systems, either dense or sparse, is to perform the LU factorization of the coefficient matrix using Gaussian elimination. First, the coefficient matrix A is factored into the product of a lower triangular matrix L and an upper triangular matrix U. Partial row pivoting is in general used to improve numerical stability resulting in a factorization PA=LU, where P is a permutation matrix. The solution for the system is achieved by first solving Ly=Pb (forward substitution) and then solving Ux=y (backward substitution). Due to round-off errors, the computed solution, x, carries a numerical error magnified by the condition number of the coefficient matrix A. In order to improve the computed solution, an iterative process can be applied, which produces a correction to the computed solution at each iteration, which then yields the method that is commonly known as the iterative refinement algorithm. Provided that the system is not too ill-conditioned, the algorithm produces a solution correct to the working precision. Running time: seconds/minutes
ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems

PubMed Central

Expósito, Roberto R.

2018-01-01

Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/. PMID:29608567
ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems.

PubMed

González-Domínguez, Jorge; Expósito, Roberto R

2018-01-01

Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.
Synthesis of blind source separation algorithms on reconfigurable FPGA platforms

NASA Astrophysics Data System (ADS)

Du, Hongtao; Qi, Hairong; Szu, Harold H.

2005-03-01

Recent advances in intelligence technology have boosted the development of micro- Unmanned Air Vehicles (UAVs) including Sliver Fox, Shadow, and Scan Eagle for various surveillance and reconnaissance applications. These affordable and reusable devices have to fit a series of size, weight, and power constraints. Cameras used on such micro-UAVs are therefore mounted directly at a fixed angle without any motion-compensated gimbals. This mounting scheme has resulted in the so-called jitter effect in which jitter is defined as sub-pixel or small amplitude vibrations. The jitter blur caused by the jitter effect needs to be corrected before any other processing algorithms can be practically applied. Jitter restoration has been solved by various optimization techniques, including Wiener approximation, maximum a-posteriori probability (MAP), etc. However, these algorithms normally assume a spatial-invariant blur model that is not the case with jitter blur. Szu et al. developed a smart real-time algorithm based on auto-regression (AR) with its natural generalization of unsupervised artificial neural network (ANN) learning to achieve restoration accuracy at the sub-pixel level. This algorithm resembles the capability of the human visual system, in which an agreement between the pair of eyes indicates "signal", otherwise, the jitter noise. Using this non-statistical method, for each single pixel, a deterministic blind sources separation (BSS) process can then be carried out independently based on a deterministic minimum of the Helmholtz free energy with a generalization of Shannon's information theory applied to open dynamic systems. From a hardware implementation point of view, the process of jitter restoration of an image using Szu's algorithm can be optimized by pixel-based parallelization. In our previous work, a parallelly structured independent component analysis (ICA) algorithm has been implemented on both Field Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC) using standard-height cells. ICA is an algorithm that can solve BSS problems by carrying out the all-order statistical, decorrelation-based transforms, in which an assumption that neighborhood pixels share the same but unknown mixing matrix A is made. In this paper, we continue our investigation on the design challenges of firmware approaches to smart algorithms. We think two levels of parallelization can be explored, including pixel-based parallelization and the parallelization of the restoration algorithm performed at each pixel. This paper focuses on the latter and we use ICA as an example to explain the design and implementation methods. It is well known that the capacity constraints of single FPGA have limited the implementation of many complex algorithms including ICA. Using the reconfigurability of FPGA, we show, in this paper, how to manipulate the FPGA-based system to provide extra computing power for the parallelized ICA algorithm with limited FPGA resources. The synthesis aiming at the pilchard re-configurable FPGA platform is reported. The pilchard board is embedded with single Xilinx VIRTEX 1000E FPGA and transfers data directly to CPU on the 64-bit memory bus at the maximum frequency of 133MHz. Both the feasibility performance evaluations and experimental results validate the effectiveness and practicality of this synthesis, which can be extended to the spatial-variant jitter restoration for micro-UAV deployment.
Optical Neural Classification Of Binary Patterns

NASA Astrophysics Data System (ADS)

Gustafson, Steven C.; Little, Gordon R.

1988-05-01

Binary pattern classification that may be implemented using optical hardware and neural network algorithms is considered. Pattern classification problems that have no concise description (as in classifying handwritten characters) or no concise computation (as in NP-complete problems) are expected to be particularly amenable to this approach. For example, optical processors that efficiently classify binary patterns in accordance with their Boolean function complexity might be designed. As a candidate for such a design, an optical neural network model is discussed that is designed for binary pattern classification and that consists of an optical resonator with a dynamic multiplex-recorded reflection hologram and a phase conjugate mirror with thresholding and gain. In this model, learning or training examples of binary patterns may be recorded on the hologram such that one bit in each pattern marks the pattern class. Any input pattern, including one with an unknown class or marker bit, will be modified by a large number of parallel interactions with the reflection hologram and nonlinear mirror. After perhaps several seconds and 100 billion interactions, a steady-state pattern may develop with a marker bit that represents a minimum-Boolean-complexity classification of the input pattern. Computer simulations are presented that illustrate progress in understanding the behavior of this model and in developing a processor design that could have commanding and enduring performance advantages compared to current pattern classification techniques.
A source-channel coding approach to digital image protection and self-recovery.

PubMed

Sarreshtedari, Saeed; Akhaee, Mohammad Ali

2015-07-01

Watermarking algorithms have been widely applied to the field of image forensics recently. One of these very forensic applications is the protection of images against tampering. For this purpose, we need to design a watermarking algorithm fulfilling two purposes in case of image tampering: 1) detecting the tampered area of the received image and 2) recovering the lost information in the tampered zones. State-of-the-art techniques accomplish these tasks using watermarks consisting of check bits and reference bits. Check bits are used for tampering detection, whereas reference bits carry information about the whole image. The problem of recovering the lost reference bits still stands. This paper is aimed at showing that having the tampering location known, image tampering can be modeled and dealt with as an erasure error. Therefore, an appropriate design of channel code can protect the reference bits against tampering. In the present proposed method, the total watermark bit-budget is dedicated to three groups: 1) source encoder output bits; 2) channel code parity bits; and 3) check bits. In watermark embedding phase, the original image is source coded and the output bit stream is protected using appropriate channel encoder. For image recovery, erasure locations detected by check bits help channel erasure decoder to retrieve the original source encoded image. Experimental results show that our proposed scheme significantly outperforms recent techniques in terms of image quality for both watermarked and recovered image. The watermarked image quality gain is achieved through spending less bit-budget on watermark, while image recovery quality is considerably improved as a consequence of consistent performance of designed source and channel codes.
The Design of a Single-Bit CMOS Image Sensor for Iris Recognition Applications

PubMed Central

Park, Keunyeol; Song, Minkyu

2018-01-01

This paper presents a single-bit CMOS image sensor (CIS) that uses a data processing technique with an edge detection block for simple iris segmentation. In order to recognize the iris image, the image sensor conventionally captures high-resolution image data in digital code, extracts the iris data, and then compares it with a reference image through a recognition algorithm. However, in this case, the frame rate decreases by the time required for digital signal conversion of multi-bit digital data through the analog-to-digital converter (ADC) in the CIS. In order to reduce the overall processing time as well as the power consumption, we propose a data processing technique with an exclusive OR (XOR) logic gate to obtain single-bit and edge detection image data instead of multi-bit image data through the ADC. In addition, we propose a logarithmic counter to efficiently measure single-bit image data that can be applied to the iris recognition algorithm. The effective area of the proposed single-bit image sensor (174 × 144 pixel) is 2.84 mm2 with a 0.18 μm 1-poly 4-metal CMOS image sensor process. The power consumption of the proposed single-bit CIS is 2.8 mW with a 3.3 V of supply voltage and 520 frame/s of the maximum frame rates. The error rate of the ADC is 0.24 least significant bit (LSB) on an 8-bit ADC basis at a 50 MHz sampling frequency. PMID:29495273
The Design of a Single-Bit CMOS Image Sensor for Iris Recognition Applications.

PubMed

Park, Keunyeol; Song, Minkyu; Kim, Soo Youn

2018-02-24

This paper presents a single-bit CMOS image sensor (CIS) that uses a data processing technique with an edge detection block for simple iris segmentation. In order to recognize the iris image, the image sensor conventionally captures high-resolution image data in digital code, extracts the iris data, and then compares it with a reference image through a recognition algorithm. However, in this case, the frame rate decreases by the time required for digital signal conversion of multi-bit digital data through the analog-to-digital converter (ADC) in the CIS. In order to reduce the overall processing time as well as the power consumption, we propose a data processing technique with an exclusive OR (XOR) logic gate to obtain single-bit and edge detection image data instead of multi-bit image data through the ADC. In addition, we propose a logarithmic counter to efficiently measure single-bit image data that can be applied to the iris recognition algorithm. The effective area of the proposed single-bit image sensor (174 × 144 pixel) is 2.84 mm² with a 0.18 μm 1-poly 4-metal CMOS image sensor process. The power consumption of the proposed single-bit CIS is 2.8 mW with a 3.3 V of supply voltage and 520 frame/s of the maximum frame rates. The error rate of the ADC is 0.24 least significant bit (LSB) on an 8-bit ADC basis at a 50 MHz sampling frequency.
Lesion Detection in CT Images Using Deep Learning Semantic Segmentation Technique

NASA Astrophysics Data System (ADS)

Kalinovsky, A.; Liauchuk, V.; Tarasau, A.

2017-05-01

In this paper, the problem of automatic detection of tuberculosis lesion on 3D lung CT images is considered as a benchmark for testing out algorithms based on a modern concept of Deep Learning. For training and testing of the algorithms a domestic dataset of 338 3D CT scans of tuberculosis patients with manually labelled lesions was used. The algorithms which are based on using Deep Convolutional Networks were implemented and applied in three different ways including slice-wise lesion detection in 2D images using semantic segmentation, slice-wise lesion detection in 2D images using sliding window technique as well as straightforward detection of lesions via semantic segmentation in whole 3D CT scans. The algorithms demonstrate superior performance compared to algorithms based on conventional image analysis methods.
Real-time minimal-bit-error probability decoding of convolutional codes

NASA Technical Reports Server (NTRS)

Lee, L.-N.

1974-01-01

A recursive procedure is derived for decoding of rate R = 1/n binary convolutional codes which minimizes the probability of the individual decoding decisions for each information bit, subject to the constraint that the decoding delay be limited to Delta branches. This new decoding algorithm is similar to, but somewhat more complex than, the Viterbi decoding algorithm. A real-time, i.e., fixed decoding delay, version of the Viterbi algorithm is also developed and used for comparison to the new algorithm on simulated channels. It is shown that the new algorithm offers advantages over Viterbi decoding in soft-decision applications, such as in the inner coding system for concatenated coding.
Real-time minimal bit error probability decoding of convolutional codes

NASA Technical Reports Server (NTRS)

Lee, L. N.

1973-01-01

A recursive procedure is derived for decoding of rate R=1/n binary convolutional codes which minimizes the probability of the individual decoding decisions for each information bit subject to the constraint that the decoding delay be limited to Delta branches. This new decoding algorithm is similar to, but somewhat more complex than, the Viterbi decoding algorithm. A real-time, i.e. fixed decoding delay, version of the Viterbi algorithm is also developed and used for comparison to the new algorithm on simulated channels. It is shown that the new algorithm offers advantages over Viterbi decoding in soft-decision applications such as in the inner coding system for concatenated coding.

Resolution-Adaptive Hybrid MIMO Architectures for Millimeter Wave Communications

NASA Astrophysics Data System (ADS)

Choi, Jinseok; Evans, Brian L.; Gatherer, Alan

2017-12-01

In this paper, we propose a hybrid analog-digital beamforming architecture with resolution-adaptive ADCs for millimeter wave (mmWave) receivers with large antenna arrays. We adopt array response vectors for the analog combiners and derive ADC bit-allocation (BA) solutions in closed form. The BA solutions reveal that the optimal number of ADC bits is logarithmically proportional to the RF chain's signal-to-noise ratio raised to the 1/3 power. Using the solutions, two proposed BA algorithms minimize the mean square quantization error of received analog signals under a total ADC power constraint. Contributions of this paper include 1) ADC bit-allocation algorithms to improve communication performance of a hybrid MIMO receiver, 2) approximation of the capacity with the BA algorithm as a function of channels, and 3) a worst-case analysis of the ergodic rate of the proposed MIMO receiver that quantifies system tradeoffs and serves as the lower bound. Simulation results demonstrate that the BA algorithms outperform a fixed-ADC approach in both spectral and energy efficiency, and validate the capacity and ergodic rate formula. For a power constraint equivalent to that of fixed 4-bit ADCs, the revised BA algorithm makes the quantization error negligible while achieving 22% better energy efficiency. Having negligible quantization error allows existing state-of-the-art digital beamformers to be readily applied to the proposed system.
Integrated-Circuit Pseudorandom-Number Generator

NASA Technical Reports Server (NTRS)

Steelman, James E.; Beasley, Jeff; Aragon, Michael; Ramirez, Francisco; Summers, Kenneth L.; Knoebel, Arthur

1992-01-01

Integrated circuit produces 8-bit pseudorandom numbers from specified probability distribution, at rate of 10 MHz. Use of Boolean logic, circuit implements pseudorandom-number-generating algorithm. Circuit includes eight 12-bit pseudorandom-number generators, outputs are uniformly distributed. 8-bit pseudorandom numbers satisfying specified nonuniform probability distribution are generated by processing uniformly distributed outputs of eight 12-bit pseudorandom-number generators through "pipeline" of D flip-flops, comparators, and memories implementing conditional probabilities on zeros and ones.
Fuel management optimization using genetic algorithms and expert knowledge

DOE Office of Scientific and Technical Information (OSTI.GOV)

DeChaine, M.D.; Feltus, M.A.

1996-09-01

The CIGARO fuel management optimization code based on genetic algorithms is described and tested. The test problem optimized the core lifetime for a pressurized water reactor with a penalty function constraint on the peak normalized power. A bit-string genotype encoded the loading patterns, and genotype bias was reduced with additional bits. Expert knowledge about fuel management was incorporated into the genetic algorithm. Regional crossover exchanged physically adjacent fuel assemblies and improved the optimization slightly. Biasing the initial population toward a known priority table significantly improved the optimization.
Transport implementation of the Bernstein-Vazirani algorithm with ion qubits

NASA Astrophysics Data System (ADS)

Fallek, S. D.; Herold, C. D.; McMahon, B. J.; Maller, K. M.; Brown, K. R.; Amini, J. M.

2016-08-01

Using trapped ion quantum bits in a scalable microfabricated surface trap, we perform the Bernstein-Vazirani algorithm. Our architecture takes advantage of the ion transport capabilities of such a trap. The algorithm is demonstrated using two- and three-ion chains. For three ions, an improvement is achieved compared to a classical system using the same number of oracle queries. For two ions and one query, we correctly determine an unknown bit string with probability 97.6(8)%. For three ions, we succeed with probability 80.9(3)%.
Scene-aware joint global and local homographic video coding

NASA Astrophysics Data System (ADS)

Peng, Xiulian; Xu, Jizheng; Sullivan, Gary J.

2016-09-01

Perspective motion is commonly represented in video content that is captured and compressed for various applications including cloud gaming, vehicle and aerial monitoring, etc. Existing approaches based on an eight-parameter homography motion model cannot deal with this efficiently, either due to low prediction accuracy or excessive bit rate overhead. In this paper, we consider the camera motion model and scene structure in such video content and propose a joint global and local homography motion coding approach for video with perspective motion. The camera motion is estimated by a computer vision approach, and camera intrinsic and extrinsic parameters are globally coded at the frame level. The scene is modeled as piece-wise planes, and three plane parameters are coded at the block level. Fast gradient-based approaches are employed to search for the plane parameters for each block region. In this way, improved prediction accuracy and low bit costs are achieved. Experimental results based on the HEVC test model show that up to 9.1% bit rate savings can be achieved (with equal PSNR quality) on test video content with perspective motion. Test sequences for the example applications showed a bit rate savings ranging from 3.7 to 9.1%.
True Randomness from Big Data.

PubMed

Papakonstantinou, Periklis A; Woodruff, David P; Yang, Guang

2016-09-26

Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests.
True Randomness from Big Data

NASA Astrophysics Data System (ADS)

Papakonstantinou, Periklis A.; Woodruff, David P.; Yang, Guang

2016-09-01

Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests.
True Randomness from Big Data

PubMed Central

Papakonstantinou, Periklis A.; Woodruff, David P.; Yang, Guang

2016-01-01

Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests. PMID:27666514
PTM Along Track Algorithm to Maintain Spacing During Same Direction Pair-Wise Trajectory Management Operations

NASA Technical Reports Server (NTRS)

Carreno, Victor A.

2015-01-01

Pair-wise Trajectory Management (PTM) is a cockpit based delegated responsibility separation standard. When an air traffic service provider gives a PTM clearance to an aircraft and the flight crew accepts the clearance, the flight crew will maintain spacing and separation from a designated aircraft. A PTM along track algorithm will receive state information from the designated aircraft and from the own ship to produce speed guidance for the flight crew to maintain spacing and separation
Data compression using adaptive transform coding. Appendix 1: Item 1. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Rost, Martin Christopher

1988-01-01

Adaptive low-rate source coders are described in this dissertation. These coders adapt by adjusting the complexity of the coder to match the local coding difficulty of the image. This is accomplished by using a threshold driven maximum distortion criterion to select the specific coder used. The different coders are built using variable blocksized transform techniques, and the threshold criterion selects small transform blocks to code the more difficult regions and larger blocks to code the less complex regions. A theoretical framework is constructed from which the study of these coders can be explored. An algorithm for selecting the optimal bit allocation for the quantization of transform coefficients is developed. The bit allocation algorithm is more fully developed, and can be used to achieve more accurate bit assignments than the algorithms currently used in the literature. Some upper and lower bounds for the bit-allocation distortion-rate function are developed. An obtainable distortion-rate function is developed for a particular scalar quantizer mixing method that can be used to code transform coefficients at any rate.
Representing and computing regular languages on massively parallel networks

DOE Office of Scientific and Technical Information (OSTI.GOV)

Miller, M.I.; O'Sullivan, J.A.; Boysam, B.

1991-01-01

This paper proposes a general method for incorporating rule-based constraints corresponding to regular languages into stochastic inference problems, thereby allowing for a unified representation of stochastic and syntactic pattern constraints. The authors' approach first established the formal connection of rules to Chomsky grammars, and generalizes the original work of Shannon on the encoding of rule-based channel sequences to Markov chains of maximum entropy. This maximum entropy probabilistic view leads to Gibb's representations with potentials which have their number of minima growing at precisely the exponential rate that the language of deterministically constrained sequences grow. These representations are coupled to stochasticmore » diffusion algorithms, which sample the language-constrained sequences by visiting the energy minima according to the underlying Gibbs' probability law. The coupling to stochastic search methods yields the all-important practical result that fully parallel stochastic cellular automata may be derived to generate samples from the rule-based constraint sets. The production rules and neighborhood state structure of the language of sequences directly determines the necessary connection structures of the required parallel computing surface. Representations of this type have been mapped to the DAP-510 massively-parallel processor consisting of 1024 mesh-connected bit-serial processing elements for performing automated segmentation of electron-micrograph images.« less
Extensive complementarity between gene function prediction methods.

PubMed

Vidulin, Vedrana; Šmuc, Tomislav; Supek, Fran

2016-12-01

The number of sequenced genomes rises steadily but we still lack the knowledge about the biological roles of many genes. Automated function prediction (AFP) is thus a necessity. We hypothesized that AFP approaches that draw on distinct genome features may be useful for predicting different types of gene functions, motivating a systematic analysis of the benefits gained by obtaining and integrating such predictions. Our pipeline amalgamates 5 133 543 genes from 2071 genomes in a single massive analysis that evaluates five established genomic AFP methodologies. While 1227 Gene Ontology (GO) terms yielded reliable predictions, the majority of these functions were accessible to only one or two of the methods. Moreover, different methods tend to assign a GO term to non-overlapping sets of genes. Thus, inferences made by diverse genomic AFP methods display a striking complementary, both gene-wise and function-wise. Because of this, a viable integration strategy is to rely on a single most-confident prediction per gene/function, rather than enforcing agreement across multiple AFP methods. Using an information-theoretic approach, we estimate that current databases contain 29.2 bits/gene of known Escherichia coli gene functions. This can be increased by up to 5.5 bits/gene using individual AFP methods or by 11 additional bits/gene upon integration, thereby providing a highly-ranking predictor on the Critical Assessment of Function Annotation 2 community benchmark. Availability of more sequenced genomes boosts the predictive accuracy of AFP approaches and also the benefit from integrating them. The individual and integrated GO predictions for the complete set of genes are available from http://gorbi.irb.hr/ CONTACT: fran.supek@irb.hrSupplementary information: Supplementary materials are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Quantum Adiabatic Algorithms and Large Spin Tunnelling

NASA Technical Reports Server (NTRS)

Boulatov, A.; Smelyanskiy, V. N.

2003-01-01

We provide a theoretical study of the quantum adiabatic evolution algorithm with different evolution paths proposed in this paper. The algorithm is applied to a random binary optimization problem (a version of the 3-Satisfiability problem) where the n-bit cost function is symmetric with respect to the permutation of individual bits. The evolution paths are produced, using the generic control Hamiltonians H (r) that preserve the bit symmetry of the underlying optimization problem. In the case where the ground state of H(0) coincides with the totally-symmetric state of an n-qubit system the algorithm dynamics is completely described in terms of the motion of a spin-n/2. We show that different control Hamiltonians can be parameterized by a set of independent parameters that are expansion coefficients of H (r) in a certain universal set of operators. Only one of these operators can be responsible for avoiding the tunnelling in the spin-n/2 system during the quantum adiabatic algorithm. We show that it is possible to select a coefficient for this operator that guarantees a polynomial complexity of the algorithm for all problem instances. We show that a successful evolution path of the algorithm always corresponds to the trajectory of a classical spin-n/2 and provide a complete characterization of such paths.
ImageJ: Image processing and analysis in Java

NASA Astrophysics Data System (ADS)

Rasband, W. S.

2012-06-01

ImageJ is a public domain Java image processing program inspired by NIH Image. It can display, edit, analyze, process, save and print 8-bit, 16-bit and 32-bit images. It can read many image formats including TIFF, GIF, JPEG, BMP, DICOM, FITS and "raw". It supports "stacks", a series of images that share a single window. It is multithreaded, so time-consuming operations such as image file reading can be performed in parallel with other operations.
A 9-Bit 50 MSPS Quadrature Parallel Pipeline ADC for Communication Receiver Application

NASA Astrophysics Data System (ADS)

Roy, Sounak; Banerjee, Swapna

2018-03-01

This paper presents the design and implementation of a pipeline Analog-to-Digital Converter (ADC) for superheterodyne receiver application. Several enhancement techniques have been applied in implementing the ADC, in order to relax the target specifications of its building blocks. The concepts of time interleaving and double sampling have been used simultaneously to enhance the sampling speed and to reduce the number of amplifiers used in the ADC. Removal of a front end sample-and-hold amplifier is possible by employing dynamic comparators with switched capacitor based comparison of input signal and reference voltage. Each module of the ADC comprises two 2.5-bit stages followed by two 1.5-bit stages and a 3-bit flash stage. Four such pipeline ADC modules are time interleaved using two pairs of non-overlapping clock signals. These two pairs of clock signals are in phase quadrature with each other. Hence the term quadrature parallel pipeline ADC has been used. These configurations ensure that the entire ADC contains only eight operational-trans-conductance amplifiers. The ADC is implemented in a 0.18-μm CMOS process and supply voltage of 1.8 V. The proto-type is tested at sampling frequencies of 50 and 75 MSPS producing an Effective Number of Bits (ENOB) of 6.86- and 6.11-bits respectively. At peak sampling speed, the core ADC consumes only 65 mW of power.
A 9-Bit 50 MSPS Quadrature Parallel Pipeline ADC for Communication Receiver Application

NASA Astrophysics Data System (ADS)

Roy, Sounak; Banerjee, Swapna

2018-06-01

This paper presents the design and implementation of a pipeline Analog-to-Digital Converter (ADC) for superheterodyne receiver application. Several enhancement techniques have been applied in implementing the ADC, in order to relax the target specifications of its building blocks. The concepts of time interleaving and double sampling have been used simultaneously to enhance the sampling speed and to reduce the number of amplifiers used in the ADC. Removal of a front end sample-and-hold amplifier is possible by employing dynamic comparators with switched capacitor based comparison of input signal and reference voltage. Each module of the ADC comprises two 2.5-bit stages followed by two 1.5-bit stages and a 3-bit flash stage. Four such pipeline ADC modules are time interleaved using two pairs of non-overlapping clock signals. These two pairs of clock signals are in phase quadrature with each other. Hence the term quadrature parallel pipeline ADC has been used. These configurations ensure that the entire ADC contains only eight operational-trans-conductance amplifiers. The ADC is implemented in a 0.18-μm CMOS process and supply voltage of 1.8 V. The proto-type is tested at sampling frequencies of 50 and 75 MSPS producing an Effective Number of Bits (ENOB) of 6.86- and 6.11-bits respectively. At peak sampling speed, the core ADC consumes only 65 mW of power.
Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array

NASA Astrophysics Data System (ADS)

Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

2008-04-01

This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.
Cloud Computing Security Model with Combination of Data Encryption Standard Algorithm (DES) and Least Significant Bit (LSB)

NASA Astrophysics Data System (ADS)

Basri, M.; Mawengkang, H.; Zamzami, E. M.

2018-03-01

Limitations of storage sources is one option to switch to cloud storage. Confidentiality and security of data stored on the cloud is very important. To keep up the confidentiality and security of such data can be done one of them by using cryptography techniques. Data Encryption Standard (DES) is one of the block cipher algorithms used as standard symmetric encryption algorithm. This DES will produce 8 blocks of ciphers combined into one ciphertext, but the ciphertext are weak against brute force attacks. Therefore, the last 8 block cipher will be converted into 8 random images using Least Significant Bit (LSB) algorithm which later draws the result of cipher of DES algorithm to be merged into one.
Pattern Discovery and Change Detection of Online Music Query Streams

NASA Astrophysics Data System (ADS)

Li, Hua-Fu

In this paper, an efficient stream mining algorithm, called FTP-stream (Frequent Temporal Pattern mining of streams), is proposed to find the frequent temporal patterns over melody sequence streams. In the framework of our proposed algorithm, an effective bit-sequence representation is used to reduce the time and memory needed to slide the windows. The FTP-stream algorithm can calculate the support threshold in only a single pass based on the concept of bit-sequence representation. It takes the advantage of "left" and "and" operations of the representation. Experiments show that the proposed algorithm only scans the music query stream once, and runs significant faster and consumes less memory than existing algorithms, such as SWFI-stream and Moment.
Wavelet compression of multichannel ECG data by enhanced set partitioning in hierarchical trees algorithm.

PubMed

Sharifahmadian, Ershad

2006-01-01

The set partitioning in hierarchical trees (SPIHT) algorithm is very effective and computationally simple technique for image and signal compression. Here the author modified the algorithm which provides even better performance than the SPIHT algorithm. The enhanced set partitioning in hierarchical trees (ESPIHT) algorithm has performance faster than the SPIHT algorithm. In addition, the proposed algorithm reduces the number of bits in a bit stream which is stored or transmitted. I applied it to compression of multichannel ECG data. Also, I presented a specific procedure based on the modified algorithm for more efficient compression of multichannel ECG data. This method employed on selected records from the MIT-BIH arrhythmia database. According to experiments, the proposed method attained the significant results regarding compression of multichannel ECG data. Furthermore, in order to compress one signal which is stored for a long time, the proposed multichannel compression method can be utilized efficiently.

Introducing difference recurrence relations for faster semi-global alignment of long sequences.

PubMed

Suzuki, Hajime; Kasahara, Masahiro

2018-02-19

The read length of single-molecule DNA sequencers is reaching 1 Mb. Popular alignment software tools widely used for analyzing such long reads often take advantage of single-instruction multiple-data (SIMD) operations to accelerate calculation of dynamic programming (DP) matrices in the Smith-Waterman-Gotoh (SWG) algorithm with a fixed alignment start position at the origin. Nonetheless, 16-bit or 32-bit integers are necessary for storing the values in a DP matrix when sequences to be aligned are long; this situation hampers the use of the full SIMD width of modern processors. We proposed a faster semi-global alignment algorithm, "difference recurrence relations," that runs more rapidly than the state-of-the-art algorithm by a factor of 2.1. Instead of calculating and storing all the values in a DP matrix directly, our algorithm computes and stores mainly the differences between the values of adjacent cells in the matrix. Although the SWG algorithm and our algorithm can output exactly the same result, our algorithm mainly involves 8-bit integer operations, enabling us to exploit the full width of SIMD operations (e.g., 32) on modern processors. We also developed a library, libgaba, so that developers can easily integrate our algorithm into alignment programs. Our novel algorithm and optimized library implementation will facilitate accelerating nucleotide long-read analysis algorithms that use pairwise alignment stages. The library is implemented in the C programming language and available at https://github.com/ocxtal/libgaba .
Data Acquisition System for Multi-Frequency Radar Flight Operations Preparation

NASA Technical Reports Server (NTRS)

Leachman, Jonathan

2010-01-01

A three-channel data acquisition system was developed for the NASA Multi-Frequency Radar (MFR) system. The system is based on a commercial-off-the-shelf (COTS) industrial PC (personal computer) and two dual-channel 14-bit digital receiver cards. The decimated complex envelope representations of the three radar signals are passed to the host PC via the PCI bus, and then processed in parallel by multiple cores of the PC CPU (central processing unit). The innovation is this parallelization of the radar data processing using multiple cores of a standard COTS multi-core CPU. The data processing portion of the data acquisition software was built using autonomous program modules or threads, which can run simultaneously on different cores. A master program module calculates the optimal number of processing threads, launches them, and continually supplies each with data. The benefit of this new parallel software architecture is that COTS PCs can be used to implement increasingly complex processing algorithms on an increasing number of radar range gates and data rates. As new PCs become available with higher numbers of CPU cores, the software will automatically utilize the additional computational capacity.
A High-Speed Design of Montgomery Multiplier

NASA Astrophysics Data System (ADS)

Fan, Yibo; Ikenaga, Takeshi; Goto, Satoshi

With the increase of key length used in public cryptographic algorithms such as RSA and ECC, the speed of Montgomery multiplication becomes a bottleneck. This paper proposes a high speed design of Montgomery multiplier. Firstly, a modified scalable high-radix Montgomery algorithm is proposed to reduce critical path. Secondly, a high-radix clock-saving dataflow is proposed to support high-radix operation and one clock cycle delay in dataflow. Finally, a hardware-reused architecture is proposed to reduce the hardware cost and a parallel radix-16 design of data path is proposed to accelerate the speed. By using HHNEC 0.25μm standard cell library, the implementation results show that the total cost of Montgomery multiplier is 130 KGates, the clock frequency is 180MHz and the throughput of 1024-bit RSA encryption is 352kbps. This design is suitable to be used in high speed RSA or ECC encryption/decryption. As a scalable design, it supports any key-length encryption/decryption up to the size of on-chip memory.
Method and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by dynamic global mapping of contended links

DOEpatents

Archer, Charles Jens [Rochester, MN; Musselman, Roy Glenn [Rochester, MN; Peters, Amanda [Rochester, MN; Pinnow, Kurt Walter [Rochester, MN; Swartz, Brent Allen [Chippewa Falls, WI; Wallenfelt, Brian Paul [Eden Prairie, MN

2011-10-04

A massively parallel nodal computer system periodically collects and broadcasts usage data for an internal communications network. A node sending data over the network makes a global routing determination using the network usage data. Preferably, network usage data comprises an N-bit usage value for each output buffer associated with a network link. An optimum routing is determined by summing the N-bit values associated with each link through which a data packet must pass, and comparing the sums associated with different possible routes.
Modified signed-digit arithmetic based on redundant bit representation.

PubMed

Huang, H; Itoh, M; Yatagai, T

1994-09-10

Fully parallel modified signed-digit arithmetic operations are realized based on redundant bit representation of the digits proposed. A new truth-table minimizing technique is presented based on redundant-bitrepresentation coding. It is shown that only 34 minterms are enough for implementing one-step modified signed-digit addition and subtraction with this new representation. Two optical implementation schemes, correlation and matrix multiplication, are described. Experimental demonstrations of the correlation architecture are presented. Both architectures use fixed minterm masks for arbitrary-length operands, taking full advantage of the parallelism of the modified signed-digit number system and optics.
Towards massively parallelized all-optical magnetic recording

NASA Astrophysics Data System (ADS)

Davies, C. S.; Janušonis, J.; Kimel, A. V.; Kirilyuk, A.; Tsukamoto, A.; Rasing, Th.; Tobey, R. I.

2018-06-01

We demonstrate an approach to parallel all-optical writing of magnetic domains using spatial and temporal interference of two ultrashort light pulses. We explore how the fluence and grating periodicity of the optical transient grating influence the size and uniformity of the written bits. Using a total incident optical energy of 3.5 μJ, we demonstrate the capability of simultaneously writing 102 spatially separated bits, each featuring a relevant lateral width of ˜1 μm. We discuss viable routes to extend this technique to write individually addressable, sub-diffraction-limited magnetic domains in a wide range of materials.
A new interferential multispectral image compression algorithm based on adaptive classification and curve-fitting

NASA Astrophysics Data System (ADS)

Wang, Ke-Yan; Li, Yun-Song; Liu, Kai; Wu, Cheng-Ke

2008-08-01

A novel compression algorithm for interferential multispectral images based on adaptive classification and curve-fitting is proposed. The image is first partitioned adaptively into major-interference region and minor-interference region. Different approximating functions are then constructed for two kinds of regions respectively. For the major interference region, some typical interferential curves are selected to predict other curves. These typical curves are then processed by curve-fitting method. For the minor interference region, the data of each interferential curve are independently approximated. Finally the approximating errors of two regions are entropy coded. The experimental results show that, compared with JPEG2000, the proposed algorithm not only decreases the average output bit-rate by about 0.2 bit/pixel for lossless compression, but also improves the reconstructed images and reduces the spectral distortion greatly, especially at high bit-rate for lossy compression.
FBCOT: a fast block coding option for JPEG 2000

NASA Astrophysics Data System (ADS)

Taubman, David; Naman, Aous; Mathew, Reji

2017-09-01

Based on the EBCOT algorithm, JPEG 2000 finds application in many fields, including high performance scientific, geospatial and video coding applications. Beyond digital cinema, JPEG 2000 is also attractive for low-latency video communications. The main obstacle for some of these applications is the relatively high computational complexity of the block coder, especially at high bit-rates. This paper proposes a drop-in replacement for the JPEG 2000 block coding algorithm, achieving much higher encoding and decoding throughputs, with only modest loss in coding efficiency (typically < 0.5dB). The algorithm provides only limited quality/SNR scalability, but offers truly reversible transcoding to/from any standard JPEG 2000 block bit-stream. The proposed FAST block coder can be used with EBCOT's post-compression RD-optimization methodology, allowing a target compressed bit-rate to be achieved even at low latencies, leading to the name FBCOT (Fast Block Coding with Optimized Truncation).
Intra Frame Coding In Advanced Video Coding Standard (H.264) to Obtain Consistent PSNR and Reduce Bit Rate for Diagonal Down Left Mode Using Gaussian Pulse

NASA Astrophysics Data System (ADS)

Manjanaik, N.; Parameshachari, B. D.; Hanumanthappa, S. N.; Banu, Reshma

2017-08-01

Intra prediction process of H.264 video coding standard used to code first frame i.e. Intra frame of video to obtain good coding efficiency compare to previous video coding standard series. More benefit of intra frame coding is to reduce spatial pixel redundancy with in current frame, reduces computational complexity and provides better rate distortion performance. To code Intra frame it use existing process Rate Distortion Optimization (RDO) method. This method increases computational complexity, increases in bit rate and reduces picture quality so it is difficult to implement in real time applications, so the many researcher has been developed fast mode decision algorithm for coding of intra frame. The previous work carried on Intra frame coding in H.264 standard using fast decision mode intra prediction algorithm based on different techniques was achieved increased in bit rate, degradation of picture quality(PSNR) for different quantization parameters. Many previous approaches of fast mode decision algorithms on intra frame coding achieved only reduction of computational complexity or it save encoding time and limitation was increase in bit rate with loss of quality of picture. In order to avoid increase in bit rate and loss of picture quality a better approach was developed. In this paper developed a better approach i.e. Gaussian pulse for Intra frame coding using diagonal down left intra prediction mode to achieve higher coding efficiency in terms of PSNR and bitrate. In proposed method Gaussian pulse is multiplied with each 4x4 frequency domain coefficients of 4x4 sub macro block of macro block of current frame before quantization process. Multiplication of Gaussian pulse for each 4x4 integer transformed coefficients at macro block levels scales the information of the coefficients in a reversible manner. The resulting signal would turn abstract. Frequency samples are abstract in a known and controllable manner without intermixing of coefficients, it avoids picture getting bad hit for higher values of quantization parameters. The proposed work was implemented using MATLAB and JM 18.6 reference software. The proposed work measure the performance parameters PSNR, bit rate and compression of intra frame of yuv video sequences in QCIF resolution under different values of quantization parameter with Gaussian value for diagonal down left intra prediction mode. The simulation results of proposed algorithm are tabulated and compared with previous algorithm i.e. Tian et al method. The proposed algorithm achieved reduced in bit rate averagely 30.98% and maintain consistent picture quality for QCIF sequences compared to previous algorithm i.e. Tian et al method.
Design and Implementation of High Performance Content-Addressable Memories.

DTIC Science & Technology

1985-12-01

content addressability and two basic implementations of content addressing. The need and application of hardware CAM is presented to motivate the " topic...3r Pass 4th Ps4 Pass Figure 2.15 Maximum SearchUsing All-Parallel CAM - left-most position (the most significant bit) and the other IF bits are zeros
Communication system analysis for manned space flight

NASA Technical Reports Server (NTRS)

Schilling, D. L.

1977-01-01

One- and two-dimensional adaptive delta modulator (ADM) algorithms are discussed and compared. Results are shown for bit rates of two bits/pixel, one bit/pixel and 0.5 bits/pixel. Pictures showing the difference between the encoded-decoded pictures and the original pictures are presented. The effect of channel errors on the reconstructed picture is illustrated. A two-dimensional ADM using interframe encoding is also presented. This system operates at the rate of two bits/pixel and produces excellent quality pictures when there is little motion. The effect of large amounts of motion on the reconstructed picture is described.
The 10 to the 8th power bit solid state spacecraft data recorder. [utilizing bubble domain memory technology

NASA Technical Reports Server (NTRS)

Murray, G. W.; Bohning, O. D.; Kinoshita, R. Y.; Becker, F. J.

1979-01-01

The results are summarized of a program to demonstrate the feasibility of Bubble Domain Memory Technology as a mass memory medium for spacecraft applications. The design, fabrication and test of a partially populated 10 to the 8th power Bit Data Recorder using 100 Kbit serial bubble memory chips is described. Design tradeoffs, design approach and performance are discussed. This effort resulted in a 10 to the 8th power bit recorder with a volume of 858.6 cu in and a weight of 47.2 pounds. The recorder is plug reconfigurable, having the capability of operating as one, two or four independent serial channel recorders or as a single sixteen bit byte parallel input recorder. Data rates up to 1.2 Mb/s in a serial mode and 2.4 Mb/s in a parallel mode may be supported. Fabrication and test of the recorder demonstrated the basic feasibility of Bubble Domain Memory technology for such applications. Test results indicate the need for improvement in memory element operating temperature range and detector performance.
Optical RISC computer

NASA Astrophysics Data System (ADS)

Guilfoyle, Peter S.; Stone, Richard V.; Hessenbruch, John M.; Zeise, Frederick F.

1993-07-01

A second generation digital optical computer (DOC II) has been developed which utilizes a RISC based operating system as its host. This 32 bit, high performance (12.8 GByte/sec), computing platform demonstrates a number of basic principals that are inherent to parallel free space optical interconnects such as speed (up to 1012 bit operations per second) and low power 1.2 fJ per bit). Although DOC II is a general purpose machine, special purpose applications have been developed and are currently being evaluated on the optical platform.
A sparse matrix algorithm on the Boolean vector machine

NASA Technical Reports Server (NTRS)

Wagner, Robert A.; Patrick, Merrell L.

1988-01-01

VLSI technology is being used to implement a prototype Boolean Vector Machine (BVM), which is a large network of very small processors with equally small memories that operate in SIMD mode; these use bit-serial arithmetic, and communicate via cube-connected cycles network. The BVM's bit-serial arithmetic and the small memories of individual processors are noted to compromise the system's effectiveness in large numerical problem applications. Attention is presently given to the implementation of a basic matrix-vector iteration algorithm for space matrices of the BVM, in order to generate over 1 billion useful floating-point operations/sec for this iteration algorithm. The algorithm is expressed in a novel language designated 'BVM'.
Stochastic optimization of GeantV code by use of genetic algorithms

DOE PAGES

Amadio, G.; Apostolakis, J.; Bandieramonte, M.; ...

2017-10-01

GeantV is a complex system based on the interaction of different modules needed for detector simulation, which include transport of particles in fields, physics models simulating their interactions with matter and a geometrical modeler library for describing the detector and locating the particles and computing the path length to the current volume boundary. The GeantV project is recasting the classical simulation approach to get maximum benefit from SIMD/MIMD computational architectures and highly massive parallel systems. This involves finding the appropriate balance between several aspects influencing computational performance (floating-point performance, usage of off-chip memory bandwidth, specification of cache hierarchy, etc.) andmore » handling a large number of program parameters that have to be optimized to achieve the best simulation throughput. This optimization task can be treated as a black-box optimization problem, which requires searching the optimum set of parameters using only point-wise function evaluations. Here, the goal of this study is to provide a mechanism for optimizing complex systems (high energy physics particle transport simulations) with the help of genetic algorithms and evolution strategies as tuning procedures for massive parallel simulations. One of the described approaches is based on introducing a specific multivariate analysis operator that could be used in case of resource expensive or time consuming evaluations of fitness functions, in order to speed-up the convergence of the black-box optimization problem.« less
Stochastic optimization of GeantV code by use of genetic algorithms

NASA Astrophysics Data System (ADS)

Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Behera, S. P.; Brun, R.; Canal, P.; Carminati, F.; Cosmo, G.; Duhem, L.; Elvira, D.; Folger, G.; Gheata, A.; Gheata, M.; Goulas, I.; Hariri, F.; Jun, S. Y.; Konstantinov, D.; Kumawat, H.; Ivantchenko, V.; Lima, G.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.

2017-10-01

GeantV is a complex system based on the interaction of different modules needed for detector simulation, which include transport of particles in fields, physics models simulating their interactions with matter and a geometrical modeler library for describing the detector and locating the particles and computing the path length to the current volume boundary. The GeantV project is recasting the classical simulation approach to get maximum benefit from SIMD/MIMD computational architectures and highly massive parallel systems. This involves finding the appropriate balance between several aspects influencing computational performance (floating-point performance, usage of off-chip memory bandwidth, specification of cache hierarchy, etc.) and handling a large number of program parameters that have to be optimized to achieve the best simulation throughput. This optimization task can be treated as a black-box optimization problem, which requires searching the optimum set of parameters using only point-wise function evaluations. The goal of this study is to provide a mechanism for optimizing complex systems (high energy physics particle transport simulations) with the help of genetic algorithms and evolution strategies as tuning procedures for massive parallel simulations. One of the described approaches is based on introducing a specific multivariate analysis operator that could be used in case of resource expensive or time consuming evaluations of fitness functions, in order to speed-up the convergence of the black-box optimization problem.
Stochastic optimization of GeantV code by use of genetic algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amadio, G.; Apostolakis, J.; Bandieramonte, M.

GeantV is a complex system based on the interaction of different modules needed for detector simulation, which include transport of particles in fields, physics models simulating their interactions with matter and a geometrical modeler library for describing the detector and locating the particles and computing the path length to the current volume boundary. The GeantV project is recasting the classical simulation approach to get maximum benefit from SIMD/MIMD computational architectures and highly massive parallel systems. This involves finding the appropriate balance between several aspects influencing computational performance (floating-point performance, usage of off-chip memory bandwidth, specification of cache hierarchy, etc.) andmore » handling a large number of program parameters that have to be optimized to achieve the best simulation throughput. This optimization task can be treated as a black-box optimization problem, which requires searching the optimum set of parameters using only point-wise function evaluations. Here, the goal of this study is to provide a mechanism for optimizing complex systems (high energy physics particle transport simulations) with the help of genetic algorithms and evolution strategies as tuning procedures for massive parallel simulations. One of the described approaches is based on introducing a specific multivariate analysis operator that could be used in case of resource expensive or time consuming evaluations of fitness functions, in order to speed-up the convergence of the black-box optimization problem.« less
New Bandwidth Efficient Parallel Concatenated Coding Schemes

NASA Technical Reports Server (NTRS)

Denedetto, S.; Divsalar, D.; Montorsi, G.; Pollara, F.

1996-01-01

We propose a new solution to parallel concatenation of trellis codes with multilevel amplitude/phase modulations and a suitable iterative decoding structure. Examples are given for throughputs 2 bits/sec/Hz with 8PSK and 16QAM signal constellations.
Broadband continuous wave source localization via pair-wise, cochleagram processing

NASA Astrophysics Data System (ADS)

Nosal, Eva-Marie; Frazer, L. Neil

2005-04-01

A pair-wise processor has been developed for the passive localization of broadband continuous-wave underwater sources. The algorithm uses sparse hydrophone arrays and does not require previous knowledge of the source signature. It is applicable in multiple source situations. A spectrogram/cochleagram version of the algorithm has been developed in order to utilize higher frequencies at longer ranges where signal incoherence, and limited computational resources, preclude the use of full waveforms. Simulations demonstrating the robustness of the algorithm with respect to noise and environmental mismatch will be presented, together with initial results from the analysis of humpback whale song recorded at the Pacific Missile Range Facility off Kauai. [Work supported by MHPCC and ONR.
Implementation of the DPM Monte Carlo code on a parallel architecture for treatment planning applications.

PubMed

Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J

2004-09-01

We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.

Improved liver R2* mapping by pixel-wise curve fitting with adaptive neighborhood regularization.

PubMed

Wang, Changqing; Zhang, Xinyuan; Liu, Xiaoyun; He, Taigang; Chen, Wufan; Feng, Qianjin; Feng, Yanqiu

2018-08-01

To improve liver R2* mapping by incorporating adaptive neighborhood regularization into pixel-wise curve fitting. Magnetic resonance imaging R2* mapping remains challenging because of the serial images with low signal-to-noise ratio. In this study, we proposed to exploit the neighboring pixels as regularization terms and adaptively determine the regularization parameters according to the interpixel signal similarity. The proposed algorithm, called the pixel-wise curve fitting with adaptive neighborhood regularization (PCANR), was compared with the conventional nonlinear least squares (NLS) and nonlocal means filter-based NLS algorithms on simulated, phantom, and in vivo data. Visually, the PCANR algorithm generates R2* maps with significantly reduced noise and well-preserved tiny structures. Quantitatively, the PCANR algorithm produces R2* maps with lower root mean square errors at varying R2* values and signal-to-noise-ratio levels compared with the NLS and nonlocal means filter-based NLS algorithms. For the high R2* values under low signal-to-noise-ratio levels, the PCANR algorithm outperforms the NLS and nonlocal means filter-based NLS algorithms in the accuracy and precision, in terms of mean and standard deviation of R2* measurements in selected region of interests, respectively. The PCANR algorithm can reduce the effect of noise on liver R2* mapping, and the improved measurement precision will benefit the assessment of hepatic iron in clinical practice. Magn Reson Med 80:792-801, 2018. © 2018 International Society for Magnetic Resonance in Medicine. © 2018 International Society for Magnetic Resonance in Medicine.
Maximum-likelihood soft-decision decoding of block codes using the A* algorithm

NASA Technical Reports Server (NTRS)

Ekroot, L.; Dolinar, S.

1994-01-01

The A* algorithm finds the path in a finite depth binary tree that optimizes a function. Here, it is applied to maximum-likelihood soft-decision decoding of block codes where the function optimized over the codewords is the likelihood function of the received sequence given each codeword. The algorithm considers codewords one bit at a time, making use of the most reliable received symbols first and pursuing only the partially expanded codewords that might be maximally likely. A version of the A* algorithm for maximum-likelihood decoding of block codes has been implemented for block codes up to 64 bits in length. The efficiency of this algorithm makes simulations of codes up to length 64 feasible. This article details the implementation currently in use, compares the decoding complexity with that of exhaustive search and Viterbi decoding algorithms, and presents performance curves obtained with this implementation of the A* algorithm for several codes.
Improved Iris Recognition through Fusion of Hamming Distance and Fragile Bit Distance.

PubMed

Hollingsworth, Karen P; Bowyer, Kevin W; Flynn, Patrick J

2011-12-01

The most common iris biometric algorithm represents the texture of an iris using a binary iris code. Not all bits in an iris code are equally consistent. A bit is deemed fragile if its value changes across iris codes created from different images of the same iris. Previous research has shown that iris recognition performance can be improved by masking these fragile bits. Rather than ignoring fragile bits completely, we consider what beneficial information can be obtained from the fragile bits. We find that the locations of fragile bits tend to be consistent across different iris codes of the same eye. We present a metric, called the fragile bit distance, which quantitatively measures the coincidence of the fragile bit patterns in two iris codes. We find that score fusion of fragile bit distance and Hamming distance works better for recognition than Hamming distance alone. To our knowledge, this is the first and only work to use the coincidence of fragile bit locations to improve the accuracy of matches.
Chaotic Image Encryption Algorithm Based on Bit Permutation and Dynamic DNA Encoding.

PubMed

Zhang, Xuncai; Han, Feng; Niu, Ying

2017-01-01

With the help of the fact that chaos is sensitive to initial conditions and pseudorandomness, combined with the spatial configurations in the DNA molecule's inherent and unique information processing ability, a novel image encryption algorithm based on bit permutation and dynamic DNA encoding is proposed here. The algorithm first uses Keccak to calculate the hash value for a given DNA sequence as the initial value of a chaotic map; second, it uses a chaotic sequence to scramble the image pixel locations, and the butterfly network is used to implement the bit permutation. Then, the image is coded into a DNA matrix dynamic, and an algebraic operation is performed with the DNA sequence to realize the substitution of the pixels, which further improves the security of the encryption. Finally, the confusion and diffusion properties of the algorithm are further enhanced by the operation of the DNA sequence and the ciphertext feedback. The results of the experiment and security analysis show that the algorithm not only has a large key space and strong sensitivity to the key but can also effectively resist attack operations such as statistical analysis and exhaustive analysis.
Chaotic Image Encryption Algorithm Based on Bit Permutation and Dynamic DNA Encoding

PubMed Central

2017-01-01

With the help of the fact that chaos is sensitive to initial conditions and pseudorandomness, combined with the spatial configurations in the DNA molecule's inherent and unique information processing ability, a novel image encryption algorithm based on bit permutation and dynamic DNA encoding is proposed here. The algorithm first uses Keccak to calculate the hash value for a given DNA sequence as the initial value of a chaotic map; second, it uses a chaotic sequence to scramble the image pixel locations, and the butterfly network is used to implement the bit permutation. Then, the image is coded into a DNA matrix dynamic, and an algebraic operation is performed with the DNA sequence to realize the substitution of the pixels, which further improves the security of the encryption. Finally, the confusion and diffusion properties of the algorithm are further enhanced by the operation of the DNA sequence and the ciphertext feedback. The results of the experiment and security analysis show that the algorithm not only has a large key space and strong sensitivity to the key but can also effectively resist attack operations such as statistical analysis and exhaustive analysis. PMID:28912802
A fast combination calibration of foreground and background for pipelined ADCs

NASA Astrophysics Data System (ADS)

Kexu, Sun; Lenian, He

2012-06-01

This paper describes a fast digital calibration scheme for pipelined analog-to-digital converters (ADCs). The proposed method corrects the nonlinearity caused by finite opamp gain and capacitor mismatch in multiplying digital-to-analog converters (MDACs). The considered calibration technique takes the advantages of both foreground and background calibration schemes. In this combination calibration algorithm, a novel parallel background calibration with signal-shifted correlation is proposed, and its calibration cycle is very short. The details of this technique are described in the example of a 14-bit 100 Msample/s pipelined ADC. The high convergence speed of this background calibration is achieved by three means. First, a modified 1.5-bit stage is proposed in order to allow the injection of a large pseudo-random dithering without missing code. Second, before correlating the signal, it is shifted according to the input signal so that the correlation error converges quickly. Finally, the front pipeline stages are calibrated simultaneously rather than stage by stage to reduce the calibration tracking constants. Simulation results confirm that the combination calibration has a fast startup process and a short background calibration cycle of 2 × 221 conversions.
Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications

PubMed Central

2014-01-01

Background The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets. Results We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. Conclusions In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one. PMID:25077818
Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications.

PubMed

D'Angelo, Gianni; Rampone, Salvatore

2014-01-01

The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n(3)) and of O(n(5)) order, respectively, and so, the algorithm is unaffordable for huge data sets. We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one.
A parallel algorithm for 2D visco-acoustic frequency-domain full-waveform inversion: application to a dense OBS data set

NASA Astrophysics Data System (ADS)

Sourbier, F.; Operto, S.; Virieux, J.

2006-12-01

We present a distributed-memory parallel algorithm for 2D visco-acoustic full-waveform inversion of wide-angle seismic data. Our code is written in fortran90 and use MPI for parallelism. The algorithm was applied to real wide-angle data set recorded by 100 OBSs with a 1-km spacing in the eastern-Nankai trough (Japan) to image the deep structure of the subduction zone. Full-waveform inversion is applied sequentially to discrete frequencies by proceeding from the low to the high frequencies. The inverse problem is solved with a classic gradient method. Full-waveform modeling is performed with a frequency-domain finite-difference method. In the frequency-domain, solving the wave equation requires resolution of a large unsymmetric system of linear equations. We use the massively parallel direct solver MUMPS (http://www.enseeiht.fr/irit/apo/MUMPS) for distributed-memory computer to solve this system. The MUMPS solver is based on a multifrontal method for the parallel factorization. The MUMPS algorithm is subdivided in 3 main steps: a symbolic analysis step that performs re-ordering of the matrix coefficients to minimize the fill-in of the matrix during the subsequent factorization and an estimation of the assembly tree of the matrix. Second, the factorization is performed with dynamic scheduling to accomodate numerical pivoting and provides the LU factors distributed over all the processors. Third, the resolution is performed for multiple sources. To compute the gradient of the cost function, 2 simulations per shot are required (one to compute the forward wavefield and one to back-propagate residuals). The multi-source resolutions can be performed in parallel with MUMPS. In the end, each processor stores in core a sub-domain of all the solutions. These distributed solutions can be exploited to compute in parallel the gradient of the cost function. Since the gradient of the cost function is a weighted stack of the shot and residual solutions of MUMPS, each processor computes the corresponding sub-domain of the gradient. In the end, the gradient is centralized on the master processor using a collective communation. The gradient is scaled by the diagonal elements of the Hessian matrix. This scaling is computed only once per frequency before the first iteration of the inversion. Estimation of the diagonal terms of the Hessian requires performing one simulation per non redondant shot and receiver position. The same strategy that the one used for the gradient is used to compute the diagonal Hessian in parallel. This algorithm was applied to a dense wide-angle data set recorded by 100 OBSs in the eastern Nankai trough, offshore Japan. Thirteen frequencies ranging from 3 and 15 Hz were inverted. Tweny iterations per frequency were computed leading to 260 tomographic velocity models of increasing resolution. The velocity model dimensions are 105 km x 25 km corresponding to a finite-difference grid of 4201 x 1001 grid with a 25-m grid interval. The number of shot was 1005 and the number of inverted OBS gathers was 93. The inversion requires 20 days on 6 32-bits bi-processor nodes with 4 Gbytes of RAM memory per node when only the LU factorization is performed in parallel. Preliminary estimations of the time required to perform the inversion with the fully-parallelized code is 6 and 4 days using 20 and 50 processors respectively.
Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease

PubMed Central

Shamonin, Denis P.; Bron, Esther E.; Lelieveldt, Boudewijn P. F.; Smits, Marion; Klein, Stefan; Staring, Marius

2013-01-01

Nonrigid image registration is an important, but time-consuming task in medical image analysis. In typical neuroimaging studies, multiple image registrations are performed, i.e., for atlas-based segmentation or template construction. Faster image registration routines would therefore be beneficial. In this paper we explore acceleration of the image registration package elastix by a combination of several techniques: (i) parallelization on the CPU, to speed up the cost function derivative calculation; (ii) parallelization on the GPU building on and extending the OpenCL framework from ITKv4, to speed up the Gaussian pyramid computation and the image resampling step; (iii) exploitation of certain properties of the B-spline transformation model; (iv) further software optimizations. The accelerated registration tool is employed in a study on diagnostic classification of Alzheimer's disease and cognitively normal controls based on T1-weighted MRI. We selected 299 participants from the publicly available Alzheimer's Disease Neuroimaging Initiative database. Classification is performed with a support vector machine based on gray matter volumes as a marker for atrophy. We evaluated two types of strategies (voxel-wise and region-wise) that heavily rely on nonrigid image registration. Parallelization and optimization resulted in an acceleration factor of 4–5x on an 8-core machine. Using OpenCL a speedup factor of 2 was realized for computation of the Gaussian pyramids, and 15–60 for the resampling step, for larger images. The voxel-wise and the region-wise classification methods had an area under the receiver operator characteristic curve of 88 and 90%, respectively, both for standard and accelerated registration. We conclude that the image registration package elastix was substantially accelerated, with nearly identical results to the non-optimized version. The new functionality will become available in the next release of elastix as open source under the BSD license. PMID:24474917
FPGA implementation of bit controller in double-tick architecture

NASA Astrophysics Data System (ADS)

Kobylecki, Michał; Kania, Dariusz

2017-11-01

This paper presents a comparison of the two original architectures of programmable bit controllers built on FPGAs. Programmable Logic Controllers (which include, among other things programmable bit controllers) built on FPGAs provide a efficient alternative to the controllers based on microprocessors which are expensive and often too slow. The presented and compared methods allow for the efficient implementation of any bit control algorithm written in Ladder Diagram language into the programmable logic system in accordance with IEC61131-3. In both cases, we have compared the effect of the applied architecture on the performance of executing the same bit control program in relation to its own size.
A novel color image encryption algorithm based on genetic recombination and the four-dimensional memristive hyperchaotic system

NASA Astrophysics Data System (ADS)

Chai, Xiu-Li; Gan, Zhi-Hua; Lu, Yang; Zhang, Miao-Hui; Chen, Yi-Ran

2016-10-01

Recently, many image encryption algorithms based on chaos have been proposed. Most of the previous algorithms encrypt components R, G, and B of color images independently and neglect the high correlation between them. In the paper, a novel color image encryption algorithm is introduced. The 24 bit planes of components R, G, and B of the color plain image are obtained and recombined into 4 compound bit planes, and this can make the three components affect each other. A four-dimensional (4D) memristive hyperchaotic system generates the pseudorandom key streams and its initial values come from the SHA 256 hash value of the color plain image. The compound bit planes and key streams are confused according to the principles of genetic recombination, then confusion and diffusion as a union are applied to the bit planes, and the color cipher image is obtained. Experimental results and security analyses demonstrate that the proposed algorithm is secure and effective so that it may be adopted for secure communication. Project supported by the National Natural Science Foundation of China (Grant Nos. 61203094 and 61305042), the Natural Science Foundation of the United States (Grant Nos. CNS-1253424 and ECCS-1202225), the Science and Technology Foundation of Henan Province, China (Grant No. 152102210048), the Foundation and Frontier Project of Henan Province, China (Grant No. 162300410196), the Natural Science Foundation of Educational Committee of Henan Province, China (Grant No. 14A413015), and the Research Foundation of Henan University, China (Grant No. xxjc20140006).
Long sequence correlation coprocessor

NASA Astrophysics Data System (ADS)

Gage, Douglas W.

1994-09-01

A long sequence correlation coprocessor (LSCC) accelerates the bitwise correlation of arbitrarily long digital sequences by calculating in parallel the correlation score for 16, for example, adjacent bit alignments between two binary sequences. The LSCC integrated circuit is incorporated into a computer system with memory storage buffers and a separate general purpose computer processor which serves as its controller. Each of the LSCC's set of sequential counters simultaneously tallies a separate correlation coefficient. During each LSCC clock cycle, computer enable logic associated with each counter compares one bit of a first sequence with one bit of a second sequence to increment the counter if the bits are the same. A shift register assures that the same bit of the first sequence is simultaneously compared to different bits of the second sequence to simultaneously calculate the correlation coefficient by the different counters to represent different alignments of the two sequences.
FPGA Implementation of an Efficient Algorithm for the Calculation of Charged Particle Trajectories in Cosmic Ray Detectors

NASA Astrophysics Data System (ADS)

Villar, Xabier; Piso, Daniel; Bruguera, Javier D.

2014-02-01

This paper presents an FPGA implementation of an algorithm, previously published, for the the reconstruction of cosmic rays' trajectories and the determination of the time of arrival and velocity of the particles. The accuracy and precision issues of the algorithm have been analyzed to propose a suitable implementation. Thus, a 32-bit fixed-point format has been used for the representation of the data values. Moreover, the dependencies among the different operations have been taken into account to obtain a highly parallel and efficient hardware implementation. The final hardware architecture requires 18 cycles to process every particle, and has been exhaustively simulated to validate all the design decisions. The architecture has been mapped over different commercial FPGAs, with a frequency of operation ranging from 300 MHz to 1.3 GHz, depending on the FPGA being used. Consequently, the number of particle trajectories processed per second is between 16 million and 72 million. The high number of particle trajectories calculated per second shows that the proposed FPGA implementation might be used also in high rate environments such as those found in particle and nuclear physics experiments.
An algorithm to compute the sequency ordered Walsh transform

NASA Technical Reports Server (NTRS)

Larsen, H.

1976-01-01

A fast sequency-ordered Walsh transform algorithm is presented; this sequency-ordered fast transform is complementary to the sequency-ordered fast Walsh transform introduced by Manz (1972) and eliminating gray code reordering through a modification of the basic fast Hadamard transform structure. The new algorithm retains the advantages of its complement (it is in place and is its own inverse), while differing in having a decimation-in time structure, accepting data in normal order, and returning the coefficients in bit-reversed sequency order. Applications include estimation of Walsh power spectra for a random process, sequency filtering and computing logical autocorrelations, and selective bit reversing.
Differential Fault Analysis on CLEFIA with 128, 192, and 256-Bit Keys

NASA Astrophysics Data System (ADS)

Takahashi, Junko; Fukunaga, Toshinori

This paper describes a differential fault analysis (DFA) attack against CLEFIA. The proposed attack can be applied to CLEFIA with all supported keys: 128, 192, and 256-bit keys. DFA is a type of side-channel attack. This attack enables the recovery of secret keys by injecting faults into a secure device during its computation of the cryptographic algorithm and comparing the correct ciphertext with the faulty one. CLEFIA is a 128-bit blockcipher with 128, 192, and 256-bit keys developed by the Sony Corporation in 2007. CLEFIA employs a generalized Feistel structure with four data lines. We developed a new attack method that uses this characteristic structure of the CLEFIA algorithm. On the basis of the proposed attack, only 2 pairs of correct and faulty ciphertexts are needed to retrieve the 128-bit key, and 10.78 pairs on average are needed to retrieve the 192 and 256-bit keys. The proposed attack is more efficient than any previously reported. In order to verify the proposed attack and estimate the calculation time to recover the secret key, we conducted an attack simulation using a PC. The simulation results show that we can obtain each secret key within three minutes on average. This result shows that we can obtain the entire key within a feasible computational time.
60-GHz optical/wireless MIMO system integrated with optical subcarrier multiplexing and 2x2 wireless communication.

PubMed

Lin, Chi-Hsiang; Lin, Chun-Ting; Huang, Hou-Tzu; Zeng, Wei-Siang; Chiang, Shou-Chih; Chang, Hsi-Yu

2015-05-04

This paper proposes a 2x2 MIMO OFDM Radio-over-Fiber scheme based on optical subcarrier multiplexing and 60-GHz MIMO wireless transmission. We also schematically investigated the principle of optical subcarrier multiplexing, which is based on a dual-parallel Mach-Zehnder modulator (DP-MZM). In our simulation result, combining two MIMO OFDM signals to drive DP-MZM gives rise to the PAPR augmentation of less than 0.4 dB, which mitigates nonlinear distortion. Moreover, we applied a Levin-Campello bit-loading algorithm to compensate for the uneven frequency responses in the V-band. The resulting system achieves OFDM signal rates of 61.5-Gbits/s with BER of 10(-3) over 25-km SMF transmission followed by 3-m wireless transmission.
Coding for Parallel Links to Maximize the Expected Value of Decodable Messages

NASA Technical Reports Server (NTRS)

Klimesh, Matthew A.; Chang, Christopher S.

2011-01-01

When multiple parallel communication links are available, it is useful to consider link-utilization strategies that provide tradeoffs between reliability and throughput. Interesting cases arise when there are three or more available links. Under the model considered, the links have known probabilities of being in working order, and each link has a known capacity. The sender has a number of messages to send to the receiver. Each message has a size and a value (i.e., a worth or priority). Messages may be divided into pieces arbitrarily, and the value of each piece is proportional to its size. The goal is to choose combinations of messages to send on the links so that the expected value of the messages decodable by the receiver is maximized. There are three parts to the innovation: (1) Applying coding to parallel links under the model; (2) Linear programming formulation for finding the optimal combinations of messages to send on the links; and (3) Algorithms for assisting in finding feasible combinations of messages, as support for the linear programming formulation. There are similarities between this innovation and methods developed in the field of network coding. However, network coding has generally been concerned with either maximizing throughput in a fixed network, or robust communication of a fixed volume of data. In contrast, under this model, the throughput is expected to vary depending on the state of the network. Examples of error-correcting codes that are useful under this model but which are not needed under previous models have been found. This model can represent either a one-shot communication attempt, or a stream of communications. Under the one-shot model, message sizes and link capacities are quantities of information (e.g., measured in bits), while under the communications stream model, message sizes and link capacities are information rates (e.g., measured in bits/second). This work has the potential to increase the value of data returned from spacecraft under certain conditions.
Bit-Grooming: Shave Your Bits with Razor-sharp Precision

NASA Astrophysics Data System (ADS)

Zender, C. S.; Silver, J.

2017-12-01

Lossless compression can reduce climate data storage by 30-40%. Further reduction requires lossy compression that also reduces precision. Fortunately, geoscientific models and measurements generate false precision (scientifically meaningless data bits) that can be eliminated without sacrificing scientifically meaningful data. We introduce Bit Grooming, a lossy compression algorithm that removes the bloat due to false-precision, those bits and bytes beyond the meaningful precision of the data.Bit Grooming is statistically unbiased, applies to all floating point numbers, and is easy to use. Bit-Grooming reduces geoscience data storage requirements by 40-80%. We compared Bit Grooming to competitors Linear Packing, Layer Packing, and GRIB2/JPEG2000. The other compression methods have the edge in terms of compression, but Bit Grooming is the most accurate and certainly the most usable and portable.Bit Grooming provides flexible and well-balanced solutions to the trade-offs among compression, accuracy, and usability required by lossy compression. Geoscientists could reduce their long term storage costs, and show leadership in the elimination of false precision, by adopting Bit Grooming.
Recognition of the optical packet header for two channels utilizing the parallel reservoir computing based on a semiconductor ring laser

NASA Astrophysics Data System (ADS)

Bao, Xiurong; Zhao, Qingchun; Yin, Hongxi; Qin, Jie

2018-05-01

In this paper, an all-optical parallel reservoir computing (RC) system with two channels for the optical packet header recognition is proposed and simulated, which is based on a semiconductor ring laser (SRL) with the characteristic of bidirectional light paths. The parallel optical loops are built through the cross-feedback of the bidirectional light paths where every optical loop can independently recognize each injected optical packet header. Two input signals are mapped and recognized simultaneously by training all-optical parallel reservoir, which is attributed to the nonlinear states in the laser. The recognition of optical packet headers for two channels from 4 bits to 32 bits is implemented through the simulation optimizing system parameters and therefore, the optimal recognition error ratio is 0. Since this structure can combine with the wavelength division multiplexing (WDM) optical packet switching network, the wavelength of each channel of optical packet headers for recognition can be different, and a better recognition result can be obtained.

Application of a Noise Adaptive Contrast Sensitivity Function to Image Data Compression

NASA Astrophysics Data System (ADS)

Daly, Scott J.

1989-08-01

The visual contrast sensitivity function (CSF) has found increasing use in image compression as new algorithms optimize the display-observer interface in order to reduce the bit rate and increase the perceived image quality. In most compression algorithms, increasing the quantization intervals reduces the bit rate at the expense of introducing more quantization error, a potential image quality degradation. The CSF can be used to distribute this error as a function of spatial frequency such that it is undetectable by the human observer. Thus, instead of being mathematically lossless, the compression algorithm can be designed to be visually lossless, with the advantage of a significantly reduced bit rate. However, the CSF is strongly affected by image noise, changing in both shape and peak sensitivity. This work describes a model of the CSF that includes these changes as a function of image noise level by using the concepts of internal visual noise, and tests this model in the context of image compression with an observer study.
Automatic speech recognition research at NASA-Ames Research Center

NASA Technical Reports Server (NTRS)

Coler, Clayton R.; Plummer, Robert P.; Huff, Edward M.; Hitchcock, Myron H.

1977-01-01

A trainable acoustic pattern recognizer manufactured by Scope Electronics is presented. The voice command system VCS encodes speech by sampling 16 bandpass filters with center frequencies in the range from 200 to 5000 Hz. Variations in speaking rate are compensated for by a compression algorithm that subdivides each utterance into eight subintervals in such a way that the amount of spectral change within each subinterval is the same. The recorded filter values within each subinterval are then reduced to a 15-bit representation, giving a 120-bit encoding for each utterance. The VCS incorporates a simple recognition algorithm that utilizes five training samples of each word in a vocabulary of up to 24 words. The recognition rate of approximately 85 percent correct for untrained speakers and 94 percent correct for trained speakers was not considered adequate for flight systems use. Therefore, the built-in recognition algorithm was disabled, and the VCS was modified to transmit 120-bit encodings to an external computer for recognition.
Sleep stage classification with low complexity and low bit rate.

PubMed

Virkkala, Jussi; Värri, Alpo; Hasan, Joel; Himanen, Sari-Leena; Müller, Kiti

2009-01-01

Standard sleep stage classification is based on visual analysis of central (usually also frontal and occipital) EEG, two-channel EOG, and submental EMG signals. The process is complex, using multiple electrodes, and is usually based on relatively high (200-500 Hz) sampling rates. Also at least 12 bit analog to digital conversion is recommended (with 16 bit storage) resulting in total bit rate of at least 12.8 kbit/s. This is not a problem for in-house laboratory sleep studies, but in the case of online wireless self-applicable ambulatory sleep studies, lower complexity and lower bit rates are preferred. In this study we further developed earlier single channel facial EMG/EOG/EEG-based automatic sleep stage classification. An algorithm with a simple decision tree separated 30 s epochs into wakefulness, SREM, S1/S2 and SWS using 18-45 Hz beta power and 0.5-6 Hz amplitude. Improvements included low complexity recursive digital filtering. We also evaluated the effects of a reduced sampling rate, reduced number of quantization steps and reduced dynamic range on the sleep data of 132 training and 131 testing subjects. With the studied algorithm, it was possible to reduce the sampling rate to 50 Hz (having a low pass filter at 90 Hz), and the dynamic range to 244 microV, with an 8 bit resolution resulting in a bit rate of 0.4 kbit/s. Facial electrodes and a low bit rate enables the use of smaller devices for sleep stage classification in home environments.
Bayer image parallel decoding based on GPU

NASA Astrophysics Data System (ADS)

Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

2012-11-01

In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Design and implementation of the modified signed digit multiplication routine on a ternary optical computer.

PubMed

Xu, Qun; Wang, Xianchao; Xu, Chao

2017-06-01

Multiplication with traditional electronic computers is faced with a low calculating accuracy and a long computation time delay. To overcome these problems, the modified signed digit (MSD) multiplication routine is established based on the MSD system and the carry-free adder. Also, its parallel algorithm and optimization techniques are studied in detail. With the help of a ternary optical computer's characteristics, the structured data processor is designed especially for the multiplication routine. Several ternary optical operators are constructed to perform M transformations and summations in parallel, which has accelerated the iterative process of multiplication. In particular, the routine allocates data bits of the ternary optical processor based on digits of multiplication input, so the accuracy of the calculation results can always satisfy the users. Finally, the routine is verified by simulation experiments, and the results are in full compliance with the expectations. Compared with an electronic computer, the MSD multiplication routine is not only good at dealing with large-value data and high-precision arithmetic, but also maintains lower power consumption and fewer calculating delays.
VizieR Online Data Catalog: OCSVM anomalies (Solarz+, 2017)

NASA Astrophysics Data System (ADS)

Solarz, A.; Bilicki, M.; Gromadzki, M.; Pollo, A.; Durkalec, A.; Wypych, M.

2017-07-01

One table containing 642,353 sources selected as anomalous with one-class support vector machine algorithm in AllWISE data release. Data have AllWISE photometry in W1, W2 and W3 passband and include W3 flux correction described in Krakowski et al. (2016A&A...596A..39K). (1 data file).
A joint equalization algorithm in high speed communication systems

NASA Astrophysics Data System (ADS)

Hao, Xin; Lin, Changxing; Wang, Zhaohui; Cheng, Binbin; Deng, Xianjin

2018-02-01

This paper presents a joint equalization algorithm in high speed communication systems. This algorithm takes the advantages of traditional equalization algorithms to use pre-equalization and post-equalization. The pre-equalization algorithm takes the advantage of CMA algorithm, which is not sensitive to the frequency offset. Pre-equalization is located before the carrier recovery loop in order to make the carrier recovery loop a better performance and overcome most of the frequency offset. The post-equalization takes the advantage of MMA algorithm in order to overcome the residual frequency offset. This paper analyzes the advantages and disadvantages of several equalization algorithms in the first place, and then simulates the proposed joint equalization algorithm in Matlab platform. The simulation results shows the constellation diagrams and the bit error rate curve, both these results show that the proposed joint equalization algorithm is better than the traditional algorithms. The residual frequency offset is shown directly in the constellation diagrams. When SNR is 14dB， the bit error rate of the simulated system with the proposed joint equalization algorithm is 103 times better than CMA algorithm, 77 times better than MMA equalization, and 9 times better than CMA-MMA equalization.
Pattern recognition of electronic bit-sequences using a semiconductor mode-locked laser and spatial light modulators

NASA Astrophysics Data System (ADS)

Bhooplapur, Sharad; Akbulut, Mehmetkan; Quinlan, Franklyn; Delfyett, Peter J.

2010-04-01

A novel scheme for recognition of electronic bit-sequences is demonstrated. Two electronic bit-sequences that are to be compared are each mapped to a unique code from a set of Walsh-Hadamard codes. The codes are then encoded in parallel on the spectral phase of the frequency comb lines from a frequency-stabilized mode-locked semiconductor laser. Phase encoding is achieved by using two independent spatial light modulators based on liquid crystal arrays. Encoded pulses are compared using interferometric pulse detection and differential balanced photodetection. Orthogonal codes eight bits long are compared, and matched codes are successfully distinguished from mismatched codes with very low error rates, of around 10-18. This technique has potential for high-speed, high accuracy recognition of bit-sequences, with applications in keyword searches and internet protocol packet routing.
Capacity-optimized mp2 audio watermarking

NASA Astrophysics Data System (ADS)

Steinebach, Martin; Dittmann, Jana

2003-06-01

Today a number of audio watermarking algorithms have been proposed, some of them at a quality making them suitable for commercial applications. The focus of most of these algorithms is copyright protection. Therefore, transparency and robustness are the most discussed and optimised parameters. But other applications for audio watermarking can also be identified stressing other parameters like complexity or payload. In our paper, we introduce a new mp2 audio watermarking algorithm optimised for high payload. Our algorithm uses the scale factors of an mp2 file for watermark embedding. They are grouped and masked based on a pseudo-random pattern generated from a secret key. In each group, we embed one bit. Depending on the bit to embed, we change the scale factors by adding 1 where necessary until it includes either more even or uneven scale factors. An uneven group has a 1 embedded, an even group a 0. The same rule is later applied to detect the watermark. The group size can be increased or decreased for transparency/payload trade-off. We embed 160 bits or more in an mp2 file per second without reducing perceived quality. As an application example, we introduce a prototypic Karaoke system displaying song lyrics embedded as a watermark.
Fitness Probability Distribution of Bit-Flip Mutation.

PubMed

Chicano, Francisco; Sutton, Andrew M; Whitley, L Darrell; Alba, Enrique

2015-01-01

Bit-flip mutation is a common mutation operator for evolutionary algorithms applied to optimize functions over binary strings. In this paper, we develop results from the theory of landscapes and Krawtchouk polynomials to exactly compute the probability distribution of fitness values of a binary string undergoing uniform bit-flip mutation. We prove that this probability distribution can be expressed as a polynomial in p, the probability of flipping each bit. We analyze these polynomials and provide closed-form expressions for an easy linear problem (Onemax), and an NP-hard problem, MAX-SAT. We also discuss a connection of the results with runtime analysis.
A Hybrid Color Space for Skin Detection Using Genetic Algorithm Heuristic Search and Principal Component Analysis Technique

PubMed Central

2015-01-01

Color is one of the most prominent features of an image and used in many skin and face detection applications. Color space transformation is widely used by researchers to improve face and skin detection performance. Despite the substantial research efforts in this area, choosing a proper color space in terms of skin and face classification performance which can address issues like illumination variations, various camera characteristics and diversity in skin color tones has remained an open issue. This research proposes a new three-dimensional hybrid color space termed SKN by employing the Genetic Algorithm heuristic and Principal Component Analysis to find the optimal representation of human skin color in over seventeen existing color spaces. Genetic Algorithm heuristic is used to find the optimal color component combination setup in terms of skin detection accuracy while the Principal Component Analysis projects the optimal Genetic Algorithm solution to a less complex dimension. Pixel wise skin detection was used to evaluate the performance of the proposed color space. We have employed four classifiers including Random Forest, Naïve Bayes, Support Vector Machine and Multilayer Perceptron in order to generate the human skin color predictive model. The proposed color space was compared to some existing color spaces and shows superior results in terms of pixel-wise skin detection accuracy. Experimental results show that by using Random Forest classifier, the proposed SKN color space obtained an average F-score and True Positive Rate of 0.953 and False Positive Rate of 0.0482 which outperformed the existing color spaces in terms of pixel wise skin detection accuracy. The results also indicate that among the classifiers used in this study, Random Forest is the most suitable classifier for pixel wise skin detection applications. PMID:26267377
Control mechanism of double-rotator-structure ternary optical computer

NASA Astrophysics Data System (ADS)

Kai, SONG; Liping, YAN

2017-03-01

Double-rotator-structure ternary optical processor (DRSTOP) has two characteristics, namely, giant data-bits parallel computing and reconfigurable processor, which can handle thousands of data bits in parallel, and can run much faster than computers and other optical computer systems so far. In order to put DRSTOP into practical application, this paper established a series of methods, namely, task classification method, data-bits allocation method, control information generation method, control information formatting and sending method, and decoded results obtaining method and so on. These methods form the control mechanism of DRSTOP. This control mechanism makes DRSTOP become an automated computing platform. Compared with the traditional calculation tools, DRSTOP computing platform can ease the contradiction between high energy consumption and big data computing due to greatly reducing the cost of communications and I/O. Finally, the paper designed a set of experiments for DRSTOP control mechanism to verify its feasibility and correctness. Experimental results showed that the control mechanism is correct, feasible and efficient.
Explanatory Supplement to the WISE All-Sky Release Products

NASA Technical Reports Server (NTRS)

2012-01-01

The Wide-field Infrared Survey Explorer (WISE; Wright et al. 2010) surveyed the entire sky at 3.4, 4.6, 12 and 22 microns in 2010, achieving 5-sigma point source sensitivities per band better than 0.08, 0.11, 1 and 6 mJy in unconfused regions on the ecliptic. The WISE All-Sky Data Release, conducted on March 14, 2012, incorporates all data taken during the full cryogenic mission phase, 7 January 2010 to 6 August 20l0,that were processed with improved calibrations and reduction algorithms. Release data products include: (1) an Atlas of 18,240 match-filtered, calibrated and coadded image sets; (2) a Source Catalog containing positions and four-band photometry for over 563 million objects, and (3) an Explanatory Supplement. Ancillary products include a Reject Table that contains 284 million detections that were not selected for the Source Catalog because they are low signal-to-noise ratio or spurious detections of image artifacts, an archive of over 1.5 million sets of calibrated WISE Single-exposure images, and a database of 9.4 billion source extractions from those single images, and moving object tracklets identified by the NEOWISE program (Mainzer et aI. 2011). The WISE All-Sky Data Release products supersede those from the WISE Preliminary Data Release (Cutri et al. 2011). The Explanatory Supplement to the WISE All-Sky Data Release Products is a general guide for users of the WISE data. The Supplement contains an overview of the WISE mission, facilities, and operations, a detailed description of WISE data processing algorithms, a guide to the content and formals of the image and tabular data products, and cautionary notes that describe known limitations of the All-Sky Release products. Instructions for accessing the WISE data products via the services of the NASA/IPAC Infrared Science Archive are provided. The Supplement also provides analyses of the achieved sky coverage, photometric and astrometric characteristics and completeness and reliability of the All-Sky Release data products. The WISE All-Sky Release Explanatory Supplement is an on-line document that is updated frequently to provide the most current information for users of the WISE data products. The Explanatory Supplement is maintained at: http://wise2.ipac.caltech.edu/docs/release/allsky/expsup/index.html WISE is a joint project of the University of California, Los Angeles and the Jet Propulsion Laboratory/California Institute of Technology, funded by the National Aeronautics and Space Administration. NEOWISE is a project of the Jet Propulsion Laboratory/California Institute of Technology, funded by the Planetary Science Division of the National Aeronautics and Space Administration.
JPEG 2000 Encoding with Perceptual Distortion Control

NASA Technical Reports Server (NTRS)

Watson, Andrew B.; Liu, Zhen; Karam, Lina J.

2008-01-01

An alternative approach has been devised for encoding image data in compliance with JPEG 2000, the most recent still-image data-compression standard of the Joint Photographic Experts Group. Heretofore, JPEG 2000 encoding has been implemented by several related schemes classified as rate-based distortion-minimization encoding. In each of these schemes, the end user specifies a desired bit rate and the encoding algorithm strives to attain that rate while minimizing a mean squared error (MSE). While rate-based distortion minimization is appropriate for transmitting data over a limited-bandwidth channel, it is not the best approach for applications in which the perceptual quality of reconstructed images is a major consideration. A better approach for such applications is the present alternative one, denoted perceptual distortion control, in which the encoding algorithm strives to compress data to the lowest bit rate that yields at least a specified level of perceptual image quality. Some additional background information on JPEG 2000 is prerequisite to a meaningful summary of JPEG encoding with perceptual distortion control. The JPEG 2000 encoding process includes two subprocesses known as tier-1 and tier-2 coding. In order to minimize the MSE for the desired bit rate, a rate-distortion- optimization subprocess is introduced between the tier-1 and tier-2 subprocesses. In tier-1 coding, each coding block is independently bit-plane coded from the most-significant-bit (MSB) plane to the least-significant-bit (LSB) plane, using three coding passes (except for the MSB plane, which is coded using only one "clean up" coding pass). For M bit planes, this subprocess involves a total number of (3M - 2) coding passes. An embedded bit stream is then generated for each coding block. Information on the reduction in distortion and the increase in the bit rate associated with each coding pass is collected. This information is then used in a rate-control procedure to determine the contribution of each coding block to the output compressed bit stream.
Bandwidth reduction for video-on-demand broadcasting using secondary content insertion

NASA Astrophysics Data System (ADS)

Golynski, Alexander; Lopez-Ortiz, Alejandro; Poirier, Guillaume; Quimper, Claude-Guy

2005-01-01

An optimal broadcasting scheme under the presence of secondary content (i.e. advertisements) is proposed. The proposed scheme works both for movies encoded in a Constant Bit Rate (CBR) or a Variable Bit Rate (VBR) format. It is shown experimentally that secondary content in movies can make Video-on-Demand (VoD) broadcasting systems more efficient. An efficient algorithm is given to compute the optimal broadcasting schedule with secondary content, which in particular significantly improves over the best previously known algorithm for computing the optimal broadcasting schedule without secondary content.
A unified framework of unsupervised subjective optimized bit allocation for multiple video object coding

NASA Astrophysics Data System (ADS)

Chen, Zhenzhong; Han, Junwei; Ngan, King Ngi

2005-10-01

MPEG-4 treats a scene as a composition of several objects or so-called video object planes (VOPs) that are separately encoded and decoded. Such a flexible video coding framework makes it possible to code different video object with different distortion scale. It is necessary to analyze the priority of the video objects according to its semantic importance, intrinsic properties and psycho-visual characteristics such that the bit budget can be distributed properly to video objects to improve the perceptual quality of the compressed video. This paper aims to provide an automatic video object priority definition method based on object-level visual attention model and further propose an optimization framework for video object bit allocation. One significant contribution of this work is that the human visual system characteristics are incorporated into the video coding optimization process. Another advantage is that the priority of the video object can be obtained automatically instead of fixing weighting factors before encoding or relying on the user interactivity. To evaluate the performance of the proposed approach, we compare it with traditional verification model bit allocation and the optimal multiple video object bit allocation algorithms. Comparing with traditional bit allocation algorithms, the objective quality of the object with higher priority is significantly improved under this framework. These results demonstrate the usefulness of this unsupervised subjective quality lifting framework.
Robust group-wise rigid registration of point sets using t-mixture model

NASA Astrophysics Data System (ADS)

Ravikumar, Nishant; Gooya, Ali; Frangi, Alejandro F.; Taylor, Zeike A.

2016-03-01

A probabilistic framework for robust, group-wise rigid alignment of point-sets using a mixture of Students t-distribution especially when the point sets are of varying lengths, are corrupted by an unknown degree of outliers or in the presence of missing data. Medical images (in particular magnetic resonance (MR) images), their segmentations and consequently point-sets generated from these are highly susceptible to corruption by outliers. This poses a problem for robust correspondence estimation and accurate alignment of shapes, necessary for training statistical shape models (SSMs). To address these issues, this study proposes to use a t-mixture model (TMM), to approximate the underlying joint probability density of a group of similar shapes and align them to a common reference frame. The heavy-tailed nature of t-distributions provides a more robust registration framework in comparison to state of the art algorithms. Significant reduction in alignment errors is achieved in the presence of outliers, using the proposed TMM-based group-wise rigid registration method, in comparison to its Gaussian mixture model (GMM) counterparts. The proposed TMM-framework is compared with a group-wise variant of the well-known Coherent Point Drift (CPD) algorithm and two other group-wise methods using GMMs, using both synthetic and real data sets. Rigid alignment errors for groups of shapes are quantified using the Hausdorff distance (HD) and quadratic surface distance (QSD) metrics.
Error correcting circuit design with carbon nanotube field effect transistors

NASA Astrophysics Data System (ADS)

Liu, Xiaoqiang; Cai, Li; Yang, Xiaokuo; Liu, Baojun; Liu, Zhongyong

2018-03-01

In this work, a parallel error correcting circuit based on (7, 4) Hamming code is designed and implemented with carbon nanotube field effect transistors, and its function is validated by simulation in HSpice with the Stanford model. A grouping method which is able to correct multiple bit errors in 16-bit and 32-bit application is proposed, and its error correction capability is analyzed. Performance of circuits implemented with CNTFETs and traditional MOSFETs respectively is also compared, and the former shows a 34.4% decrement of layout area and a 56.9% decrement of power consumption.
Reduction from cost-sensitive ordinal ranking to weighted binary classification.

PubMed

Lin, Hsuan-Tien; Li, Ling

2012-05-01

We present a reduction framework from ordinal ranking to binary classification. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranker from the binary classifier. Based on the framework, we show that a weighted 0/1 loss of the binary classifier upper-bounds the mislabeling cost of the ranker, both error-wise and regret-wise. Our framework allows not only the design of good ordinal ranking algorithms based on well-tuned binary classification approaches, but also the derivation of new generalization bounds for ordinal ranking from known bounds for binary classification. In addition, our framework unifies many existing ordinal ranking algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms. In addition, the newly designed algorithms lead to better cost-sensitive ordinal ranking performance, as well as improved listwise ranking performance.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.

Direct bit detection receiver noise performance analysis for 32-PSK and 64-PSK modulated signals

NASA Astrophysics Data System (ADS)

Ahmed, Iftikhar

1987-12-01

Simple two channel receivers for 32-PSK and 64-PSK modulated signals have been proposed which allow digital data (namely bits), to be recovered directly instead of the traditional approach of symbol detection followed by symbol to bit mappings. This allows for binary rather than M-ary receiver decisions, reduces the amount of signal processing operations and permits parallel recovery of the bits. The noise performance of these receivers quantified by the Bit Error Rate (BER) assuming an Additive White Gaussian Noise interference model is evaluated as a function of Eb/No, the signal to noise ratio, and transmitted phase angles of the signals. The performance results of the direct bit detection receivers (DBDR) when compared to that of convectional phase measurement receivers demonstrate that DBDR's are optimum in BER sense. The simplicity of the receiver implementations and the BER of the delivered data make DBDR's attractive for high speed, spectrally efficient digital communication systems.
Demonstration of an optical directed half-subtracter using integrated silicon photonic circuits.

PubMed

Liu, Zilong; Zhao, Yongpeng; Xiao, Huifu; Deng, Lin; Meng, Yinghao; Guo, Xiaonan; Liu, Guipeng; Tian, Yonghui; Yang, Jianhong

2018-04-01

An integrated silicon photonic circuit consisting of two silicon microring resonators (MRRs) is proposed and experimentally demonstrated for the purpose of half-subtraction operation. The thermo-optic modulation scheme is employed to modulate the MRRs due to its relatively simple fabrication process. The high and low levels of the electrical pulse signal are utilized to define logic 1 and 0 in the electrical domain, respectively, and the high and low levels of the optical power represent logic 1 and 0 in the optical domain, respectively. Two electrical pulse sequences regarded as the operands are applied to the corresponding micro-heaters fabricated on the top of the MRRs to achieve their dynamic modulations. The final operation results of bit-wise borrow and difference are obtained at their corresponding output ports in the form of light. At last, the subtraction operation of two bits with the operation speed of 10 kbps is demonstrated successfully.
Facial emotion recognition system for autistic children: a feasible study based on FPGA implementation.

PubMed

Smitha, K G; Vinod, A P

2015-11-01

Children with autism spectrum disorder have difficulty in understanding the emotional and mental states from the facial expressions of the people they interact. The inability to understand other people's emotions will hinder their interpersonal communication. Though many facial emotion recognition algorithms have been proposed in the literature, they are mainly intended for processing by a personal computer, which limits their usability in on-the-move applications where portability is desired. The portability of the system will ensure ease of use and real-time emotion recognition and that will aid for immediate feedback while communicating with caretakers. Principal component analysis (PCA) has been identified as the least complex feature extraction algorithm to be implemented in hardware. In this paper, we present a detailed study of the implementation of serial and parallel implementation of PCA in order to identify the most feasible method for realization of a portable emotion detector for autistic children. The proposed emotion recognizer architectures are implemented on Virtex 7 XC7VX330T FFG1761-3 FPGA. We achieved 82.3% detection accuracy for a word length of 8 bits.
Malleable architecture generator for FPGA computing

NASA Astrophysics Data System (ADS)

Gokhale, Maya; Kaba, James; Marks, Aaron; Kim, Jang

1996-10-01

The malleable architecture generator (MARGE) is a tool set that translates high-level parallel C to configuration bit streams for field-programmable logic based computing systems. MARGE creates an application-specific instruction set and generates the custom hardware components required to perform exactly those computations specified by the C program. In contrast to traditional fixed-instruction processors, MARGE's dynamic instruction set creation provides for efficient use of hardware resources. MARGE processes intermediate code in which each operation is annotated by the bit lengths of the operands. Each basic block (sequence of straight line code) is mapped into a single custom instruction which contains all the operations and logic inherent in the block. A synthesis phase maps the operations comprising the instructions into register transfer level structural components and control logic which have been optimized to exploit functional parallelism and function unit reuse. As a final stage, commercial technology-specific tools are used to generate configuration bit streams for the desired target hardware. Technology- specific pre-placed, pre-routed macro blocks are utilized to implement as much of the hardware as possible. MARGE currently supports the Xilinx-based Splash-2 reconfigurable accelerator and National Semiconductor's CLAy-based parallel accelerator, MAPA. The MARGE approach has been demonstrated on systolic applications such as DNA sequence comparison.
A parallel algorithm for the eigenvalues and eigenvectors for a general complex matrix

NASA Technical Reports Server (NTRS)

Shroff, Gautam

1989-01-01

A new parallel Jacobi-like algorithm is developed for computing the eigenvalues of a general complex matrix. Most parallel methods for this parallel typically display only linear convergence. Sequential norm-reducing algorithms also exit and they display quadratic convergence in most cases. The new algorithm is a parallel form of the norm-reducing algorithm due to Eberlein. It is proven that the asymptotic convergence rate of this algorithm is quadratic. Numerical experiments are presented which demonstrate the quadratic convergence of the algorithm and certain situations where the convergence is slow are also identified. The algorithm promises to be very competitive on a variety of parallel architectures.
Arikan and Alamouti matrices based on fast block-wise inverse Jacket transform

NASA Astrophysics Data System (ADS)

Lee, Moon Ho; Khan, Md Hashem Ali; Kim, Kyeong Jin

2013-12-01

Recently, Lee and Hou (IEEE Signal Process Lett 13: 461-464, 2006) proposed one-dimensional and two-dimensional fast algorithms for block-wise inverse Jacket transforms (BIJTs). Their BIJTs are not real inverse Jacket transforms from mathematical point of view because their inverses do not satisfy the usual condition, i.e., the multiplication of a matrix with its inverse matrix is not equal to the identity matrix. Therefore, we mathematically propose a fast block-wise inverse Jacket transform of orders N = 2 k , 3 k , 5 k , and 6 k , where k is a positive integer. Based on the Kronecker product of the successive lower order Jacket matrices and the basis matrix, the fast algorithms for realizing these transforms are obtained. Due to the simple inverse and fast algorithms of Arikan polar binary and Alamouti multiple-input multiple-output (MIMO) non-binary matrices, which are obtained from BIJTs, they can be applied in areas such as 3GPP physical layer for ultra mobile broadband permutation matrices design, first-order q-ary Reed-Muller code design, diagonal channel design, diagonal subchannel decompose for interference alignment, and 4G MIMO long-term evolution Alamouti precoding design.
New adaptive color quantization method based on self-organizing maps.

PubMed

Chang, Chip-Hong; Xu, Pengfei; Xiao, Rui; Srikanthan, Thambipillai

2005-01-01

Color quantization (CQ) is an image processing task popularly used to convert true color images to palletized images for limited color display devices. To minimize the contouring artifacts introduced by the reduction of colors, a new competitive learning (CL) based scheme called the frequency sensitive self-organizing maps (FS-SOMs) is proposed to optimize the color palette design for CQ. FS-SOM harmonically blends the neighborhood adaptation of the well-known self-organizing maps (SOMs) with the neuron dependent frequency sensitive learning model, the global butterfly permutation sequence for input randomization, and the reinitialization of dead neurons to harness effective utilization of neurons. The net effect is an improvement in adaptation, a well-ordered color palette, and the alleviation of underutilization problem, which is the main cause of visually perceivable artifacts of CQ. Extensive simulations have been performed to analyze and compare the learning behavior and performance of FS-SOM against other vector quantization (VQ) algorithms. The results show that the proposed FS-SOM outperforms classical CL, Linde, Buzo, and Gray (LBG), and SOM algorithms. More importantly, FS-SOM achieves its superiority in reconstruction quality and topological ordering with a much greater robustness against variations in network parameters than the current art SOM algorithm for CQ. A most significant bit (MSB) biased encoding scheme is also introduced to reduce the number of parallel processing units. By mapping the pixel values as sign-magnitude numbers and biasing the magnitudes according to their sign bits, eight lattice points in the color space are condensed into one common point density function. Consequently, the same processing element can be used to map several color clusters and the entire FS-SOM network can be substantially scaled down without severely scarifying the quality of the displayed image. The drawback of this encoding scheme is the additional storage overhead, which can be cut down by leveraging on existing encoder in an overall lossy compression scheme.
Efficient Parallel Levenberg-Marquardt Model Fitting towards Real-Time Automated Parametric Imaging Microscopy

PubMed Central

Zhu, Xiang; Zhang, Dianwen

2013-01-01

We present a fast, accurate and robust parallel Levenberg-Marquardt minimization optimizer, GPU-LMFit, which is implemented on graphics processing unit for high performance scalable parallel model fitting processing. GPU-LMFit can provide a dramatic speed-up in massive model fitting analyses to enable real-time automated pixel-wise parametric imaging microscopy. We demonstrate the performance of GPU-LMFit for the applications in superresolution localization microscopy and fluorescence lifetime imaging microscopy. PMID:24130785
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Brouwer, Randall Jay

1991-01-01

The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
Region-of-interest determination and bit-rate conversion for H.264 video transcoding

NASA Astrophysics Data System (ADS)

Huang, Shu-Fen; Chen, Mei-Juan; Tai, Kuang-Han; Li, Mian-Shiuan

2013-12-01

This paper presents a video bit-rate transcoder for baseline profile in H.264/AVC standard to fit the available channel bandwidth for the client when transmitting video bit-streams via communication channels. To maintain visual quality for low bit-rate video efficiently, this study analyzes the decoded information in the transcoder and proposes a Bayesian theorem-based region-of-interest (ROI) determination algorithm. In addition, a curve fitting scheme is employed to find the models of video bit-rate conversion. The transcoded video will conform to the target bit-rate by re-quantization according to our proposed models. After integrating the ROI detection method and the bit-rate transcoding models, the ROI-based transcoder allocates more coding bits to ROI regions and reduces the complexity of the re-encoding procedure for non-ROI regions. Hence, it not only keeps the coding quality but improves the efficiency of the video transcoding for low target bit-rates and makes the real-time transcoding more practical. Experimental results show that the proposed framework gets significantly better visual quality.
Optical LDPC decoders for beyond 100 Gbits/s optical transmission.

PubMed

Djordjevic, Ivan B; Xu, Lei; Wang, Ting

2009-05-01

We present an optical low-density parity-check (LDPC) decoder suitable for implementation above 100 Gbits/s, which provides large coding gains when based on large-girth LDPC codes. We show that a basic building block, the probabilities multiplier circuit, can be implemented using a Mach-Zehnder interferometer, and we propose corresponding probabilistic-domain sum-product algorithm (SPA). We perform simulations of a fully parallel implementation employing girth-10 LDPC codes and proposed SPA. The girth-10 LDPC(24015,19212) code of the rate of 0.8 outperforms the BCH(128,113)xBCH(256,239) turbo-product code of the rate of 0.82 by 0.91 dB (for binary phase-shift keying at 100 Gbits/s and a bit error rate of 10(-9)), and provides a net effective coding gain of 10.09 dB.
On the improvement of neural cryptography using erroneous transmitted information with error prediction.

PubMed

Allam, Ahmed M; Abbas, Hazem M

2010-12-01

Neural cryptography deals with the problem of "key exchange" between two neural networks using the mutual learning concept. The two networks exchange their outputs (in bits) and the key between the two communicating parties is eventually represented in the final learned weights, when the two networks are said to be synchronized. Security of neural synchronization is put at risk if an attacker is capable of synchronizing with any of the two parties during the training process. Therefore, diminishing the probability of such a threat improves the reliability of exchanging the output bits through a public channel. The synchronization with feedback algorithm is one of the existing algorithms that enhances the security of neural cryptography. This paper proposes three new algorithms to enhance the mutual learning process. They mainly depend on disrupting the attacker confidence in the exchanged outputs and input patterns during training. The first algorithm is called "Do not Trust My Partner" (DTMP), which relies on one party sending erroneous output bits, with the other party being capable of predicting and correcting this error. The second algorithm is called "Synchronization with Common Secret Feedback" (SCSFB), where inputs are kept partially secret and the attacker has to train its network on input patterns that are different from the training sets used by the communicating parties. The third algorithm is a hybrid technique combining the features of the DTMP and SCSFB. The proposed approaches are shown to outperform the synchronization with feedback algorithm in the time needed for the parties to synchronize.
High-performance floating-point image computing workstation for medical applications

NASA Astrophysics Data System (ADS)

Mills, Karl S.; Wong, Gilman K.; Kim, Yongmin

1990-07-01

The medical imaging field relies increasingly on imaging and graphics techniques in diverse applications with needs similar to (or more stringent than) those of the military, industrial and scientific communities. However, most image processing and graphics systems available for use in medical imaging today are either expensive, specialized, or in most cases both. High performance imaging and graphics workstations which can provide real-time results for a number of applications, while maintaining affordability and flexibility, can facilitate the application of digital image computing techniques in many different areas. This paper describes the hardware and software architecture of a medium-cost floating-point image processing and display subsystem for the NeXT computer, and its applications as a medical imaging workstation. Medical imaging applications of the workstation include use in a Picture Archiving and Communications System (PACS), in multimodal image processing and 3-D graphics workstation for a broad range of imaging modalities, and as an electronic alternator utilizing its multiple monitor display capability and large and fast frame buffer. The subsystem provides a 2048 x 2048 x 32-bit frame buffer (16 Mbytes of image storage) and supports both 8-bit gray scale and 32-bit true color images. When used to display 8-bit gray scale images, up to four different 256-color palettes may be used for each of four 2K x 2K x 8-bit image frames. Three of these image frames can be used simultaneously to provide pixel selectable region of interest display. A 1280 x 1024 pixel screen with 1: 1 aspect ratio can be windowed into the frame buffer for display of any portion of the processed image or images. In addition, the system provides hardware support for integer zoom and an 82-color cursor. This subsystem is implemented on an add-in board occupying a single slot in the NeXT computer. Up to three boards may be added to the NeXT for multiple display capability (e.g., three 1280 x 1024 monitors, each with a 16-Mbyte frame buffer). Each add-in board provides an expansion connector to which an optional image computing coprocessor board may be added. Each coprocessor board supports up to four processors for a peak performance of 160 MFLOPS. The coprocessors can execute programs from external high-speed microcode memory as well as built-in internal microcode routines. The internal microcode routines provide support for 2-D and 3-D graphics operations, matrix and vector arithmetic, and image processing in integer, IEEE single-precision floating point, or IEEE double-precision floating point. In addition to providing a library of C functions which links the NeXT computer to the add-in board and supports its various operational modes, algorithms and medical imaging application programs are being developed and implemented for image display and enhancement. As an extension to the built-in algorithms of the coprocessors, 2-D Fast Fourier Transform (FF1), 2-D Inverse FFF, convolution, warping and other algorithms (e.g., Discrete Cosine Transform) which exploit the parallel architecture of the coprocessor board are being implemented.
An ablative pulsed plasma thruster with a segmented anode

NASA Astrophysics Data System (ADS)

Zhang, Zhe; Ren, Junxue; Tang, Haibin; Ling, William Yeong Liang; York, Thomas M.

2018-01-01

An ablative pulsed plasma thruster (APPT) design with a ‘segmented anode’ is proposed in this paper. We aim to examine the effect that this asymmetric electrode configuration (a normal cathode and a segmented anode) has on the performance of an APPT. The magnetic field of the discharge arc, plasma density in the exit plume, impulse bit, and thrust efficiency were studied using a magnetic probe, Langmuir probe, thrust stand, and mass bit measurements, respectively. When compared with conventional symmetric parallel electrodes, the segmented anode APPT shows an improvement in the impulse bit of up to 28%. The thrust efficiency is also improved by 49% (from 5.3% to 7.9% for conventional and segmented designs, respectively). Long-exposure broadband emission images of the discharge morphology show that compared with a normal anode, a segmented anode results in clear differences in the luminous discharge morphology and better collimation of the plasma. The magnetic probe data indicate that the segmented anode APPT exhibits a higher current density in the discharge arc. Furthermore, Langmuir probe data collected from the central exit plane show that the peak electron density is 75% higher than with conventional parallel electrodes. These results are believed to be fundamental to the physical mechanisms behind the increased impulse bit of an APPT with a segmented electrode.
Burst-mode optical label processor with ultralow power consumption.

PubMed

Ibrahim, Salah; Nakahara, Tatsushi; Ishikawa, Hiroshi; Takahashi, Ryo

2016-04-04

A novel label processor subsystem for 100-Gbps (25-Gbps × 4λs) burst-mode optical packets is developed, in which a highly energy-efficient method is pursued for extracting and interfacing the ultrafast packet-label to a CMOS-based processor where label recognition takes place. The method involves performing serial-to-parallel conversion for the label bits on a bit-by-bit basis by using an optoelectronic converter that is operated with a set of optical triggers generated in a burst-mode manner upon packet arrival. Here we present three key achievements that enabled a significant reduction in the total power consumption and latency of the whole subsystem; 1) based on a novel operation mechanism for providing amplification with bit-level selectivity, an optical trigger pulse generator, that consumes power for a very short duration upon packet arrival, is proposed and experimentally demonstrated, 2) the energy of optical triggers needed by the optoelectronic serial-to-parallel converter is reduced by utilizing a negative-polarity signal while employing an enhanced conversion scheme entitled the discharge-or-hold scheme, 3) the necessary optical trigger energy is further cut down by half by coupling the triggers through the chip's backside, whereas a novel lens-free packaging method is developed to enable a low-cost alignment process that works with simple visual observation.
Back-end and interface implementation of the STS-XYTER2 prototype ASIC for the CBM experiment

NASA Astrophysics Data System (ADS)

Kasinski, K.; Szczygiel, R.; Zabolotny, W.

2016-11-01

Each front-end readout ASIC for the High-Energy Physics experiments requires robust and effective hit data streaming and control mechanism. A new STS-XYTER2 full-size prototype chip for the Silicon Tracking System and Muon Chamber detectors in the Compressed Baryonic Matter experiment at Facility for Antiproton and Ion Research (FAIR, Germany) is a 128-channel time and amplitude measuring solution for silicon microstrip and gas detectors. It operates at 250 kHit/s/channel hit rate, each hit producing 27 bits of information (5-bit amplitude, 14-bit timestamp, position and diagnostics data). The chip back-end implements fast front-end channel read-out, timestamp-wise hit sorting, and data streaming via a scalable interface implementing the dedicated protocol (STS-HCTSP) for chip control and hit transfer with data bandwidth from 9.7 MHit/s up to 47 MHit/s. It also includes multiple options for link diagnostics, failure detection, and throttling features. The back-end is designed to operate with the data acquisition architecture based on the CERN GBTx transceivers. This paper presents the details of the back-end and interface design and its implementation in the UMC 180 nm CMOS process.
LSB Based Quantum Image Steganography Algorithm

NASA Astrophysics Data System (ADS)

Jiang, Nan; Zhao, Na; Wang, Luo

2016-01-01

Quantum steganography is the technique which hides a secret message into quantum covers such as quantum images. In this paper, two blind LSB steganography algorithms in the form of quantum circuits are proposed based on the novel enhanced quantum representation (NEQR) for quantum images. One algorithm is plain LSB which uses the message bits to substitute for the pixels' LSB directly. The other is block LSB which embeds a message bit into a number of pixels that belong to one image block. The extracting circuits can regain the secret message only according to the stego cover. Analysis and simulation-based experimental results demonstrate that the invisibility is good, and the balance between the capacity and the robustness can be adjusted according to the needs of applications.
Iterative algorithms for large sparse linear systems on parallel computers

NASA Technical Reports Server (NTRS)

Adams, L. M.

1982-01-01

Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
VLSI design of an RSA encryption/decryption chip using systolic array based architecture

NASA Astrophysics Data System (ADS)

Sun, Chi-Chia; Lin, Bor-Shing; Jan, Gene Eu; Lin, Jheng-Yi

2016-09-01

This article presents the VLSI design of a configurable RSA public key cryptosystem supporting the 512-bit, 1024-bit and 2048-bit based on Montgomery algorithm achieving comparable clock cycles of current relevant works but with smaller die size. We use binary method for the modular exponentiation and adopt Montgomery algorithm for the modular multiplication to simplify computational complexity, which, together with the systolic array concept for electric circuit designs effectively, lower the die size. The main architecture of the chip consists of four functional blocks, namely input/output modules, registers module, arithmetic module and control module. We applied the concept of systolic array to design the RSA encryption/decryption chip by using VHDL hardware language and verified using the TSMC/CIC 0.35 m 1P4 M technology. The die area of the 2048-bit RSA chip without the DFT is 3.9 × 3.9 mm2 (4.58 × 4.58 mm2 with DFT). Its average baud rate can reach 10.84 kbps under a 100 MHz clock.
A software reconfigurable optical multiband UWB system utilizing a bit-loading combined with adaptive LDPC code rate scheme

NASA Astrophysics Data System (ADS)

He, Jing; Dai, Min; Chen, Qinghui; Deng, Rui; Xiang, Changqing; Chen, Lin

2017-07-01

In this paper, an effective bit-loading combined with adaptive LDPC code rate algorithm is proposed and investigated in software reconfigurable multiband UWB over fiber system. To compensate the power fading and chromatic dispersion for the high frequency of multiband OFDM UWB signal transmission over standard single mode fiber (SSMF), a Mach-Zehnder modulator (MZM) with negative chirp parameter is utilized. In addition, the negative power penalty of -1 dB for 128 QAM multiband OFDM UWB signal are measured at the hard-decision forward error correction (HD-FEC) limitation of 3.8 × 10-3 after 50 km SSMF transmission. The experimental results show that, compared to the fixed coding scheme with the code rate of 75%, the signal-to-noise (SNR) is improved by 2.79 dB for 128 QAM multiband OFDM UWB system after 100 km SSMF transmission using ALCR algorithm. Moreover, by employing bit-loading combined with ALCR algorithm, the bit error rate (BER) performance of system can be further promoted effectively. The simulation results present that, at the HD-FEC limitation, the value of Q factor is improved by 3.93 dB at the SNR of 19.5 dB over 100 km SSMF transmission, compared to the fixed modulation with uncoded scheme at the same spectrum efficiency (SE).

Learning Short Binary Codes for Large-scale Image Retrieval.

PubMed

Liu, Li; Yu, Mengyang; Shao, Ling

2017-03-01

Large-scale visual information retrieval has become an active research area in this big data era. Recently, hashing/binary coding algorithms prove to be effective for scalable retrieval applications. Most existing hashing methods require relatively long binary codes (i.e., over hundreds of bits, sometimes even thousands of bits) to achieve reasonable retrieval accuracies. However, for some realistic and unique applications, such as on wearable or mobile devices, only short binary codes can be used for efficient image retrieval due to the limitation of computational resources or bandwidth on these devices. In this paper, we propose a novel unsupervised hashing approach called min-cost ranking (MCR) specifically for learning powerful short binary codes (i.e., usually the code length shorter than 100 b) for scalable image retrieval tasks. By exploring the discriminative ability of each dimension of data, MCR can generate one bit binary code for each dimension and simultaneously rank the discriminative separability of each bit according to the proposed cost function. Only top-ranked bits with minimum cost-values are then selected and grouped together to compose the final salient binary codes. Extensive experimental results on large-scale retrieval demonstrate that MCR can achieve comparative performance as the state-of-the-art hashing algorithms but with significantly shorter codes, leading to much faster large-scale retrieval.
Global navigation satellite system receiver for weak signals under all dynamic conditions

NASA Astrophysics Data System (ADS)

Ziedan, Nesreen Ibrahim

The ability of the Global Navigation Satellite System (GNSS) receiver to work under weak signal and various dynamic conditions is required in some applications. For example, to provide a positioning capability in wireless devices, or orbit determination of Geostationary and high Earth orbit satellites. This dissertation develops Global Positioning System (GPS) receiver algorithms for such applications. Fifteen algorithms are developed for the GPS C/A signal. They cover all the receiver main functions, which include acquisition, fine acquisition, bit synchronization, code and carrier tracking, and navigation message decoding. They are integrated together, and they can be used in any software GPS receiver. They also can be modified to fit any other GPS or GNSS signals. The algorithms have new capabilities. The processing and memory requirements are considered in the design to allow the algorithms to fit the limited resources of some applications; they do not require any assisting information. Weak signals can be acquired in the presence of strong interfering signals and under high dynamic conditions. The fine acquisition, bit synchronization, and tracking algorithms are based on the Viterbi algorithm and Extended Kalman filter approaches. The tracking algorithms capabilities increase the time to lose lock. They have the ability to adaptively change the integration length and the code delay separation. More than one code delay separation can be used in the same time. Large tracking errors can be detected and then corrected by a re-initialization and an acquisition-like algorithms. Detecting the navigation message is needed to increase the coherent integration; decoding it is needed to calculate the navigation solution. The decoding algorithm utilizes the message structure to enable its decoding for signals with high Bit Error Rate. The algorithms are demonstrated using simulated GPS C/A code signals, and TCXO clocks. The results have shown the algorithms ability to reliably work with 15 dB-Hz signals and acceleration over 6 g.
A class of parallel algorithms for computation of the manipulator inertia matrix

NASA Technical Reports Server (NTRS)

Fijany, Amir; Bejczy, Antal K.

1989-01-01

Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
An adaptive DPCM encoder for NTSC composite video signals

NASA Astrophysics Data System (ADS)

Cox, N. R.

An adaptive DPCM algorithm is proposed for encoding digitized National Television Systems Committee (NTSC) color video signals. This algorithm essentially predicts picture contours in the composite signal without resorting to component separation. Preliminary subjective and objective tests performed on an experimental encoder/simulator indicate that high quality color pictures can be encoded at 4.0 bits/pel or 42.95 Mbit/s. This requires the use of a 4/8 bit dual-word-length coder and buffer memory. Such a system might be useful in certain short hop applications if both large-signal and small-signal responses can be preserved.
Numerical method for high accuracy index of refraction estimation for spectro-angular surface plasmon resonance systems.

PubMed

Alleyne, Colin J; Kirk, Andrew G; Chien, Wei-Yin; Charette, Paul G

2008-11-24

An eigenvector analysis based algorithm is presented for estimating refractive index changes from 2-D reflectance/dispersion images obtained with spectro-angular surface plasmon resonance systems. High resolution over a large dynamic range can be achieved simultaneously. The method performs well in simulations with noisy data maintaining an error of less than 10(-8) refractive index units with up to six bits of noise on 16 bit quantized image data. Experimental measurements show that the method results in a much higher signal to noise ratio than the standard 1-D weighted centroid dip finding algorithm.
Digital CODEC for real-time processing of broadcast quality video signals at 1.8 bits/pixel

NASA Technical Reports Server (NTRS)

Shalkhauser, Mary JO; Whyte, Wayne A., Jr.

1989-01-01

Advances in very large-scale integration and recent work in the field of bandwidth efficient digital modulation techniques have combined to make digital video processing technically feasible and potentially cost competitive for broadcast quality television transmission. A hardware implementation was developed for a DPCM-based digital television bandwidth compression algorithm which processes standard NTSC composite color television signals and produces broadcast quality video in real time at an average of 1.8 bits/pixel. The data compression algorithm and the hardware implementation of the CODEC are described, and performance results are provided.
Digital CODEC for real-time processing of broadcast quality video signals at 1.8 bits/pixel

NASA Technical Reports Server (NTRS)

Shalkhauser, Mary JO; Whyte, Wayne A.

1991-01-01

Advances in very large scale integration and recent work in the field of bandwidth efficient digital modulation techniques have combined to make digital video processing technically feasible an potentially cost competitive for broadcast quality television transmission. A hardware implementation was developed for DPCM (differential pulse code midulation)-based digital television bandwidth compression algorithm which processes standard NTSC composite color television signals and produces broadcast quality video in real time at an average of 1.8 bits/pixel. The data compression algorithm and the hardware implementation of the codec are described, and performance results are provided.
Improved Adaptive LSB Steganography Based on Chaos and Genetic Algorithm

NASA Astrophysics Data System (ADS)

Yu, Lifang; Zhao, Yao; Ni, Rongrong; Li, Ting

2010-12-01

We propose a novel steganographic method in JPEG images with high performance. Firstly, we propose improved adaptive LSB steganography, which can achieve high capacity while preserving the first-order statistics. Secondly, in order to minimize visual degradation of the stego image, we shuffle bits-order of the message based on chaos whose parameters are selected by the genetic algorithm. Shuffling message's bits-order provides us with a new way to improve the performance of steganography. Experimental results show that our method outperforms classical steganographic methods in image quality, while preserving characteristics of histogram and providing high capacity.
Standard random number generation for MBASIC

NASA Technical Reports Server (NTRS)

Tausworthe, R. C.

1976-01-01

A machine-independent algorithm is presented and analyzed for generating pseudorandom numbers suitable for the standard MBASIC system. The algorithm used is the polynomial congruential or linear recurrence modulo 2 method. Numbers, formed as nonoverlapping adjacent 28-bit words taken from the bit stream produced by the formula a sub m + 532 = a sub m + 37 + a sub m (modulo 2), do not repeat within the projected age of the solar system, show no ensemble correlation, exhibit uniform distribution of adjacent numbers up to 19 dimensions, and do not deviate from random runs-up and runs-down behavior.
A novel digital image sensor with row wise gain compensation for Hyper Spectral Imager (HySI) application

NASA Astrophysics Data System (ADS)

Lin, Shengmin; Lin, Chi-Pin; Wang, Weng-Lyang; Hsiao, Feng-Ke; Sikora, Robert

2009-08-01

A 256x512 element digital image sensor has been developed which has a large pixel size, slow scan and low power consumption for Hyper Spectral Imager (HySI) applications. The device is a mixed mode, silicon on chip (SOC) IC. It combines analog circuitry, digital circuitry and optical sensor circuitry into a single chip. This chip integrates a 256x512 active pixel sensor array, a programming gain amplifier (PGA) for row wise gain setting, I2C interface, SRAM, 12 bit analog to digital convertor (ADC), voltage regulator, low voltage differential signal (LVDS) and timing generator. The device can be used for 256 pixels of spatial resolution and 512 bands of spectral resolution ranged from 400 nm to 950 nm in wavelength. In row wise gain readout mode, one can set a different gain on each row of the photo detector by storing the gain setting data on the SRAM thru the I2C interface. This unique row wise gain setting can be used to compensate the silicon spectral response non-uniformity problem. Due to this unique function, the device is suitable for hyper-spectral imager applications. The HySI camera located on-board the Chandrayaan-1 satellite, was successfully launched to the moon on Oct. 22, 2008. The device is currently mapping the moon and sending back excellent images of the moon surface. The device design and the moon image data will be presented in the paper.
Extraction of drainage networks from large terrain datasets using high throughput computing

NASA Astrophysics Data System (ADS)

Gong, Jianya; Xie, Jibo

2009-02-01

Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.
Analog Signal Correlating Using an Analog-Based Signal Conditioning Front End

NASA Technical Reports Server (NTRS)

Prokop, Norman; Krasowski, Michael

2013-01-01

This innovation is capable of correlating two analog signals by using an analog-based signal conditioning front end to hard-limit the analog signals through adaptive thresholding into a binary bit stream, then performing the correlation using a Hamming "similarity" calculator function embedded in a one-bit digital correlator (OBDC). By converting the analog signal into a bit stream, the calculation of the correlation function is simplified, and less hardware resources are needed. This binary representation allows the hardware to move from a DSP where instructions are performed serially, into digital logic where calculations can be performed in parallel, greatly speeding up calculations.
Introducing parallelism to histogramming functions for GEM systems

NASA Astrophysics Data System (ADS)

Krawczyk, Rafał D.; Czarski, Tomasz; Kolasinski, Piotr; Pozniak, Krzysztof T.; Linczuk, Maciej; Byszuk, Adrian; Chernyshova, Maryna; Juszczyk, Bartlomiej; Kasprowicz, Grzegorz; Wojenski, Andrzej; Zabolotny, Wojciech

2015-09-01

This article is an assessment of potential parallelization of histogramming algorithms in GEM detector system. Histogramming and preprocessing algorithms in MATLAB were analyzed with regard to adding parallelism. Preliminary implementation of parallel strip histogramming resulted in speedup. Analysis of algorithms parallelizability is presented. Overview of potential hardware and software support to implement parallel algorithm is discussed.
A novel speech watermarking algorithm by line spectrum pair modification

NASA Astrophysics Data System (ADS)

Zhang, Qian; Yang, Senbin; Chen, Guang; Zhou, Jun

2011-10-01

To explore digital watermarking specifically suitable for the speech domain, this paper experimentally investigates the properties of line spectrum pair (LSP) parameters firstly. The results show that the differences between contiguous LSPs are robust against common signal processing operations and small modifications of LSPs are imperceptible to the human auditory system (HAS). According to these conclusions, three contiguous LSPs of a speech frame are selected to embed a watermark bit. The middle LSP is slightly altered to modify the differences of these LSPs when embedding watermark. Correspondingly, the watermark is extracted by comparing these differences. The proposed algorithm's transparency is adjustable to meet the needs of different applications. The algorithm has good robustness against additive noise, quantization, amplitude scale and MP3 compression attacks, for the bit error rate (BER) is less than 5%. In addition, the algorithm allows a relatively low capacity, which approximates to 50 bps.
Reed Solomon codes for error control in byte organized computer memory systems

NASA Technical Reports Server (NTRS)

Lin, S.; Costello, D. J., Jr.

1984-01-01

A problem in designing semiconductor memories is to provide some measure of error control without requiring excessive coding overhead or decoding time. In LSI and VLSI technology, memories are often organized on a multiple bit (or byte) per chip basis. For example, some 256K-bit DRAM's are organized in 32Kx8 bit-bytes. Byte oriented codes such as Reed Solomon (RS) codes can provide efficient low overhead error control for such memories. However, the standard iterative algorithm for decoding RS codes is too slow for these applications. Some special decoding techniques for extended single-and-double-error-correcting RS codes which are capable of high speed operation are presented. These techniques are designed to find the error locations and the error values directly from the syndrome without having to use the iterative algorithm to find the error locator polynomial.
Bitshuffle: Filter for improving compression of typed binary data

NASA Astrophysics Data System (ADS)

Masui, Kiyoshi

2017-12-01

Bitshuffle rearranges typed, binary data for improving compression; the algorithm is implemented in a python/C package within the Numpy framework. The library can be used alongside HDF5 to compress and decompress datasets and is integrated through the dynamically loaded filters framework. Algorithmically, Bitshuffle is closely related to HDF5's Shuffle filter except it operates at the bit level instead of the byte level. Arranging a typed data array in to a matrix with the elements as the rows and the bits within the elements as the columns, Bitshuffle "transposes" the matrix, such that all the least-significant-bits are in a row, etc. This transposition is performed within blocks of data roughly 8kB long; this does not in itself compress data, but rearranges it for more efficient compression. A compression library is necessary to perform the actual compression. This scheme has been used for compression of radio data in high performance computing.
Decoding of DBEC-TBED Reed-Solomon codes. [Double-Byte-Error-Correcting, Triple-Byte-Error-Detecting

NASA Technical Reports Server (NTRS)

Deng, Robert H.; Costello, Daniel J., Jr.

1987-01-01

A problem in designing semiconductor memories is to provide some measure of error control without requiring excessive coding overhead or decoding time. In LSI and VLSI technology, memories are often organized on a multiple bit (or byte) per chip basis. For example, some 256 K bit DRAM's are organized in 32 K x 8 bit-bytes. Byte-oriented codes such as Reed-Solomon (RS) codes can provide efficient low overhead error control for such memories. However, the standard iterative algorithm for decoding RS codes is too slow for these applications. The paper presents a special decoding technique for double-byte-error-correcting, triple-byte-error-detecting RS codes which is capable of high-speed operation. This technique is designed to find the error locations and the error values directly from the syndrome without having to use the iterative algorithm to find the error locator polynomial.
Parallelizing flow-accumulation calculations on graphics processing units—From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

NASA Astrophysics Data System (ADS)

Qin, Cheng-Zhi; Zhan, Lijun

2012-06-01

As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU-based algorithms based on existing parallelization strategies.
The development of a scalable parallel 3-D CFD algorithm for turbomachinery. M.S. Thesis Final Report

NASA Technical Reports Server (NTRS)

Luke, Edward Allen

1993-01-01

Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Automated novelty detection in the WISE survey with one-class support vector machines

NASA Astrophysics Data System (ADS)

Solarz, A.; Bilicki, M.; Gromadzki, M.; Pollo, A.; Durkalec, A.; Wypych, M.

2017-10-01

Wide-angle photometric surveys of previously uncharted sky areas or wavelength regimes will always bring in unexpected sources - novelties or even anomalies - whose existence and properties cannot be easily predicted from earlier observations. Such objects can be efficiently located with novelty detection algorithms. Here we present an application of such a method, called one-class support vector machines (OCSVM), to search for anomalous patterns among sources preselected from the mid-infrared AllWISE catalogue covering the whole sky. To create a model of expected data we train the algorithm on a set of objects with spectroscopic identifications from the SDSS DR13 database, present also in AllWISE. The OCSVM method detects as anomalous those sources whose patterns - WISE photometric measurements in this case - are inconsistent with the model. Among the detected anomalies we find artefacts, such as objects with spurious photometry due to blending, but more importantly also real sources of genuine astrophysical interest. Among the latter, OCSVM has identified a sample of heavily reddened AGN/quasar candidates distributed uniformly over the sky and in a large part absent from other WISE-based AGN catalogues. It also allowed us to find a specific group of sources of mixed types, mostly stars and compact galaxies. By combining the semi-supervised OCSVM algorithm with standard classification methods it will be possible to improve the latter by accounting for sources which are not present in the training sample, but are otherwise well-represented in the target set. Anomaly detection adds flexibility to automated source separation procedures and helps verify the reliability and representativeness of the training samples. It should be thus considered as an essential step in supervised classification schemes to ensure completeness and purity of produced catalogues. The catalogues of outlier data are only available at the CDS via anonymous ftp to http://cdsarc.u-strasbg.fr (http://130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/606/A39

Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

NASA Astrophysics Data System (ADS)

Backes, Werner; Wetzel, Susanne

In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Parallel O(log n) algorithms for open- and closed-chain rigid multibody systems based on a new mass matrix factorization technique

NASA Technical Reports Server (NTRS)

Fijany, Amir

1993-01-01

In this paper, parallel O(log n) algorithms for computation of rigid multibody dynamics are developed. These parallel algorithms are derived by parallelization of new O(n) algorithms for the problem. The underlying feature of these O(n) algorithms is a drastically different strategy for decomposition of interbody force which leads to a new factorization of the mass matrix (M). Specifically, it is shown that a factorization of the inverse of the mass matrix in the form of the Schur Complement is derived as M(exp -1) = C - B(exp *)A(exp -1)B, wherein matrices C, A, and B are block tridiagonal matrices. The new O(n) algorithm is then derived as a recursive implementation of this factorization of M(exp -1). For the closed-chain systems, similar factorizations and O(n) algorithms for computation of Operational Space Mass Matrix lambda and its inverse lambda(exp -1) are also derived. It is shown that these O(n) algorithms are strictly parallel, that is, they are less efficient than other algorithms for serial computation of the problem. But, to our knowledge, they are the only known algorithms that can be parallelized and that lead to both time- and processor-optimal parallel algorithms for the problem, i.e., parallel O(log n) algorithms with O(n) processors. The developed parallel algorithms, in addition to their theoretical significance, are also practical from an implementation point of view due to their simple architectural requirements.
Preliminary design for a standard 10 sup 7 bit Solid State Memory (SSM)

NASA Technical Reports Server (NTRS)

Hayes, P. J.; Howle, W. M., Jr.; Stermer, R. L., Jr.

1978-01-01

A modular concept with three separate modules roughly separating bubble domain technology, control logic technology, and power supply technology was employed. These modules were respectively the standard memory module (SMM), the data control unit (DCU), and power supply module (PSM). The storage medium was provided by bubble domain chips organized into memory cells. These cells and the circuitry for parallel data access to the cells make up the SMM. The DCU provides a flexible serial data interface to the SMM. The PSM provides adequate power to enable one DCU and one SMM to operate simultaneously at the maximum data rate. The SSM was designed to handle asynchronous data rates from dc to 1.024 Mbs with a bit error rate less than 1 error in 10 to the eight power bits. Two versions of the SSM, a serial data memory and a dual parallel data memory were specified using the standard modules. The SSM specification includes requirements for radiation hardness, temperature and mechanical environments, dc magnetic field emission and susceptibility, electromagnetic compatibility, and reliability.
A 32-bit Ultrafast Parallel Correlator using Resonant Tunneling Devices

NASA Technical Reports Server (NTRS)

Kulkarni, Shriram; Mazumder, Pinaki; Haddad, George I.

1995-01-01

An ultrafast 32-bit pipeline correlator has been implemented using resonant tunneling diodes (RTD) and hetero-junction bipolar transistors (HBT). The negative differential resistance (NDR) characteristics of RTD's is the basis of logic gates with the self-latching property that eliminates pipeline area and delay overheads which limit throughput in conventional technologies. The circuit topology also allows threshold logic functions such as minority/majority to be implemented in a compact manner resulting in reduction of the overall complexity and delay of arbitrary logic circuits. The parallel correlator is an essential component in code division multi-access (CDMA) transceivers used for the continuous calculation of correlation between an incoming data stream and a PN sequence. Simulation results show that a nano-pipelined correlator can provide and effective throughput of one 32-bit correlation every 100 picoseconds, using minimal hardware, with a power dissipation of 1.5 watts. RTD plus HBT based logic gates have been fabricated and the RTD plus HBT based correlator is compared with state of the art complementary metal oxide semiconductor (CMOS) implementations.
Turbo Trellis Coded Modulation With Iterative Decoding for Mobile Satellite Communications

NASA Technical Reports Server (NTRS)

Divsalar, D.; Pollara, F.

1997-01-01

In this paper, analytical bounds on the performance of parallel concatenation of two codes, known as turbo codes, and serial concatenation of two codes over fading channels are obtained. Based on this analysis, design criteria for the selection of component trellis codes for MPSK modulation, and a suitable bit-by-bit iterative decoding structure are proposed. Examples are given for throughput of 2 bits/sec/Hz with 8PSK modulation. The parallel concatenation example uses two rate 4/5 8-state convolutional codes with two interleavers. The convolutional codes' outputs are then mapped to two 8PSK modulations. The serial concatenated code example uses an 8-state outer code with rate 4/5 and a 4-state inner trellis code with 5 inputs and 2 x 8PSK outputs per trellis branch. Based on the above mentioned design criteria for fading channels, a method to obtain he structure of the trellis code with maximum diversity is proposed. Simulation results are given for AWGN and an independent Rayleigh fading channel with perfect Channel State Information (CSI).
High-Performance Psychometrics: The Parallel-E Parallel-M Algorithm for Generalized Latent Variable Models. Research Report. ETS RR-16-34

ERIC Educational Resources Information Center

von Davier, Matthias

2016-01-01

This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Hardware multiplier processor

DOEpatents

Pierce, Paul E.

1986-01-01

A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
Hardware multiplier processor

DOEpatents

Pierce, P.E.

A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
ProperCAD: A portable object-oriented parallel environment for VLSI CAD

NASA Technical Reports Server (NTRS)

Ramkumar, Balkrishna; Banerjee, Prithviraj

1993-01-01

Most parallel algorithms for VLSI CAD proposed to date have one important drawback: they work efficiently only on machines that they were designed for. As a result, algorithms designed to date are dependent on the architecture for which they are developed and do not port easily to other parallel architectures. A new project under way to address this problem is described. A Portable object-oriented parallel environment for CAD algorithms (ProperCAD) is being developed. The objectives of this research are (1) to develop new parallel algorithms that run in a portable object-oriented environment (CAD algorithms using a general purpose platform for portable parallel programming called CARM is being developed and a C++ environment that is truly object-oriented and specialized for CAD applications is also being developed); and (2) to design the parallel algorithms around a good sequential algorithm with a well-defined parallel-sequential interface (permitting the parallel algorithm to benefit from future developments in sequential algorithms). One CAD application that has been implemented as part of the ProperCAD project, flat VLSI circuit extraction, is described. The algorithm, its implementation, and its performance on a range of parallel machines are discussed in detail. It currently runs on an Encore Multimax, a Sequent Symmetry, Intel iPSC/2 and i860 hypercubes, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. Performance data for other applications that were developed are provided: namely test pattern generation for sequential circuits, parallel logic synthesis, and standard cell placement.
Cascaded VLSI Chips Help Neural Network To Learn

NASA Technical Reports Server (NTRS)

Duong, Tuan A.; Daud, Taher; Thakoor, Anilkumar P.

1993-01-01

Cascading provides 12-bit resolution needed for learning. Using conventional silicon chip fabrication technology of VLSI, fully connected architecture consisting of 32 wide-range, variable gain, sigmoidal neurons along one diagonal and 7-bit resolution, electrically programmable, synaptic 32 x 31 weight matrix implemented on neuron-synapse chip. To increase weight nominally from 7 to 13 bits, synapses on chip individually cascaded with respective synapses on another 32 x 32 matrix chip with 7-bit resolution synapses only (without neurons). Cascade correlation algorithm varies number of layers effectively connected into network; adds hidden layers one at a time during learning process in such way as to optimize overall number of neurons and complexity and configuration of network.
Cascade Error Projection: A Learning Algorithm for Hardware Implementation

NASA Technical Reports Server (NTRS)

Duong, Tuan A.; Daud, Taher

1996-01-01

In this paper, we workout a detailed mathematical analysis for a new learning algorithm termed Cascade Error Projection (CEP) and a general learning frame work. This frame work can be used to obtain the cascade correlation learning algorithm by choosing a particular set of parameters. Furthermore, CEP learning algorithm is operated only on one layer, whereas the other set of weights can be calculated deterministically. In association with the dynamical stepsize change concept to convert the weight update from infinite space into a finite space, the relation between the current stepsize and the previous energy level is also given and the estimation procedure for optimal stepsize is used for validation of our proposed technique. The weight values of zero are used for starting the learning for every layer, and a single hidden unit is applied instead of using a pool of candidate hidden units similar to cascade correlation scheme. Therefore, simplicity in hardware implementation is also obtained. Furthermore, this analysis allows us to select from other methods (such as the conjugate gradient descent or the Newton's second order) one of which will be a good candidate for the learning technique. The choice of learning technique depends on the constraints of the problem (e.g., speed, performance, and hardware implementation); one technique may be more suitable than others. Moreover, for a discrete weight space, the theoretical analysis presents the capability of learning with limited weight quantization. Finally, 5- to 8-bit parity and chaotic time series prediction problems are investigated; the simulation results demonstrate that 4-bit or more weight quantization is sufficient for learning neural network using CEP. In addition, it is demonstrated that this technique is able to compensate for less bit weight resolution by incorporating additional hidden units. However, generation result may suffer somewhat with lower bit weight quantization.
Node synchronization schemes for the Big Viterbi Decoder

NASA Technical Reports Server (NTRS)

Cheung, K.-M.; Swanson, L.; Arnold, S.

1992-01-01

The Big Viterbi Decoder (BVD), currently under development for the DSN, includes three separate algorithms to acquire and maintain node and frame synchronization. The first measures the number of decoded bits between two consecutive renormalization operations (renorm rate), the second detects the presence of the frame marker in the decoded bit stream (bit correlation), while the third searches for an encoded version of the frame marker in the encoded input stream (symbol correlation). A detailed account of the operation is given, as well as performance comparison, of the three methods.
Random Bits Forest: a Strong Classifier/Regressor for Big Data

NASA Astrophysics Data System (ADS)

Wang, Yi; Li, Yi; Pu, Weilin; Wen, Kathryn; Shugart, Yin Yao; Xiong, Momiao; Jin, Li

2016-07-01

Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).
Fundamental physics issues of multilevel logic in developing a parallel processor.

NASA Astrophysics Data System (ADS)

Bandyopadhyay, Anirban; Miki, Kazushi

2007-06-01

In the last century, On and Off physical switches, were equated with two decisions 0 and 1 to express every information in terms of binary digits and physically realize it in terms of switches connected in a circuit. Apart from memory-density increase significantly, more possible choices in particular space enables pattern-logic a reality, and manipulation of pattern would allow controlling logic, generating a new kind of processor. Neumann's computer is based on sequential logic, processing bits one by one. But as pattern-logic is generated on a surface, viewing whole pattern at a time is a truly parallel processing. Following Neumann's and Shannons fundamental thermodynamical approaches we have built compatible model based on series of single molecule based multibit logic systems of 4-12 bits in an UHV-STM. On their monolayer multilevel communication and pattern formation is experimentally verified. Furthermore, the developed intelligent monolayer is trained by Artificial Neural Network. Therefore fundamental weak interactions for the building of truly parallel processor are explored here physically and theoretically.
Fault-tolerant corrector/detector chip for high-speed data processing

DOEpatents

Andaleon, David D.; Napolitano, Jr., Leonard M.; Redinbo, G. Robert; Shreeve, William O.

1994-01-01

An internally fault-tolerant data error detection and correction integrated circuit device (10) and a method of operating same. The device functions as a bidirectional data buffer between a 32-bit data processor and the remainder of a data processing system and provides a 32-bit datum is provided with a relatively short eight bits of data-protecting parity. The 32-bits of data by eight bits of parity is partitioned into eight 4-bit nibbles and two 4-bit nibbles, respectively. For data flowing towards the processor the data and parity nibbles are checked in parallel and in a single operation employing a dual orthogonal basis technique. The dual orthogonal basis increase the efficiency of the implementation. Any one of ten (eight data, two parity) nibbles are correctable if erroneous, or two different erroneous nibbles are detectable. For data flowing away from the processor the appropriate parity nibble values are calculated and transmitted to the system along with the data. The device regenerates parity values for data flowing in either direction and compares regenerated to generated parity with a totally self-checking equality checker. As such, the device is self-validating and enabled to both detect and indicate an occurrence of an internal failure. A generalization of the device to protect 64-bit data with 16-bit parity to protect against byte-wide errors is also presented.
Fault-tolerant corrector/detector chip for high-speed data processing

DOEpatents

Andaleon, D.D.; Napolitano, L.M. Jr.; Redinbo, G.R.; Shreeve, W.O.

1994-03-01

An internally fault-tolerant data error detection and correction integrated circuit device and a method of operating same is described. The device functions as a bidirectional data buffer between a 32-bit data processor and the remainder of a data processing system and provides a 32-bit datum with a relatively short eight bits of data-protecting parity. The 32-bits of data by eight bits of parity is partitioned into eight 4-bit nibbles and two 4-bit nibbles, respectively. For data flowing towards the processor the data and parity nibbles are checked in parallel and in a single operation employing a dual orthogonal basis technique. The dual orthogonal basis increase the efficiency of the implementation. Any one of ten (eight data, two parity) nibbles are correctable if erroneous, or two different erroneous nibbles are detectable. For data flowing away from the processor the appropriate parity nibble values are calculated and transmitted to the system along with the data. The device regenerates parity values for data flowing in either direction and compares regenerated to generated parity with a totally self-checking equality checker. As such, the device is self-validating and enabled to both detect and indicate an occurrence of an internal failure. A generalization of the device to protect 64-bit data with 16-bit parity to protect against byte-wide errors is also presented. 8 figures.
List-mode reconstruction for the Biograph mCT with physics modeling and event-by-event motion correction

NASA Astrophysics Data System (ADS)

Jin, Xiao; Chan, Chung; Mulnix, Tim; Panin, Vladimir; Casey, Michael E.; Liu, Chi; Carson, Richard E.

2013-08-01

Whole-body PET/CT scanners are important clinical and research tools to study tracer distribution throughout the body. In whole-body studies, respiratory motion results in image artifacts. We have previously demonstrated for brain imaging that, when provided with accurate motion data, event-by-event correction has better accuracy than frame-based methods. Therefore, the goal of this work was to develop a list-mode reconstruction with novel physics modeling for the Siemens Biograph mCT with event-by-event motion correction, based on the MOLAR platform (Motion-compensation OSEM List-mode Algorithm for Resolution-Recovery Reconstruction). Application of MOLAR for the mCT required two algorithmic developments. First, in routine studies, the mCT collects list-mode data in 32 bit packets, where averaging of lines-of-response (LORs) by axial span and angular mashing reduced the number of LORs so that 32 bits are sufficient to address all sinogram bins. This degrades spatial resolution. In this work, we proposed a probabilistic LOR (pLOR) position technique that addresses axial and transaxial LOR grouping in 32 bit data. Second, two simplified approaches for 3D time-of-flight (TOF) scatter estimation were developed to accelerate the computationally intensive calculation without compromising accuracy. The proposed list-mode reconstruction algorithm was compared to the manufacturer's point spread function + TOF (PSF+TOF) algorithm. Phantom, animal, and human studies demonstrated that MOLAR with pLOR gives slightly faster contrast recovery than the PSF+TOF algorithm that uses the average 32 bit LOR sinogram positioning. Moving phantom and a whole-body human study suggested that event-by-event motion correction reduces image blurring caused by respiratory motion. We conclude that list-mode reconstruction with pLOR positioning provides a platform to generate high quality images for the mCT, and to recover fine structures in whole-body PET scans through event-by-event motion correction.
List-mode Reconstruction for the Biograph mCT with Physics Modeling and Event-by-Event Motion Correction

PubMed Central

Jin, Xiao; Chan, Chung; Mulnix, Tim; Panin, Vladimir; Casey, Michael E.; Liu, Chi; Carson, Richard E.

2013-01-01

Whole-body PET/CT scanners are important clinical and research tools to study tracer distribution throughout the body. In whole-body studies, respiratory motion results in image artifacts. We have previously demonstrated for brain imaging that, when provided accurate motion data, event-by-event correction has better accuracy than frame-based methods. Therefore, the goal of this work was to develop a list-mode reconstruction with novel physics modeling for the Siemens Biograph mCT with event-by-event motion correction, based on the MOLAR platform (Motion-compensation OSEM List-mode Algorithm for Resolution-Recovery Reconstruction). Application of MOLAR for the mCT required two algorithmic developments. First, in routine studies, the mCT collects list-mode data in 32-bit packets, where averaging of lines of response (LORs) by axial span and angular mashing reduced the number of LORs so that 32 bits are sufficient to address all sinogram bins. This degrades spatial resolution. In this work, we proposed a probabilistic assignment of LOR positions (pLOR) that addresses axial and transaxial LOR grouping in 32-bit data. Second, two simplified approaches for 3D TOF scatter estimation were developed to accelerate the computationally intensive calculation without compromising accuracy. The proposed list-mode reconstruction algorithm was compared to the manufacturer's point spread function + time-of-flight (PSF+TOF) algorithm. Phantom, animal, and human studies demonstrated that MOLAR with pLOR gives slightly faster contrast recovery than the PSF+TOF algorithm that uses the average 32-bit LOR sinogram positioning. Moving phantom and a whole-body human study suggested that event-by-event motion correction reduces image blurring caused by respiratory motion. We conclude that list-mode reconstruction with pLOR positioning provides a platform to generate high quality images for the mCT, and to recover fine structures in whole-body PET scans through event-by-event motion correction. PMID:23892635
DIALIGN P: fast pair-wise and multiple sequence alignment using parallel processors.

PubMed

Schmollinger, Martin; Nieselt, Kay; Kaufmann, Michael; Morgenstern, Burkhard

2004-09-09

Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b) For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.
Photonic content-addressable memory system that uses a parallel-readout optical disk

NASA Astrophysics Data System (ADS)

Krishnamoorthy, Ashok V.; Marchand, Philippe J.; Yayla, Gökçe; Esener, Sadik C.

1995-11-01

We describe a high-performance associative-memory system that can be implemented by means of an optical disk modified for parallel readout and a custom-designed silicon integrated circuit with parallel optical input. The system can achieve associative recall on 128 \\times 128 bit images and also on variable-size subimages. The system's behavior and performance are evaluated on the basis of experimental results on a motionless-head parallel-readout optical-disk system, logic simulations of the very-large-scale integrated chip, and a software emulation of the overall system.

Development of gallium arsenide high-speed, low-power serial parallel interface modules: Executive summary

NASA Technical Reports Server (NTRS)

1988-01-01

Final report to NASA LeRC on the development of gallium arsenide (GaAS) high-speed, low power serial/parallel interface modules. The report discusses the development and test of a family of 16, 32 and 64 bit parallel to serial and serial to parallel integrated circuits using a self aligned gate MESFET technology developed at the Honeywell Sensors and Signal Processing Laboratory. Lab testing demonstrated 1.3 GHz clock rates at a power of 300 mW. This work was accomplished under contract number NAS3-24676.
Multitasking TORT under UNICOS: Parallel performance models and measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnett, A.; Azmy, Y.Y.

1999-09-27

The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Multitasking TORT Under UNICOS: Parallel Performance Models and Measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azmy, Y.Y.; Barnett, D.A.

1999-09-27

The existing parallel algorithms in the TORT discrete ordinates were updated to function in a UNI-COS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Performance Analysis for Channel Estimation With 1-Bit ADC and Unknown Quantization Threshold

NASA Astrophysics Data System (ADS)

Stein, Manuel S.; Bar, Shahar; Nossek, Josef A.; Tabrikian, Joseph

2018-05-01

In this work, the problem of signal parameter estimation from measurements acquired by a low-complexity analog-to-digital converter (ADC) with $1$-bit output resolution and an unknown quantization threshold is considered. Single-comparator ADCs are energy-efficient and can be operated at ultra-high sampling rates. For analysis of such systems, a fixed and known quantization threshold is usually assumed. In the symmetric case, i.e., zero hard-limiting offset, it is known that in the low signal-to-noise ratio (SNR) regime the signal processing performance degrades moderately by ${2}/{\\pi}$ ($-1.96$ dB) when comparing to an ideal $\\infty$-bit converter. Due to hardware imperfections, low-complexity $1$-bit ADCs will in practice exhibit an unknown threshold different from zero. Therefore, we study the accuracy which can be obtained with receive data processed by a hard-limiter with unknown quantization level by using asymptotically optimal channel estimation algorithms. To characterize the estimation performance of these nonlinear algorithms, we employ analytic error expressions for different setups while modeling the offset as a nuisance parameter. In the low SNR regime, we establish the necessary condition for a vanishing loss due to missing offset knowledge at the receiver. As an application, we consider the estimation of single-input single-output wireless channels with inter-symbol interference and validate our analysis by comparing the analytic and experimental performance of the studied estimation algorithms. Finally, we comment on the extension to multiple-input multiple-output channel models.
Entropic Lattice Boltzmann Simulations of Turbulence

NASA Astrophysics Data System (ADS)

Keating, Brian; Vahala, George; Vahala, Linda; Soe, Min; Yepez, Jeffrey

2006-10-01

Because of its simplicity, nearly perfect parallelization and vectorization on supercomputer platforms, lattice Boltzmann (LB) methods hold great promise for simulations of nonlinear physics. Indeed, our MHD-LB code has the best sustained performance/PE of any code on the Earth Simulator. By projecting into the higher dimensional kinetic phase space, the solution trajectory is simpler and much easier to compute than standard CFD approach. However, simple LB -- with its simple advection and local BGK collisional relaxation -- does not impose positive definiteness of the distribution functions in the time evolution. This leads to numerical instabilities for very low transport coefficients. In Entropic LB (ELB) one determines a discrete H-theorem and the equilibrium distribution functions subject to the collisional invariants. The ELB algorithm is unconditionally stable to arbitrary small transport coefficients. Various choices of velocity discretization are examined: 15, 19 and 27-bit ELB models. The connection between Tsallis and Boltzmann entropies are clarified.
SAR processing on the MPP

NASA Technical Reports Server (NTRS)

Batcher, K. E.; Eddey, E. E.; Faiss, R. O.; Gilmore, P. A.

1981-01-01

The processing of synthetic aperture radar (SAR) signals using the massively parallel processor (MPP) is discussed. The fast Fourier transform convolution procedures employed in the algorithms are described. The MPP architecture comprises an array unit (ARU) which processes arrays of data; an array control unit which controls the operation of the ARU and performs scalar arithmetic; a program and data management unit which controls the flow of data; and a unique staging memory (SM) which buffers and permutes data. The ARU contains a 128 by 128 array of bit-serial processing elements (PE). Two-by-four surarrays of PE's are packaged in a custom VLSI HCMOS chip. The staging memory is a large multidimensional-access memory which buffers and permutes data flowing with the system. Efficient SAR processing is achieved via ARU communication paths and SM data manipulation. Real time processing capability can be realized via a multiple ARU, multiple SM configuration.
The Level 0 Pixel Trigger system for the ALICE experiment

NASA Astrophysics Data System (ADS)

Aglieri Rinella, G.; Kluge, A.; Krivda, M.; ALICE Silicon Pixel Detector project

2007-01-01

The ALICE Silicon Pixel Detector contains 1200 readout chips. Fast-OR signals indicate the presence of at least one hit in the 8192 pixel matrix of each chip. The 1200 bits are transmitted every 100 ns on 120 data readout optical links using the G-Link protocol. The Pixel Trigger System extracts and processes them to deliver an input signal to the Level 0 trigger processor targeting a latency of 800 ns. The system is compact, modular and based on FPGA devices. The architecture allows the user to define and implement various trigger algorithms. The system uses advanced 12-channel parallel optical fiber modules operating at 1310 nm as optical receivers and 12 deserializer chips closely packed in small area receiver boards. Alternative solutions with multi-channel G-Link deserializers implemented directly in programmable hardware devices were investigated. The design of the system and the progress of the ALICE Pixel Trigger project are described in this paper.
Method and apparatus for high speed data acquisition and processing

DOEpatents

Ferron, J.R.

1997-02-11

A method and apparatus are disclosed for high speed digital data acquisition. The apparatus includes one or more multiplexers for receiving multiple channels of digital data at a low data rate and asserting a multiplexed data stream at a high data rate, and one or more FIFO memories for receiving data from the multiplexers and asserting the data to a real time processor. Preferably, the invention includes two multiplexers, two FIFO memories, and a 64-bit bus connecting the FIFO memories with the processor. Each multiplexer receives four channels of 14-bit digital data at a rate of up to 5 MHz per channel, and outputs a data stream to one of the FIFO memories at a rate of 20 MHz. The FIFO memories assert output data in parallel to the 64-bit bus, thus transferring 14-bit data values to the processor at a combined rate of 40 MHz. The real time processor is preferably a floating-point processor which processes 32-bit floating-point words. A set of mask bits is prestored in each 32-bit storage location of the processor memory into which a 14-bit data value is to be written. After data transfer from the FIFO memories, mask bits are concatenated with each stored 14-bit data value to define a valid 32-bit floating-point word. Preferably, a user can select any of several modes for starting and stopping direct memory transfers of data from the FIFO memories to memory within the real time processor, by setting the content of a control and status register. 15 figs.
Method and apparatus for high speed data acquisition and processing

DOEpatents

Ferron, John R.

1997-01-01

A method and apparatus for high speed digital data acquisition. The apparatus includes one or more multiplexers for receiving multiple channels of digital data at a low data rate and asserting a multiplexed data stream at a high data rate, and one or more FIFO memories for receiving data from the multiplexers and asserting the data to a real time processor. Preferably, the invention includes two multiplexers, two FIFO memories, and a 64-bit bus connecting the FIFO memories with the processor. Each multiplexer receives four channels of 14-bit digital data at a rate of up to 5 MHz per channel, and outputs a data stream to one of the FIFO memories at a rate of 20 MHz. The FIFO memories assert output data in parallel to the 64-bit bus, thus transferring 14-bit data values to the processor at a combined rate of 40 MHz. The real time processor is preferably a floating-point processor which processes 32-bit floating-point words. A set of mask bits is prestored in each 32-bit storage location of the processor memory into which a 14-bit data value is to be written. After data transfer from the FIFO memories, mask bits are concatenated with each stored 14-bit data value to define a valid 32-bit floating-point word. Preferably, a user can select any of several modes for starting and stopping direct memory transfers of data from the FIFO memories to memory within the real time processor, by setting the content of a control and status register.
Fast computational scheme of image compression for 32-bit microprocessors

NASA Technical Reports Server (NTRS)

Kasperovich, Leonid

1994-01-01

This paper presents a new computational scheme of image compression based on the discrete cosine transform (DCT), underlying JPEG and MPEG International Standards. The algorithm for the 2-d DCT computation uses integer operations (register shifts and additions / subtractions only); its computational complexity is about 8 additions per image pixel. As a meaningful example of an on-board image compression application we consider the software implementation of the algorithm for the Mars Rover (Marsokhod, in Russian) imaging system being developed as a part of Mars-96 International Space Project. It's shown that fast software solution for 32-bit microprocessors may compete with the DCT-based image compression hardware.
New scene change control scheme based on pseudoskipped picture

NASA Astrophysics Data System (ADS)

Lee, Youngsun; Lee, Jinwhan; Chang, Hyunsik; Nam, Jae Y.

1997-01-01

A new scene change control scheme which improves the video coding performance for sequences that have many scene changed pictures is proposed in this paper. The scene changed pictures except intra-coded picture usually need more bits than normal pictures in order to maintain constant picture quality. The major idea of this paper is how to obtain extra bits which are needed to encode scene changed pictures. We encode a B picture which is located before a scene changed picture like a skipped picture. We call such a B picture as a pseudo-skipped picture. By generating the pseudo-skipped picture like a skipped picture. We call such a B picture as a pseudo-skipped picture. By generating the pseudo-skipped picture, we can save some bits and they are added to the originally allocated target bits to encode the scene changed picture. The simulation results show that the proposed algorithm improves encoding performance about 0.5 to approximately 2.0 dB of PSNR compared to MPEG-2 TM5 rate controls scheme. In addition, the suggested algorithm is compatible with MPEG-2 video syntax and the picture repetition is not recognizable.
Expeditious reconciliation for practical quantum key distribution

NASA Astrophysics Data System (ADS)

Nakassis, Anastase; Bienfang, Joshua C.; Williams, Carl J.

2004-08-01

The paper proposes algorithmic and environmental modifications to the extant reconciliation algorithms within the BB84 protocol so as to speed up reconciliation and privacy amplification. These algorithms have been known to be a performance bottleneck 1 and can process data at rates that are six times slower than the quantum channel they serve2. As improvements in single-photon sources and detectors are expected to improve the quantum channel throughput by two or three orders of magnitude, it becomes imperative to improve the performance of the classical software. We developed a Cascade-like algorithm that relies on a symmetric formulation of the problem, error estimation through the segmentation process, outright elimination of segments with many errors, Forward Error Correction, recognition of the distinct data subpopulations that emerge as the algorithm runs, ability to operate on massive amounts of data (of the order of 1 Mbit), and a few other minor improvements. The data from the experimental algorithm we developed show that by operating on massive arrays of data we can improve software performance by better than three orders of magnitude while retaining nearly as many bits (typically more than 90%) as the algorithms that were designed for optimal bit retention.
Scalable Parallel Density-based Clustering and Applications

NASA Astrophysics Data System (ADS)

Patwary, Mostofa Ali

2014-04-01

Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Parallel consistent labeling algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Samal, A.; Henderson, T.

Mackworth and Freuder have analyzed the time complexity of several constraint satisfaction algorithms. Mohr and Henderson have given new algorithms, AC-4 and PC-3, for arc and path consistency, respectively, and have shown that the arc consistency algorithm is optimal in time complexity and of the same order space complexity as the earlier algorithms. In this paper, they give parallel algorithms for solving node and arc consistency. They show that any parallel algorithm for enforcing arc consistency in the worst case must have O(na) sequential steps, where n is number of nodes, and a is the number of labels per node.more » They give several parallel algorithms to do arc consistency. It is also shown that they all have optimal time complexity. The results of running the parallel algorithms on a BBN Butterfly multiprocessor are also presented.« less
Experience in highly parallel processing using DAP

NASA Technical Reports Server (NTRS)

Parkinson, D.

1987-01-01

Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.
Stochastic Formal Correctness of Numerical Algorithms

NASA Technical Reports Server (NTRS)

Daumas, Marc; Lester, David; Martin-Dorel, Erik; Truffert, Annick

2009-01-01

We provide a framework to bound the probability that accumulated errors were never above a given threshold on numerical algorithms. Such algorithms are used for example in aircraft and nuclear power plants. This report contains simple formulas based on Levy's and Markov's inequalities and it presents a formal theory of random variables with a special focus on producing concrete results. We selected four very common applications that fit in our framework and cover the common practices of systems that evolve for a long time. We compute the number of bits that remain continuously significant in the first two applications with a probability of failure around one out of a billion, where worst case analysis considers that no significant bit remains. We are using PVS as such formal tools force explicit statement of all hypotheses and prevent incorrect uses of theorems.
Parallel CE/SE Computations via Domain Decomposition

NASA Technical Reports Server (NTRS)

Himansu, Ananda; Jorgenson, Philip C. E.; Wang, Xiao-Yen; Chang, Sin-Chung

2000-01-01

This paper describes the parallelization strategy and achieved parallel efficiency of an explicit time-marching algorithm for solving conservation laws. The Space-Time Conservation Element and Solution Element (CE/SE) algorithm for solving the 2D and 3D Euler equations is parallelized with the aid of domain decomposition. The parallel efficiency of the resultant algorithm on a Silicon Graphics Origin 2000 parallel computer is checked.
Comparison of multihardware parallel implementations for a phase unwrapping algorithm

NASA Astrophysics Data System (ADS)

Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo

2018-04-01

Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
A wide bandwidth CCD buffer memory system

NASA Technical Reports Server (NTRS)

Siemens, K.; Wallace, R. W.; Robinson, C. R.

1978-01-01

A prototype system was implemented to demonstrate that CCD's can be applied advantageously to the problem of low power digital storage and particularly to the problem of interfacing widely varying data rates. CCD shift register memories (8K bit) were used to construct a feasibility model 128 K-bit buffer memory system. Serial data that can have rates between 150 kHz and 4.0 MHz can be stored in 4K-bit, randomly-accessible memory blocks. Peak power dissipation during a data transfer is less than 7 W, while idle power is approximately 5.4 W. The system features automatic data input synchronization with the recirculating CCD memory block start address. System expansion to accommodate parallel inputs or a greater number of memory blocks can be performed in a modular fashion. Since the control logic does not increase proportionally to increase in memory capacity, the power requirements per bit of storage can be reduced significantly in a larger system.
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations

NASA Technical Reports Server (NTRS)

Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw

2005-01-01

A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.

Missile Manufacturing Technology Conference Held at Hilton Head Island, South Carolina on 22-26 September 1975. Panel Presentations. Test Equipment

DTIC Science & Technology

1975-01-01

in the computer in 16 bit parallel computer DIO transfers at the max- imum computer I/O speed. it then transmits this data in a bit- serial echo...maximum DIO rate under computer interrupt control. The LCI also provides station interrupt information for transfer to the computer under computer...been in daily operation since 1973. The SAM-D Missile system is currently in the Engineering De - velopment phase which precedes the Production and
Bit-serial neuroprocessor architecture

NASA Technical Reports Server (NTRS)

Tawel, Raoul (Inventor)

2001-01-01

A neuroprocessor architecture employs a combination of bit-serial and serial-parallel techniques for implementing the neurons of the neuroprocessor. The neuroprocessor architecture includes a neural module containing a pool of neurons, a global controller, a sigmoid activation ROM look-up-table, a plurality of neuron state registers, and a synaptic weight RAM. The neuroprocessor reduces the number of neurons required to perform the task by time multiplexing groups of neurons from a fixed pool of neurons to achieve the successive hidden layers of a recurrent network topology.
A parallel algorithm for the two-dimensional time fractional diffusion equation with implicit difference method.

PubMed

Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie

2014-01-01

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
Synthesis and evaluation of phase detectors for active bit synchronizers

NASA Technical Reports Server (NTRS)

Mcbride, A. L.

1974-01-01

Self-synchronizing digital data communication systems usually use active or phase-locked loop (PLL) bit synchronizers. The three main elements of PLL synchronizers are the phase detector, loop filter, and the voltage controlled oscillator. Of these three elements, phase detector synthesis is the main source of difficulty, particularly when the received signals are demodulated square-wave signals. A phase detector synthesis technique is reviewed that provides a physically realizable design for bit synchronizer phase detectors. The development is based upon nonlinear recursive estimation methods. The phase detector portion of the algorithm is isolated and analyzed.
Pseudo-color coding method for high-dynamic single-polarization SAR images

NASA Astrophysics Data System (ADS)

Feng, Zicheng; Liu, Xiaolin; Pei, Bingzhi

2018-04-01

A raw synthetic aperture radar (SAR) image usually has a 16-bit or higher bit depth, which cannot be directly visualized on 8-bit displays. In this study, we propose a pseudo-color coding method for high-dynamic singlepolarization SAR images. The method considers the characteristics of both SAR images and human perception. In HSI (hue, saturation and intensity) color space, the method carries out high-dynamic range tone mapping and pseudo-color processing simultaneously in order to avoid loss of details and to improve object identifiability. It is a highly efficient global algorithm.
Nonlinear Algorithms for Channel Equalization and Map Symbol Detection.

NASA Astrophysics Data System (ADS)

Giridhar, K.

The transfer of information through a communication medium invariably results in various kinds of distortion to the transmitted signal. In this dissertation, a feed -forward neural network-based equalizer, and a family of maximum a posteriori (MAP) symbol detectors are proposed for signal recovery in the presence of intersymbol interference (ISI) and additive white Gaussian noise. The proposed neural network-based equalizer employs a novel bit-mapping strategy to handle multilevel data signals in an equivalent bipolar representation. It uses a training procedure to learn the channel characteristics, and at the end of training, the multilevel symbols are recovered from the corresponding inverse bit-mapping. When the channel characteristics are unknown and no training sequences are available, blind estimation of the channel (or its inverse) and simultaneous data recovery is required. Convergence properties of several existing Bussgang-type blind equalization algorithms are studied through computer simulations, and a unique gain independent approach is used to obtain a fair comparison of their rates of convergence. Although simple to implement, the slow convergence of these Bussgang-type blind equalizers make them unsuitable for many high data-rate applications. Rapidly converging blind algorithms based on the principle of MAP symbol-by -symbol detection are proposed, which adaptively estimate the channel impulse response (CIR) and simultaneously decode the received data sequence. Assuming a linear and Gaussian measurement model, the near-optimal blind MAP symbol detector (MAPSD) consists of a parallel bank of conditional Kalman channel estimators, where the conditioning is done on each possible data subsequence that can convolve with the CIR. This algorithm is also extended to the recovery of convolutionally encoded waveforms in the presence of ISI. Since the complexity of the MAPSD algorithm increases exponentially with the length of the assumed CIR, a suboptimal decision-feedback mechanism is introduced to truncate the channel memory "seen" by the MAPSD section. Also, simpler gradient-based updates for the channel estimates, and a metric pruning technique are used to further reduce the MAPSD complexity. Spatial diversity MAP combiners are developed to enhance the error rate performance and combat channel fading. As a first application of the MAPSD algorithm, dual-mode recovery techniques for TDMA (time-division multiple access) mobile radio signals are presented. Combined estimation of the symbol timing and the multipath parameters is proposed, using an auxiliary extended Kalman filter during the training cycle, and then tracking of the fading parameters is performed during the data cycle using the blind MAPSD algorithm. For the second application, a single-input receiver is employed to jointly recover cochannel narrowband signals. Assuming known channels, this two-stage joint MAPSD (JMAPSD) algorithm is compared to the optimal joint maximum likelihood sequence estimator, and to the joint decision-feedback detector. A blind MAPSD algorithm for the joint recovery of cochannel signals is also presented. Computer simulation results are provided to quantify the performance of the various algorithms proposed in this dissertation.
Implementation of DFT application on ternary optical computer

NASA Astrophysics Data System (ADS)

Junjie, Peng; Youyi, Fu; Xiaofeng, Zhang; Shuai, Kong; Xinyu, Wei

2018-03-01

As its characteristics of huge number of data bits and low energy consumption, optical computing may be used in the applications such as DFT etc. which needs a lot of computation and can be implemented in parallel. According to this, DFT implementation methods in full parallel as well as in partial parallel are presented. Based on resources ternary optical computer (TOC), extensive experiments were carried out. Experimental results show that the proposed schemes are correct and feasible. They provide a foundation for further exploration of the applications on TOC that needs a large amount calculation and can be processed in parallel.
Displaying Data As Movies

NASA Technical Reports Server (NTRS)

Moore, Judith G.

1992-01-01

NMSB Movie computer program displays large sets of data (more than million individual values). Presentation dynamic, rapidly displaying sequential image "frames" in main "movie" window. Any sequence of two-dimensional sets of data scaled between 0 and 255 (1-byte resolution) displayed as movie. Time- or slice-wise progression of data illustrated. Originally written to present data from three-dimensional ultrasonic scans of damaged aerospace composite materials, illustrates data acquired by thermal-analysis systems measuring rates of heating and cooling of various materials. Developed on Macintosh IIx computer with 8-bit color display adapter and 8 megabytes of memory using Symantec Corporation's Think C, version 4.0.
Design specification of an acousto-optic spectrum analyzer that could be used as an auxiliary receiver for CANEWS

NASA Astrophysics Data System (ADS)

Studenny, John; Johnstone, Eric

1991-01-01

The acousto-optic spectrum analyzer has undergone a theoretical design review and a basic parameter tradeoff analysis has been performed. The main conclusion is that for the given scenario of a 55 dB dynamic range and for a one-second temporal resolution, a 3.9 MHz resolution is a reasonable compromise with respect to current technology. Additional configurations are suggested. Noise testing of the signal detection processor algorithm was conducted. Additive white Gaussian noise was introduced to pure data. As expected, the tradeoff was between algorithm sensitivity and false alarms. No additional algorithm improvements could be made. The algorithm was observed to be robust, provided that the noise floor was set at a proper level. The digitization scheme was mainly driven by hardware constraints. To implement an analog to digital conversion scheme that linearly covers a 55 dB dynamic range would require a minimum of 17 bits. The general consensus was that 17 bits would be untenable for very large scale integration.
Algorithm for Lossless Compression of Calibrated Hyperspectral Imagery

NASA Technical Reports Server (NTRS)

Kiely, Aaron B.; Klimesh, Matthew A.

2010-01-01

A two-stage predictive method was developed for lossless compression of calibrated hyperspectral imagery. The first prediction stage uses a conventional linear predictor intended to exploit spatial and/or spectral dependencies in the data. The compressor tabulates counts of the past values of the difference between this initial prediction and the actual sample value. To form the ultimate predicted value, in the second stage, these counts are combined with an adaptively updated weight function intended to capture information about data regularities introduced by the calibration process. Finally, prediction residuals are losslessly encoded using adaptive arithmetic coding. Algorithms of this type are commonly tested on a readily available collection of images from the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) hyperspectral imager. On the standard calibrated AVIRIS hyperspectral images that are most widely used for compression benchmarking, the new compressor provides more than 0.5 bits/sample improvement over the previous best compression results. The algorithm has been implemented in Mathematica. The compression algorithm was demonstrated as beneficial on 12-bit calibrated AVIRIS images.
Real-time motion-based H.263+ frame rate control

NASA Astrophysics Data System (ADS)

Song, Hwangjun; Kim, JongWon; Kuo, C.-C. Jay

1998-12-01

Most existing H.263+ rate control algorithms, e.g. the one adopted in the test model of the near-term (TMN8), focus on the macroblock layer rate control and low latency under the assumptions of with a constant frame rate and through a constant bit rate (CBR) channel. These algorithms do not accommodate the transmission bandwidth fluctuation efficiently, and the resulting video quality can be degraded. In this work, we propose a new H.263+ rate control scheme which supports the variable bit rate (VBR) channel through the adjustment of the encoding frame rate and quantization parameter. A fast algorithm for the encoding frame rate control based on the inherent motion information within a sliding window in the underlying video is developed to efficiently pursue a good tradeoff between spatial and temporal quality. The proposed rate control algorithm also takes the time-varying bandwidth characteristic of the Internet into account and is able to accommodate the change accordingly. Experimental results are provided to demonstrate the superior performance of the proposed scheme.
Design of a Low-Light-Level Image Sensor with On-Chip Sigma-Delta Analog-to- Digital Conversion

NASA Technical Reports Server (NTRS)

Mendis, Sunetra K.; Pain, Bedabrata; Nixon, Robert H.; Fossum, Eric R.

1993-01-01

The design and projected performance of a low-light-level active-pixel-sensor (APS) chip with semi-parallel analog-to-digital (A/D) conversion is presented. The individual elements have been fabricated and tested using MOSIS* 2 micrometer CMOS technology, although the integrated system has not yet been fabricated. The imager consists of a 128 x 128 array of active pixels at a 50 micrometer pitch. Each column of pixels shares a 10-bit A/D converter based on first-order oversampled sigma-delta (Sigma-Delta) modulation. The 10-bit outputs of each converter are multiplexed and read out through a single set of outputs. A semi-parallel architecture is chosen to achieve 30 frames/second operation even at low light levels. The sensor is designed for less than 12 e^- rms noise performance.
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Choudhary, Alok Nidhi

1989-01-01

Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Validation of Regression-Based Myogenic Correction Techniques for Scalp and Source-Localized EEG

PubMed Central

McMenamin, Brenton W.; Shackman, Alexander J.; Maxwell, Jeffrey S.; Greischar, Lawrence L.; Davidson, Richard J.

2008-01-01

EEG and EEG source-estimation are susceptible to electromyographic artifacts (EMG) generated by the cranial muscles. EMG can mask genuine effects or masquerade as a legitimate effect - even in low frequencies, such as alpha (8–13Hz). Although regression-based correction has been used previously, only cursory attempts at validation exist and the utility for source-localized data is unknown. To address this, EEG was recorded from 17 participants while neurogenic and myogenic activity were factorially varied. We assessed the sensitivity and specificity of four regression-based techniques: between-subjects, between-subjects using difference-scores, within-subjects condition-wise, and within-subject epoch-wise on the scalp and in data modeled using the LORETA algorithm. Although within-subject epoch-wise showed superior performance on the scalp, no technique succeeded in the source-space. Aside from validating the novel epoch-wise methods on the scalp, we highlight methods requiring further development. PMID:19298626
Adaptive mesh refinement and load balancing based on multi-level block-structured Cartesian mesh

NASA Astrophysics Data System (ADS)

Misaka, Takashi; Sasaki, Daisuke; Obayashi, Shigeru

2017-11-01

We developed a framework for a distributed-memory parallel computer that enables dynamic data management for adaptive mesh refinement and load balancing. We employed simple data structure of the building cube method (BCM) where a computational domain is divided into multi-level cubic domains and each cube has the same number of grid points inside, realising a multi-level block-structured Cartesian mesh. Solution adaptive mesh refinement, which works efficiently with the help of the dynamic load balancing, was implemented by dividing cubes based on mesh refinement criteria. The framework was investigated with the Laplace equation in terms of adaptive mesh refinement, load balancing and the parallel efficiency. It was then applied to the incompressible Navier-Stokes equations to simulate a turbulent flow around a sphere. We considered wall-adaptive cube refinement where a non-dimensional wall distance y+ near the sphere is used for a criterion of mesh refinement. The result showed the load imbalance due to y+ adaptive mesh refinement was corrected by the present approach. To utilise the BCM framework more effectively, we also tested a cube-wise algorithm switching where an explicit and implicit time integration schemes are switched depending on the local Courant-Friedrichs-Lewy (CFL) condition in each cube.
Infectious encephalitis: utility of a rational approach to aetiological diagnosis in daily clinical practice.

PubMed

López-Sánchez, C; Sulleiro, E; Bocanegra, C; Romero, S; Codina, G; Sanz, I; Esperalba, J; Serra, J; Pigrau, C; Burgos, J; Almirante, B; Falcó, V

2017-04-01

In this study we attempt to assess the utility of a simplified step-wise diagnostic algorithm to determinate the aetiology of encephalitis in daily clinical practice and to describe the main causes in our setting. This was a prospective cohort study of all consecutive cases of encephalitis in adult patients diagnosed between January 2010 and March 2015 at the University Hospital Vall d'Hebron in Barcelona, Spain. The aetiological study was carried out following the proposed step-wise algorithm. The proportion of aetiological diagnoses achieved in each step was analysed. Data from 97 patients with encephalitis were assessed. Following a simplified step-wise algorithm, a definite diagnosis was made in the first step in 53 patients (55 %) and in 12 additional cases (12 %) in the second step. Overall, a definite or probable aetiological diagnosis was achieved in 78 % of the cases. Herpes virus, L. monocytogenes and M. tuberculosis were the leading causative agents demonstrated, whereas less frequent aetiologies were observed, mainly in immunosuppressed patients. The overall related mortality was 13.4 %. According to our experience, the leading and treatable causes of encephalitis can be identified in a first diagnostic step with limited microbiological studies. L. monocytogenes treatment should be considered on arrival in some patients. Additional diagnostic effort should be made in immunosuppressed patients.
Evaluation of a metal artifact reduction algorithm applied to post-interventional flat detector CT in comparison to pre-treatment CT in patients with acute subarachnoid haemorrhage.

PubMed

Mennecke, Angelika; Svergun, Stanislav; Scholz, Bernhard; Royalty, Kevin; Dörfler, Arnd; Struffert, Tobias

2017-01-01

Metal artefacts can impair accurate diagnosis of haemorrhage using flat detector CT (FD-CT), especially after aneurysm coiling. Within this work we evaluate a prototype metal artefact reduction algorithm by comparison of the artefact-reduced and the non-artefact-reduced FD-CT images to pre-treatment FD-CT and multi-slice CT images. Twenty-five patients with acute aneurysmal subarachnoid haemorrhage (SAH) were selected retrospectively. FD-CT and multi-slice CT before endovascular treatment as well as FD-CT data sets after treatment were available for all patients. The algorithm was applied to post-treatment FD-CT. The effect of the algorithm was evaluated utilizing the pre-post concordance of a modified Fisher score, a subjective image quality assessment, the range of the Hounsfield units within three ROIs, and the pre-post slice-wise Pearson correlation. The pre-post concordance of the modified Fisher score, the subjective image quality, and the pre-post correlation of the ranges of the Hounsfield units were significantly higher for artefact-reduced than for non-artefact-reduced images. Within the metal-affected slices, the pre-post slice-wise Pearson correlation coefficient was higher for artefact-reduced than for non-artefact-reduced images. The overall diagnostic quality of the artefact-reduced images was improved and reached the level of the pre-interventional FD-CT images. The metal-unaffected parts of the image were not modified. • After coiling subarachnoid haemorrhage, metal artefacts seriously reduce FD-CT image quality. • This new metal artefact reduction algorithm is feasible for flat-detector CT. • After coiling, MAR is necessary for diagnostic quality of affected slices. • Slice-wise Pearson correlation is introduced to evaluate improvement of MAR in future studies. • Metal-unaffected parts of image are not modified by this MAR algorithm.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

NASA Technical Reports Server (NTRS)

Sun, Xian-He

1997-01-01

Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
An efficient parallel algorithm: Poststack and prestack Kirchhoff 3D depth migration using flexi-depth iterations

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh

2015-07-01

This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
Choosing wisely: prevalence and correlates of low-value health care services in the United States.

PubMed

Colla, Carrie H; Morden, Nancy E; Sequist, Thomas D; Schpero, William L; Rosenthal, Meredith B

2015-02-01

Specialty societies in the United States identified low-value tests and procedures that contribute to waste and poor health care quality via implementation of the American Board of Internal Medicine Foundation's Choosing Wisely initiative. To develop claims-based algorithms, to use them to estimate the prevalence of select Choosing Wisely services and to examine the demographic, health and health care system correlates of low-value care at a regional level. Using Medicare data from 2006 to 2011, we created claims-based algorithms to measure the prevalence of 11 Choosing Wisely-identified low-value services and examined geographic variation across hospital referral regions (HRRs). We created a composite low-value care score for each HRR and used linear regression to identify regional characteristics associated with more intense use of low-value services. Fee-for-service Medicare beneficiaries over age 65. Prevalence of selected Choosing Wisely low-value services. The national average annual prevalence of the selected Choosing Wisely low-value services ranged from 1.2% (upper urinary tract imaging in men with benign prostatic hyperplasia) to 46.5% (preoperative cardiac testing for low-risk, non-cardiac procedures). Prevalence across HRRs varied significantly. Regional characteristics associated with higher use of low-value services included greater overall per capita spending, a higher specialist to primary care ratio and higher proportion of minority beneficiaries. Identifying and measuring low-value health services is a prerequisite for improving quality and eliminating waste. Our findings suggest that the delivery of wasteful and potentially harmful services may be a fruitful area for further research and policy intervention for HRRs with higher per-capita spending. These findings should inform action by physicians, health systems, policymakers, payers and consumer educators to improve the value of health care by targeting services and areas with greater use of potentially inappropriate care.

Parallel Algorithms for Least Squares and Related Computations.

DTIC Science & Technology

1991-03-22

for dense computations in linear algebra . The work has recently been published in a general reference book on parallel algorithms by SIAM. AFO SR...written his Ph.D. dissertation with the principal investigator. (See publication 6.) • Parallel Algorithms for Dense Linear Algebra Computations. Our...and describe and to put into perspective a selection of the more important parallel algorithms for numerical linear algebra . We give a major new
Genetic algorithms using SISAL parallel programming language

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tejada, S.

1994-05-06

Genetic algorithms are a mathematical optimization technique developed by John Holland at the University of Michigan [1]. The SISAL programming language possesses many of the characteristics desired to implement genetic algorithms. SISAL is a deterministic, functional programming language which is inherently parallel. Because SISAL is functional and based on mathematical concepts, genetic algorithms can be efficiently translated into the language. Several of the steps involved in genetic algorithms, such as mutation, crossover, and fitness evaluation, can be parallelized using SISAL. In this paper I will l discuss the implementation and performance of parallel genetic algorithms in SISAL.
S-EMG signal compression based on domain transformation and spectral shape dynamic bit allocation

PubMed Central

2014-01-01

Background Surface electromyographic (S-EMG) signal processing has been emerging in the past few years due to its non-invasive assessment of muscle function and structure and because of the fast growing rate of digital technology which brings about new solutions and applications. Factors such as sampling rate, quantization word length, number of channels and experiment duration can lead to a potentially large volume of data. Efficient transmission and/or storage of S-EMG signals are actually a research issue. That is the aim of this work. Methods This paper presents an algorithm for the data compression of surface electromyographic (S-EMG) signals recorded during isometric contractions protocol and during dynamic experimental protocols such as the cycling activity. The proposed algorithm is based on discrete wavelet transform to proceed spectral decomposition and de-correlation, on a dynamic bit allocation procedure to code the wavelets transformed coefficients, and on an entropy coding to minimize the remaining redundancy and to pack all data. The bit allocation scheme is based on mathematical decreasing spectral shape models, which indicates a shorter digital word length to code high frequency wavelets transformed coefficients. Four bit allocation spectral shape methods were implemented and compared: decreasing exponential spectral shape, decreasing linear spectral shape, decreasing square-root spectral shape and rotated hyperbolic tangent spectral shape. Results The proposed method is demonstrated and evaluated for an isometric protocol and for a dynamic protocol using a real S-EMG signal data bank. Objective performance evaluations metrics are presented. In addition, comparisons with other encoders proposed in scientific literature are shown. Conclusions The decreasing bit allocation shape applied to the quantized wavelet coefficients combined with arithmetic coding results is an efficient procedure. The performance comparisons of the proposed S-EMG data compression algorithm with the established techniques found in scientific literature have shown promising results. PMID:24571620
Micro-PIV/LIF measurements on electrokinetically-driven flow in surface modified microchannels

NASA Astrophysics Data System (ADS)

Ichiyanagi, Mitsuhisa; Sasaki, Seiichi; Sato, Yohei; Hishida, Koichi

2009-04-01

Effects of surface modification patterning on flow characteristics were investigated experimentally by measuring electroosmotic flow velocities, which were obtained by micron-resolution particle image velocimetry using a confocal microscope. The depth-wise velocity was evaluated by using the continuity equation and the velocity data. The microchannel was composed of a poly(dimethylsiloxane) chip and a borosilicate cover-glass plate. Surface modification patterns were fabricated by modifying octadecyltrichlorosilane (OTS) on the glass surface. OTS can decrease the electroosmotic flow velocity compared to the velocity in the glass microchannel. For the surface charge varying parallel to the electric field, the depth-wise velocity was generated at the boundary area between OTS and the glass surfaces. For the surface charge varying perpendicular to the electric field, the depth-wise velocity did not form because the surface charge did not vary in the stream-wise direction. The surface charge pattern with the oblique stripes yielded a three-dimensional flow in a microchannel. Furthermore, the oblique patterning was applied to a mixing flow field in a T-shaped microchannel, and mixing efficiencies were evaluated from heterogeneity degree of fluorescent dye intensity, which was obtained by laser-induced fluorescence. It was found that the angle of the oblique stripes is an important factor to promote the span-wise and depth-wise momentum transport and contributes to the mixing flow in a microchannel.
On the suitability of the connection machine for direct particle simulation

NASA Technical Reports Server (NTRS)

Dagum, Leonard

1990-01-01

The algorithmic structure was examined of the vectorizable Stanford particle simulation (SPS) method and the structure is reformulated in data parallel form. Some of the SPS algorithms can be directly translated to data parallel, but several of the vectorizable algorithms have no direct data parallel equivalent. This requires the development of new, strictly data parallel algorithms. In particular, a new sorting algorithm is developed to identify collision candidates in the simulation and a master/slave algorithm is developed to minimize communication cost in large table look up. Validation of the method is undertaken through test calculations for thermal relaxation of a gas, shock wave profiles, and shock reflection from a stationary wall. A qualitative measure is provided of the performance of the Connection Machine for direct particle simulation. The massively parallel architecture of the Connection Machine is found quite suitable for this type of calculation. However, there are difficulties in taking full advantage of this architecture because of lack of a broad based tradition of data parallel programming. An important outcome of this work has been new data parallel algorithms specifically of use for direct particle simulation but which also expand the data parallel diction.
Least reliable bits coding (LRBC) for high data rate satellite communications

NASA Technical Reports Server (NTRS)

Vanderaar, Mark; Budinger, James; Wagner, Paul

1992-01-01

LRBC, a bandwidth efficient multilevel/multistage block-coded modulation technique, is analyzed. LRBC uses simple multilevel component codes that provide increased error protection on increasingly unreliable modulated bits in order to maintain an overall high code rate that increases spectral efficiency. Soft-decision multistage decoding is used to make decisions on unprotected bits through corrections made on more protected bits. Analytical expressions and tight performance bounds are used to show that LRBC can achieve increased spectral efficiency and maintain equivalent or better power efficiency compared to that of BPSK. The relative simplicity of Galois field algebra vs the Viterbi algorithm and the availability of high-speed commercial VLSI for block codes indicates that LRBC using block codes is a desirable method for high data rate implementations.
Designing an efficient LT-code with unequal error protection for image transmission

NASA Astrophysics Data System (ADS)

S. Marques, F.; Schwartz, C.; Pinho, M. S.; Finamore, W. A.

2015-10-01

The use of images from earth observation satellites is spread over different applications, such as a car navigation systems and a disaster monitoring. In general, those images are captured by on board imaging devices and must be transmitted to the Earth using a communication system. Even though a high resolution image can produce a better Quality of Service, it leads to transmitters with high bit rate which require a large bandwidth and expend a large amount of energy. Therefore, it is very important to design efficient communication systems. From communication theory, it is well known that a source encoder is crucial in an efficient system. In a remote sensing satellite image transmission, this efficiency is achieved by using an image compressor, to reduce the amount of data which must be transmitted. The Consultative Committee for Space Data Systems (CCSDS), a multinational forum for the development of communications and data system standards for space flight, establishes a recommended standard for a data compression algorithm for images from space systems. Unfortunately, in the satellite communication channel, the transmitted signal is corrupted by the presence of noise, interference signals, etc. Therefore, the receiver of a digital communication system may fail to recover the transmitted bit. Actually, a channel code can be used to reduce the effect of this failure. In 2002, the Luby Transform code (LT-code) was introduced and it was shown that it was very efficient when the binary erasure channel model was used. Since the effect of the bit recovery failure depends on the position of the bit in the compressed image stream, in the last decade many e orts have been made to develop LT-code with unequal error protection. In 2012, Arslan et al. showed improvements when LT-codes with unequal error protection were used in images compressed by SPIHT algorithm. The techniques presented by Arslan et al. can be adapted to work with the algorithm for image compression recommended by CCSDS. In fact, to design a LT-code with an unequal error protection, the bit stream produced by the algorithm recommended by CCSDS must be partitioned in M disjoint sets of bits. Using the weighted approach, the LT-code produces M different failure probabilities for each set of bits, p1, ..., pM leading to a total probability of failure, p which is an average of p1, ..., pM. In general, the parameters of the LT-code with unequal error protection is chosen using a heuristic procedure. In this work, we analyze the problem of choosing the LT-code parameters to optimize two figure of merits: (a) the probability of achieving a minimum acceptable PSNR, and (b) the mean of PSNR, given that the minimum acceptable PSNR has been achieved. Given the rate-distortion curve achieved by CCSDS recommended algorithm, this work establishes a closed form of the mean of PSNR (given that the minimum acceptable PSNR has been achieved) as a function of p1, ..., pM. The main contribution of this work is the study of a criteria to select the parameters p1, ..., pM to optimize the performance of image transmission.
Demodulation Algorithms for the Ofdm Signals in the Time- and Frequency-Scattering Channels

NASA Astrophysics Data System (ADS)

Bochkov, G. N.; Gorokhov, K. V.; Kolobkov, A. V.

2016-06-01

We consider a method based on the generalized maximum-likelihood rule for solving the problem of reception of the signals with orthogonal frequency division multiplexing of their harmonic components (OFDM signals) in the time- and frequency-scattering channels. The coherent and incoherent demodulators effectively using the time scattering due to the fast fading of the signal are developed. Using computer simulation, we performed comparative analysis of the proposed algorithms and well-known signal-reception algorithms with equalizers. The proposed symbolby-symbol detector with decision feedback and restriction of the number of searched variants is shown to have the best bit-error-rate performance. It is shown that under conditions of the limited accuracy of estimating the communication-channel parameters, the incoherent OFDMsignal detectors with differential phase-shift keying can ensure a better bit-error-rate performance compared with the coherent OFDM-signal detectors with absolute phase-shift keying.
Performance Enhancement of MC-CDMA System through Novel Sensitive Bit Algorithm Aided Turbo Multi User Detection

PubMed Central

Kumaravel, Rasadurai; Narayanaswamy, Kumaratharan

2015-01-01

Multi carrier code division multiple access (MC-CDMA) system is a promising multi carrier modulation (MCM) technique for high data rate wireless communication over frequency selective fading channels. MC-CDMA system is a combination of code division multiple access (CDMA) and orthogonal frequency division multiplexing (OFDM). The OFDM parts reduce multipath fading and inter symbol interference (ISI) and the CDMA part increases spectrum utilization. Advantages of this technique are its robustness in case of multipath propagation and improve security with the minimize ISI. Nevertheless, due to the loss of orthogonality at the receiver in a mobile environment, the multiple access interference (MAI) appears. The MAI is one of the factors that degrade the bit error rate (BER) performance of MC-CDMA system. The multiuser detection (MUD) and turbo coding are the two dominant techniques for enhancing the performance of the MC-CDMA systems in terms of BER as a solution of overcome to MAI effects. In this paper a low complexity iterative soft sensitive bits algorithm (SBA) aided logarithmic-Maximum a-Posteriori algorithm (Log MAP) based turbo MUD is proposed. Simulation results show that the proposed method provides better BER performance with low complexity decoding, by mitigating the detrimental effects of MAI. PMID:25714917
Error control techniques for satellite and space communications

NASA Technical Reports Server (NTRS)

Costello, D. J., Jr.

1986-01-01

High rate concatenated coding systems with trellis inner codes and Reed-Solomon (RS) outer codes for application in satellite communication systems are considered. Two types of inner codes are studied: high rate punctured binary convolutional codes which result in overall effective information rates between 1/2 and 1 bit per channel use; and bandwidth efficient signal space trellis codes which can achieve overall effective information rates greater than 1 bit per channel use. Channel capacity calculations with and without side information performed for the concatenated coding system. Concatenated coding schemes are investigated. In Scheme 1, the inner code is decoded with the Viterbi algorithm and the outer RS code performs error-correction only (decoding without side information). In scheme 2, the inner code is decoded with a modified Viterbi algorithm which produces reliability information along with the decoded output. In this algorithm, path metrics are used to estimate the entire information sequence, while branch metrics are used to provide the reliability information on the decoded sequence. This information is used to erase unreliable bits in the decoded output. An errors-and-erasures RS decoder is then used for the outer code. These two schemes are proposed for use on NASA satellite channels. Results indicate that high system reliability can be achieved with little or no bandwidth expansion.
Sum of the Magnitude for Hard Decision Decoding Algorithm Based on Loop Update Detection.

PubMed

Meng, Jiahui; Zhao, Danfeng; Tian, Hai; Zhang, Liang

2018-01-15

In order to improve the performance of non-binary low-density parity check codes (LDPC) hard decision decoding algorithm and to reduce the complexity of decoding, a sum of the magnitude for hard decision decoding algorithm based on loop update detection is proposed. This will also ensure the reliability, stability and high transmission rate of 5G mobile communication. The algorithm is based on the hard decision decoding algorithm (HDA) and uses the soft information from the channel to calculate the reliability, while the sum of the variable nodes' (VN) magnitude is excluded for computing the reliability of the parity checks. At the same time, the reliability information of the variable node is considered and the loop update detection algorithm is introduced. The bit corresponding to the error code word is flipped multiple times, before this is searched in the order of most likely error probability to finally find the correct code word. Simulation results show that the performance of one of the improved schemes is better than the weighted symbol flipping (WSF) algorithm under different hexadecimal numbers by about 2.2 dB and 2.35 dB at the bit error rate (BER) of 10 -5 over an additive white Gaussian noise (AWGN) channel, respectively. Furthermore, the average number of decoding iterations is significantly reduced.
Runtime support for parallelizing data mining algorithms

NASA Astrophysics Data System (ADS)

Jin, Ruoming; Agrawal, Gagan

2002-03-01

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.
Parallel and Preemptable Dynamically Dimensioned Search Algorithms for Single and Multi-objective Optimization in Water Resources

NASA Astrophysics Data System (ADS)

Tolson, B.; Matott, L. S.; Gaffoor, T. A.; Asadzadeh, M.; Shafii, M.; Pomorski, P.; Xu, X.; Jahanpour, M.; Razavi, S.; Haghnegahdar, A.; Craig, J. R.

2015-12-01

We introduce asynchronous parallel implementations of the Dynamically Dimensioned Search (DDS) family of algorithms including DDS, discrete DDS, PA-DDS and DDS-AU. These parallel algorithms are unique from most existing parallel optimization algorithms in the water resources field in that parallel DDS is asynchronous and does not require an entire population (set of candidate solutions) to be evaluated before generating and then sending a new candidate solution for evaluation. One key advance in this study is developing the first parallel PA-DDS multi-objective optimization algorithm. The other key advance is enhancing the computational efficiency of solving optimization problems (such as model calibration) by combining a parallel optimization algorithm with the deterministic model pre-emption concept. These two efficiency techniques can only be combined because of the asynchronous nature of parallel DDS. Model pre-emption functions to terminate simulation model runs early, prior to completely simulating the model calibration period for example, when intermediate results indicate the candidate solution is so poor that it will definitely have no influence on the generation of further candidate solutions. The computational savings of deterministic model preemption available in serial implementations of population-based algorithms (e.g., PSO) disappear in synchronous parallel implementations as these algorithms. In addition to the key advances above, we implement the algorithms across a range of computation platforms (Windows and Unix-based operating systems from multi-core desktops to a supercomputer system) and package these for future modellers within a model-independent calibration software package called Ostrich as well as MATLAB versions. Results across multiple platforms and multiple case studies (from 4 to 64 processors) demonstrate the vast improvement over serial DDS-based algorithms and highlight the important role model pre-emption plays in the performance of parallel, pre-emptable DDS algorithms. Case studies include single- and multiple-objective optimization problems in water resources model calibration and in many cases linear or near linear speedups are observed.
Parallel transformation of K-SVD solar image denoising algorithm

NASA Astrophysics Data System (ADS)

Liang, Youwen; Tian, Yu; Li, Mei

2017-02-01

The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer

NASA Technical Reports Server (NTRS)

Jones, Mark Howard

1987-01-01

A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Massively parallel algorithms for real-time wavefront control of a dense adaptive optics system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fijany, A.; Milman, M.; Redding, D.

1994-12-31

In this paper massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. The authors have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although, this represents an optimal computation, due the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. They develop a novel algorithm,more » designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. Due to these features, this algorithm is significantly more efficient than other Fast Poisson Solvers for implementation on massively parallel architectures. The authors also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.« less
Least Reliable Bits Coding (LRBC) for high data rate satellite communications

NASA Technical Reports Server (NTRS)

Vanderaar, Mark; Wagner, Paul; Budinger, James

1992-01-01

An analysis and discussion of a bandwidth efficient multi-level/multi-stage block coded modulation technique called Least Reliable Bits Coding (LRBC) is presented. LRBC uses simple multi-level component codes that provide increased error protection on increasingly unreliable modulated bits in order to maintain an overall high code rate that increases spectral efficiency. Further, soft-decision multi-stage decoding is used to make decisions on unprotected bits through corrections made on more protected bits. Using analytical expressions and tight performance bounds it is shown that LRBC can achieve increased spectral efficiency and maintain equivalent or better power efficiency compared to that of Binary Phase Shift Keying (BPSK). Bit error rates (BER) vs. channel bit energy with Additive White Gaussian Noise (AWGN) are given for a set of LRB Reed-Solomon (RS) encoded 8PSK modulation formats with an ensemble rate of 8/9. All formats exhibit a spectral efficiency of 2.67 = (log2(8))(8/9) information bps/Hz. Bit by bit coded and uncoded error probabilities with soft-decision information are determined. These are traded with with code rate to determine parameters that achieve good performance. The relative simplicity of Galois field algebra vs. the Viterbi algorithm and the availability of high speed commercial Very Large Scale Integration (VLSI) for block codes indicates that LRBC using block codes is a desirable method for high data rate implementations.
Scalable non-negative matrix tri-factorization.

PubMed

Čopar, Andrej; Žitnik, Marinka; Zupan, Blaž

2017-01-01

Matrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Matrix factorization learns a latent data model that takes a data matrix and transforms it into a latent feature space enabling generalization, noise removal and feature discovery. However, factorization algorithms are numerically intensive, and hence there is a pressing challenge to scale current algorithms to work with large datasets. Our focus in this paper is matrix tri-factorization, a popular method that is not limited by the assumption of standard matrix factorization about data residing in one latent space. Matrix tri-factorization solves this by inferring a separate latent space for each dimension in a data matrix, and a latent mapping of interactions between the inferred spaces, making the approach particularly suitable for biomedical data mining. We developed a block-wise approach for latent factor learning in matrix tri-factorization. The approach partitions a data matrix into disjoint submatrices that are treated independently and fed into a parallel factorization system. An appealing property of the proposed approach is its mathematical equivalence with serial matrix tri-factorization. In a study on large biomedical datasets we show that our approach scales well on multi-processor and multi-GPU architectures. On a four-GPU system we demonstrate that our approach can be more than 100-times faster than its single-processor counterpart. A general approach for scaling non-negative matrix tri-factorization is proposed. The approach is especially useful parallel matrix factorization implemented in a multi-GPU environment. We expect the new approach will be useful in emerging procedures for latent factor analysis, notably for data integration, where many large data matrices need to be collectively factorized.
Speeding up parallel processing

NASA Technical Reports Server (NTRS)

Denning, Peter J.

1988-01-01

In 1967 Amdahl expressed doubts about the ultimate utility of multiprocessors. The formulation, now called Amdahl's law, became part of the computing folklore and has inspired much skepticism about the ability of the current generation of massively parallel processors to efficiently deliver all their computing power to programs. The widely publicized recent results of a group at Sandia National Laboratory, which showed speedup on a 1024 node hypercube of over 500 for three fixed size problems and over 1000 for three scalable problems, have convincingly challenged this bit of folklore and have given new impetus to parallel scientific computing.
Blind One-Bit Compressive Sampling

DTIC Science & Technology

2013-01-17

14] Q. Li, C. A. Micchelli, L. Shen, and Y. Xu, A proximity algorithm accelerated by Gauss - Seidel iterations for L1/TV denoising models, Inverse...methods for nonconvex optimization on the unit sphere and has a provable convergence guarantees. Binary iterative hard thresholding (BIHT) algorithms were... Convergence analysis of the algorithm is presented. Our approach is to obtain a sequence of optimization problems by successively approximating the ℓ0

The PASM Parallel Processing System: Hardware Design and Intelligent Operating System Concepts

DTIC Science & Technology

1986-07-01

IND-3 Jac Logic 0ISCAUTO-3 UK Jus Parallel IrAorf act Pori 90-7 el MS. IND-3 P110-3 Logic = .CUTO-3 AC-4 0 Sow PAIS WK.110-7 --------- CSS CC. THO...process communication are part of the ment, which must be part of the task body: jitsu VP-20043 uses 32-bit integers. Pro- language. The compiler actually
Parallel Algorithms for Groebner-Basis Reduction

DTIC Science & Technology

1987-09-25

22209 ELEMENT NO. NO. NO. ACCESSION NO. 11. TITLE (Include Security Classification) * PARALLEL ALGORITHMS FOR GROEBNER -BASIS REDUCTION 12. PERSONAL...All other editions are obsolete. Productivity Engineering in the UNIXt Environment p Parallel Algorithms for Groebner -Basis Reduction Technical Report
Parallel and fault-tolerant algorithms for hypercube multiprocessors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aykanat, C.

1988-01-01

Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
Hybrid massively parallel fast sweeping method for static Hamilton-Jacobi equations

NASA Astrophysics Data System (ADS)

Detrixhe, Miles; Gibou, Frédéric

2016-10-01

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton-Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
Data recording and playback on video tape--a multi-channel analog interface for a digital audio processor system.

PubMed

Blaettler, M; Bruegger, A; Forster, I C; Lehareinger, Y

1988-03-01

The design of an analog interface to a digital audio signal processor (DASP)-video cassette recorder (VCR) system is described. The complete system represents a low-cost alternative to both FM instrumentation tape recorders and multi-channel chart recorders. The interface or DASP input-output unit described in this paper enables the recording and playback of up to 12 analog channels with a maximum of 12 bit resolution and a bandwidth of 2 kHz per channel. Internal control and timing in the recording component of the interface is performed using ROMs which can be reprogrammed to suit different analog-to-digital converter hardware. Improvement in the bandwidth specifications is possible by connecting channels in parallel. A parallel 16 bit data output port is provided for direct transfer of the digitized data to a computer.
Integrated test system of infrared and laser data based on USB 3.0

NASA Astrophysics Data System (ADS)

Fu, Hui Quan; Tang, Lin Bo; Zhang, Chao; Zhao, Bao Jun; Li, Mao Wen

2017-07-01

Based on USB3.0, this paper presents the design method of an integrated test system for both infrared image data and laser signal data processing module. The core of the design is FPGA logic control, the design uses dual-chip DDR3 SDRAM to achieve high-speed laser data cache, and receive parallel LVDS image data through serial-to-parallel conversion chip, and it achieves high-speed data communication between the system and host computer through the USB3.0 bus. The experimental results show that the developed PC software realizes the real-time display of 14-bit LVDS original image after 14-to-8 bit conversion and JPEG2000 compressed image after decompression in software, and can realize the real-time display of the acquired laser signal data. The correctness of the test system design is verified, indicating that the interface link is normal.
The SMART MIL-STD-1553 bus adapter hardware manual

NASA Technical Reports Server (NTRS)

Ton, T. T.

1981-01-01

The SMART Multiplexer Interface Adapter, (SMIA) a complete system interface for message structure of the MIL-STD-1553, is described. It provides buffering and storage for transmitted and received data and handles all the necessary handshaking to interface between parallel 8-bit data bus and a MIL-STD serial bit stream. The bus adapter is configured as either a bus controller of a remote terminal interface. It is coupled directly to the multiplex bus, or stub coupled through an additional isolation transformer located at the connection point. Fault isolation resistors provide short circuit protection.
Large-Constraint-Length, Fast Viterbi Decoder

NASA Technical Reports Server (NTRS)

Collins, O.; Dolinar, S.; Hsu, In-Shek; Pollara, F.; Olson, E.; Statman, J.; Zimmerman, G.

1990-01-01

Scheme for efficient interconnection makes VLSI design feasible. Concept for fast Viterbi decoder provides for processing of convolutional codes of constraint length K up to 15 and rates of 1/2 to 1/6. Fully parallel (but bit-serial) architecture developed for decoder of K = 7 implemented in single dedicated VLSI circuit chip. Contains six major functional blocks. VLSI circuits perform branch metric computations, add-compare-select operations, and then store decisions in traceback memory. Traceback processor reads appropriate memory locations and puts out decoded bits. Used as building block for decoders of larger K.
Biological sequence compression algorithms.

PubMed

Matsumoto, T; Sadakane, K; Imai, H

2000-01-01

Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequences. The standard compression algorithms such as gzip or compress cannot compress DNA sequences, but only expand them in size. On the other hand, CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do not use special structures of biological sequences. Two characteristic structures of DNA sequences are known. One is called palindromes or reverse complements and the other structure is approximate repeats. Several specific algorithms for DNA sequences that use these structures can compress them less than two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNA sequences are available. Before encoding the next symbol, the algorithm searches an approximate repeat and palindrome using hash and dynamic programming. If there is a palindrome or an approximate repeat with enough length then our algorithm represents it with length and distance. By using this preprocessing, a new program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.
Parallel Algorithms for Switching Edges in Heterogeneous Graphs.

PubMed

Bhuiyan, Hasanuzzaman; Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav

2017-06-01

An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors.
Parallel Algorithms for Switching Edges in Heterogeneous Graphs☆

PubMed Central

Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav

2017-01-01

An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors. PMID:28757680
Scaling Up Coordinate Descent Algorithms for Large ℓ1 Regularization Problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Scherrer, Chad; Halappanavar, Mahantesh; Tewari, Ambuj

2012-07-03

We present a generic framework for parallel coordinate descent (CD) algorithms that has as special cases the original sequential algorithms of Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm of Bradley et al. We introduce two novel parallel algorithms that are also special cases---Thread-Greedy CD and Coloring-Based CD---and give performance measurements for an OpenMP implementation of these.
Parallel language constructs for tensor product computations on loosely coupled architectures

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Van Rosendale, John

1989-01-01

A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. The authors focus on tensor product array computations, a simple but important class of numerical algorithms. They consider first the problem of programming one-dimensional kernel routines, such as parallel tridiagonal solvers, and then look at how such parallel kernels can be combined to form parallel tensor product algorithms.
Investigation of interference in multiple-input multiple-output wireless transmission at W band for an optical wireless integration system.

PubMed

Li, Xinying; Yu, Jianjun; Dong, Ze; Zhang, Junwen; Chi, Nan; Yu, Jianguo

2013-03-01

We experimentally investigate the interference in multiple-input multiple-output (MIMO) wireless transmission by adjusting the relative locations of horn antennas (HAs) in a 100 GHz optical wireless integration system, which can deliver a 50 Gb/s polarization-division-multiplexing quadrature-phase-shift-keying signal over 80 km single-mode fiber-28 and a 2×2 MIMO wireless link. For the parallel 2×2 MIMO wireless link, each receiver HA can only get wireless power from the corresponding transmitter HA, while for the crossover ones, the receiver HA can get wireless power from two transmitter HAs. At the wireless receiver, polarization demultiplexing is realized by the constant modulus algorithm (CMA) in the digital-signal-processing part. Compared to the parallel case, wireless interference causes about 2 dB optical signal-to-noise ratio penalty at a bit-error ratio (BER) of 3.8×10(-3) for the crossover cases if similar CMA taps are employed. The increase in CMA tap length can reduce wireless interference and improve BER performance. Furthermore, more CMA taps should be adopted to overcome the severe wireless interference when two pairs of transmitter and receiver HAs have different wireless distances.
Big Data: A Parallel Particle Swarm Optimization-Back-Propagation Neural Network Algorithm Based on MapReduce.

PubMed

Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan

2016-01-01

A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data.
From 16-bit to high-accuracy IDCT approximation: fruits of single architecture affliation

NASA Astrophysics Data System (ADS)

Liu, Lijie; Tran, Trac D.; Topiwala, Pankaj

2007-09-01

In this paper, we demonstrate an effective unified framework for high-accuracy approximation of the irrational co-effcient floating-point IDCT by a single integer-coeffcient fixed-point architecture. Our framework is based on a modified version of the Loeffler's sparse DCT factorization, and the IDCT architecture is constructed via a cascade of dyadic lifting steps and butterflies. We illustrate that simply varying the accuracy of the approximating parameters yields a large family of standard-compliant IDCTs, from rare 16-bit approximations catering to portable computing to ultra-high-accuracy 32-bit versions that virtually eliminate any drifting effect when pairing with the 64-bit floating-point IDCT at the encoder. Drifting performances of the proposed IDCTs along with existing popular IDCT algorithms in H.263+, MPEG-2 and MPEG-4 are also demonstrated.
Converting point-wise nuclear cross sections to pole representation using regularized vector fitting

NASA Astrophysics Data System (ADS)

Peng, Xingjie; Ducru, Pablo; Liu, Shichang; Forget, Benoit; Liang, Jingang; Smith, Kord

2018-03-01

Direct Doppler broadening of nuclear cross sections in Monte Carlo codes has been widely sought for coupled reactor simulations. One recent approach proposed analytical broadening using a pole representation of the commonly used resonance models and the introduction of a local windowing scheme to improve performance (Hwang, 1987; Forget et al., 2014; Josey et al., 2015, 2016). This pole representation has been achieved in the past by converting resonance parameters in the evaluation nuclear data library into poles and residues. However, cross sections of some isotopes are only provided as point-wise data in ENDF/B-VII.1 library. To convert these isotopes to pole representation, a recent approach has been proposed using the relaxed vector fitting (RVF) algorithm (Gustavsen and Semlyen, 1999; Gustavsen, 2006; Liu et al., 2018). This approach however needs to specify ahead of time the number of poles. This article addresses this issue by adding a poles and residues filtering step to the RVF procedure. This regularized VF (ReV-Fit) algorithm is shown to efficiently converge the poles close to the physical ones, eliminating most of the superfluous poles, and thus enabling the conversion of point-wise nuclear cross sections.
Multirate-based fast parallel algorithms for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2012-07-01

Novel algorithms for the multirate and fast parallel implementation of the 2-D discrete Hartley transform (DHT)-based real-valued discrete Gabor transform (RDGT) and its inverse transform are presented in this paper. A 2-D multirate-based analysis convolver bank is designed for the 2-D RDGT, and a 2-D multirate-based synthesis convolver bank is designed for the 2-D inverse RDGT. The parallel channels in each of the two convolver banks have a unified structure and can apply the 2-D fast DHT algorithm to speed up their computations. The computational complexity of each parallel channel is low and is independent of the Gabor oversampling rate. All the 2-D RDGT coefficients of an image are computed in parallel during the analysis process and can be reconstructed in parallel during the synthesis process. The computational complexity and time of the proposed parallel algorithms are analyzed and compared with those of the existing fastest algorithms for 2-D discrete Gabor transforms. The results indicate that the proposed algorithms are the fastest, which make them attractive for real-time image processing.
A reconstruction algorithm for helical CT imaging on PI-planes.

PubMed

Liang, Hongzhu; Zhang, Cishen; Yan, Ming

2006-01-01

In this paper, a Feldkamp type approximate reconstruction algorithm is presented for helical cone-beam Computed Tomography. To effectively suppress artifacts due to large cone angle scanning, it is proposed to reconstruct the object point-wisely on unique customized tilted PI-planes which are close to the data collecting helices of the corresponding points. Such a reconstruction scheme can considerably suppress the artifacts in the cone-angle scanning. Computer simulations show that the proposed algorithm can provide improved imaging performance compared with the existing approximate cone-beam reconstruction algorithms.
An efficient parallel algorithm for matrix-vector multiplication

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hendrickson, B.; Leland, R.; Plimpton, S.

The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in themore » well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.« less

Experimental realization of Shor's quantum factoring algorithm using nuclear magnetic resonance.

PubMed

Vandersypen, L M; Steffen, M; Breyta, G; Yannoni, C S; Sherwood, M H; Chuang, I L

The number of steps any classical computer requires in order to find the prime factors of an l-digit integer N increases exponentially with l, at least using algorithms known at present. Factoring large integers is therefore conjectured to be intractable classically, an observation underlying the security of widely used cryptographic codes. Quantum computers, however, could factor integers in only polynomial time, using Shor's quantum factoring algorithm. Although important for the study of quantum computers, experimental demonstration of this algorithm has proved elusive. Here we report an implementation of the simplest instance of Shor's algorithm: factorization of N = 15 (whose prime factors are 3 and 5). We use seven spin-1/2 nuclei in a molecule as quantum bits, which can be manipulated with room temperature liquid-state nuclear magnetic resonance techniques. This method of using nuclei to store quantum information is in principle scalable to systems containing many quantum bits, but such scalability is not implied by the present work. The significance of our work lies in the demonstration of experimental and theoretical techniques for precise control and modelling of complex quantum computers. In particular, we present a simple, parameter-free but predictive model of decoherence effects in our system.
How Crossover Speeds up Building Block Assembly in Genetic Algorithms.

PubMed

Sudholt, Dirk

2017-01-01

We reinvestigate a fundamental question: How effective is crossover in genetic algorithms in combining building blocks of good solutions? Although this has been discussed controversially for decades, we are still lacking a rigorous and intuitive answer. We provide such answers for royal road functions and OneMax, where every bit is a building block. For the latter, we show that using crossover makes every ([Formula: see text]+[Formula: see text]) genetic algorithm at least twice as fast as the fastest evolutionary algorithm using only standard bit mutation, up to small-order terms and for moderate [Formula: see text] and [Formula: see text]. Crossover is beneficial because it can capitalize on mutations that have both beneficial and disruptive effects on building blocks: crossover is able to repair the disruptive effects of mutation in later generations. Compared to mutation-based evolutionary algorithms, this makes multibit mutations more useful. Introducing crossover changes the optimal mutation rate on OneMax from [Formula: see text] to [Formula: see text]. This holds both for uniform crossover and k-point crossover. Experiments and statistical tests confirm that our findings apply to a broad class of building block functions.
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu; University of California Santa Barbara, Santa Barbara, CA, 93106; Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling,more » and show state-of-the-art speedup values for the fast sweeping method.« less
Comparison of statistical sampling methods with ScannerBit, the GAMBIT scanning module

NASA Astrophysics Data System (ADS)

Martinez, Gregory D.; McKay, James; Farmer, Ben; Scott, Pat; Roebber, Elinore; Putze, Antje; Conrad, Jan

2017-11-01

We introduce ScannerBit, the statistics and sampling module of the public, open-source global fitting framework GAMBIT. ScannerBit provides a standardised interface to different sampling algorithms, enabling the use and comparison of multiple computational methods for inferring profile likelihoods, Bayesian posteriors, and other statistical quantities. The current version offers random, grid, raster, nested sampling, differential evolution, Markov Chain Monte Carlo (MCMC) and ensemble Monte Carlo samplers. We also announce the release of a new standalone differential evolution sampler, Diver, and describe its design, usage and interface to ScannerBit. We subject Diver and three other samplers (the nested sampler MultiNest, the MCMC GreAT, and the native ScannerBit implementation of the ensemble Monte Carlo algorithm T-Walk) to a battery of statistical tests. For this we use a realistic physical likelihood function, based on the scalar singlet model of dark matter. We examine the performance of each sampler as a function of its adjustable settings, and the dimensionality of the sampling problem. We evaluate performance on four metrics: optimality of the best fit found, completeness in exploring the best-fit region, number of likelihood evaluations, and total runtime. For Bayesian posterior estimation at high resolution, T-Walk provides the most accurate and timely mapping of the full parameter space. For profile likelihood analysis in less than about ten dimensions, we find that Diver and MultiNest score similarly in terms of best fit and speed, outperforming GreAT and T-Walk; in ten or more dimensions, Diver substantially outperforms the other three samplers on all metrics.
Optimal Design of Passive Power Filters Based on Pseudo-parallel Genetic Algorithm

NASA Astrophysics Data System (ADS)

Li, Pei; Li, Hongbo; Gao, Nannan; Niu, Lin; Guo, Liangfeng; Pei, Ying; Zhang, Yanyan; Xu, Minmin; Chen, Kerui

2017-05-01

The economic costs together with filter efficiency are taken as targets to optimize the parameter of passive filter. Furthermore, the method of combining pseudo-parallel genetic algorithm with adaptive genetic algorithm is adopted in this paper. In the early stages pseudo-parallel genetic algorithm is introduced to increase the population diversity, and adaptive genetic algorithm is used in the late stages to reduce the workload. At the same time, the migration rate of pseudo-parallel genetic algorithm is improved to change with population diversity adaptively. Simulation results show that the filter designed by the proposed method has better filtering effect with lower economic cost, and can be used in engineering.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

NASA Technical Reports Server (NTRS)

Povitsky, A.

1998-01-01

In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique

PubMed Central

Li, Pinghao; Wang, Shuang; Kim, Jihoon; Xiong, Hongkai; Ohno-Machado, Lucila; Jiang, Xiaoqian

2013-01-01

Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA sequencing outstrips the rate of increase in disk storage capacity, the storage and data transferring of large genome data are becoming important concerns for biomedical researchers. We propose a two-pass lossless genome compression algorithm, which highlights the synthesis of complementary contextual models, to improve the compression performance. The proposed framework could handle genome compression with and without reference sequences, and demonstrated performance advantages over best existing algorithms. The method for reference-free compression led to bit rates of 1.720 and 1.838 bits per base for bacteria and yeast, which were approximately 3.7% and 2.6% better than the state-of-the-art algorithms. Regarding performance with reference, we tested on the first Korean personal genome sequence data set, and our proposed method demonstrated a 189-fold compression rate, reducing the raw file size from 2986.8 MB to 15.8 MB at a comparable decompression cost with existing algorithms. DNAcompact is freely available at https://sourceforge.net/projects/dnacompact/for research purpose. PMID:24282536
A joint encryption/watermarking system for verifying the reliability of medical images.

PubMed

Bouslimi, Dalel; Coatrieux, Gouenou; Cozic, Michel; Roux, Christian

2012-09-01

In this paper, we propose a joint encryption/water-marking system for the purpose of protecting medical images. This system is based on an approach which combines a substitutive watermarking algorithm, the quantization index modulation, with an encryption algorithm: a stream cipher algorithm (e.g., the RC4) or a block cipher algorithm (e.g., the AES in cipher block chaining (CBC) mode of operation). Our objective is to give access to the outcomes of the image integrity and of its origin even though the image is stored encrypted. If watermarking and encryption are conducted jointly at the protection stage, watermark extraction and decryption can be applied independently. The security analysis of our scheme and experimental results achieved on 8-bit depth ultrasound images as well as on 16-bit encoded positron emission tomography images demonstrate the capability of our system to securely make available security attributes in both spatial and encrypted domains while minimizing image distortion. Furthermore, by making use of the AES block cipher in CBC mode, the proposed system is compliant with or transparent to the DICOM standard.
A Degree Distribution Optimization Algorithm for Image Transmission

NASA Astrophysics Data System (ADS)

Jiang, Wei; Yang, Junjie

2016-09-01

Luby Transform (LT) code is the first practical implementation of digital fountain code. The coding behavior of LT code is mainly decided by the degree distribution which determines the relationship between source data and codewords. Two degree distributions are suggested by Luby. They work well in typical situations but not optimally in case of finite encoding symbols. In this work, the degree distribution optimization algorithm is proposed to explore the potential of LT code. Firstly selection scheme of sparse degrees for LT codes is introduced. Then probability distribution is optimized according to the selected degrees. In image transmission, bit stream is sensitive to the channel noise and even a single bit error may cause the loss of synchronization between the encoder and the decoder. Therefore the proposed algorithm is designed for image transmission situation. Moreover, optimal class partition is studied for image transmission with unequal error protection. The experimental results are quite promising. Compared with LT code with robust soliton distribution, the proposed algorithm improves the final quality of recovered images obviously with the same overhead.
Singular value decomposition utilizing parallel algorithms on graphical processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kotas, Charlotte W; Barhen, Jacob

2011-01-01

One of the current challenges in underwater acoustic array signal processing is the detection of quiet targets in the presence of noise. In order to enable robust detection, one of the key processing steps requires data and replica whitening. This, in turn, involves the eigen-decomposition of the sample spectral matrix, Cx = 1/K xKX(k)XH(k) where X(k) denotes a single frequency snapshot with an element for each element of the array. By employing the singular value decomposition (SVD) method, the eigenvectors and eigenvalues can be determined directly from the data without computing the sample covariance matrix, reducing the computational requirements formore » a given level of accuracy (van Trees, Optimum Array Processing). (Recall that the SVD of a complex matrix A involves determining V, , and U such that A = U VH where U and V are orthonormal and is a positive, real, diagonal matrix containing the singular values of A. U and V are the eigenvectors of AAH and AHA, respectively, while the singular values are the square roots of the eigenvalues of AAH.) Because it is desirable to be able to compute these quantities in real time, an efficient technique for computing the SVD is vital. In addition, emerging multicore processors like graphical processing units (GPUs) are bringing parallel processing capabilities to an ever increasing number of users. Since the computational tasks involved in array signal processing are well suited for parallelization, it is expected that these computations will be implemented using GPUs as soon as users have the necessary computational tools available to them. Thus, it is important to have an SVD algorithm that is suitable for these processors. This work explores the effectiveness of two different parallel SVD implementations on an NVIDIA Tesla C2050 GPU (14 multiprocessors, 32 cores per multiprocessor, 1.15 GHz clock - peed). The first algorithm is based on a two-step algorithm which bidiagonalizes the matrix using Householder transformations, and then diagonalizes the intermediate bidiagonal matrix through implicit QR shifts. This is similar to that implemented for real matrices by Lahabar and Narayanan ("Singular Value Decomposition on GPU using CUDA", IEEE International Parallel Distributed Processing Symposium 2009). The implementation is done in a hybrid manner, with the bidiagonalization stage done using the GPU while the diagonalization stage is done using the CPU, with the GPU used to update the U and V matrices. The second algorithm is based on a one-sided Jacobi scheme utilizing a sequence of pair-wise column orthogonalizations such that A is replaced by AV until the resulting matrix is sufficiently orthogonal (that is, equal to U ). V is obtained from the sequence of orthogonalizations, while can be found from the square root of the diagonal elements of AH A and, once is known, U can be found from column scaling the resulting matrix. These implementations utilize CUDA Fortran and NVIDIA's CUB LAS library. The primary goal of this study is to quantify the comparative performance of these two techniques against themselves and other standard implementations (for example, MATLAB). Considering that there is significant overhead associated with transferring data to the GPU and with synchronization between the GPU and the host CPU, it is also important to understand when it is worthwhile to use the GPU in terms of the matrix size and number of concurrent SVDs to be calculated.« less
A multiresolution halftoning algorithm for progressive display

NASA Astrophysics Data System (ADS)

Mukherjee, Mithun; Sharma, Gaurav

2005-01-01

We describe and implement an algorithmic framework for memory efficient, 'on-the-fly' halftoning in a progressive transmission environment. Instead of a conventional approach which repeatedly recalls the continuous tone image from memory and subsequently halftones it for display, the proposed method achieves significant memory efficiency by storing only the halftoned image and updating it in response to additional information received through progressive transmission. Thus the method requires only a single frame-buffer of bits for storage of the displayed binary image and no additional storage is required for the contone data. The additional image data received through progressive transmission is accommodated through in-place updates of the buffer. The method is thus particularly advantageous for high resolution bi-level displays where it can result in significant savings in memory. The proposed framework is implemented using a suitable multi-resolution, multi-level modification of error diffusion that is motivated by the presence of a single binary frame-buffer. Aggregates of individual display bits constitute the multiple output levels at a given resolution. This creates a natural progression of increasing resolution with decreasing bit-depth.
An evaluation of the effect of JPEG, JPEG2000, and H.264/AVC on CQR codes decoding process

NASA Astrophysics Data System (ADS)

Vizcarra Melgar, Max E.; Farias, Mylène C. Q.; Zaghetto, Alexandre

2015-02-01

This paper presents a binarymatrix code based on QR Code (Quick Response Code), denoted as CQR Code (Colored Quick Response Code), and evaluates the effect of JPEG, JPEG2000 and H.264/AVC compression on the decoding process. The proposed CQR Code has three additional colors (red, green and blue), what enables twice as much storage capacity when compared to the traditional black and white QR Code. Using the Reed-Solomon error-correcting code, the CQR Code model has a theoretical correction capability of 38.41%. The goal of this paper is to evaluate the effect that degradations inserted by common image compression algorithms have on the decoding process. Results show that a successful decoding process can be achieved for compression rates up to 0.3877 bits/pixel, 0.1093 bits/pixel and 0.3808 bits/pixel for JPEG, JPEG2000 and H.264/AVC formats, respectively. The algorithm that presents the best performance is the H.264/AVC, followed by the JPEG2000, and JPEG.
Utilizing a language model to improve online dynamic data collection in P300 spellers.

PubMed

Mainsah, Boyla O; Colwell, Kenneth A; Collins, Leslie M; Throckmorton, Chandra S

2014-07-01

P300 spellers provide a means of communication for individuals with severe physical limitations, especially those with locked-in syndrome, such as amyotrophic lateral sclerosis. However, P300 speller use is still limited by relatively low communication rates due to the multiple data measurements that are required to improve the signal-to-noise ratio of event-related potentials for increased accuracy. Therefore, the amount of data collection has competing effects on accuracy and spelling speed. Adaptively varying the amount of data collection prior to character selection has been shown to improve spelling accuracy and speed. The goal of this study was to optimize a previously developed dynamic stopping algorithm that uses a Bayesian approach to control data collection by incorporating a priori knowledge via a language model. Participants ( n = 17) completed online spelling tasks using the dynamic stopping algorithm, with and without a language model. The addition of the language model resulted in improved participant performance from a mean theoretical bit rate of 46.12 bits/min at 88.89% accuracy to 54.42 bits/min ( ) at 90.36% accuracy.
Reforming Earth science education in developing countries

NASA Astrophysics Data System (ADS)

Aswathanarayana, U.

Improving the employability of Earth science graduates by reforming Earth science instruction is a matter of concern to universities worldwide. It should, however, be self-evident that the developing countries cannot follow the same blueprint for change as the industrialized countries due to constraints of affordability and relevance. Peanuts are every bit as nutritious as almonds; if one with limited means has to choose between a fistful of peanuts and just one almond, it is wise to choose the peanuts. A paradigm proposed here would allow institutions in developing countries to impart good quality relevant Earth science instruction that would be affordable and lead to employment.
A software framework for pipelined arithmetic algorithms in field programmable gate arrays

NASA Astrophysics Data System (ADS)

Kim, J. B.; Won, E.

2018-03-01

Pipelined algorithms implemented in field programmable gate arrays are extensively used for hardware triggers in the modern experimental high energy physics field and the complexity of such algorithms increases rapidly. For development of such hardware triggers, algorithms are developed in C++, ported to hardware description language for synthesizing firmware, and then ported back to C++ for simulating the firmware response down to the single bit level. We present a C++ software framework which automatically simulates and generates hardware description language code for pipelined arithmetic algorithms.
On Multiple AER Handshaking Channels Over High-Speed Bit-Serial Bidirectional LVDS Links With Flow-Control and Clock-Correction on Commercial FPGAs for Scalable Neuromorphic Systems.

PubMed

Yousefzadeh, Amirreza; Jablonski, Miroslaw; Iakymchuk, Taras; Linares-Barranco, Alejandro; Rosado, Alfredo; Plana, Luis A; Temple, Steve; Serrano-Gotarredona, Teresa; Furber, Steve B; Linares-Barranco, Bernabe

2017-10-01

Address event representation (AER) is a widely employed asynchronous technique for interchanging "neural spikes" between different hardware elements in neuromorphic systems. Each neuron or cell in a chip or a system is assigned an address (or ID), which is typically communicated through a high-speed digital bus, thus time-multiplexing a high number of neural connections. Conventional AER links use parallel physical wires together with a pair of handshaking signals (request and acknowledge). In this paper, we present a fully serial implementation using bidirectional SATA connectors with a pair of low-voltage differential signaling (LVDS) wires for each direction. The proposed implementation can multiplex a number of conventional parallel AER links for each physical LVDS connection. It uses flow control, clock correction, and byte alignment techniques to transmit 32-bit address events reliably over multiplexed serial connections. The setup has been tested using commercial Spartan6 FPGAs attaining a maximum event transmission speed of 75 Meps (Mega events per second) for 32-bit events at a line rate of 3.0 Gbps. Full HDL codes (vhdl/verilog) and example demonstration codes for the SpiNNaker platform will be made available.
UWGSP4: an imaging and graphics superworkstation and its medical applications

NASA Astrophysics Data System (ADS)

Jong, Jing-Ming; Park, Hyun Wook; Eo, Kilsu; Kim, Min-Hwan; Zhang, Peng; Kim, Yongmin

1992-05-01

UWGSP4 is configured with a parallel architecture for image processing and a pipelined architecture for computer graphics. The system's peak performance is 1,280 MFLOPS for image processing and over 200,000 Gouraud shaded 3-D polygons per second for graphics. The simulated sustained performance is about 50% of the peak performance in general image processing. Most of the 2-D image processing functions are efficiently vectorized and parallelized in UWGSP4. A performance of 770 MFLOPS in convolution and 440 MFLOPS in FFT is achieved. The real-time cine display, up to 32 frames of 1280 X 1024 pixels per second, is supported. In 3-D imaging, the update rate for the surface rendering is 10 frames of 20,000 polygons per second; the update rate for the volume rendering is 6 frames of 128 X 128 X 128 voxels per second. The system provides 1280 X 1024 X 32-bit double frame buffers and one 1280 X 1024 X 8-bit overlay buffer for supporting realistic animation, 24-bit true color, and text annotation. A 1280 X 1024- pixel, 66-Hz noninterlaced display screen with 1:1 aspect ratio can be windowed into the frame buffer for the display of any portion of the processed image or graphics.
Parallel optimization algorithms and their implementation in VLSI design

NASA Technical Reports Server (NTRS)

Lee, G.; Feeley, J. J.

1991-01-01

Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
A parallel time integrator for noisy nonlinear oscillatory systems

NASA Astrophysics Data System (ADS)

Subber, Waad; Sarkar, Abhijit

2018-06-01

In this paper, we adapt a parallel time integration scheme to track the trajectories of noisy non-linear dynamical systems. Specifically, we formulate a parallel algorithm to generate the sample path of nonlinear oscillator defined by stochastic differential equations (SDEs) using the so-called parareal method for ordinary differential equations (ODEs). The presence of Wiener process in SDEs causes difficulties in the direct application of any numerical integration techniques of ODEs including the parareal algorithm. The parallel implementation of the algorithm involves two SDEs solvers, namely a fine-level scheme to integrate the system in parallel and a coarse-level scheme to generate and correct the required initial conditions to start the fine-level integrators. For the numerical illustration, a randomly excited Duffing oscillator is investigated in order to study the performance of the stochastic parallel algorithm with respect to a range of system parameters. The distributed implementation of the algorithm exploits Massage Passing Interface (MPI).
A parallel variable metric optimization algorithm

NASA Technical Reports Server (NTRS)

Straeter, T. A.

1973-01-01

An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.

Reconfigurable data path processor

NASA Technical Reports Server (NTRS)

Donohoe, Gregory (Inventor)

2005-01-01

A reconfigurable data path processor comprises a plurality of independent processing elements. Each of the processing elements advantageously comprising an identical architecture. Each processing element comprises a plurality of data processing means for generating a potential output. Each processor is also capable of through-putting an input as a potential output with little or no processing. Each processing element comprises a conditional multiplexer having a first conditional multiplexer input, a second conditional multiplexer input and a conditional multiplexer output. A first potential output value is transmitted to the first conditional multiplexer input, and a second potential output value is transmitted to the second conditional multiplexer output. The conditional multiplexer couples either the first conditional multiplexer input or the second conditional multiplexer input to the conditional multiplexer output, according to an output control command. The output control command is generated by processing a set of arithmetic status-bits through a logical mask. The conditional multiplexer output is coupled to a first processing element output. A first set of arithmetic bits are generated according to the processing of the first processable value. A second set of arithmetic bits may be generated from a second processing operation. The selection of the arithmetic status-bits is performed by an arithmetic-status bit multiplexer selects the desired set of arithmetic status bits from among the first and second set of arithmetic status bits. The conditional multiplexer evaluates the select arithmetic status bits according to logical mask defining an algorithm for evaluating the arithmetic status bits.
Development of a Tool Condition Monitoring System for Impregnated Diamond Bits in Rock Drilling Applications

NASA Astrophysics Data System (ADS)

Perez, Santiago; Karakus, Murat; Pellet, Frederic

2017-05-01

The great success and widespread use of impregnated diamond (ID) bits are due to their self-sharpening mechanism, which consists of a constant renewal of diamonds acting at the cutting face as the bit wears out. It is therefore important to keep this mechanism acting throughout the lifespan of the bit. Nonetheless, such a mechanism can be altered by the blunting of the bit that ultimately leads to a less than optimal drilling performance. For this reason, this paper aims at investigating the applicability of artificial intelligence-based techniques in order to monitor tool condition of ID bits, i.e. sharp or blunt, under laboratory conditions. Accordingly, topologically invariant tests are carried out with sharp and blunt bits conditions while recording acoustic emissions (AE) and measuring-while-drilling variables. The combined output of acoustic emission root-mean-square value (AErms), depth of cut ( d), torque (tob) and weight-on-bit (wob) is then utilized to create two approaches in order to predict the wear state condition of the bits. One approach is based on the combination of the aforementioned variables and another on the specific energy of drilling. The two different approaches are assessed for classification performance with various pattern recognition algorithms, such as simple trees, support vector machines, k-nearest neighbour, boosted trees and artificial neural networks. In general, Acceptable pattern recognition rates were obtained, although the subset composed by AErms and tob excels due to the high classification performances rates and fewer input variables.
Study on a low complexity adaptive modulation algorithm in OFDM-ROF system with sub-carrier grouping technology

NASA Astrophysics Data System (ADS)

Liu, Chong-xin; Liu, Bo; Zhang, Li-jia; Xin, Xiang-jun; Tian, Qing-hua; Tian, Feng; Wang, Yong-jun; Rao, Lan; Mao, Yaya; Li, Deng-ao

2018-01-01

During the last decade, the orthogonal frequency division multiplexing radio-over-fiber (OFDM-ROF) system with adaptive modulation technology is of great interest due to its capability of raising the spectral efficiency dramatically, reducing the effects of fiber link or wireless channel, and improving the communication quality. In this study, according to theoretical analysis of nonlinear distortion and frequency selective fading on the transmitted signal, a low-complexity adaptive modulation algorithm is proposed in combination with sub-carrier grouping technology. This algorithm achieves the optimal performance of the system by calculating the average combined signal-to-noise ratio of each group and dynamically adjusting the origination modulation format according to the preset threshold and user's requirements. At the same time, this algorithm takes the sub-carrier group as the smallest unit in the initial bit allocation and the subsequent bit adjustment. So, the algorithm complexity is only 1 /M (M is the number of sub-carriers in each group) of Fischer algorithm, which is much smaller than many classic adaptive modulation algorithms, such as Hughes-Hartogs algorithm, Chow algorithm, and is in line with the development direction of green and high speed communication. Simulation results show that the performance of OFDM-ROF system with the improved algorithm is much better than those without adaptive modulation, and the BER of the former achieves 10e1 to 10e2 times lower than the latter when SNR values gets larger. We can obtain that this low complexity adaptive modulation algorithm is extremely useful for the OFDM-ROF system.
Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment.

PubMed

Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che

2014-01-16

To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks.
Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment

PubMed Central

2014-01-01

Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks. PMID:24428926
Iterative current mode per pixel ADC for 3D SoftChip implementation in CMOS

NASA Astrophysics Data System (ADS)

Lachowicz, Stefan W.; Rassau, Alexander; Lee, Seung-Minh; Eshraghian, Kamran; Lee, Mike M.

2003-04-01

Mobile multimedia communication has rapidly become a significant area of research and development constantly challenging boundaries on a variety of technological fronts. The processing requirements for the capture, conversion, compression, decompression, enhancement, display, etc. of increasingly higher quality multimedia content places heavy demands even on current ULSI (ultra large scale integration) systems, particularly for mobile applications where area and power are primary considerations. The ADC presented in this paper is designed for a vertically integrated (3D) system comprising two distinct layers bonded together using Indium bump technology. The top layer is a CMOS imaging array containing analogue-to-digital converters, and a buffer memory. The bottom layer takes the form of a configurable array processor (CAP), a highly parallel array of soft programmable processors capable of carrying out complex processing tasks directly on data stored in the top plane. This paper presents a ADC scheme for the image capture plane. The analogue photocurrent or sampled voltage is transferred to the ADC via a column or a column/row bus. In the proposed system, an array of analogue-to-digital converters is distributed, so that a one-bit cell is associated with one sensor. The analogue-to-digital converters are algorithmic current-mode converters. Eight such cells are cascaded to form an 8-bit converter. Additionally, each photo-sensor is equipped with a current memory cell, and multiple conversions are performed with scaled values of the photocurrent for colour processing.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

NASA Technical Reports Server (NTRS)

Straeter, T. A.; Markos, A. T.

1975-01-01

A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
Rapid code acquisition algorithms employing PN matched filters

NASA Technical Reports Server (NTRS)

Su, Yu T.

1988-01-01

The performance of four algorithms using pseudonoise matched filters (PNMFs), for direct-sequence spread-spectrum systems, is analyzed. They are: parallel search with fix dwell detector (PL-FDD), parallel search with sequential detector (PL-SD), parallel-serial search with fix dwell detector (PS-FDD), and parallel-serial search with sequential detector (PS-SD). The operation characteristic for each detector and the mean acquisition time for each algorithm are derived. All the algorithms are studied in conjunction with the noncoherent integration technique, which enables the system to operate in the presence of data modulation. Several previous proposals using PNMF are seen as special cases of the present algorithms.
Algorithms and programming tools for image processing on the MPP

NASA Technical Reports Server (NTRS)

Reeves, A. P.

1985-01-01

Topics addressed include: data mapping and rotational algorithms for the Massively Parallel Processor (MPP); Parallel Pascal language; documentation for the Parallel Pascal Development system; and a description of the Parallel Pascal language used on the MPP.
Big Data: A Parallel Particle Swarm Optimization-Back-Propagation Neural Network Algorithm Based on MapReduce

PubMed Central

Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan

2016-01-01

A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network’s initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data. PMID:27304987
Applications and accuracy of the parallel diagonal dominant algorithm

NASA Technical Reports Server (NTRS)

Sun, Xian-He

1993-01-01

The Parallel Diagonal Dominant (PDD) algorithm is a highly efficient, ideally scalable tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is introduced. Then the algorithm is extended to solve periodic tridiagonal systems. A variant, the reduced PDD algorithm, is also proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric, and anti-symmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the algorithm is a good candidate for the emerging massively parallel machines.
Sum of the Magnitude for Hard Decision Decoding Algorithm Based on Loop Update Detection

PubMed Central

Meng, Jiahui; Zhao, Danfeng; Tian, Hai; Zhang, Liang

2018-01-01

In order to improve the performance of non-binary low-density parity check codes (LDPC) hard decision decoding algorithm and to reduce the complexity of decoding, a sum of the magnitude for hard decision decoding algorithm based on loop update detection is proposed. This will also ensure the reliability, stability and high transmission rate of 5G mobile communication. The algorithm is based on the hard decision decoding algorithm (HDA) and uses the soft information from the channel to calculate the reliability, while the sum of the variable nodes’ (VN) magnitude is excluded for computing the reliability of the parity checks. At the same time, the reliability information of the variable node is considered and the loop update detection algorithm is introduced. The bit corresponding to the error code word is flipped multiple times, before this is searched in the order of most likely error probability to finally find the correct code word. Simulation results show that the performance of one of the improved schemes is better than the weighted symbol flipping (WSF) algorithm under different hexadecimal numbers by about 2.2 dB and 2.35 dB at the bit error rate (BER) of 10−5 over an additive white Gaussian noise (AWGN) channel, respectively. Furthermore, the average number of decoding iterations is significantly reduced. PMID:29342963
Memory-efficient decoding of LDPC codes

NASA Technical Reports Server (NTRS)

Kwok-San Lee, Jason; Thorpe, Jeremy; Hawkins, Jon

2005-01-01

We present a low-complexity quantization scheme for the implementation of regular (3,6) LDPC codes. The quantization parameters are optimized to maximize the mutual information between the source and the quantized messages. Using this non-uniform quantized belief propagation algorithm, we have simulated that an optimized 3-bit quantizer operates with 0.2dB implementation loss relative to a floating point decoder, and an optimized 4-bit quantizer operates less than 0.1dB quantization loss.
A Multi-Week Behavioral Sampling Tag for Sound Effects Studies: Design Trade-Offs and Prototype Evaluation

DTIC Science & Technology

2014-09-30

to establish the performance of algorithms detecting dives, strokes , clicks, respiration and gait changes. We have also found that a combination of...whale click count, total click count, vocal duration, SOC2 depth, EOC3 depth) Descent 40 bits (duration, vertical speed, stroke count 0...100 m, stroke count 100-400 m, OBDA4, sum sr35) Bottom 26 bits (movement index6, OBDA, jerk events7, median jerk depth) Ascent
Fast and Flexible Successive-Cancellation List Decoders for Polar Codes

NASA Astrophysics Data System (ADS)

Hashemi, Seyyed Ali; Condo, Carlo; Gross, Warren J.

2017-11-01

Polar codes have gained significant amount of attention during the past few years and have been selected as a coding scheme for the next generation of mobile broadband standard. Among decoding schemes, successive-cancellation list (SCL) decoding provides a reasonable trade-off between the error-correction performance and hardware implementation complexity when used to decode polar codes, at the cost of limited throughput. The simplified SCL (SSCL) and its extension SSCL-SPC increase the speed of decoding by removing redundant calculations when encountering particular information and frozen bit patterns (rate one and single parity check codes), while keeping the error-correction performance unaltered. In this paper, we improve SSCL and SSCL-SPC by proving that the list size imposes a specific number of bit estimations required to decode rate one and single parity check codes. Thus, the number of estimations can be limited while guaranteeing exactly the same error-correction performance as if all bits of the code were estimated. We call the new decoding algorithms Fast-SSCL and Fast-SSCL-SPC. Moreover, we show that the number of bit estimations in a practical application can be tuned to achieve desirable speed, while keeping the error-correction performance almost unchanged. Hardware architectures implementing both algorithms are then described and implemented: it is shown that our design can achieve 1.86 Gb/s throughput, higher than the best state-of-the-art decoders.
Parallel Computing Strategies for Irregular Algorithms

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

2002-01-01

Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Bit-level plane image encryption based on coupled map lattice with time-varying delay

NASA Astrophysics Data System (ADS)

Lv, Xiupin; Liao, Xiaofeng; Yang, Bo

2018-04-01

Most of the existing image encryption algorithms had two basic properties: confusion and diffusion in a pixel-level plane based on various chaotic systems. Actually, permutation in a pixel-level plane could not change the statistical characteristics of an image, and many of the existing color image encryption schemes utilized the same method to encrypt R, G and B components, which means that the three color components of a color image are processed three times independently. Additionally, dynamical performance of a single chaotic system degrades greatly with finite precisions in computer simulations. In this paper, a novel coupled map lattice with time-varying delay therefore is applied in color images bit-level plane encryption to solve the above issues. Spatiotemporal chaotic system with both much longer period in digitalization and much excellent performances in cryptography is recommended. Time-varying delay embedded in coupled map lattice enhances dynamical behaviors of the system. Bit-level plane image encryption algorithm has greatly reduced the statistical characteristics of an image through the scrambling processing. The R, G and B components cross and mix with one another, which reduces the correlation among the three components. Finally, simulations are carried out and all the experimental results illustrate that the proposed image encryption algorithm is highly secure, and at the same time, also demonstrates superior performance.
BitPredator: A Discovery Algorithm for BitTorrent Initial Seeders and Peers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Borges, Raymond; Patton, Robert M; Kettani, Houssain

2011-01-01

There is a large amount of illegal content being replicated through peer-to-peer (P2P) networks where BitTorrent is dominant; therefore, a framework to profile and police it is needed. The goal of this work is to explore the behavior of initial seeds and highly active peers to develop techniques to correctly identify them. We intend to establish a new methodology and software framework for profiling BitTorrent peers. This involves three steps: crawling torrent indexers for keywords in recently added torrents using Really Simple Syndication protocol (RSS), querying torrent trackers for peer list data and verifying Internet Protocol (IP) addresses from peermore » lists. We verify IPs using active monitoring methods. Peer behavior is evaluated and modeled using bitfield message responses. We also design a tool to profile worldwide file distribution by mapping IP-to-geolocation and linking to WHOIS server information in Google Earth.« less
Adaptive bit plane quadtree-based block truncation coding for image compression

NASA Astrophysics Data System (ADS)

Li, Shenda; Wang, Jin; Zhu, Qing

2018-04-01

Block truncation coding (BTC) is a fast image compression technique applied in spatial domain. Traditional BTC and its variants mainly focus on reducing computational complexity for low bit rate compression, at the cost of lower quality of decoded images, especially for images with rich texture. To solve this problem, in this paper, a quadtree-based block truncation coding algorithm combined with adaptive bit plane transmission is proposed. First, the direction of edge in each block is detected using Sobel operator. For the block with minimal size, adaptive bit plane is utilized to optimize the BTC, which depends on its MSE loss encoded by absolute moment block truncation coding (AMBTC). Extensive experimental results show that our method gains 0.85 dB PSNR on average compare to some other state-of-the-art BTC variants. So it is desirable for real time image compression applications.
An overview of Space Communication Artificial Intelligence for Link Evaluation Terminal (SCAILET) Project

NASA Technical Reports Server (NTRS)

Shahidi, Anoosh K.; Schlegelmilch, Richard F.; Petrik, Edward J.; Walters, Jerry L.

1991-01-01

A software application to assist end-users of the link evaluation terminal (LET) for satellite communications is being developed. This software application incorporates artificial intelligence (AI) techniques and will be deployed as an interface to LET. The high burst rate (HBR) LET provides 30 GHz transmitting/20 GHz receiving (220/110 Mbps) capability for wideband communications technology experiments with the Advanced Communications Technology Satellite (ACTS). The HBR LET can monitor and evaluate the integrity of the HBR communications uplink and downlink to the ACTS satellite. The uplink HBR transmission is performed by bursting the bit-pattern as a modulated signal to the satellite. The HBR LET can determine the bit error rate (BER) under various atmospheric conditions by comparing the transmitted bit pattern with the received bit pattern. An algorithm for power augmentation will be applied to enhance the system's BER performance at reduced signal strength caused by adverse conditions.

Parallel integrated frame synchronizer chip

NASA Technical Reports Server (NTRS)

Solomon, Jeffrey Michael (Inventor); Ghuman, Parminder Singh (Inventor); Bennett, Toby Dennis (Inventor)

2000-01-01

A parallel integrated frame synchronizer which implements a sequential pipeline process wherein serial data in the form of telemetry data or weather satellite data enters the synchronizer by means of a front-end subsystem and passes to a parallel correlator subsystem or a weather satellite data processing subsystem. When in a CCSDS mode, data from the parallel correlator subsystem passes through a window subsystem, then to a data alignment subsystem and then to a bit transition density (BTD)/cyclical redundancy check (CRC) decoding subsystem. Data from the BTD/CRC decoding subsystem or data from the weather satellite data processing subsystem is then fed to an output subsystem where it is output from a data output port.
National Information Systems Security Conference (19th) held in Baltimore, Maryland on October 22-25, 1996. Volume 1

DTIC Science & Technology

1996-10-25

been demonstrated that steganography is ineffective 195 when images are stored using this compression algorithm [2]. Difficulty in designing a general...Despite the relative ease of employing steganography to covertly transport data in an uncompressed 24-bit image , lossy compression algorithms based on... image , the security threat that steganography poses cannot be completely eliminated by application of a transform-based lossy compression algorithm
Sublattice parallel replica dynamics.

PubMed

Martínez, Enrique; Uberuaga, Blas P; Voter, Arthur F

2014-06-01

Exascale computing presents a challenge for the scientific community as new algorithms must be developed to take full advantage of the new computing paradigm. Atomistic simulation methods that offer full fidelity to the underlying potential, i.e., molecular dynamics (MD) and parallel replica dynamics, fail to use the whole machine speedup, leaving a region in time and sample size space that is unattainable with current algorithms. In this paper, we present an extension of the parallel replica dynamics algorithm [A. F. Voter, Phys. Rev. B 57, R13985 (1998)] by combining it with the synchronous sublattice approach of Shim and Amar [ and , Phys. Rev. B 71, 125432 (2005)], thereby exploiting event locality to improve the algorithm scalability. This algorithm is based on a domain decomposition in which events happen independently in different regions in the sample. We develop an analytical expression for the speedup given by this sublattice parallel replica dynamics algorithm and compare it with parallel MD and traditional parallel replica dynamics. We demonstrate how this algorithm, which introduces a slight additional approximation of event locality, enables the study of physical systems unreachable with traditional methodologies and promises to better utilize the resources of current high performance and future exascale computers.
High capacity reversible watermarking for audio by histogram shifting and predicted error expansion.

PubMed

Wang, Fei; Xie, Zhaoxin; Chen, Zuo

2014-01-01

Being reversible, the watermarking information embedded in audio signals can be extracted while the original audio data can achieve lossless recovery. Currently, the few reversible audio watermarking algorithms are confronted with following problems: relatively low SNR (signal-to-noise) of embedded audio; a large amount of auxiliary embedded location information; and the absence of accurate capacity control capability. In this paper, we present a novel reversible audio watermarking scheme based on improved prediction error expansion and histogram shifting. First, we use differential evolution algorithm to optimize prediction coefficients and then apply prediction error expansion to output stego data. Second, in order to reduce location map bits length, we introduced histogram shifting scheme. Meanwhile, the prediction error modification threshold according to a given embedding capacity can be computed by our proposed scheme. Experiments show that this algorithm improves the SNR of embedded audio signals and embedding capacity, drastically reduces location map bits length, and enhances capacity control capability.
Uncertain decision tree inductive inference

NASA Astrophysics Data System (ADS)

Zarban, L.; Jafari, S.; Fakhrahmad, S. M.

2011-10-01

Induction is the process of reasoning in which general rules are formulated based on limited observations of recurring phenomenal patterns. Decision tree learning is one of the most widely used and practical inductive methods, which represents the results in a tree scheme. Various decision tree algorithms have already been proposed such as CLS, ID3, Assistant C4.5, REPTree and Random Tree. These algorithms suffer from some major shortcomings. In this article, after discussing the main limitations of the existing methods, we introduce a new decision tree induction algorithm, which overcomes all the problems existing in its counterparts. The new method uses bit strings and maintains important information on them. This use of bit strings and logical operation on them causes high speed during the induction process. Therefore, it has several important features: it deals with inconsistencies in data, avoids overfitting and handles uncertainty. We also illustrate more advantages and the new features of the proposed method. The experimental results show the effectiveness of the method in comparison with other methods existing in the literature.
Fast converging minimum probability of error neural network receivers for DS-CDMA communications.

PubMed

Matyjas, John D; Psaromiligkos, Ioannis N; Batalama, Stella N; Medley, Michael J

2004-03-01

We consider a multilayer perceptron neural network (NN) receiver architecture for the recovery of the information bits of a direct-sequence code-division-multiple-access (DS-CDMA) user. We develop a fast converging adaptive training algorithm that minimizes the bit-error rate (BER) at the output of the receiver. The adaptive algorithm has three key features: i) it incorporates the BER, i.e., the ultimate performance evaluation measure, directly into the learning process, ii) it utilizes constraints that are derived from the properties of the optimum single-user decision boundary for additive white Gaussian noise (AWGN) multiple-access channels, and iii) it embeds importance sampling (IS) principles directly into the receiver optimization process. Simulation studies illustrate the BER performance of the proposed scheme.
Noise removing in encrypted color images by statistical analysis

NASA Astrophysics Data System (ADS)

Islam, N.; Puech, W.

2012-03-01

Cryptographic techniques are used to secure confidential data from unauthorized access but these techniques are very sensitive to noise. A single bit change in encrypted data can have catastrophic impact over the decrypted data. This paper addresses the problem of removing bit error in visual data which are encrypted using AES algorithm in the CBC mode. In order to remove the noise, a method is proposed which is based on the statistical analysis of each block during the decryption. The proposed method exploits local statistics of the visual data and confusion/diffusion properties of the encryption algorithm to remove the errors. Experimental results show that the proposed method can be used at the receiving end for the possible solution for noise removing in visual data in encrypted domain.
A High Performance Image Data Compression Technique for Space Applications

NASA Technical Reports Server (NTRS)

Yeh, Pen-Shu; Venbrux, Jack

2003-01-01

A highly performing image data compression technique is currently being developed for space science applications under the requirement of high-speed and pushbroom scanning. The technique is also applicable to frame based imaging data. The algorithm combines a two-dimensional transform with a bitplane encoding; this results in an embedded bit string with exact desirable compression rate specified by the user. The compression scheme performs well on a suite of test images acquired from spacecraft instruments. It can also be applied to three-dimensional data cube resulting from hyper-spectral imaging instrument. Flight qualifiable hardware implementations are in development. The implementation is being designed to compress data in excess of 20 Msampledsec and support quantization from 2 to 16 bits. This paper presents the algorithm, its applications and status of development.
Amélioration des performances d'une implantation parallélovectorielle du gradient conjugué par extension du schéma de stockage matriciel

NASA Astrophysics Data System (ADS)

Magnin, H.; Coulomb, J. L.

1993-03-01

Electromagnetic field computation with the Finite Element (FE) method implies solving of large linear systems of equations. Performances and memory capacities of today computers allow to achieve three-dimensional FE discretizations of electromagnetic problems, but the number of unknowns grows high. So, to improve time to the numerical solution of the linear system(s) thus arising, the use of parallel and/or vector computers has to be envisaged. In this paper, the main constitutive steps of the Pre-conditioned Conjugate Gradient algorithm (PCG) are analysed. After a short recall of our previous work concerning their improvement by use of vector and parallel computations, we show some speedup limitations due to the sparse row-wise matrix storage scheme employed. Then, an extension of this matrix representation is proposed, leading to introduce redundant storage of non-zero coefficients. In spite of the “memory waste” thus implied, it is shown how this extension can be successfully employed to increase the speedup due to parallelism and vectorization on the whole algorithm, and in particular to derive a parallel preconditioner. La résolution par la méthode des éléments finis des équations de l'électromagnétisme conduit à résoudre de grands systèmes d'équations linéaires. Les capacités mémoire et les performances actuelles des systèmes informatiques permettent de traiter les problèmes électromagnétiques par discrétisation tridimensionnelle, mais alors le nombre d'inconnues devient très élevé. Ainsi, la résolution en un temps raisonnable des équations linéaires associées à de telles discrétisations conduit à envisager l'emploi d'ordinateurs à architecture parallèle. Dans cet article, les différentes étapes constitutives de l'algorithme du gradient conjugué préconditionné (GCP) sont analysées. Après un court rappel de nos travaux antérieurs concemant leur amélioration par utilisation de traitements parallèles et vectoriels, nous montrons les limitations du gain de temps dues au mode de stockage matriciel utilisé : la représentation creuse dite “Morse”. Nous proposons alors une extension de ce mode de stockage, conduisant à l'introduction de redondance au niveau du rangement des termes matriciels en mémoire. Malgré le “gaspillage” mémoire ainsi occasionné, il apparait que cette extension peut être mise à profit pour augmenter sensiblement les gains par parallélisation et vectorisation de l'ensemble de l'algorithme du gradient conjugué, et notamment pour la réalisation d'un pré-conditionnement parallèle.
Parallel Algorithms and Patterns

DOE Office of Scientific and Technical Information (OSTI.GOV)

Robey, Robert W.

2016-06-16

This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.

PubMed

Bhandarkar, S M; Chirravuri, S; Arnold, J

1996-01-01

Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.
Exact parallel algorithms for some members of the traveling salesman problem family

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pekny, J.F.

1989-01-01

The traveling salesman problem and its many generalizations comprise one of the best known combinatorial optimization problem families. Most members of the family are NP-complete problems so that exact algorithms require an unpredictable and sometimes large computational effort. Parallel computers offer hope for providing the power required to meet these demands. A major barrier to applying parallel computers is the lack of parallel algorithms. The contributions presented in this thesis center around new exact parallel algorithms for the asymmetric traveling salesman problem (ATSP), prize collecting traveling salesman problem (PCTSP), and resource constrained traveling salesman problem (RCTSP). The RCTSP is amore » particularly difficult member of the family since finding a feasible solution is an NP-complete problem. An exact sequential algorithm is also presented for the directed hamiltonian cycle problem (DHCP). The DHCP algorithm is superior to current heuristic approaches and represents the first exact method applicable to large graphs. Computational results presented for each of the algorithms demonstrates the effectiveness of combining efficient algorithms with parallel computing methods. Performance statistics are reported for randomly generated ATSPs with 7,500 cities, PCTSPs with 200 cities, RCTSPs with 200 cities, DHCPs with 3,500 vertices, and assignment problems of size 10,000. Sequential results were collected on a Sun 4/260 engineering workstation, while parallel results were collected using a 14 and 100 processor BBN Butterfly Plus computer. The computational results represent the largest instances ever solved to optimality on any type of computer.« less
Efficient implementation of parallel three-dimensional FFT on clusters of PCs

NASA Astrophysics Data System (ADS)

Takahashi, Daisuke

2003-05-01

In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.
DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function.

PubMed

Cheng, Liang; Hu, Yang; Sun, Jie; Zhou, Meng; Jiang, Qinghua

2018-06-01

DincRNA aims to provide a comprehensive web-based bioinformatics toolkit to elucidate the entangled relationships among diseases and non-coding RNAs (ncRNAs) from the perspective of disease similarity. The quantitative way to illustrate relationships of pair-wise diseases always depends on their molecular mechanisms, and structures of the directed acyclic graph of Disease Ontology (DO). Corresponding methods for calculating similarity of pair-wise diseases involve Resnik's, Lin's, Wang's, PSB and SemFunSim methods. Recently, disease similarity was validated suitable for calculating functional similarities of ncRNAs and prioritizing ncRNA-disease pairs, and it has been widely applied for predicting the ncRNA function due to the limited biological knowledge from wet lab experiments of these RNAs. For this purpose, a large number of algorithms and priori knowledge need to be integrated. e.g. 'pair-wise best, pairs-average' (PBPA) and 'pair-wise all, pairs-maximum' (PAPM) methods for calculating functional similarities of ncRNAs, and random walk with restart (RWR) method for prioritizing ncRNA-disease pairs. To facilitate the exploration of disease associations and ncRNA function, DincRNA implemented all of the above eight algorithms based on DO and disease-related genes. Currently, it provides the function to query disease similarity scores, miRNA and lncRNA functional similarity scores, and the prioritization scores of lncRNA-disease and miRNA-disease pairs. http://bio-annotation.cn:18080/DincRNAClient/. biofomeng@hotmail.com or qhjiang@hit.edu.cn. Supplementary data are available at Bioinformatics online.
A New Paradigm Hidden in Steganography

DTIC Science & Technology

2000-01-01

In steganography , we do not make the \\strong" assumption that Eve has knowledge of the steganographic algorithm . This is why there may, or may not be...the n least signi cant bits ( LSB ) of each pixel in the cov- erimage, with the n most signi cant bits (MSB) from the corresponding pixel of the image to...e.g., 2 LSB are (0,0) ) to 3 (e.g., 2 LSB are (1,1) ), it is visually impossible for Eve to detect the steganography . Of course, if Eve has knowl
VLSI for High-Speed Digital Signal Processing

DTIC Science & Technology

1994-09-30

particular, the design, layout and fab - rication of integrated circuits. The primary project for this grant has been the design and implementation of a...targeted at 33.36 dB, and PSNR (dB) Rate ( bpp ) the FRSBC algorithm, targeted at 0.5 bits/pixel, respec- Filter FDSBC FRSBC FDSBC FRSBC tively. The filter...to mean square error d by as shown in Fig. 6, is used, yielding a total of 16 subbands. 255’ The rates, in bits per pixel ( bpp ), and the peak signal
Double-tick realization of binary control program

NASA Astrophysics Data System (ADS)

Kobylecki, Michał; Kania, Dariusz

2016-12-01

This paper presents a procedure for the implementation of control algorithms for hardware-bit compatible with the standard IEC61131-3. The described transformation based on the sets of calculus and graphs, allows translation of the original form of the control program to the form in full compliance with the original, giving the architecture represented by two tick. The proposed method enables the efficient implementation of the control bits in the FPGA with the use of a standardized programming language LD.
Parallel 3D Multi-Stage Simulation of a Turbofan Engine

NASA Technical Reports Server (NTRS)

Turner, Mark G.; Topp, David A.

1998-01-01

A 3D multistage simulation of each component of a modern GE Turbofan engine has been made. An axisymmetric view of this engine is presented in the document. This includes a fan, booster rig, high pressure compressor rig, high pressure turbine rig and a low pressure turbine rig. In the near future, all components will be run in a single calculation for a solution of 49 blade rows. The simulation exploits the use of parallel computations by using two levels of parallelism. Each blade row is run in parallel and each blade row grid is decomposed into several domains and run in parallel. 20 processors are used for the 4 blade row analysis. The average passage approach developed by John Adamczyk at NASA Lewis Research Center has been further developed and parallelized. This is APNASA Version A. It is a Navier-Stokes solver using a 4-stage explicit Runge-Kutta time marching scheme with variable time steps and residual smoothing for convergence acceleration. It has an implicit K-E turbulence model which uses an ADI solver to factor the matrix. Between 50 and 100 explicit time steps are solved before a blade row body force is calculated and exchanged with the other blade rows. This outer iteration has been coined a "flip." Efforts have been made to make the solver linearly scaleable with the number of blade rows. Enough flips are run (between 50 and 200) so the solution in the entire machine is not changing. The K-E equations are generally solved every other explicit time step. One of the key requirements in the development of the parallel code was to make the parallel solution exactly (bit for bit) match the serial solution. This has helped isolate many small parallel bugs and guarantee the parallelization was done correctly. The domain decomposition is done only in the axial direction since the number of points axially is much larger than the other two directions. This code uses MPI for message passing. The parallel speed up of the solver portion (no 1/0 or body force calculation) for a grid which has 227 points axially.
Computational mechanics analysis tools for parallel-vector supercomputers

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.; Qin, J.

1993-01-01

Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigen-solution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization algorithm and domain decomposition. The source code for many of these algorithms is available from NASA Langley.
Concurrent computation of attribute filters on shared memory parallel machines.

PubMed

Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold

2008-10-01

Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.

Massively parallel data processing for quantitative total flow imaging with optical coherence microscopy and tomography

NASA Astrophysics Data System (ADS)

Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo

2017-08-01

We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU data processing time)
Faster Double-Size Bipartite Multiplication out of Montgomery Multipliers

NASA Astrophysics Data System (ADS)

Yoshino, Masayuki; Okeya, Katsuyuki; Vuillaume, Camille

This paper proposes novel algorithms for computing double-size modular multiplications with few modulus-dependent precomputations. Low-end devices such as smartcards are usually equipped with hardware Montgomery multipliers. However, due to progresses of mathematical attacks, security institutions such as NIST have steadily demanded longer bit-lengths for public-key cryptography, making the multipliers quickly obsolete. In an attempt to extend the lifespan of such multipliers, double-size techniques compute modular multiplications with twice the bit-length of the multipliers. Techniques are known for extending the bit-length of classical Euclidean multipliers, of Montgomery multipliers and the combination thereof, namely bipartite multipliers. However, unlike classical and bipartite multiplications, Montgomery multiplications involve modulus-dependent precomputations, which amount to a large part of an RSA encryption or signature verification. The proposed double-size technique simulates double-size multiplications based on single-size Montgomery multipliers, and yet precomputations are essentially free: in an 2048-bit RSA encryption or signature verification with public exponent e=216+1, the proposal with a 1024-bit Montgomery multiplier is at least 1.5 times faster than previous double-size Montgomery multiplications.
Regional-scale calculation of the LS factor using parallel processing

NASA Astrophysics Data System (ADS)

Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong

2015-05-01

With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.
A new scheduling algorithm for parallel sparse LU factorization with static pivoting

DOE Office of Scientific and Technical Information (OSTI.GOV)

Grigori, Laura; Li, Xiaoye S.

2002-08-20

In this paper we present a static scheduling algorithm for parallel sparse LU factorization with static pivoting. The algorithm is divided into mapping and scheduling phases, using the symmetric pruned graphs of L' and U to represent dependencies. The scheduling algorithm is designed for driving the parallel execution of the factorization on a distributed-memory architecture. Experimental results and comparisons with SuperLU{_}DIST are reported after applying this algorithm on real world application matrices on an IBM SP RS/6000 distributed memory machine.
Digital pulse processing for planar TlBr detectors

NASA Astrophysics Data System (ADS)

Nakhostin, M.; Hitomi, K.; Ishii, K.; Kikuchi, Y.

2010-04-01

We report on a digital pulse processing algorithm for correction of charge trapping in the planar TlBr detectors. The algorithm is performed on the signals digitized at the preamplifier stage. The algorithm is very simple and is implemented with little computational effort. By using a digitizer with a sampling rate of 250 MSample/s and 8 bit resolution, an energy resolution of 6.5% is achieved at 511 keV with a 0.7 mm thick detector.
Parallel conjugate gradient algorithms for manipulator dynamic simulation

NASA Technical Reports Server (NTRS)

Fijany, Amir; Scheld, Robert E.

1989-01-01

Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).
Finger Vein Recognition Based on a Personalized Best Bit Map

PubMed Central

Yang, Gongping; Xi, Xiaoming; Yin, Yilong

2012-01-01

Finger vein patterns have recently been recognized as an effective biometric identifier. In this paper, we propose a finger vein recognition method based on a personalized best bit map (PBBM). Our method is rooted in a local binary pattern based method and then inclined to use the best bits only for matching. We first present the concept of PBBM and the generating algorithm. Then we propose the finger vein recognition framework, which consists of preprocessing, feature extraction, and matching. Finally, we design extensive experiments to evaluate the effectiveness of our proposal. Experimental results show that PBBM achieves not only better performance, but also high robustness and reliability. In addition, PBBM can be used as a general framework for binary pattern based recognition. PMID:22438735
Molecular computation: RNA solutions to chess problems.

PubMed

Faulhammer, D; Cukras, A R; Lipton, R J; Landweber, L F

2000-02-15

We have expanded the field of "DNA computers" to RNA and present a general approach for the solution of satisfiability problems. As an example, we consider a variant of the "Knight problem," which asks generally what configurations of knights can one place on an n x n chess board such that no knight is attacking any other knight on the board. Using specific ribonuclease digestion to manipulate strands of a 10-bit binary RNA library, we developed a molecular algorithm and applied it to a 3 x 3 chessboard as a 9-bit instance of this problem. Here, the nine spaces on the board correspond to nine "bits" or placeholders in a combinatorial RNA library. We recovered a set of "winning" molecules that describe solutions to this problem.
Finger vein recognition based on a personalized best bit map.

PubMed

Yang, Gongping; Xi, Xiaoming; Yin, Yilong

2012-01-01

Finger vein patterns have recently been recognized as an effective biometric identifier. In this paper, we propose a finger vein recognition method based on a personalized best bit map (PBBM). Our method is rooted in a local binary pattern based method and then inclined to use the best bits only for matching. We first present the concept of PBBM and the generating algorithm. Then we propose the finger vein recognition framework, which consists of preprocessing, feature extraction, and matching. Finally, we design extensive experiments to evaluate the effectiveness of our proposal. Experimental results show that PBBM achieves not only better performance, but also high robustness and reliability. In addition, PBBM can be used as a general framework for binary pattern based recognition.
Bit-level quantum color image encryption scheme with quantum cross-exchange operation and hyper-chaotic system

NASA Astrophysics Data System (ADS)

Zhou, Nanrun; Chen, Weiwei; Yan, Xinyu; Wang, Yunqian

2018-06-01

In order to obtain higher encryption efficiency, a bit-level quantum color image encryption scheme by exploiting quantum cross-exchange operation and a 5D hyper-chaotic system is designed. Additionally, to enhance the scrambling effect, the quantum channel swapping operation is employed to swap the gray values of corresponding pixels. The proposed color image encryption algorithm has larger key space and higher security since the 5D hyper-chaotic system has more complex dynamic behavior, better randomness and unpredictability than those based on low-dimensional hyper-chaotic systems. Simulations and theoretical analyses demonstrate that the presented bit-level quantum color image encryption scheme outperforms its classical counterparts in efficiency and security.
Robust High-Capacity Audio Watermarking Based on FFT Amplitude Modification

NASA Astrophysics Data System (ADS)

Fallahpour, Mehdi; Megías, David

This paper proposes a novel robust audio watermarking algorithm to embed data and extract it in a bit-exact manner based on changing the magnitudes of the FFT spectrum. The key point is selecting a frequency band for embedding based on the comparison between the original and the MP3 compressed/decompressed signal and on a suitable scaling factor. The experimental results show that the method has a very high capacity (about 5kbps), without significant perceptual distortion (ODG about -0.25) and provides robustness against common audio signal processing such as added noise, filtering and MPEG compression (MP3). Furthermore, the proposed method has a larger capacity (number of embedded bits to number of host bits rate) than recent image data hiding methods.
A 640-MHz 32-megachannel real-time polyphase-FFT spectrum analyzer

NASA Technical Reports Server (NTRS)

Zimmerman, G. A.; Garyantes, M. F.; Grimm, M. J.; Charny, B.

1991-01-01

A polyphase fast Fourier transform (FFT) spectrum analyzer being designed for NASA's Search for Extraterrestrial Intelligence (SETI) Sky Survey at the Jet Propulsion Laboratory is described. By replacing the time domain multiplicative window preprocessing with polyphase filter processing, much of the processing loss of windowed FFTs can be eliminated. Polyphase coefficient memory costs are minimized by effective use of run length compression. Finite word length effects are analyzed, producing a balanced system with 8 bit inputs, 16 bit fixed point polyphase arithmetic, and 24 bit fixed point FFT arithmetic. Fixed point renormalization midway through the computation is seen to be naturally accommodated by the matrix FFT algorithm proposed. Simulation results validate the finite word length arithmetic analysis and the renormalization technique.
Optimized atom position and coefficient coding for matching pursuit-based image compression.

PubMed

Shoa, Alireza; Shirani, Shahram

2009-12-01

In this paper, we propose a new encoding algorithm for matching pursuit image coding. We show that coding performance is improved when correlations between atom positions and atom coefficients are both used in encoding. We find the optimum tradeoff between efficient atom position coding and efficient atom coefficient coding and optimize the encoder parameters. Our proposed algorithm outperforms the existing coding algorithms designed for matching pursuit image coding. Additionally, we show that our algorithm results in better rate distortion performance than JPEG 2000 at low bit rates.
GPU-completeness: theory and implications

NASA Astrophysics Data System (ADS)

Lin, I.-Jong

2011-01-01

This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe that the selection of architecture can be defined in terms of properties of GPU-Completeness. For a welldefined subset of algorithms, GPU-Completeness is intended to connect the parallelism, algorithms and efficient architectures into a unified framework to show that multiple layers of parallel implementation are guided by the same underlying trade-off.
Crashworthiness simulations with DYNA3D

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schauer, D.A.; Hoover, C.G.; Kay, G.J.

1996-04-01

Current progress in parallel algorithm research and applications in vehicle crash simulation is described for the explicit, finite element algorithms in DYNA3D. Problem partitioning methods and parallel algorithms for contact at material interfaces are the two challenging algorithm research problems that are addressed. Two prototype parallel contact algorithms have been developed for treating the cases of local and arbitrary contact. Demonstration problems for local contact are crashworthiness simulations with 222 locally defined contact surfaces and a vehicle/barrier collision modeled with arbitrary contact. A simulation of crash tests conducted for a vehicle impacting a U-channel small sign post embedded in soilmore » has been run on both the serial and parallel versions of DYNA3D. A significant reduction in computational time has been observed when running these problems on the parallel version. However, to achieve maximum efficiency, complex problems must be appropriately partitioned, especially when contact dominates the computation.« less
Adaptive filtering of GOCE-derived gravity gradients of the disturbing potential in the context of the space-wise approach

NASA Astrophysics Data System (ADS)

Piretzidis, Dimitrios; Sideris, Michael G.

2017-09-01

Filtering and signal processing techniques have been widely used in the processing of satellite gravity observations to reduce measurement noise and correlation errors. The parameters and types of filters used depend on the statistical and spectral properties of the signal under investigation. Filtering is usually applied in a non-real-time environment. The present work focuses on the implementation of an adaptive filtering technique to process satellite gravity gradiometry data for gravity field modeling. Adaptive filtering algorithms are commonly used in communication systems, noise and echo cancellation, and biomedical applications. Two independent studies have been performed to introduce adaptive signal processing techniques and test the performance of the least mean-squared (LMS) adaptive algorithm for filtering satellite measurements obtained by the gravity field and steady-state ocean circulation explorer (GOCE) mission. In the first study, a Monte Carlo simulation is performed in order to gain insights about the implementation of the LMS algorithm on data with spectral behavior close to that of real GOCE data. In the second study, the LMS algorithm is implemented on real GOCE data. Experiments are also performed to determine suitable filtering parameters. Only the four accurate components of the full GOCE gravity gradient tensor of the disturbing potential are used. The characteristics of the filtered gravity gradients are examined in the time and spectral domain. The obtained filtered GOCE gravity gradients show an agreement of 63-84 mEötvös (depending on the gravity gradient component), in terms of RMS error, when compared to the gravity gradients derived from the EGM2008 geopotential model. Spectral-domain analysis of the filtered gradients shows that the adaptive filters slightly suppress frequencies in the bandwidth of approximately 10-30 mHz. The limitations of the adaptive LMS algorithm are also discussed. The tested filtering algorithm can be connected to and employed in the first computational steps of the space-wise approach, where a time-wise Wiener filter is applied at the first stage of GOCE gravity gradient filtering. The results of this work can be extended to using other adaptive filtering algorithms, such as the recursive least-squares and recursive least-squares lattice filters.
Progress of the Swedish-Australian research collaboration on uncooled smart IR sensors

NASA Astrophysics Data System (ADS)

Liddiard, Kevin C.; Ringh, Ulf; Jansson, Christer; Reinhold, Olaf

1998-10-01

Progress is reported on the development of uncooled microbolometer IR focal plane detector arrays (IRFPDA) under a research collaboration between the Swedish Defence Research Establishment (FOA), and the Defence Science and Technology Organization (DSTO), Australia. The paper describes current focal plane detector arrays designed by Electro-optic Sensor Design (EOSD) for readout circuits developed by FOA. The readouts are fabricated in 0.8 micrometer CMOS, and have a novel signal conditioning and 16 bit parallel ADC design. The arrays are post-processed at DSTO on wafers supplied by FOA. During the past year array processing has been carried out at a new microengineering facility at DSTO, Salisbury, South Australia. A number of small format 16 X 16 arrays have been delivered to FOA for evaluation, and imaging has been demonstrated with these arrays. A 320 X 240 readout with 320 parallel 16 bit ADCs has been developed and IRFPDAs for this readout have been fabricated and are currently being evaluated.
Multiplexed Oversampling Digitizer in 65 nm CMOS for Column-Parallel CCD Readout

DOE Office of Scientific and Technical Information (OSTI.GOV)

Grace, Carl; Walder, Jean-Pierre; von der Lippe, Henrik

2012-04-10

A digitizer designed to read out column-parallel charge-coupled devices (CCDs) used for high-speed X-ray imaging is presented. The digitizer is included as part of the High-Speed Image Preprocessor with Oversampling (HIPPO) integrated circuit. The digitizer module comprises a multiplexed, oversampling, 12-bit, 80 MS/s pipelined Analog-to-Digital Converter (ADC) and a bank of four fast-settling sample-and-hold amplifiers to instrument four analog channels. The ADC multiplexes and oversamples to reduce its area to allow integration that is pitch-matched to the columns of the CCD. Novel design techniques are used to enable oversampling and multiplexing with a reduced power penalty. The ADC exhibits 188more » ?V-rms noise which is less than 1 LSB at a 12-bit level. The prototype is implemented in a commercially available 65 nm CMOS process. The digitizer will lead to a proof-of-principle 2D 10 Gigapixel/s X-ray detector.« less
Line-drawing algorithms for parallel machines

NASA Technical Reports Server (NTRS)

Pang, Alex T.

1990-01-01

The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length and orientation.
Multiprocessing the Sieve of Eratosthenes

NASA Technical Reports Server (NTRS)

Bokhari, S.

1986-01-01

The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.

A Parallel Rendering Algorithm for MIMD Architectures

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.; Orloff, Tobias

1991-01-01

Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
A highly efficient multi-core algorithm for clustering extremely large datasets

PubMed Central

2010-01-01

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
A soft decoding algorithm and hardware implementation for the visual prosthesis based on high order soft demodulation.

PubMed

Yang, Yuan; Quan, Nannan; Bu, Jingjing; Li, Xueping; Yu, Ningmei

2016-09-26

High order modulation and demodulation technology can solve the frequency requirement between the wireless energy transmission and data communication. In order to achieve reliable wireless data communication based on high order modulation technology for visual prosthesis, this work proposed a Reed-Solomon (RS) error correcting code (ECC) circuit on the basis of differential amplitude and phase shift keying (DAPSK) soft demodulation. Firstly, recognizing the weakness of the traditional DAPSK soft demodulation algorithm based on division that is complex for hardware implementation, an improved phase soft demodulation algorithm for visual prosthesis to reduce the hardware complexity is put forward. Based on this new algorithm, an improved RS soft decoding method is hence proposed. In this new decoding method, the combination of Chase algorithm and hard decoding algorithms is used to achieve soft decoding. In order to meet the requirements of implantable visual prosthesis, the method to calculate reliability of symbol-level based on multiplication of bit reliability is derived, which reduces the testing vectors number of Chase algorithm. The proposed algorithms are verified by MATLAB simulation and FPGA experimental results. During MATLAB simulation, the biological channel attenuation property model is added into the ECC circuit. The data rate is 8 Mbps in the MATLAB simulation and FPGA experiments. MATLAB simulation results show that the improved phase soft demodulation algorithm proposed in this paper saves hardware resources without losing bit error rate (BER) performance. Compared with the traditional demodulation circuit, the coding gain of the ECC circuit has been improved by about 3 dB under the same BER of [Formula: see text]. The FPGA experimental results show that under the condition of data demodulation error with wireless coils 3 cm away, the system can correct it. The greater the distance, the higher the BER. Then we use a bit error rate analyzer to measure BER of the demodulation circuit and the RS ECC circuit with different distance of two coils. And the experimental results show that the RS ECC circuit has about an order of magnitude lower BER than the demodulation circuit when under the same coils distance. Therefore, the RS ECC circuit has more higher reliability of the communication in the system. The improved phase soft demodulation algorithm and soft decoding algorithm proposed in this paper enables data communication that is more reliable than other demodulation system, which also provide a significant reference for further study to the visual prosthesis system.
NAS Parallel Benchmarks. 2.4

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob; Biegel, Bryan A. (Technical Monitor)

2002-01-01

We describe a new problem size, called Class D, for the NAS Parallel Benchmarks (NPB), whose MPI source code implementation is being released as NPB 2.4. A brief rationale is given for how the new class is derived. We also describe the modifications made to the MPI (Message Passing Interface) implementation to allow the new class to be run on systems with 32-bit integers, and with moderate amounts of memory. Finally, we give the verification values for the new problem size.
a Predator-Prey Model Based on the Fully Parallel Cellular Automata

NASA Astrophysics Data System (ADS)

He, Mingfeng; Ruan, Hongbo; Yu, Changliang

We presented a predator-prey lattice model containing moveable wolves and sheep, which are characterized by Penna double bit strings. Sexual reproduction and child-care strategies are considered. To implement this model in an efficient way, we build a fully parallel Cellular Automata based on a new definition of the neighborhood. We show the roles played by the initial densities of the populations, the mutation rate and the linear size of the lattice in the evolution of this model.
A sweep algorithm for massively parallel simulation of circuit-switched networks

NASA Technical Reports Server (NTRS)

Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.

1992-01-01

A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
Robust hashing for 3D models

NASA Astrophysics Data System (ADS)

Berchtold, Waldemar; Schäfer, Marcel; Rettig, Michael; Steinebach, Martin

2014-02-01

3D models and applications are of utmost interest in both science and industry. With the increment of their usage, their number and thereby the challenge to correctly identify them increases. Content identification is commonly done by cryptographic hashes. However, they fail as a solution in application scenarios such as computer aided design (CAD), scientific visualization or video games, because even the smallest alteration of the 3D model, e.g. conversion or compression operations, massively changes the cryptographic hash as well. Therefore, this work presents a robust hashing algorithm for 3D mesh data. The algorithm applies several different bit extraction methods. They are built to resist desired alterations of the model as well as malicious attacks intending to prevent correct allocation. The different bit extraction methods are tested against each other and, as far as possible, the hashing algorithm is compared to the state of the art. The parameters tested are robustness, security and runtime performance as well as False Acceptance Rate (FAR) and False Rejection Rate (FRR), also the probability calculation of hash collision is included. The introduced hashing algorithm is kept adaptive e.g. in hash length, to serve as a proper tool for all applications in practice.
Constraint treatment techniques and parallel algorithms for multibody dynamic analysis. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Chiou, Jin-Chern

1990-01-01

Computational procedures for kinematic and dynamic analysis of three-dimensional multibody dynamic (MBD) systems are developed from the differential-algebraic equations (DAE's) viewpoint. Constraint violations during the time integration process are minimized and penalty constraint stabilization techniques and partitioning schemes are developed. The governing equations of motion, a two-stage staggered explicit-implicit numerical algorithm, are treated which takes advantage of a partitioned solution procedure. A robust and parallelizable integration algorithm is developed. This algorithm uses a two-stage staggered central difference algorithm to integrate the translational coordinates and the angular velocities. The angular orientations of bodies in MBD systems are then obtained by using an implicit algorithm via the kinematic relationship between Euler parameters and angular velocities. It is shown that the combination of the present solution procedures yields a computationally more accurate solution. To speed up the computational procedures, parallel implementation of the present constraint treatment techniques, the two-stage staggered explicit-implicit numerical algorithm was efficiently carried out. The DAE's and the constraint treatment techniques were transformed into arrowhead matrices to which Schur complement form was derived. By fully exploiting the sparse matrix structural analysis techniques, a parallel preconditioned conjugate gradient numerical algorithm is used to solve the systems equations written in Schur complement form. A software testbed was designed and implemented in both sequential and parallel computers. This testbed was used to demonstrate the robustness and efficiency of the constraint treatment techniques, the accuracy of the two-stage staggered explicit-implicit numerical algorithm, and the speed up of the Schur-complement-based parallel preconditioned conjugate gradient algorithm on a parallel computer.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azad, Ariful; Buluc, Aydn; Pothen, Alex

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE PAGES

Azad, Ariful; Buluc, Aydn; Pothen, Alex

2016-03-24

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Fast parallel approach for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2009-12-01

Two-dimensional fast Gabor transform algorithms are useful for real-time applications due to the high computational complexity of the traditional 2-D complex-valued discrete Gabor transform (CDGT). This paper presents two block time-recursive algorithms for 2-D DHT-based real-valued discrete Gabor transform (RDGT) and its inverse transform and develops a fast parallel approach for the implementation of the two algorithms. The computational complexity of the proposed parallel approach is analyzed and compared with that of the existing 2-D CDGT algorithms. The results indicate that the proposed parallel approach is attractive for real time image processing.
Communications oriented programming of parallel iterative solutions of sparse linear systems

NASA Technical Reports Server (NTRS)

Patrick, M. L.; Pratt, T. W.

1986-01-01

Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

PubMed

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

PubMed Central

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
A hybrid quantum-inspired genetic algorithm for multiobjective flow shop scheduling.

PubMed

Li, Bin-Bin; Wang, Ling

2007-06-01

This paper proposes a hybrid quantum-inspired genetic algorithm (HQGA) for the multiobjective flow shop scheduling problem (FSSP), which is a typical NP-hard combinatorial optimization problem with strong engineering backgrounds. On the one hand, a quantum-inspired GA (QGA) based on Q-bit representation is applied for exploration in the discrete 0-1 hyperspace by using the updating operator of quantum gate and genetic operators of Q-bit. Moreover, random-key representation is used to convert the Q-bit representation to job permutation for evaluating the objective values of the schedule solution. On the other hand, permutation-based GA (PGA) is applied for both performing exploration in permutation-based scheduling space and stressing exploitation for good schedule solutions. To evaluate solutions in multiobjective sense, a randomly weighted linear-sum function is used in QGA, and a nondominated sorting technique including classification of Pareto fronts and fitness assignment is applied in PGA with regard to both proximity and diversity of solutions. To maintain the diversity of the population, two trimming techniques for population are proposed. The proposed HQGA is tested based on some multiobjective FSSPs. Simulation results and comparisons based on several performance metrics demonstrate the effectiveness of the proposed HQGA.
Research on parallel algorithm for sequential pattern mining

NASA Astrophysics Data System (ADS)

Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao

2008-03-01

Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
Parallel/distributed direct method for solving linear systems

NASA Technical Reports Server (NTRS)

Lin, Avi

1990-01-01

A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.
A good performance watermarking LDPC code used in high-speed optical fiber communication system

NASA Astrophysics Data System (ADS)

Zhang, Wenbo; Li, Chao; Zhang, Xiaoguang; Xi, Lixia; Tang, Xianfeng; He, Wenxue

2015-07-01

A watermarking LDPC code, which is a strategy designed to improve the performance of the traditional LDPC code, was introduced. By inserting some pre-defined watermarking bits into original LDPC code, we can obtain a more correct estimation about the noise level in the fiber channel. Then we use them to modify the probability distribution function (PDF) used in the initial process of belief propagation (BP) decoding algorithm. This algorithm was tested in a 128 Gb/s PDM-DQPSK optical communication system and results showed that the watermarking LDPC code had a better tolerances to polarization mode dispersion (PMD) and nonlinearity than that of traditional LDPC code. Also, by losing about 2.4% of redundancy for watermarking bits, the decoding efficiency of the watermarking LDPC code is about twice of the traditional one.
Quantum Associative Neural Network with Nonlinear Search Algorithm

NASA Astrophysics Data System (ADS)

Zhou, Rigui; Wang, Huian; Wu, Qian; Shi, Yang

2012-03-01

Based on analysis on properties of quantum linear superposition, to overcome the complexity of existing quantum associative memory which was proposed by Ventura, a new storage method for multiply patterns is proposed in this paper by constructing the quantum array with the binary decision diagrams. Also, the adoption of the nonlinear search algorithm increases the pattern recalling speed of this model which has multiply patterns to O( {log2}^{2^{n -t}} ) = O( n - t ) time complexity, where n is the number of quantum bit and t is the quantum information of the t quantum bit. Results of case analysis show that the associative neural network model proposed in this paper based on quantum learning is much better and optimized than other researchers' counterparts both in terms of avoiding the additional qubits or extraordinary initial operators, storing pattern and improving the recalling speed.
SIAM Conference on Parallel Processing for Scientific Computing, 4th, Chicago, IL, Dec. 11-13, 1989, Proceedings

NASA Technical Reports Server (NTRS)

Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)

1990-01-01

Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.

On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms.

PubMed

Chen, Chunlei; He, Li; Zhang, Huixiang; Zheng, Hao; Wang, Lei

2017-01-01

Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions.
Cascade Error Projection with Low Bit Weight Quantization for High Order Correlation Data

NASA Technical Reports Server (NTRS)

Duong, Tuan A.; Daud, Taher

1998-01-01

In this paper, we reinvestigate the solution for chaotic time series prediction problem using neural network approach. The nature of this problem is such that the data sequences are never repeated, but they are rather in chaotic region. However, these data sequences are correlated between past, present, and future data in high order. We use Cascade Error Projection (CEP) learning algorithm to capture the high order correlation between past and present data to predict a future data using limited weight quantization constraints. This will help to predict a future information that will provide us better estimation in time for intelligent control system. In our earlier work, it has been shown that CEP can sufficiently learn 5-8 bit parity problem with 4- or more bits, and color segmentation problem with 7- or more bits of weight quantization. In this paper, we demonstrate that chaotic time series can be learned and generalized well with as low as 4-bit weight quantization using round-off and truncation techniques. The results show that generalization feature will suffer less as more bit weight quantization is available and error surfaces with the round-off technique are more symmetric around zero than error surfaces with the truncation technique. This study suggests that CEP is an implementable learning technique for hardware consideration.
Application of integration algorithms in a parallel processing environment for the simulation of jet engines

NASA Technical Reports Server (NTRS)

Krosel, S. M.; Milner, E. J.

1982-01-01

The application of Predictor corrector integration algorithms developed for the digital parallel processing environment are investigated. The algorithms are implemented and evaluated through the use of a software simulator which provides an approximate representation of the parallel processing hardware. Test cases which focus on the use of the algorithms are presented and a specific application using a linear model of a turbofan engine is considered. Results are presented showing the effects of integration step size and the number of processors on simulation accuracy. Real time performance, interprocessor communication, and algorithm startup are also discussed.
Efficient Parallel Algorithm For Direct Numerical Simulation of Turbulent Flows

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Gatski, Thomas B.

1997-01-01

A distributed algorithm for a high-order-accurate finite-difference approach to the direct numerical simulation (DNS) of transition and turbulence in compressible flows is described. This work has two major objectives. The first objective is to demonstrate that parallel and distributed-memory machines can be successfully and efficiently used to solve computationally intensive and input/output intensive algorithms of the DNS class. The second objective is to show that the computational complexity involved in solving the tridiagonal systems inherent in the DNS algorithm can be reduced by algorithm innovations that obviate the need to use a parallelized tridiagonal solver.
Efficiency Analysis of the Parallel Implementation of the SIMPLE Algorithm on Multiprocessor Computers

NASA Astrophysics Data System (ADS)

Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.

2017-12-01

This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
MULTIOBJECTIVE PARALLEL GENETIC ALGORITHM FOR WASTE MINIMIZATION

EPA Science Inventory

In this research we have developed an efficient multiobjective parallel genetic algorithm (MOPGA) for waste minimization problems. This MOPGA integrates PGAPack (Levine, 1996) and NSGA-II (Deb, 2000) with novel modifications. PGAPack is a master-slave parallel implementation of a...
Design Considerations for a Computationally-Lightweight Authentication Mechanism for Passive RFID Tags

DTIC Science & Technology

2009-09-01

suffer the power and complexity requirements of a public key system. 28 In [18], a simulation of the SHA –1 algorithm is performed on a Xilinx FPGA ... 256 bits. Thus, the construction of a hash table would need 2512 independent comparisons. It is known that hash collisions of the SHA –1 algorithm... SHA –1 algorithm for small-core FPGA design. Small-core FPGA design is the process by which a circuit is adapted to use the minimal amount of logic
A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.

PubMed

Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang

2018-01-01

The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Davis, Kristan D.; Faraj, Daniel A.

2014-07-22

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Davis, Kristan D; Faraj, Daniel A

2013-07-09

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Graphical Representation of Parallel Algorithmic Processes

DTIC Science & Technology

1990-12-01

interface with the AAARF main process . The source code for the AAARF class-common library is in the common subdi- rectory and consists of the following files... for public release; distribution unlimited AFIT/GCE/ENG/90D-07 Graphical Representation of Parallel Algorithmic Processes THESIS Presented to the...goal of this study is to develop an algorithm animation facility for parallel processes executing on different architectures, from multiprocessor
The openGL visualization of the 2D parallel FDTD algorithm

NASA Astrophysics Data System (ADS)

Walendziuk, Wojciech

2005-02-01

This paper presents a way of visualization of a two-dimensional version of a parallel algorithm of the FDTD method. The visualization module was created on the basis of the OpenGL graphic standard with the use of the GLUT interface. In addition, the work includes the results of the efficiency of the parallel algorithm in the form of speedup charts.
Investigation of system integration methods for bubble domain flight recorders

NASA Technical Reports Server (NTRS)

Chen, T. T.; Bohning, O. D.

1975-01-01

System integration methods for bubble domain flight records are investigated. Bubble memory module packaging and assembly, the control electronics design and construction, field coils, and permanent magnet bias structure design are studied. A small 60-k bit engineering model was built and tested to demonstrate the feasibility of the bubble recorder. Based on the various studies performed, a projection is made on a 50,000,000-bit prototype recorder. It is estimated that the recorder will occupy 190 cubic in., weigh 12 lb, and consume 12 w power when all of its four tracks are operated in parallel at 150 kHz data rate.
NbN A/D Conversion of IR Focal Plane Sensor Signal at 10 K

NASA Technical Reports Server (NTRS)

Eaton, L.; Durand, D.; Sandell, R.; Spargo, J.; Krabach, T.

1994-01-01

We are implementing a 12 bit SFQ counting ADC with parallel-to-serial readout using our established 10 K NbN capability. This circuit provides a key element of the analog signal processor (ASP) used in large infrared focal plane arrays. The circuit processes the signal data stream from a Si:As BIB detector array. A 10 mega samples per second (MSPS) pixel data stream flows from the chip at a 120 megabit bit rate in a format that is compatible with other superconductive time dependent processor (TDP) circuits being developed. We will discuss our planned ASP demonstration, the circuit design, and test results.
Computer-Aided Parallelizer and Optimizer

NASA Technical Reports Server (NTRS)

Jin, Haoqiang

2011-01-01

The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform.

PubMed

Cao, Jianfang; Chen, Lichao; Wang, Min; Tian, Yun

2018-01-01

The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance.
Reducing weight precision of convolutional neural networks towards large-scale on-chip image recognition

NASA Astrophysics Data System (ADS)

Ji, Zhengping; Ovsiannikov, Ilia; Wang, Yibing; Shi, Lilong; Zhang, Qiang

2015-05-01

In this paper, we develop a server-client quantization scheme to reduce bit resolution of deep learning architecture, i.e., Convolutional Neural Networks, for image recognition tasks. Low bit resolution is an important factor in bringing the deep learning neural network into hardware implementation, which directly determines the cost and power consumption. We aim to reduce the bit resolution of the network without sacrificing its performance. To this end, we design a new quantization algorithm called supervised iterative quantization to reduce the bit resolution of learned network weights. In the training stage, the supervised iterative quantization is conducted via two steps on server - apply k-means based adaptive quantization on learned network weights and retrain the network based on quantized weights. These two steps are alternated until the convergence criterion is met. In this testing stage, the network configuration and low-bit weights are loaded to the client hardware device to recognize coming input in real time, where optimized but expensive quantization becomes infeasible. Considering this, we adopt a uniform quantization for the inputs and internal network responses (called feature maps) to maintain low on-chip expenses. The Convolutional Neural Network with reduced weight and input/response precision is demonstrated in recognizing two types of images: one is hand-written digit images and the other is real-life images in office scenarios. Both results show that the new network is able to achieve the performance of the neural network with full bit resolution, even though in the new network the bit resolution of both weight and input are significantly reduced, e.g., from 64 bits to 4-5 bits.
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee

2008-01-01

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.« less
A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Rao, Hariprasad Nannapaneni

1989-01-01

The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.
Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank

NASA Astrophysics Data System (ADS)

Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

2014-05-01

Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is achieved without use of polyphase signal processing or time-interleaved ADC methods. That is, all digital processors operate at the same Fclk clock frequency without phasing, while wideband operation is achieved by sub-sampling of narrower sub-bands at the the RF channelizer outputs.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.