NASA Astrophysics Data System (ADS)
Åkerman, Björn
1997-04-01
DNA orientation measurements by linear dichroism (LD) spectroscopy and single molecule imaging by fluorescence microscopy are used to investigate the effect of DNA size (71-740 kilo base pairs) and field strength E (1-5.9 V/cm) on the conformation dynamics during the field-driven threading of DNA molecules through a set of parallel pores in agarose gels, with average pore radii between 380 Å and 1400 Å. Locally relaxed but globally oriented DNA molecules are subjected to a perpendicular field, and the observed LD time profile is compared with a recent theory for the threading [D. Long and J.-L. Viovy, Phys. Rev. E 53, 803 (1996)] which assumes the same initial state. As predicted the DNA is driven by the ends into a U-form, leading to an overshoot in the LD. The overshoot-time scales as E-(1.2-1.4) as predicted, but grows more slowly with DNA size than the predicted linear dependence. For long molecules loops form initially in the threading process but are finally consumed by the ends, and the process of transfer of DNA segments, from the loops to the arms of the U, leads to a shoulder in the LD as predicted. The critical size below which loops do not form (as indicated by the LD shoulder being absent) is between 71 and 105 kbp (0.5% agarose, 5.9 V/cm), and considerably larger than predicted because in the initial state the DNA molecules are housed in gel cavities with effective pore sizes about four times larger than the average pore size. From the data, the separation of DNA by exploiting the threading dynamics in pulsed fields [D. Long et al., CR Acad. Sci. Paris, Ser. IIb 321, 239 (1995)] is shown to be feasible in principle in an agarose-based system.
NASA Astrophysics Data System (ADS)
Roche-Lima, Abiel; Thulasiram, Ruppa K.
2012-02-01
Finite automata, in which each transition is augmented with an output label in addition to the familiar input label, are considered finite-state transducers. Transducers have been used to analyze some fundamental issues in bioinformatics. Weighted finite-state transducers have been proposed to pairwise alignments of DNA and protein sequences; as well as to develop kernels for computational biology. Machine learning algorithms for conditional transducers have been implemented and used for DNA sequence analysis. Transducer learning algorithms are based on conditional probability computation. It is calculated by using techniques, such as pair-database creation, normalization (with Maximum-Likelihood normalization) and parameters optimization (with Expectation-Maximization - EM). These techniques are intrinsically costly for computation, even worse when are applied to bioinformatics, because the databases sizes are large. In this work, we describe a parallel implementation of an algorithm to learn conditional transducers using these techniques. The algorithm is oriented to bioinformatics applications, such as alignments, phylogenetic trees, and other genome evolution studies. Indeed, several experiences were developed using the parallel and sequential algorithm on Westgrid (specifically, on the Breeze cluster). As results, we obtain that our parallel algorithm is scalable, because execution times are reduced considerably when the data size parameter is increased. Another experience is developed by changing precision parameter. In this case, we obtain smaller execution times using the parallel algorithm. Finally, number of threads used to execute the parallel algorithm on the Breezy cluster is changed. In this last experience, we obtain as result that speedup is considerably increased when more threads are used; however there is a convergence for number of threads equal to or greater than 16.
Constant time worker thread allocation via configuration caching
DOE Office of Scientific and Technical Information (OSTI.GOV)
Eichenberger, Alexandre E; O'Brien, John K. P.
Mechanisms are provided for allocating threads for execution of a parallel region of code. A request for allocation of worker threads to execute the parallel region of code is received from a master thread. Cached thread allocation information identifying prior thread allocations that have been performed for the master thread are accessed. Worker threads are allocated to the master thread based on the cached thread allocation information. The parallel region of code is executed using the allocated worker threads.
Multi-threading: A new dimension to massively parallel scientific computation
NASA Astrophysics Data System (ADS)
Nielsen, Ida M. B.; Janssen, Curtis L.
2000-06-01
Multi-threading is becoming widely available for Unix-like operating systems, and the application of multi-threading opens new ways for performing parallel computations with greater efficiency. We here briefly discuss the principles of multi-threading and illustrate the application of multi-threading for a massively parallel direct four-index transformation of electron repulsion integrals. Finally, other potential applications of multi-threading in scientific computing are outlined.
Thread concept for automatic task parallelization in image analysis
NASA Astrophysics Data System (ADS)
Lueckenhaus, Maximilian; Eckstein, Wolfgang
1998-09-01
Parallel processing of image analysis tasks is an essential method to speed up image processing and helps to exploit the full capacity of distributed systems. However, writing parallel code is a difficult and time-consuming process and often leads to an architecture-dependent program that has to be re-implemented when changing the hardware. Therefore it is highly desirable to do the parallelization automatically. For this we have developed a special kind of thread concept for image analysis tasks. Threads derivated from one subtask may share objects and run in the same context but may process different threads of execution and work on different data in parallel. In this paper we describe the basics of our thread concept and show how it can be used as basis of an automatic task parallelization to speed up image processing. We further illustrate the design and implementation of an agent-based system that uses image analysis threads for generating and processing parallel programs by taking into account the available hardware. The tests made with our system prototype show that the thread concept combined with the agent paradigm is suitable to speed up image processing by an automatic parallelization of image analysis tasks.
Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce.
Decap, Dries; Reumers, Joke; Herzeel, Charlotte; Costanza, Pascal; Fostier, Jan
2017-01-01
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm
NASA Astrophysics Data System (ADS)
Backes, Werner; Wetzel, Susanne
In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
IOPA: I/O-aware parallelism adaption for parallel programs
Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei
2017-01-01
With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads. PMID:28278236
IOPA: I/O-aware parallelism adaption for parallel programs.
Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei
2017-01-01
With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads.
On the utility of threads for data parallel programming
NASA Technical Reports Server (NTRS)
Fahringer, Thomas; Haines, Matthew; Mehrotra, Piyush
1995-01-01
Threads provide a useful programming model for asynchronous behavior because of their ability to encapsulate units of work that can then be scheduled for execution at runtime, based on the dynamic state of a system. Recently, the threaded model has been applied to the domain of data parallel scientific codes, and initial reports indicate that the threaded model can produce performance gains over non-threaded approaches, primarily through the use of overlapping useful computation with communication latency. However, overlapping computation with communication is possible without the benefit of threads if the communication system supports asynchronous primitives, and this comparison has not been made in previous papers. This paper provides a critical look at the utility of lightweight threads as applied to data parallel scientific programming.
Ropes: Support for collective opertions among distributed threads
NASA Technical Reports Server (NTRS)
Haines, Matthew; Mehrotra, Piyush; Cronk, David
1995-01-01
Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous control structures in applications and language implementations. Recently, systems have been designed and implemented to support interprocessor communication between lightweight threads so that threads can be exploited in a distributed memory system. Their use, in this setting, has been largely restricted to supporting latency hiding techniques and functional parallelism within a single application. However, to execute data parallel codes independent of other threads in the system, collective operations and relative indexing among threads are required. This paper describes the design of ropes: a scoping mechanism for collective operations and relative indexing among threads. We present the design of ropes in the context of the Chant system, and provide performance results evaluating our initial design decisions.
Clark, Andrew G; Naufer, M Nabuan; Westerlund, Fredrik; Lincoln, Per; Rouzina, Ioulia; Paramanathan, Thayaparan; Williams, Mark C
2018-02-06
Molecules that bind DNA via threading intercalation show high binding affinity as well as slow dissociation kinetics, properties ideal for the development of anticancer drugs. To this end, it is critical to identify the specific molecular characteristics of threading intercalators that result in optimal DNA interactions. Using single-molecule techniques, we quantify the binding of a small metal-organic ruthenium threading intercalator (Δ,Δ-B) and compare its binding characteristics to a similar molecule with significantly larger threading moieties (Δ,Δ-P). The binding affinities of the two molecules are the same, while comparison of the binding kinetics reveals significantly faster kinetics for Δ,Δ-B. However, the kinetics is still much slower than that observed for conventional intercalators. Comparison of the two threading intercalators shows that the binding affinity is modulated independently by the intercalating section and the binding kinetics is modulated by the threading moiety. In order to thread DNA, Δ,Δ-P requires a "lock mechanism", in which a large length increase of the DNA duplex is required for both association and dissociation. In contrast, measurements of the force-dependent binding kinetics show that Δ,Δ-B requires a large DNA length increase for association but no length increase for dissociation from DNA. This contrasts strongly with conventional intercalators, for which almost no DNA length change is required for association but a large DNA length change must occur for dissociation. This result illustrates the fundamentally different mechanism of threading intercalation compared with conventional intercalation and will pave the way for the rational design of therapeutic drugs based on DNA threading intercalation.
NASA Astrophysics Data System (ADS)
Lohn, Stefan B.; Dong, Xin; Carminati, Federico
2012-12-01
Chip-Multiprocessors are going to support massive parallelism by many additional physical and logical cores. Improving performance can no longer be obtained by increasing clock-frequency because the technical limits are almost reached. Instead, parallel execution must be used to gain performance. Resources like main memory, the cache hierarchy, bandwidth of the memory bus or links between cores and sockets are not going to be improved as fast. Hence, parallelism can only result into performance gains if the memory usage is optimized and the communication between threads is minimized. Besides concurrent programming has become a domain for experts. Implementing multi-threading is error prone and labor-intensive. A full reimplementation of the whole AliRoot source-code is unaffordable. This paper describes the effort to evaluate the adaption of AliRoot to the needs of multi-threading and to provide the capability of parallel processing by using a semi-automatic source-to-source transformation to address the problems as described before and to provide a straight-forward way of parallelization with almost no interference between threads. This makes the approach simple and reduces the required manual changes in the code. In a first step, unconditional thread-safety will be introduced to bring the original sequential and thread unaware source-code into the position of utilizing multi-threading. Afterwards further investigations have to be performed to point out candidates of classes that are useful to share amongst threads. Then in a second step, the transformation has to change the code to share these classes and finally to verify if there are anymore invalid interferences between threads.
A C++ Thread Package for Concurrent and Parallel Programming
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jie Chen; William Watson
1999-11-01
Recently thread libraries have become a common entity on various operating systems such as Unix, Windows NT and VxWorks. Those thread libraries offer significant performance enhancement by allowing applications to use multiple threads running either concurrently or in parallel on multiprocessors. However, the incompatibilities between native libraries introduces challenges for those who wish to develop portable applications.
Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sarje, Abhinav; Jacobsen, Douglas W.; Williams, Samuel W.
The incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards exploitation of thread parallelism, in addition to distributed memory parallelism, with the goal of delivering efficient high-performance codes. In this work we describe the exploitation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers.
Advanced Numerical Techniques of Performance Evaluation. Volume 1
1990-06-01
system scheduling3thread. The scheduling thread then runs any other ready thread that can be found. A thread can only sleep or switch out on itself...Polychronopoulos and D.J. Kuck. Guided Self- Scheduling : A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Transactions on Computers C...Kuck 1987] C.D. Polychronopoulos and D.J. Kuck. Guided Self- Scheduling : A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Trans. on Comp
Parallel Implementation of 3-D Iterative Reconstruction With Intra-Thread Update for the jPET-D4
NASA Astrophysics Data System (ADS)
Lam, Chih Fung; Yamaya, Taiga; Obi, Takashi; Yoshida, Eiji; Inadama, Naoko; Shibuya, Kengo; Nishikido, Fumihiko; Murayama, Hideo
2009-02-01
One way to speed-up iterative image reconstruction is by parallel computing with a computer cluster. However, as the number of computing threads increases, parallel efficiency decreases due to network transfer delay. In this paper, we proposed a method to reduce data transfer between computing threads by introducing an intra-thread update. The update factor is collected from each slave thread and a global image is updated as usual in the first K sub-iteration. In the rest of the sub-iterations, the global image is only updated at an interval which is controlled by a parameter L. In between that interval, the intra-thread update is carried out whereby an image update is performed in each slave thread locally. We investigated combinations of K and L parameters based on parallel implementation of RAMLA for the jPET-D4 scanner. Our evaluation used four workstations with a total of 16 slave threads. Each slave thread calculated a different set of LORs which are divided according to ring difference numbers. We assessed image quality of the proposed method with a hotspot simulation phantom. The figure of merit was the full-width-half-maximum of hotspots and the background normalized standard deviation. At an optimum K and L setting, we did not find significant change in the output images. We also applied the proposed method to a Hoffman phantom experiment and found the difference due to intra-thread update was negligible. With the intra-thread update, computation time could be reduced by about 23%.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism
DOE Office of Scientific and Technical Information (OSTI.GOV)
Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less
Efficient Thread Labeling for Monitoring Programs with Nested Parallelism
NASA Astrophysics Data System (ADS)
Ha, Ok-Kyoon; Kim, Sun-Sook; Jun, Yong-Kee
It is difficult and cumbersome to detect data races occurred in an execution of parallel programs. Any on-the-fly race detection techniques using Lamport's happened-before relation needs a thread labeling scheme for generating unique identifiers which maintain logical concurrency information for the parallel threads. NR labeling is an efficient thread labeling scheme for the fork-join program model with nested parallelism, because its efficiency depends only on the nesting depth for every fork and join operation. This paper presents an improved NR labeling, called e-NR labeling, in which every thread generates its label by inheriting the pointer to its ancestor list from the parent threads or by updating the pointer in a constant amount of time and space. This labeling is more efficient than the NR labeling, because its efficiency does not depend on the nesting depth for every fork and join operation. Some experiments were performed with OpenMP programs having nesting depths of three or four and maximum parallelisms varying from 10,000 to 1,000,000. The results show that e-NR is 5 times faster than NR labeling and 4.3 times faster than OS labeling in the average time for creating and maintaining the thread labels. In average space required for labeling, it is 3.5 times smaller than NR labeling and 3 times smaller than OS labeling.
A Review of Lightweight Thread Approaches for High Performance Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castello, Adrian; Pena, Antonio J.; Seo, Sangmin
High-level, directive-based solutions are becoming the programming models (PMs) of the multi/many-core architectures. Several solutions relying on operating system (OS) threads perfectly work with a moderate number of cores. However, exascale systems will spawn hundreds of thousands of threads in order to exploit their massive parallel architectures and thus conventional OS threads are too heavy for that purpose. Several lightweight thread (LWT) libraries have recently appeared offering lighter mechanisms to tackle massive concurrency. In order to examine the suitability of LWTs in high-level runtimes, we develop a set of microbenchmarks consisting of commonlyfound patterns in current parallel codes. Moreover, wemore » study the semantics offered by some LWT libraries in order to expose the similarities between different LWT application programming interfaces. This study reveals that a reduced set of LWT functions can be sufficient to cover the common parallel code patterns and that those LWT libraries perform better than OS threads-based solutions in cases where task and nested parallelism are becoming more popular with new architectures.« less
Topical perspective on massive threading and parallelism.
Farber, Robert M
2011-09-01
Unquestionably computer architectures have undergone a recent and noteworthy paradigm shift that now delivers multi- and many-core systems with tens to many thousands of concurrent hardware processing elements per workstation or supercomputer node. GPGPU (General Purpose Graphics Processor Unit) technology in particular has attracted significant attention as new software development capabilities, namely CUDA (Compute Unified Device Architecture) and OpenCL™, have made it possible for students as well as small and large research organizations to achieve excellent speedup for many applications over more conventional computing architectures. The current scientific literature reflects this shift with numerous examples of GPGPU applications that have achieved one, two, and in some special cases, three-orders of magnitude increased computational performance through the use of massive threading to exploit parallelism. Multi-core architectures are also evolving quickly to exploit both massive-threading and massive-parallelism such as the 1.3 million threads Blue Waters supercomputer. The challenge confronting scientists in planning future experimental and theoretical research efforts--be they individual efforts with one computer or collaborative efforts proposing to use the largest supercomputers in the world is how to capitalize on these new massively threaded computational architectures--especially as not all computational problems will scale to massive parallelism. In particular, the costs associated with restructuring software (and potentially redesigning algorithms) to exploit the parallelism of these multi- and many-threaded machines must be considered along with application scalability and lifespan. This perspective is an overview of the current state of threading and parallelize with some insight into the future. Published by Elsevier Inc.
NASA Astrophysics Data System (ADS)
Baregheh, Mandana; Mezentsev, Vladimir; Schmitz, Holger
2011-06-01
We describe a parallel multi-threaded approach for high performance modelling of wide class of phenomena in ultrafast nonlinear optics. Specific implementation has been performed using the highly parallel capabilities of a programmable graphics processor.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Earl, Christopher; Might, Matthew; Bagusetty, Abhishek
This study presents Nebo, a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena on multiple architectures. Application programmers use Nebo to write code that appears sequential but can be run in parallel, without editing the code. Currently Nebo supports single-thread execution, multi-thread execution, and many-core (GPU-based) execution. With single-thread execution, Nebo performs on par with code written by domain experts. With multi-thread execution, Nebo can linearly scale (with roughly 90% efficiency) up to 12 cores, compared to its single-thread execution. Moreover, Nebo’s many-core execution can be over 140x faster than its single-thread execution.
Earl, Christopher; Might, Matthew; Bagusetty, Abhishek; ...
2016-01-26
This study presents Nebo, a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena on multiple architectures. Application programmers use Nebo to write code that appears sequential but can be run in parallel, without editing the code. Currently Nebo supports single-thread execution, multi-thread execution, and many-core (GPU-based) execution. With single-thread execution, Nebo performs on par with code written by domain experts. With multi-thread execution, Nebo can linearly scale (with roughly 90% efficiency) up to 12 cores, compared to its single-thread execution. Moreover, Nebo’s many-core execution can be over 140x faster than its single-thread execution.
P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.
Peng, Shaoliang; Yang, Shunyun; Gao, Ming; Liao, Xiangke; Liu, Jie; Yang, Canqun; Wu, Chengkun; Yu, Wenqiang
2017-03-14
The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi).
Does Simultaneous Liposuction Adversely Affect the Outcome of Thread Lifts? A Preliminary Result.
Lee, Yong Woo; Park, Tae Hwan
2018-04-11
Along with advances in thread lift techniques and materials, ancillary procedures such as fat grafting, liposuction, or filler injections have been performed simultaneously. Some surgeons think that these ancillary procedures might affect the aesthetic outcomes of thread lifting possibly due to inadvertent injury to threads or loosening of soft tissue via passing the cannula in the surgical plane of the thread lifts. The purpose of the current study is to determine the effect of such ancillary procedures on the outcome of thread lifts in the human and cadaveric setting. We used human abdominal tissue after abdominoplasty and cadaveric faces. In the abdominal tissue, liposuction parallel to the parallel axis was performed in one area for 5 min. We counted 30 passes when liposuction was performed in one direction. This was repeated as we changed the direction of passages. The plane of thread lifts (dermal vs subcutaneous) and angle between liposuction and thread lifts (parallel vs perpendicular) were differentiated in this abdominal tissue study group. Then, we performed parallel or perpendicular thread lifts using a small slit incision. Using a tensiometer, the maximum holding strength was measured when pulling the thread out of the skin as much as possible. We also used faces of cadavers to prove whether the finding in human abdominal tissue is really valid with corresponding techniques. Our pilot study using abdominal tissue showed that liposuction after thread lifts adversely affects it regardless of the vector of thread lifts. In the cadaveric study, however, liposuction prior to thread lifting does not significantly affect the holding strength of thread lifts. Liposuction or fat grafting in the appropriate layer would not be a hurdle to safely performing simultaneous thread lifts if the target lift tissue is intra-SMAS or just above the SMAS layer. This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Final report on EURAMET.L-S21: `Supplementary comparison of parallel thread gauges'
NASA Astrophysics Data System (ADS)
Mudronja, Vedran; Šimunovic, Vedran; Acko, Bojan; Matus, Michael; Bánréti, Edit; István, Dicso; Thalmann, Rudolf; Lassila, Antti; Lillepea, Lauri; Bartolo Picotto, Gian; Bellotti, Roberto; Pometto, Marco; Ganioglu, Okhan; Meral, Ilker; Salgado, José Antonio; Georges, Vailleau
2015-01-01
The results of the comparison of parallel thread gauges between ten European countries are presented. Three thread plugs and three thread rings were calibrated in one loop. Croatian National Laboratory for Length (HMI/FSB-LPMD) acted as the coordinator and pilot laboratory of the comparison. Thread angle, thread pitch, simple pitch diameter and pitch diameter were measured. Pitch diameters were calibrated within 1a, 2a, 1b and 2b calibration categories in accordance with the EURAMET cg-10 calibration guide. A good agreement between the measurement results and differences due to different calibration categories are analysed in this paper. This comparison was a first EURAMET comparison of parallel thread gauges based on the EURAMET ctg-10 calibration guide, and has made a step towards the harmonization of future comparisons with the registration of CMC values for thread gauges. Main text. To reach the main text of this paper, click on Final Report. Note that this text is that which appears in Appendix B of the BIPM key comparison database kcdb.bipm.org/. The final report has been peer-reviewed and approved for publication by the CCL, according to the provisions of the CIPM Mutual Recognition Arrangement (CIPM MRA).
Ke, Yuwen; Huh, Jae-Wan; Warrington, Ross; Li, Bing; Wu, Nan; Leng, Mei; Zhang, Junmei; Ball, Haydn L; Li, Bing; Yu, Hongtao
2011-01-01
Centromeres nucleate the formation of kinetochores and are vital for chromosome segregation during mitosis. The SNF2 family helicase PICH (Plk1-interacting checkpoint helicase) and the BLM (the Bloom's syndrome protein) helicase decorate ultrafine histone-negative DNA threads that link the segregating sister centromeres during anaphase. The functions of PICH and BLM at these threads are not understood, however. Here, we show that PICH binds to BLM and enables BLM localization to anaphase centromeric threads. PICH- or BLM-RNAi cells fail to resolve these threads in anaphase. The fragmented threads form centromeric-chromatin-containing micronuclei in daughter cells. Anaphase threads in PICH- and BLM-RNAi cells contain histones and centromere markers. Recombinant purified PICH has nucleosome remodelling activities in vitro. We propose that PICH and BLM unravel centromeric chromatin and keep anaphase DNA threads mostly free of nucleosomes, thus allowing these threads to span long distances between rapidly segregating centromeres without breakage and providing a spatiotemporal window for their resolution. PMID:21743438
Banerjee, T; Banerjee, S; Sett, S; Ghosh, S; Rakshit, T; Mukhopadhyay, R
2016-01-01
DNA threading intercalators are a unique class of intercalating agents, albeit little biophysical information is available on their intercalative actions. Herein, the intercalative effects of nogalamycin, which is a naturally-occurring DNA threading intercalator, have been investigated by high-resolution atomic force microscopy (AFM) and spectroscopy (AFS). The results have been compared with those of the well-known chemotherapeutic drug daunomycin, which is a non-threading classical intercalator bearing structural similarity to nogalamycin. A comparative AFM assessment revealed a greater increase in DNA contour length over the entire incubation period of 48 h for nogalamycin treatment, whereas the contour length increase manifested faster in case of daunomycin. The elastic response of single DNA molecules to an externally applied force was investigated by the single molecule AFS approach. Characteristic mechanical fingerprints in the overstretching behaviour clearly distinguished the nogalamycin/daunomycin-treated dsDNA from untreated dsDNA-the former appearing less elastic than the latter, and the nogalamycin-treated DNA distinguished from the daunomycin-treated DNA-the classically intercalated dsDNA appearing the least elastic. A single molecule AFS-based discrimination of threading intercalation from the classical type is being reported for the first time.
NASA Astrophysics Data System (ADS)
Liao, S.; Chen, L.; Li, J.; Xiong, W.; Wu, Q.
2015-07-01
Existing spatiotemporal database supports spatiotemporal aggregation query over massive moving objects datasets. Due to the large amounts of data and single-thread processing method, the query speed cannot meet the application requirements. On the other hand, the query efficiency is more sensitive to spatial variation then temporal variation. In this paper, we proposed a spatiotemporal aggregation query method using multi-thread parallel technique based on regional divison and implemented it on the server. Concretely, we divided the spatiotemporal domain into several spatiotemporal cubes, computed spatiotemporal aggregation on all cubes using the technique of multi-thread parallel processing, and then integrated the query results. By testing and analyzing on the real datasets, this method has improved the query speed significantly.
Payne, Andrew C; Andregg, Michael; Kemmish, Kent; Hamalainen, Mark; Bowell, Charlotte; Bleloch, Andrew; Klejwa, Nathan; Lehrach, Wolfgang; Schatz, Ken; Stark, Heather; Marblestone, Adam; Church, George; Own, Christopher S; Andregg, William
2013-01-01
We present "molecular threading", a surface independent tip-based method for stretching and depositing single and double-stranded DNA molecules. DNA is stretched into air at a liquid-air interface, and can be subsequently deposited onto a dry substrate isolated from solution. The design of an apparatus used for molecular threading is presented, and fluorescence and electron microscopies are used to characterize the angular distribution, straightness, and reproducibility of stretched DNA deposited in arrays onto elastomeric surfaces and thin membranes. Molecular threading demonstrates high straightness and uniformity over length scales from nanometers to micrometers, and represents an alternative to existing DNA deposition and linearization methods. These results point towards scalable and high-throughput precision manipulation of single-molecule polymers.
NASA Astrophysics Data System (ADS)
Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.
2017-07-01
Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.
Ho, ThienLuan; Oh, Seung-Rohk
2017-01-01
Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively. PMID:29016700
Sett, S.; Ghosh, S.; Rakshit, T.; Mukhopadhyay, R.
2016-01-01
DNA threading intercalators are a unique class of intercalating agents, albeit little biophysical information is available on their intercalative actions. Herein, the intercalative effects of nogalamycin, which is a naturally-occurring DNA threading intercalator, have been investigated by high-resolution atomic force microscopy (AFM) and spectroscopy (AFS). The results have been compared with those of the well-known chemotherapeutic drug daunomycin, which is a non-threading classical intercalator bearing structural similarity to nogalamycin. A comparative AFM assessment revealed a greater increase in DNA contour length over the entire incubation period of 48 h for nogalamycin treatment, whereas the contour length increase manifested faster in case of daunomycin. The elastic response of single DNA molecules to an externally applied force was investigated by the single molecule AFS approach. Characteristic mechanical fingerprints in the overstretching behaviour clearly distinguished the nogalamycin/daunomycin-treated dsDNA from untreated dsDNA—the former appearing less elastic than the latter, and the nogalamycin-treated DNA distinguished from the daunomycin-treated DNA—the classically intercalated dsDNA appearing the least elastic. A single molecule AFS-based discrimination of threading intercalation from the classical type is being reported for the first time. PMID:27183010
DOE Office of Scientific and Technical Information (OSTI.GOV)
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.
Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for themore » context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.« less
Lü, Qiang; Xia, Xiao-Yan; Chen, Rong; Miao, Da-Jun; Chen, Sha-Sha; Quan, Li-Jun; Li, Hai-Ou
2012-01-01
Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise.
Lü, Qiang; Xia, Xiao-Yan; Chen, Rong; Miao, Da-Jun; Chen, Sha-Sha; Quan, Li-Jun; Li, Hai-Ou
2012-01-01
Background Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. Results A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. Conclusions This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise. PMID:23028708
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2013-10-22
Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.
Multi-Threaded DNA Tag/Anti-Tag Library Generator for Multi-Core Platforms
2009-05-01
base pair) Watson ‐ Crick strand pairs that bind perfectly within pairs, but poorly across pairs. A variety of DNA strand hybridization metrics...AFRL-RI-RS-TR-2009-131 Final Technical Report May 2009 MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE PLATFORMS...TYPE Final 3. DATES COVERED (From - To) Jun 08 – Feb 09 4. TITLE AND SUBTITLE MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak
1999-01-01
The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
GPU COMPUTING FOR PARTICLE TRACKING
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nishimura, Hiroshi; Song, Kai; Muriki, Krishna
2011-03-25
This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculationmore » of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.« less
Multi-thread parallel algorithm for reconstructing 3D large-scale porous structures
NASA Astrophysics Data System (ADS)
Ju, Yang; Huang, Yaohui; Zheng, Jiangtao; Qian, Xu; Xie, Heping; Zhao, Xi
2017-04-01
Geomaterials inherently contain many discontinuous, multi-scale, geometrically irregular pores, forming a complex porous structure that governs their mechanical and transport properties. The development of an efficient reconstruction method for representing porous structures can significantly contribute toward providing a better understanding of the governing effects of porous structures on the properties of porous materials. In order to improve the efficiency of reconstructing large-scale porous structures, a multi-thread parallel scheme was incorporated into the simulated annealing reconstruction method. In the method, four correlation functions, which include the two-point probability function, the linear-path functions for the pore phase and the solid phase, and the fractal system function for the solid phase, were employed for better reproduction of the complex well-connected porous structures. In addition, a random sphere packing method and a self-developed pre-conditioning method were incorporated to cast the initial reconstructed model and select independent interchanging pairs for parallel multi-thread calculation, respectively. The accuracy of the proposed algorithm was evaluated by examining the similarity between the reconstructed structure and a prototype in terms of their geometrical, topological, and mechanical properties. Comparisons of the reconstruction efficiency of porous models with various scales indicated that the parallel multi-thread scheme significantly shortened the execution time for reconstruction of a large-scale well-connected porous model compared to a sequential single-thread procedure.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors
NASA Astrophysics Data System (ADS)
Yi, Hongsuk
2014-03-01
Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
SMT-Aware Instantaneous Footprint Optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Roy, Probir; Liu, Xu; Song, Shuaiwen
Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the whole memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging, because they usually spawn threads within Single Program Multiple Data (SPMD) models. To address this important issue, we introduce a simple scheme for SMT-aware code optimization, which aims to reduce the memory contention across SMT threads.
Anti-parallel EUV Flows Observed along Active Region Filament Threads with Hi-C
NASA Astrophysics Data System (ADS)
Alexander, Caroline E.; Walsh, Robert W.; Régnier, Stéphane; Cirtain, Jonathan; Winebarger, Amy R.; Golub, Leon; Kobayashi, Ken; Platt, Simon; Mitchell, Nick; Korreck, Kelly; DePontieu, Bart; DeForest, Craig; Weber, Mark; Title, Alan; Kuzin, Sergey
2013-09-01
Plasma flows within prominences/filaments have been observed for many years and hold valuable clues concerning the mass and energy balance within these structures. Previous observations of these flows primarily come from Hα and cool extreme-ultraviolet (EUV) lines (e.g., 304 Å) where estimates of the size of the prominence threads has been limited by the resolution of the available instrumentation. Evidence of "counter-steaming" flows has previously been inferred from these cool plasma observations, but now, for the first time, these flows have been directly imaged along fundamental filament threads within the million degree corona (at 193 Å). In this work, we present observations of an AR filament observed with the High-resolution Coronal Imager (Hi-C) that exhibits anti-parallel flows along adjacent filament threads. Complementary data from the Solar Dynamics Observatory (SDO)/Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager are presented. The ultra-high spatial and temporal resolution of Hi-C allow the anti-parallel flow velocities to be measured (70-80 km s-1) and gives an indication of the resolvable thickness of the individual strands (0.''8 ± 0.''1). The temperature of the plasma flows was estimated to be log T (K) = 5.45 ± 0.10 using Emission Measure loci analysis. We find that SDO/AIA cannot clearly observe these anti-parallel flows or measure their velocity or thread width due to its larger pixel size. We suggest that anti-parallel/counter-streaming flows are likely commonplace within all filaments and are currently not observed in EUV due to current instrument spatial resolution.
Organizing Compression of Hyperspectral Imagery to Allow Efficient Parallel Decompression
NASA Technical Reports Server (NTRS)
Klimesh, Matthew A.; Kiely, Aaron B.
2014-01-01
family of schemes has been devised for organizing the output of an algorithm for predictive data compression of hyperspectral imagery so as to allow efficient parallelization in both the compressor and decompressor. In these schemes, the compressor performs a number of iterations, during each of which a portion of the data is compressed via parallel threads operating on independent portions of the data. The general idea is that for each iteration it is predetermined how much compressed data will be produced from each thread.
Dissecting the Dynamic Pathways of Stereoselective DNA Threading Intercalation
Almaqwashi, Ali A.; Andersson, Johanna; Lincoln, Per; Rouzina, Ioulia; Westerlund, Fredrik; Williams, Mark C.
2016-01-01
DNA intercalators that have high affinity and slow kinetics are developed for potential DNA-targeted therapeutics. Although many natural intercalators contain multiple chiral subunits, only intercalators with a single chiral unit have been quantitatively probed. Dumbbell-shaped DNA threading intercalators represent the next order of structural complexity relative to simple intercalators, and can provide significant insights into the stereoselectivity of DNA-ligand intercalation. We investigated DNA threading intercalation by binuclear ruthenium complex [μ-dppzip(phen)4Ru2]4+ (Piz). Four Piz stereoisomers are defined by the chirality of the intercalating subunit (Ru(phen)2dppz) and the distal subunit (Ru(phen)2ip), respectively, each of which can be either right-handed (Δ) or left-handed (Λ). We used optical tweezers to measure single DNA molecule elongation due to threading intercalation, revealing force-dependent DNA intercalation rates and equilibrium dissociation constants. The force spectroscopy analysis provided the zero-force DNA binding affinity, the equilibrium DNA-ligand elongation Δxeq, and the dynamic DNA structural deformations during ligand association xon and dissociation xoff. We found that Piz stereoisomers exhibit over 20-fold differences in DNA binding affinity, from a Kd of 27 ± 3 nM for (Δ,Λ)-Piz to a Kd of 622 ± 55 nM for (Λ,Δ)-Piz. The striking affinity decrease is correlated with increasing Δxeq from 0.30 ± 0.02 to 0.48 ± 0.02 nm and xon from 0.25 ± 0.01 to 0.46 ± 0.02 nm, but limited xoff changes. Notably, the affinity and threading kinetics is 10-fold enhanced for right-handed intercalating subunits, and 2- to 5-fold enhanced for left-handed distal subunits. These findings demonstrate sterically dispersed transition pathways and robust DNA structural recognition of chiral intercalators, which are critical for optimizing DNA binding affinity and kinetics. PMID:27028636
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gosink, Luke; Wu, Kesheng; Bethel, E. Wes
2009-06-02
The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
ANTI-PARALLEL EUV FLOWS OBSERVED ALONG ACTIVE REGION FILAMENT THREADS WITH HI-C
DOE Office of Scientific and Technical Information (OSTI.GOV)
Alexander, Caroline E.; Walsh, Robert W.; Régnier, Stéphane
Plasma flows within prominences/filaments have been observed for many years and hold valuable clues concerning the mass and energy balance within these structures. Previous observations of these flows primarily come from Hα and cool extreme-ultraviolet (EUV) lines (e.g., 304 Å) where estimates of the size of the prominence threads has been limited by the resolution of the available instrumentation. Evidence of 'counter-steaming' flows has previously been inferred from these cool plasma observations, but now, for the first time, these flows have been directly imaged along fundamental filament threads within the million degree corona (at 193 Å). In this work, wemore » present observations of an AR filament observed with the High-resolution Coronal Imager (Hi-C) that exhibits anti-parallel flows along adjacent filament threads. Complementary data from the Solar Dynamics Observatory (SDO)/Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager are presented. The ultra-high spatial and temporal resolution of Hi-C allow the anti-parallel flow velocities to be measured (70-80 km s{sup –1}) and gives an indication of the resolvable thickness of the individual strands (0.''8 ± 0.''1). The temperature of the plasma flows was estimated to be log T (K) = 5.45 ± 0.10 using Emission Measure loci analysis. We find that SDO/AIA cannot clearly observe these anti-parallel flows or measure their velocity or thread width due to its larger pixel size. We suggest that anti-parallel/counter-streaming flows are likely commonplace within all filaments and are currently not observed in EUV due to current instrument spatial resolution.« less
Final report for the Tera Computer TTI CRADA
DOE Office of Scientific and Technical Information (OSTI.GOV)
Davidson, G.S.; Pavlakos, C.; Silva, C.
1997-01-01
Tera Computer and Sandia National Laboratories have completed a CRADA, which examined the Tera Multi-Threaded Architecture (MTA) for use with large codes of importance to industry and DOE. The MTA is an innovative architecture that uses parallelism to mask latency between memories and processors. The physical implementation is a parallel computer with high cross-section bandwidth and GaAs processors designed by Tera, which support many small computation threads and fast, lightweight context switches between them. When any thread blocks while waiting for memory accesses to complete, another thread immediately begins execution so that high CPU utilization is maintained. The Tera MTAmore » parallel computer has a single, global address space, which is appealing when porting existing applications to a parallel computer. This ease of porting is further enabled by compiler technology that helps break computations into parallel threads. DOE and Sandia National Laboratories were interested in working with Tera to further develop this computing concept. While Tera Computer would continue the hardware development and compiler research, Sandia National Laboratories would work with Tera to ensure that their compilers worked well with important Sandia codes, most particularly CTH, a shock physics code used for weapon safety computations. In addition to that important code, Sandia National Laboratories would complete research on a robotic path planning code, SANDROS, which is important in manufacturing applications, and would evaluate the MTA performance on this code. Finally, Sandia would work directly with Tera to develop 3D visualization codes, which would be appropriate for use with the MTA. Each of these tasks has been completed to the extent possible, given that Tera has just completed the MTA hardware. All of the CRADA work had to be done on simulators.« less
Expressing Parallelism with ROOT
NASA Astrophysics Data System (ADS)
Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.
2017-10-01
The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.
Expressing Parallelism with ROOT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Piparo, D.; Tejedor, E.; Guiraud, E.
The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module inmore » Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.« less
Parallel fast multipole boundary element method applied to computational homogenization
NASA Astrophysics Data System (ADS)
Ptaszny, Jacek
2018-01-01
In the present work, a fast multipole boundary element method (FMBEM) and a parallel computer code for 3D elasticity problem is developed and applied to the computational homogenization of a solid containing spherical voids. The system of equation is solved by using the GMRES iterative solver. The boundary of the body is dicretized by using the quadrilateral serendipity elements with an adaptive numerical integration. Operations related to a single GMRES iteration, performed by traversing the corresponding tree structure upwards and downwards, are parallelized by using the OpenMP standard. The assignment of tasks to threads is based on the assumption that the tree nodes at which the moment transformations are initialized can be partitioned into disjoint sets of equal or approximately equal size and assigned to the threads. The achieved speedup as a function of number of threads is examined.
Block-Parallel Data Analysis with DIY2
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morozov, Dmitriy; Peterka, Tom
DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
Power/Performance Trade-offs of Small Batched LU Based Solvers on GPUs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Fatica, Massimiliano; Gawande, Nitin A.
In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore » parallel (one matrix, one thread) and can reach sizes up to 128x128, but it does not exploit shared memory and only relies on the high memory bandwidth of the GPU. The first and second solution only support partial pivoting, the third one easily supports partial and full pivoting, making it attractive to problems that require greater numerical stability. We analyze the trade-offs in terms of performance and power consumption as function of the size of the linear systems that are simultaneously solved. We execute the three implementations on a Tesla M2090 (Fermi) and on a Tesla K20 (Kepler).« less
NASA Technical Reports Server (NTRS)
Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.
2013-01-01
The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
CMS event processing multi-core efficiency status
NASA Astrophysics Data System (ADS)
Jones, C. D.; CMS Collaboration
2017-10-01
In 2015, CMS was the first LHC experiment to begin using a multi-threaded framework for doing event processing. This new framework utilizes Intel’s Thread Building Block library to manage concurrency via a task based processing model. During the 2015 LHC run period, CMS only ran reconstruction jobs using multiple threads because only those jobs were sufficiently thread efficient. Recent work now allows simulation and digitization to be thread efficient. In addition, during 2015 the multi-threaded framework could run events in parallel but could only use one thread per event. Work done in 2016 now allows multiple threads to be used while processing one event. In this presentation we will show how these recent changes have improved CMS’s overall threading and memory efficiency and we will discuss work to be done to further increase those efficiencies.
Real-time SHVC software decoding with multi-threaded parallel processing
NASA Astrophysics Data System (ADS)
Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu
2014-09-01
This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.
Fast parallel algorithm for slicing STL based on pipeline
NASA Astrophysics Data System (ADS)
Ma, Xulong; Lin, Feng; Yao, Bo
2016-05-01
In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
Split off-specular reflection and surface scattering from woven materials
NASA Astrophysics Data System (ADS)
Pont, Sylvia C.; Koenderink, Jan J.
2003-03-01
We measured radiance distributions for black lining cloth and copper gauze using the convenient technique of wrapping the materials around a circular cylinder, irradiating it with a parallel light source and collecting the scattered radiance by a digital camera. One family of parallel threads (weave or weft) was parallel to the cylinder generator. The most salient features for such glossy plane weaves are a splitting up of the reflection peak due to the wavy variations in local slopes of the threads around the cylinders and a surface scattering lobe due to the threads that run along the cylinder. These scattering characteristics are quite different from the (off-)specular peaks and lobes that were found before for random rough specular surfaces. The split off-specular reflection is due to the regular structures in our samples of man-made materials. We derived simple approximations for these reflectance characteristics using geometrical optics.
Gara, Alan; Ohmacht, Martin
2014-09-16
In a multiprocessor system with at least two levels of cache, a speculative thread may run on a core processor in parallel with other threads. When the thread seeks to do a write to main memory, this access is to be written through the first level cache to the second level cache. After the write though, the corresponding line is deleted from the first level cache and/or prefetch unit, so that any further accesses to the same location in main memory have to be retrieved from the second level cache. The second level cache keeps track of multiple versions of data, where more than one speculative thread is running in parallel, while the first level cache does not have any of the versions during speculation. A switch allows choosing between modes of operation of a speculation blind first level cache.
Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores
NASA Astrophysics Data System (ADS)
Kegel, Philipp; Schellmann, Maraike; Gorlatch, Sergei
We compare two parallel programming approaches for multi-core systems: the well-known OpenMP and the recently introduced Threading Building Blocks (TBB) library by Intel®. The comparison is made using the parallelization of a real-world numerical algorithm for medical imaging. We develop several parallel implementations, and compare them w.r.t. programming effort, programming style and abstraction, and runtime performance. We show that TBB requires a considerable program re-design, whereas with OpenMP simple compiler directives are sufficient. While TBB appears to be less appropriate for parallelizing existing implementations, it fosters a good programming style and higher abstraction level for newly developed parallel programs. Our experimental measurements on a dual quad-core system demonstrate that OpenMP slightly outperforms TBB in our implementation.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.
2003-01-01
Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
Constructing Neuronal Network Models in Massively Parallel Environments.
Ippen, Tammo; Eppler, Jochen M; Plesser, Hans E; Diesmann, Markus
2017-01-01
Recent advances in the development of data structures to represent spiking neuron network models enable us to exploit the complete memory of petascale computers for a single brain-scale network simulation. In this work, we investigate how well we can exploit the computing power of such supercomputers for the creation of neuronal networks. Using an established benchmark, we divide the runtime of simulation code into the phase of network construction and the phase during which the dynamical state is advanced in time. We find that on multi-core compute nodes network creation scales well with process-parallel code but exhibits a prohibitively large memory consumption. Thread-parallel network creation, in contrast, exhibits speedup only up to a small number of threads but has little overhead in terms of memory. We further observe that the algorithms creating instances of model neurons and their connections scale well for networks of ten thousand neurons, but do not show the same speedup for networks of millions of neurons. Our work uncovers that the lack of scaling of thread-parallel network creation is due to inadequate memory allocation strategies and demonstrates that thread-optimized memory allocators recover excellent scaling. An analysis of the loop order used for network construction reveals that more complex tests on the locality of operations significantly improve scaling and reduce runtime by allowing construction algorithms to step through large networks more efficiently than in existing code. The combination of these techniques increases performance by an order of magnitude and harnesses the increasingly parallel compute power of the compute nodes in high-performance clusters and supercomputers.
Constructing Neuronal Network Models in Massively Parallel Environments
Ippen, Tammo; Eppler, Jochen M.; Plesser, Hans E.; Diesmann, Markus
2017-01-01
Recent advances in the development of data structures to represent spiking neuron network models enable us to exploit the complete memory of petascale computers for a single brain-scale network simulation. In this work, we investigate how well we can exploit the computing power of such supercomputers for the creation of neuronal networks. Using an established benchmark, we divide the runtime of simulation code into the phase of network construction and the phase during which the dynamical state is advanced in time. We find that on multi-core compute nodes network creation scales well with process-parallel code but exhibits a prohibitively large memory consumption. Thread-parallel network creation, in contrast, exhibits speedup only up to a small number of threads but has little overhead in terms of memory. We further observe that the algorithms creating instances of model neurons and their connections scale well for networks of ten thousand neurons, but do not show the same speedup for networks of millions of neurons. Our work uncovers that the lack of scaling of thread-parallel network creation is due to inadequate memory allocation strategies and demonstrates that thread-optimized memory allocators recover excellent scaling. An analysis of the loop order used for network construction reveals that more complex tests on the locality of operations significantly improve scaling and reduce runtime by allowing construction algorithms to step through large networks more efficiently than in existing code. The combination of these techniques increases performance by an order of magnitude and harnesses the increasingly parallel compute power of the compute nodes in high-performance clusters and supercomputers. PMID:28559808
A Pervasive Parallel Processing Framework for Data Visualization and Analysis at Extreme Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moreland, Kenneth; Geveci, Berk
2014-11-01
The evolution of the computing world from teraflop to petaflop has been relatively effortless, with several of the existing programming models scaling effectively to the petascale. The migration to exascale, however, poses considerable challenges. All industry trends infer that the exascale machine will be built using processors containing hundreds to thousands of cores per chip. It can be inferred that efficient concurrency on exascale machines requires a massive amount of concurrent threads, each performing many operations on a localized piece of data. Currently, visualization libraries and applications are based off what is known as the visualization pipeline. In the pipelinemore » model, algorithms are encapsulated as filters with inputs and outputs. These filters are connected by setting the output of one component to the input of another. Parallelism in the visualization pipeline is achieved by replicating the pipeline for each processing thread. This works well for today’s distributed memory parallel computers but cannot be sustained when operating on processors with thousands of cores. Our project investigates a new visualization framework designed to exhibit the pervasive parallelism necessary for extreme scale machines. Our framework achieves this by defining algorithms in terms of worklets, which are localized stateless operations. Worklets are atomic operations that execute when invoked unlike filters, which execute when a pipeline request occurs. The worklet design allows execution on a massive amount of lightweight threads with minimal overhead. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale machine.« less
Characterizing and Mitigating Work Time Inflation in Task Parallel Programs
Olivier, Stephen L.; de Supinski, Bronis R.; Schulz, Martin; ...
2013-01-01
Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems.more » Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.« less
OpenGeoSys-GEMS: Hybrid parallelization of a reactive transport code with MPI and threads
NASA Astrophysics Data System (ADS)
Kosakowski, G.; Kulik, D. A.; Shao, H.
2012-04-01
OpenGeoSys-GEMS is a generic purpose reactive transport code based on the operator splitting approach. The code couples the Finite-Element groundwater flow and multi-species transport modules of the OpenGeoSys (OGS) project (http://www.ufz.de/index.php?en=18345) with the GEM-Selektor research package to model thermodynamic equilibrium of aquatic (geo)chemical systems utilizing the Gibbs Energy Minimization approach (http://gems.web.psi.ch/). The combination of OGS and the GEM-Selektor kernel (GEMS3K) is highly flexible due to the object-oriented modular code structures and the well defined (memory based) data exchange modules. Like other reactive transport codes, the practical applicability of OGS-GEMS is often hampered by the long calculation time and large memory requirements. • For realistic geochemical systems which might include dozens of mineral phases and several (non-ideal) solid solutions the time needed to solve the chemical system with GEMS3K may increase exceptionally. • The codes are coupled in a sequential non-iterative loop. In order to keep the accuracy, the time step size is restricted. In combination with a fine spatial discretization the time step size may become very small which increases calculation times drastically even for small 1D problems. • The current version of OGS is not optimized for memory use and the MPI version of OGS does not distribute data between nodes. Even for moderately small 2D problems the number of MPI processes that fit into memory of up-to-date workstations or HPC hardware is limited. One strategy to overcome the above mentioned restrictions of OGS-GEMS is to parallelize the coupled code. For OGS a parallelized version already exists. It is based on a domain decomposition method implemented with MPI and provides a parallel solver for fluid and mass transport processes. In the coupled code, after solving fluid flow and solute transport, geochemical calculations are done in form of a central loop over all finite element nodes with calls to GEMS3K and consecutive calculations of changed material parameters. In a first step the existing MPI implementation was utilized to parallelize this loop. Calculations were split between the MPI processes and afterwards data was synchronized by using MPI communication routines. Furthermore, multi-threaded calculation of the loop was implemented with help of the boost thread library (http://www.boost.org). This implementation provides a flexible environment to distribute calculations between several threads. For each MPI process at least one and up to several dozens of worker threads are spawned. These threads do not replicate the complete OGS-GEM data structure and use only a limited amount of memory. Calculation of the central geochemical loop is shared between all threads. Synchronization between the threads is done by barrier commands. The overall number of local threads times MPI processes should match the number of available computing nodes. The combination of multi-threading and MPI provides an effective and flexible environment to speed up OGS-GEMS calculations while limiting the required memory use. Test calculations on different hardware show that for certain types of applications tremendous speedups are possible.
GrigoraSNPs: Optimized Analysis of SNPs for DNA Forensics.
Ricke, Darrell O; Shcherbina, Anna; Michaleas, Adam; Fremont-Smith, Philip
2018-04-16
High-throughput sequencing (HTS) of single nucleotide polymorphisms (SNPs) enables additional DNA forensic capabilities not attainable using traditional STR panels. However, the inclusion of sets of loci selected for mixture analysis, extended kinship, phenotype, biogeographic ancestry prediction, etc., can result in large panel sizes that are difficult to analyze in a rapid fashion. GrigoraSNP was developed to address the allele-calling bottleneck that was encountered when analyzing SNP panels with more than 5000 loci using HTS. GrigoraSNPs uses a MapReduce parallel data processing on multiple computational threads plus a novel locus-identification hashing strategy leveraging target sequence tags. This tool optimizes the SNP calling module of the DNA analysis pipeline with runtimes that scale linearly with the number of HTS reads. Results are compared with SNP analysis pipelines implemented with SAMtools and GATK. GrigoraSNPs removes a computational bottleneck for processing forensic samples with large HTS SNP panels. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.
Molecular threading and tunable molecular recognition on DNA origami nanostructures.
Wu, Na; Czajkowsky, Daniel M; Zhang, Jinjin; Qu, Jianxun; Ye, Ming; Zeng, Dongdong; Zhou, Xingfei; Hu, Jun; Shao, Zhifeng; Li, Bin; Fan, Chunhai
2013-08-21
The DNA origami technology holds great promise for the assembly of nanoscopic technological devices and studies of biochemical reactions at the single-molecule level. For these, it is essential to establish well controlled attachment of functional materials to predefined sites on the DNA origami nanostructures for reliable measurements and versatile applications. However, the two-sided nature of the origami scaffold has shown limitations in this regard. We hypothesized that holes of the commonly used two-dimensional DNA origami designs are large enough for the passage of single-stranded (ss)-DNA. Sufficiently long ssDNA initially located on one side of the origami should thus be able to "thread" to the other side through the holes in the origami sheet. By using an origami sheet attached with patterned biotinylated ssDNA spacers and monitoring streptavidin binding with atomic force microscopic (AFM) imaging, we provide unambiguous evidence that the biotin ligands positioned on one side have indeed threaded through to the other side. Our finding reveals a previously overlooked critical design feature that should provide new interpretations to previous experiments and new opportunities for the construction of origami structures with new functional capabilities.
Automatic Thread-Level Parallelization in the Chombo AMR Library
DOE Office of Scientific and Technical Information (OSTI.GOV)
Christen, Matthias; Keen, Noel; Ligocki, Terry
2011-05-26
The increasing on-chip parallelism has some substantial implications for HPC applications. Currently, hybrid programming models (typically MPI+OpenMP) are employed for mapping software to the hardware in order to leverage the hardware?s architectural features. In this paper, we present an approach that automatically introduces thread level parallelism into Chombo, a parallel adaptive mesh refinement framework for finite difference type PDE solvers. In Chombo, core algorithms are specified in the ChomboFortran, a macro language extension to F77 that is part of the Chombo framework. This domain-specific language forms an already used target language for an automatic migration of the large number ofmore » existing algorithms into a hybrid MPI+OpenMP implementation. It also provides access to the auto-tuning methodology that enables tuning certain aspects of an algorithm to hardware characteristics. Performance measurements are presented for a few of the most relevant kernels with respect to a specific application benchmark using this technique as well as benchmark results for the entire application. The kernel benchmarks show that, using auto-tuning, up to a factor of 11 in performance was gained with 4 threads with respect to the serial reference implementation.« less
Ceresini, Paulo C; Costa-Souza, Elaine; Zala, Marcello; Furtado, Edson L; Souza, Nilton L
2012-04-01
The white-thread blight and black rot (WTBR) caused by basidiomycetous fungi of the genus Ceratobasidium is emerging as an important plant disease in Brazil, particularly for crop species in the Ericales such as persimmon (Diospyros kaki) and tea (Camellia sinensis). However, the species identity of the fungal pathogen associated with either of these hosts is still unclear. In this work, we used sequence variation in the internal transcribed spacer regions, including the 5.8S coding region of rDNA (ITS-5.8S rDNA), to determine the phylogenetic placement of the local white-thread-blight-associated populations of Ceratobasidium sp. from persimmon and tea, in relation to Ceratobasidium species already described world-wide. The two sister populations of Ceratobasidium sp. from persimmon and tea in the Brazilian Atlantic Forest agroecosystem most likely represent distinct species within Ceratobasidium and are also distinct from C. noxium, the etiological agent of the first description of white-thread blight disease that was reported on coffee in India. The intraspecific variation for the two Ceratobasidium sp. populations was also analyzed using three mitochondrial genes (ATP6, nad1 and nad2). As reported for other fungi, variation in nuclear and mitochondrial DNA was incongruent. Despite distinct variability in the ITS-rDNA region these two populations shared similar mitochondrial DNA haplotypes.
A software bus for thread objects
NASA Technical Reports Server (NTRS)
Callahan, John R.; Li, Dehuai
1995-01-01
The authors have implemented a software bus for lightweight threads in an object-oriented programming environment that allows for rapid reconfiguration and reuse of thread objects in discrete-event simulation experiments. While previous research in object-oriented, parallel programming environments has focused on direct communication between threads, our lightweight software bus, called the MiniBus, provides a means to isolate threads from their contexts of execution by restricting communications between threads to message-passing via their local ports only. The software bus maintains a topology of connections between these ports. It routes, queues, and delivers messages according to this topology. This approach allows for rapid reconfiguration and reuse of thread objects in other systems without making changes to the specifications or source code. A layered approach that provides the needed transparency to developers is presented. Examples of using the MiniBus are given, and the value of bus architectures in building and conducting simulations of discrete-event systems is discussed.
Dna2 nuclease-helicase structure, mechanism and regulation by Rpa.
Zhou, Chun; Pourmal, Sergei; Pavletich, Nikola P
2015-11-02
The Dna2 nuclease-helicase maintains genomic integrity by processing DNA double-strand breaks, Okazaki fragments and stalled replication forks. Dna2 requires ssDNA ends, and is dependent on the ssDNA-binding protein Rpa, which controls cleavage polarity. Here we present the 2.3 Å structure of intact mouse Dna2 bound to a 15-nucleotide ssDNA. The nuclease active site is embedded in a long, narrow tunnel through which the DNA has to thread. The helicase domain is required for DNA binding but not threading. We also present the structure of a flexibly-tethered Dna2-Rpa interaction that recruits Dna2 to Rpa-coated DNA. We establish that a second Dna2-Rpa interaction is mutually exclusive with Rpa-DNA interactions and mediates the displacement of Rpa from ssDNA. This interaction occurs at the nuclease tunnel entrance and the 5' end of the Rpa-DNA complex. Hence, it only displaces Rpa from the 5' but not 3' end, explaining how Rpa regulates cleavage polarity.
SKIRT: Hybrid parallelization of radiative transfer simulations
NASA Astrophysics Data System (ADS)
Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.
2017-07-01
We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
NASA Astrophysics Data System (ADS)
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Almaqwashi, Ali A.; Paramanathan, Thayaparan; Lincoln, Per; Rouzina, Ioulia; Westerlund, Fredrik; Williams, Mark C.
2014-01-01
DNA intercalation by threading is expected to yield high affinity and slow dissociation, properties desirable for DNA-targeted therapeutics. To measure these properties, we utilize single molecule DNA stretching to quantify both the binding affinity and the force-dependent threading intercalation kinetics of the binuclear ruthenium complex Δ,Δ-[μ‐bidppz‐(phen)4Ru2]4+ (Δ,Δ-P). We measure the DNA elongation at a range of constant stretching forces using optical tweezers, allowing direct characterization of the intercalation kinetics as well as the amount intercalated at equilibrium. Higher forces exponentially facilitate the intercalative binding, leading to a profound decrease in the binding site size that results in one ligand intercalated at almost every DNA base stack. The zero force Δ,Δ-P intercalation Kd is 44 nM, 25-fold stronger than the analogous mono-nuclear ligand (Δ-P). The force-dependent kinetics analysis reveals a mechanism that requires DNA elongation of 0.33 nm for association, relaxation to an equilibrium elongation of 0.19 nm, and an additional elongation of 0.14 nm from the equilibrium state for dissociation. In cells, a molecule with binding properties similar to Δ,Δ-P may rapidly bind DNA destabilized by enzymes during replication or transcription, but upon enzyme dissociation it is predicted to remain intercalated for several hours, thereby interfering with essential biological processes. PMID:25245944
ASC-ATDM Performance Portability Requirements for 2015-2019
DOE Office of Scientific and Technical Information (OSTI.GOV)
Edwards, Harold C.; Trott, Christian Robert
This report outlines the research, development, and support requirements for the Advanced Simulation and Computing (ASC ) Advanced Technology, Development, and Mitigation (ATDM) Performance Portability (a.k.a., Kokkos) project for 2015 - 2019 . The research and development (R&D) goal for Kokkos (v2) has been to create and demonstrate a thread - parallel programming model a nd standard C++ library - based implementation that enables performance portability across diverse manycore architectures such as multicore CPU, Intel Xeon Phi, and NVIDIA Kepler GPU. This R&D goal has been achieved for algorithms that use data parallel pat terns including parallel - for, parallelmore » - reduce, and parallel - scan. Current R&D is focusing on hierarchical parallel patterns such as a directed acyclic graph (DAG) of asynchronous tasks where each task contain s nested data parallel algorithms. This five y ear plan includes R&D required to f ully and performance portably exploit thread parallelism across current and anticipated next generation platforms (NGP). The Kokkos library is being evaluated by many projects exploring algorithm s and code design for NGP. Some production libraries and applications such as Trilinos and LAMMPS have already committed to Kokkos as their foundation for manycore parallelism an d performance portability. These five year requirements includes support required for current and antic ipated ASC projects to be effective and productive in their use of Kokkos on NGP. The greatest risk to the success of Kokkos and ASC projects relying upon Kokkos is a lack of staffing resources to support Kokkos to the degree needed by these ASC projects. This support includes up - to - date tutorials, documentation, multi - platform (hardware and software stack) testing, minor feature enhancements, thread - scalable algorithm consulting, and managing collaborative R&D.« less
Simulation of LHC events on a millions threads
NASA Astrophysics Data System (ADS)
Childers, J. T.; Uram, T. D.; LeCompte, T. J.; Papka, M. E.; Benjamin, D. P.
2015-12-01
Demand for Grid resources is expected to double during LHC Run II as compared to Run I; the capacity of the Grid, however, will not double. The HEP community must consider how to bridge this computing gap by targeting larger compute resources and using the available compute resources as efficiently as possible. Argonne's Mira, the fifth fastest supercomputer in the world, can run roughly five times the number of parallel processes that the ATLAS experiment typically uses on the Grid. We ported Alpgen, a serial x86 code, to run as a parallel application under MPI on the Blue Gene/Q architecture. By analysis of the Alpgen code, we reduced the memory footprint to allow running 64 threads per node, utilizing the four hardware threads available per core on the PowerPC A2 processor. Event generation and unweighting, typically run as independent serial phases, are coupled together in a single job in this scenario, reducing intermediate writes to the filesystem. By these optimizations, we have successfully run LHC proton-proton physics event generation at the scale of a million threads, filling two-thirds of Mira.
Massively Parallel Dantzig-Wolfe Decomposition Applied to Traffic Flow Scheduling
NASA Technical Reports Server (NTRS)
Rios, Joseph Lucio; Ross, Kevin
2009-01-01
Optimal scheduling of air traffic over the entire National Airspace System is a computationally difficult task. To speed computation, Dantzig-Wolfe decomposition is applied to a known linear integer programming approach for assigning delays to flights. The optimization model is proven to have the block-angular structure necessary for Dantzig-Wolfe decomposition. The subproblems for this decomposition are solved in parallel via independent computation threads. Experimental evidence suggests that as the number of subproblems/threads increases (and their respective sizes decrease), the solution quality, convergence, and runtime improve. A demonstration of this is provided by using one flight per subproblem, which is the finest possible decomposition. This results in thousands of subproblems and associated computation threads. This massively parallel approach is compared to one with few threads and to standard (non-decomposed) approaches in terms of solution quality and runtime. Since this method generally provides a non-integral (relaxed) solution to the original optimization problem, two heuristics are developed to generate an integral solution. Dantzig-Wolfe followed by these heuristics can provide a near-optimal (sometimes optimal) solution to the original problem hundreds of times faster than standard (non-decomposed) approaches. In addition, when massive decomposition is employed, the solution is shown to be more likely integral, which obviates the need for an integerization step. These results indicate that nationwide, real-time, high fidelity, optimal traffic flow scheduling is achievable for (at least) 3 hour planning horizons.
Ceresini, Paulo C.; Costa-Souza, Elaine; Zala, Marcello; Furtado, Edson L.; Souza, Nilton L.
2012-01-01
The white-thread blight and black rot (WTBR) caused by basidiomycetous fungi of the genus Ceratobasidium is emerging as an important plant disease in Brazil, particularly for crop species in the Ericales such as persimmon (Diospyros kaki) and tea (Camellia sinensis). However, the species identity of the fungal pathogen associated with either of these hosts is still unclear. In this work, we used sequence variation in the internal transcribed spacer regions, including the 5.8S coding region of rDNA (ITS-5.8S rDNA), to determine the phylogenetic placement of the local white-thread-blight-associated populations of Ceratobasidium sp. from persimmon and tea, in relation to Ceratobasidium species already described world-wide. The two sister populations of Ceratobasidium sp. from persimmon and tea in the Brazilian Atlantic Forest agroecosystem most likely represent distinct species within Ceratobasidium and are also distinct from C. noxium, the etiological agent of the first description of white-thread blight disease that was reported on coffee in India. The intraspecific variation for the two Ceratobasidium sp. populations was also analyzed using three mitochondrial genes (ATP6, nad1 and nad2). As reported for other fungi, variation in nuclear and mitochondrial DNA was incongruent. Despite distinct variability in the ITS-rDNA region these two populations shared similar mitochondrial DNA haplotypes. PMID:22888299
Parallelization strategies for continuum-generalized method of moments on the multi-thread systems
NASA Astrophysics Data System (ADS)
Bustamam, A.; Handhika, T.; Ernastuti, Kerami, D.
2017-07-01
Continuum-Generalized Method of Moments (C-GMM) covers the Generalized Method of Moments (GMM) shortfall which is not as efficient as Maximum Likelihood estimator by using the continuum set of moment conditions in a GMM framework. However, this computation would take a very long time since optimizing regularization parameter. Unfortunately, these calculations are processed sequentially whereas in fact all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. This paper aims to speed up the calculation process of C-GMM by designing a parallel algorithm for C-GMM on the multi-thread systems. First, parallel regions are detected for the original C-GMM algorithm. There are two parallel regions in the original C-GMM algorithm, that are contributed significantly to the reduction of computational time: the outer-loop and the inner-loop. Furthermore, this parallel algorithm will be implemented with standard shared-memory application programming interface, i.e. Open Multi-Processing (OpenMP). The experiment shows that the outer-loop parallelization is the best strategy for any number of observations.
Dual-thread parallel control strategy for ophthalmic adaptive optics.
Yu, Yongxin; Zhang, Yuhua
To improve ophthalmic adaptive optics speed and compensate for ocular wavefront aberration of high temporal frequency, the adaptive optics wavefront correction has been implemented with a control scheme including 2 parallel threads; one is dedicated to wavefront detection and the other conducts wavefront reconstruction and compensation. With a custom Shack-Hartmann wavefront sensor that measures the ocular wave aberration with 193 subapertures across the pupil, adaptive optics has achieved a closed loop updating frequency up to 110 Hz, and demonstrated robust compensation for ocular wave aberration up to 50 Hz in an adaptive optics scanning laser ophthalmoscope.
Dual-thread parallel control strategy for ophthalmic adaptive optics
Yu, Yongxin; Zhang, Yuhua
2015-01-01
To improve ophthalmic adaptive optics speed and compensate for ocular wavefront aberration of high temporal frequency, the adaptive optics wavefront correction has been implemented with a control scheme including 2 parallel threads; one is dedicated to wavefront detection and the other conducts wavefront reconstruction and compensation. With a custom Shack-Hartmann wavefront sensor that measures the ocular wave aberration with 193 subapertures across the pupil, adaptive optics has achieved a closed loop updating frequency up to 110 Hz, and demonstrated robust compensation for ocular wave aberration up to 50 Hz in an adaptive optics scanning laser ophthalmoscope. PMID:25866498
Parallel Implicit Runge-Kutta Methods Applied to Coupled Orbit/Attitude Propagation
NASA Astrophysics Data System (ADS)
Hatten, Noble; Russell, Ryan P.
2017-12-01
A variable-step Gauss-Legendre implicit Runge-Kutta (GLIRK) propagator is applied to coupled orbit/attitude propagation. Concepts previously shown to improve efficiency in 3DOF propagation are modified and extended to the 6DOF problem, including the use of variable-fidelity dynamics models. The impact of computing the stage dynamics of a single step in parallel is examined using up to 23 threads and 22 associated GLIRK stages; one thread is reserved for an extra dynamics function evaluation used in the estimation of the local truncation error. Efficiency is found to peak for typical examples when using approximately 8 to 12 stages for both serial and parallel implementations. Accuracy and efficiency compare favorably to explicit Runge-Kutta and linear-multistep solvers for representative scenarios. However, linear-multistep methods are found to be more efficient for some applications, particularly in a serial computing environment, or when parallelism can be applied across multiple trajectories.
Joel, Anna-Christin; Kappel, Peter; Adamova, Hana; Baumgartner, Werner; Scholz, Ingo
2015-11-01
Spider silk production has been studied intensively in the last years. However, capture threads of cribellate spiders employ an until now often unnoticed alternative of thread production. This thread in general is highly interesting, as it not only involves a controlled arrangement of three types of threads with one being nano-scale fibres (cribellate fibres), but also a special comb-like structure on the metatarsus of the fourth leg (calamistrum) for its production. We found the cribellate fibres organized as a mat, enclosing two parallel larger fibres (axial fibres) and forming the typical puffy structure of cribellate threads. Mat and axial fibres are punctiform connected to each other between two puffs, presumably by the action of the median spinnerets. However, this connection alone does not lead to the typical puffy shape of a cribellate thread. Removing the calamistrum, we found a functional capture thread still being produced, but the puffy shape of the thread was lost. Therefore, the calamistrum is not necessary for the extraction or combination of fibres, but for further processing of the nano-scale cribellate fibres. Using data from Uloborus plumipes we were able to develop a model of the cribellate thread production, probably universally valid for cribellate spiders. Copyright © 2015 Elsevier Ltd. All rights reserved.
Dna2 nuclease-helicase structure, mechanism and regulation by Rpa
Zhou, Chun; Pourmal, Sergei; Pavletich, Nikola P
2015-01-01
The Dna2 nuclease-helicase maintains genomic integrity by processing DNA double-strand breaks, Okazaki fragments and stalled replication forks. Dna2 requires ssDNA ends, and is dependent on the ssDNA-binding protein Rpa, which controls cleavage polarity. Here we present the 2.3 Å structure of intact mouse Dna2 bound to a 15-nucleotide ssDNA. The nuclease active site is embedded in a long, narrow tunnel through which the DNA has to thread. The helicase domain is required for DNA binding but not threading. We also present the structure of a flexibly-tethered Dna2-Rpa interaction that recruits Dna2 to Rpa-coated DNA. We establish that a second Dna2-Rpa interaction is mutually exclusive with Rpa-DNA interactions and mediates the displacement of Rpa from ssDNA. This interaction occurs at the nuclease tunnel entrance and the 5’ end of the Rpa-DNA complex. Hence, it only displaces Rpa from the 5’ but not 3’ end, explaining how Rpa regulates cleavage polarity. DOI: http://dx.doi.org/10.7554/eLife.09832.001 PMID:26491943
Threaded average temperature thermocouple
NASA Technical Reports Server (NTRS)
Ward, Stanley W. (Inventor)
1990-01-01
A threaded average temperature thermocouple 11 is provided to measure the average temperature of a test situs of a test material 30. A ceramic insulator rod 15 with two parallel holes 17 and 18 through the length thereof is securely fitted in a cylinder 16, which is bored along the longitudinal axis of symmetry of threaded bolt 12. Threaded bolt 12 is composed of material having thermal properties similar to those of test material 30. Leads of a thermocouple wire 20 leading from a remotely situated temperature sensing device 35 are each fed through one of the holes 17 or 18, secured at head end 13 of ceramic insulator rod 15, and exit at tip end 14. Each lead of thermocouple wire 20 is bent into and secured in an opposite radial groove 25 in tip end 14 of threaded bolt 12. Resulting threaded average temperature thermocouple 11 is ready to be inserted into cylindrical receptacle 32. The tip end 14 of the threaded average temperature thermocouple 11 is in intimate contact with receptacle 32. A jam nut 36 secures the threaded average temperature thermocouple 11 to test material 30.
Automatic Multilevel Parallelization Using OpenMP
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)
2002-01-01
In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.
Li, Guang-Tai; Li, Xiao-Fan; Wu, Baoping; Li, Guangrui
2016-04-01
To assess the efficacy and safety of longitudinal parallel compression suture to control heavy postpartum hemorrhage (PPH) in patients with placenta previa/accreta. Fifteen women received a longitudinal parallel compression suture to stop life-threatening PPH due to placenta previa with or without accreta during cesarean section. The suture apposed the anterior and posterior walls of the lower uterine segment together using an absorbable thread A 70-mm round needle with a Number-1 absorbable thread was used. The point of needle entry was 1 cm above the upper margin of the cervix and 1 cm from the right lateral border of the lower segment of the anterior wall. The suture was threaded through the uterine cavity to the serosa of the posterior wall. Then, it was directed upward and threaded from the posterior to the anterior wall at ∼1-2 cm above the upper boundary of the lower uterine segment and 3-cm medial to the right margin of the uterus. Both ends of the suture were tied on the anterior aspect of uterus. The left side was sutured in the same way. The success rate of the procedure was 86.7% (13/15). Two of 15 cases were concurrently administered gauze packing and achieved satisfactory hemostasis. All patients resumed a normal menstrual flow, and no postoperative anatomical or physiological abnormalities related to the suture were observed. Three women achieved further pregnancies after the procedure. Longitudinal parallel compression suture is a safe, easy, effective, practical, and conservative surgical technique to stop intractable PPH from the lower uterine segment, particularly in women who have a cesarean scar and placenta previa/accreta. Copyright © 2016. Published by Elsevier B.V.
Guo, Peixuan; Zhao, Zhengyi; Haak, Jeannie; Wang, Shaoying; Weitao, Tao
2014-01-01
Biomotors were once classified into two categories: linear motor and rotation motor. For decades, the viral DNA-packaging motor has been popularly believed to be a five-fold rotation motor. Recently, a third type of biomotor with revolution mechanism without rotation has been discovered. By analogy, rotation resembles the Earth rotating on its axis in a complete cycle every 24 hours, while revolution resembles the Earth revolving around the Sun one circle per 365 days (see animations http://nanobio.uky.edu/movie.html). The action of revolution that enables a motor free of coiling and torque has solved many puzzles and debates that have occurred throughout the history of viral DNA packaging motor studies. It also settles the discrepancies concerning the structure, stoichiometry, and functioning of DNA translocation motors. This review uses bacteriophages Phi29, HK97, SPP1, P22, T4, T7 as well as bacterial DNA translocase FtsK and SpoIIIE as examples to elucidate the puzzles. These motors use a ATPase, some of which have been confirmed to be a hexamer, to revolve around the dsDNA sequentially. ATP binding induces conformational change and possibly an entropy alteration in ATPase to a high affinity toward dsDNA; but ATP hydrolysis triggers another entropic and conformational change in ATPase to a low affinity for DNA, by which dsDNA is pushed toward an adjacent ATPase subunit. The rotation and revolution mechanisms can be distinguished by the size of channel: the channels of rotation motors are equal to or smaller than 2 nm, whereas channels of revolution motors are larger than 3 nm. Rotation motors use parallel threads to operate with a right-handed channel, while revolution motors use a left-handed channel to drive the right-handed DNA in an anti-parallel arrangement. Coordination of several vector factors in the same direction makes viral DNA-packaging motors unusually powerful and effective. Revolution mechanism avoids DNA coiling in translocating the lengthy genomic dsDNA helix could be advantage for cell replication such as bacterial binary fission and cell mitosis without the need for topoisomerase or helicase to consume additional energy. PMID:24913057
Parallel approach for bioinspired algorithms
NASA Astrophysics Data System (ADS)
Zaporozhets, Dmitry; Zaruba, Daria; Kulieva, Nina
2018-05-01
In the paper, a probabilistic parallel approach based on the population heuristic, such as a genetic algorithm, is suggested. The authors proposed using a multithreading approach at the micro level at which new alternative solutions are generated. On each iteration, several threads that independently used the same population to generate new solutions can be started. After the work of all threads, a selection operator combines obtained results in the new population. To confirm the effectiveness of the suggested approach, the authors have developed software on the basis of which experimental computations can be carried out. The authors have considered a classic optimization problem – finding a Hamiltonian cycle in a graph. Experiments show that due to the parallel approach at the micro level, increment of running speed can be obtained on graphs with 250 and more vertices.
Solving Large Problems Quickly: Progress in 2001-2003
NASA Technical Reports Server (NTRS)
Mowry, Todd C.; Colohan, Christopher B.; Brown, Angela Demke; Steffan, J. Gregory; Zhai, Antonia
2004-01-01
This document describes the progress we have made and the lessons we have learned in 2001 through 2003 under the NASA grant entitled "Solving Important Problems Faster". The long-term goal of this research is to accelerate large, irregular scientific applications which have enormous data sets and which are difficult to parallelize. To accomplish this goal, we are exploring two complementary techniques: (i) using compiler-inserted prefetching to automatically hide the I/O latency of accessing these large data sets from disk; and (ii) using thread-level data speculation to enable the optimistic parallelization of applications despite uncertainty as to whether data dependences exist between the resulting threads which would normally make them unsafe to execute in parallel. Overall, we made significant progress in 2001 through 2003, and the project has gone well.
Automatic Multilevel Parallelization Using OpenMP
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)
2002-01-01
In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.
NASA Astrophysics Data System (ADS)
Hadade, Ioan; di Mare, Luca
2016-08-01
Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Large Scale Document Inversion using a Multi-threaded Computing System
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2018-01-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Large Scale Document Inversion using a Multi-threaded Computing System.
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2017-06-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
Multithreaded Stochastic PDES for Reactions and Diffusions in Neurons.
Lin, Zhongwei; Tropper, Carl; Mcdougal, Robert A; Patoary, Mohammand Nazrul Ishlam; Lytton, William W; Yao, Yiping; Hines, Michael L
2017-07-01
Cells exhibit stochastic behavior when the number of molecules is small. Hence a stochastic reaction-diffusion simulator capable of working at scale can provide a more accurate view of molecular dynamics within the cell. This paper describes a parallel discrete event simulator, Neuron Time Warp-Multi Thread (NTW-MT), developed for the simulation of reaction diffusion models of neurons. To the best of our knowledge, this is the first parallel discrete event simulator oriented towards stochastic simulation of chemical reactions in a neuron. The simulator was developed as part of the NEURON project. NTW-MT is optimistic and thread-based, which attempts to capitalize on multi-core architectures used in high performance machines. It makes use of a multi-level queue for the pending event set and a single roll-back message in place of individual anti-messages to disperse contention and decrease the overhead of processing rollbacks. Global Virtual Time is computed asynchronously both within and among processes to get rid of the overhead for synchronizing threads. Memory usage is managed in order to avoid locking and unlocking when allocating and de-allocating memory and to maximize cache locality. We verified our simulator on a calcium buffer model. We examined its performance on a calcium wave model, comparing it to the performance of a process based optimistic simulator and a threaded simulator which uses a single priority queue for each thread. Our multi-threaded simulator is shown to achieve superior performance to these simulators. Finally, we demonstrated the scalability of our simulator on a larger CICR model and a more detailed CICR model.
Thread scheduling for GPU-based OPC simulation on multi-thread
NASA Astrophysics Data System (ADS)
Lee, Heejun; Kim, Sangwook; Hong, Jisuk; Lee, Sooryong; Han, Hwansoo
2018-03-01
As semiconductor product development based on shrinkage continues, the accuracy and difficulty required for the model based optical proximity correction (MBOPC) is increasing. OPC simulation time, which is the most timeconsuming part of MBOPC, is rapidly increasing due to high pattern density in a layout and complex OPC model. To reduce OPC simulation time, we attempt to apply graphic processing unit (GPU) to MBOPC because OPC process is good to be programmed in parallel. We address some issues that may typically happen during GPU-based OPC simulation in multi thread system, such as "out of memory" and "GPU idle time". To overcome these problems, we propose a thread scheduling method, which manages OPC jobs in multiple threads in such a way that simulations jobs from multiple threads are alternatively executed on GPU while correction jobs are executed at the same time in each CPU cores. It was observed that the amount of GPU peak memory usage decreases by up to 35%, and MBOPC runtime also decreases by 4%. In cases where out of memory issues occur in a multi-threaded environment, the thread scheduler was used to improve MBOPC runtime up to 23%.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.
Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu
2015-07-01
Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Hofierka, Jaroslav; Lacko, Michal; Zubal, Stanislav
2017-10-01
In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.
Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F
2018-03-01
Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.
Real time display Fourier-domain OCT using multi-thread parallel computing with data vectorization
NASA Astrophysics Data System (ADS)
Eom, Tae Joong; Kim, Hoon Seop; Kim, Chul Min; Lee, Yeung Lak; Choi, Eun-Seo
2011-03-01
We demonstrate a real-time display of processed OCT images using multi-thread parallel computing with a quad-core CPU of a personal computer. The data of each A-line are treated as one vector to maximize the data translation rate between the cores of the CPU and RAM stored image data. A display rate of 29.9 frames/sec for processed OCT data (4096 FFT-size x 500 A-scans) is achieved in our system using a wavelength swept source with 52-kHz swept frequency. The data processing times of the OCT image and a Doppler OCT image with a 4-time average are 23.8 msec and 91.4 msec.
Multicore Challenges and Benefits for High Performance Scientific Computing
Nielsen, Ida M. B.; Janssen, Curtis L.
2008-01-01
Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
Potential Application of a Graphical Processing Unit to Parallel Computations in the NUBEAM Code
NASA Astrophysics Data System (ADS)
Payne, J.; McCune, D.; Prater, R.
2010-11-01
NUBEAM is a comprehensive computational Monte Carlo based model for neutral beam injection (NBI) in tokamaks. NUBEAM computes NBI-relevant profiles in tokamak plasmas by tracking the deposition and the slowing of fast ions. At the core of NUBEAM are vector calculations used to track fast ions. These calculations have recently been parallelized to run on MPI clusters. However, cost and interlink bandwidth limit the ability to fully parallelize NUBEAM on an MPI cluster. Recent implementation of double precision capabilities for Graphical Processing Units (GPUs) presents a cost effective and high performance alternative or complement to MPI computation. Commercially available graphics cards can achieve up to 672 GFLOPS double precision and can handle hundreds of thousands of threads. The ability to execute at least one thread per particle simultaneously could significantly reduce the execution time and the statistical noise of NUBEAM. Progress on implementation on a GPU will be presented.
Three dimensional simulations of viscous folding in diverging microchannels
NASA Astrophysics Data System (ADS)
Xu, Bingrui; Chergui, Jalel; Shin, Seungwon; Juric, Damir
2016-11-01
Three dimensional simulations on the viscous folding in diverging microchannels reported by Cubaud and Mason are performed using the parallel code BLUE for multi-phase flows. The more viscous liquid L1 is injected into the channel from the center inlet, and the less viscous liquid L2 from two side inlets. Liquid L1 takes the form of a thin filament due to hydrodynamic focusing in the long channel that leads to the diverging region. The thread then becomes unstable to a folding instability, due to the longitudinal compressive stress applied to it by the diverging flow of liquid L2. We performed a parameter study in which the flow rate ratio, the viscosity ratio, the Reynolds number, and the shape of the channel were varied relative to a reference model. In our simulations, the cross section of the thread produced by focusing is elliptical rather than circular. The initial folding axis can be either parallel or perpendicular to the narrow dimension of the chamber. In the former case, the folding slowly transforms via twisting to perpendicular folding, or it may remain parallel. The direction of folding onset is determined by the velocity profile and the elliptical shape of the thread cross section in the channel that feeds the diverging part of the cell.
Efficient parallelization for AMR MHD multiphysics calculations; implementation in AstroBEAR
NASA Astrophysics Data System (ADS)
Carroll-Nellenback, Jonathan J.; Shroyer, Brandon; Frank, Adam; Ding, Chen
2013-03-01
Current adaptive mesh refinement (AMR) simulations require algorithms that are highly parallelized and manage memory efficiently. As compute engines grow larger, AMR simulations will require algorithms that achieve new levels of efficient parallelization and memory management. We have attempted to employ new techniques to achieve both of these goals. Patch or grid based AMR often employs ghost cells to decouple the hyperbolic advances of each grid on a given refinement level. This decoupling allows each grid to be advanced independently. In AstroBEAR we utilize this independence by threading the grid advances on each level with preference going to the finer level grids. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both physical space and AMR level. Threading of level advances can also improve performance by interleaving communication with computation, especially in deep simulations with many levels of refinement. While we see improvements of up to 30% on deep simulations run on a few cores, the speedup is typically more modest (5-20%) for larger scale simulations. To improve memory management we have employed a distributed tree algorithm that requires processors to only store and communicate local sections of the AMR tree structure with neighboring processors. Using this distributed approach we are able to get reasonable scaling efficiency (>80%) out to 12288 cores and up to 8 levels of AMR - independent of the use of threading.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.
2011-01-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W
2012-06-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Jost, Gabriele
2004-01-01
In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shin, J; Coss, D; McMurry, J
Purpose: To evaluate the efficiency of multithreaded Geant4 (Geant4-MT, version 10.0) for proton Monte Carlo dose calculations using a high performance computing facility. Methods: Geant4-MT was used to calculate 3D dose distributions in 1×1×1 mm3 voxels in a water phantom and patient's head with a 150 MeV proton beam covering approximately 5×5 cm2 in the water phantom. Three timestamps were measured on the fly to separately analyze the required time for initialization (which cannot be parallelized), processing time of individual threads, and completion time. Scalability of averaged processing time per thread was calculated as a function of thread number (1,more » 100, 150, and 200) for both 1M and 50 M histories. The total memory usage was recorded. Results: Simulations with 50 M histories were fastest with 100 threads, taking approximately 1.3 hours and 6 hours for the water phantom and the CT data, respectively with better than 1.0 % statistical uncertainty. The calculations show 1/N scalability in the event loops for both cases. The gains from parallel calculations started to decrease with 150 threads. The memory usage increases linearly with number of threads. No critical failures were observed during the simulations. Conclusion: Multithreading in Geant4-MT decreased simulation time in proton dose distribution calculations by a factor of 64 and 54 at a near optimal 100 threads for water phantom and patient's data respectively. Further simulations will be done to determine the efficiency at the optimal thread number. Considering the trend of computer architecture development, utilizing Geant4-MT for radiotherapy simulations is an excellent cost-effective alternative for a distributed batch queuing system. However, because the scalability depends highly on simulation details, i.e., the ratio of the processing time of one event versus waiting time to access for the shared event queue, a performance evaluation as described is recommended.« less
Community Detection on the GPU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh
We present and evaluate a new GPU algorithm based on the Louvain method for community detection. Our algorithm is the first for this problem that parallelizes the access to individual edges. In this way we can fine tune the load balance when processing networks with nodes of highly varying degrees. This is achieved by scaling the number of threads assigned to each node according to its degree. Extensive experiments show that we obtain speedups up to a factor of 270 compared to the sequential algorithm. The algorithm consistently outperforms other recent shared memory implementations and is only one order ofmore » magnitude slower than the current fastest parallel Louvain method running on a Blue Gene/Q supercomputer using more than 500K threads.« less
Support of Multidimensional Parallelism in the OpenMP Programming Model
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele
2003-01-01
OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code.
Komarov, Ivan; D'Souza, Roshan M
2012-01-01
The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
XMOS XC-2 Development Board for Mechanical Control and Data Collection
NASA Technical Reports Server (NTRS)
Jarnot, Robert F.; Bowden, William J.
2011-01-01
The scanning microwave limb sounder (SMLS) will use technological improvements in low-noise mixers to provide precise data on the Earth s atmospheric composition with high spatial resolution. This project focuses on the design and implementation of a realtime control system needed for airborne engineering tests of the SMLS. The system must coordinate the actuation of optical components using four motors with encoder readback, while collecting synchronized telemetric data from a GPS receiver and 3-axis gyrometric system. A graphical user interface for testing the control system was also designed using Python. Although the system could have been implemented with an FPGA(fieldprogrammable gate array)-based setup, a processor development kit manufactured by XMOS was chosen. The XMOS architecture allows parallel execution of multiple tasks on separate threads, making it ideal for this application. It is easily programmed using XC (a subset of C). The necessary communication interfaces were implemented in software, including Ethernet, with significant cost and time reduction compared to an FPGA-based approach. A simple approach to control the chopper, calibration mirror, and gimbal for the airborne SMLS was needed. The XMOS board allows for multiple threads and real-time data acquisition. The XC-2 development kit is an attractive choice for synchronized, real-time, event-driven applications. The XMOS is based on the transputer microprocessor architecture developed for parallel computing, which is being revamped in this new platform. The XMOS device has multiple cores capable of running parallel applications on separate threads. The threads communicate with each other via user-defined channels capable of transmitting data within the device. XMOS provides a C-based development environment using XC, which eliminates the need for custom tool kits associated with FPGA programming. The XC-2 has four cores and necessary hardware for Ethernet I/O.
Scaling Up Coordinate Descent Algorithms for Large ℓ1 Regularization Problems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Scherrer, Chad; Halappanavar, Mahantesh; Tewari, Ambuj
2012-07-03
We present a generic framework for parallel coordinate descent (CD) algorithms that has as special cases the original sequential algorithms of Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm of Bradley et al. We introduce two novel parallel algorithms that are also special cases---Thread-Greedy CD and Coloring-Based CD---and give performance measurements for an OpenMP implementation of these.
Roofline model toolkit: A practical tool for architectural and program analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lo, Yu Jung; Williams, Samuel; Van Straalen, Brian
We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measuremore » sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.« less
Flexbar 3.0 - SIMD and multicore parallelization.
Roehr, Johannes T; Dieterich, Christoph; Reinert, Knut
2017-09-15
High-throughput sequencing machines can process many samples in a single run. For Illumina systems, sequencing reads are barcoded with an additional DNA tag that is contained in the respective sequencing adapters. The recognition of barcode and adapter sequences is hence commonly needed for the analysis of next-generation sequencing data. Flexbar performs demultiplexing based on barcodes and adapter trimming for such data. The massive amounts of data generated on modern sequencing machines demand that this preprocessing is done as efficiently as possible. We present Flexbar 3.0, the successor of the popular program Flexbar. It employs now twofold parallelism: multi-threading and additionally SIMD vectorization. Both types of parallelism are used to speed-up the computation of pair-wise sequence alignments, which are used for the detection of barcodes and adapters. Furthermore, new features were included to cover a wide range of applications. We evaluated the performance of Flexbar based on a simulated sequencing dataset. Our program outcompetes other tools in terms of speed and is among the best tools in the presented quality benchmark. https://github.com/seqan/flexbar. johannes.roehr@fu-berlin.de or knut.reinert@fu-berlin.de. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
NASA Astrophysics Data System (ADS)
Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
2016-06-01
CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
Adapter plate assembly for adjustable mounting of objects
Blackburn, R.S.
1986-05-02
An adapter plate and two locking discs are together affixed to an optic table with machine screws or bolts threaded into a fixed array of internally threaded holes provided in the table surface. The adapter plate preferably has two, and preferably parallel, elongated locating slots each freely receiving a portion of one of the locking discs for secure affixation of the adapter plate to the optic table. A plurality of threaded apertures provided in the adapter plate are available to attach optical mounts or other devices onto the adapter plate in an orientation not limited by the disposition of the array of threaded holes in the table surface. An axially aligned but radially offset hole through each locking disc receives a screw that tightens onto the table, such that prior to tightening of the screw the locking disc may rotate and translate within each locating slot of the adapter plate for maximum flexibility of the orientation thereof.
Adapter plate assembly for adjustable mounting of objects
Blackburn, Robert S.
1987-01-01
An adapter plate and two locking discs are together affixed to an optic table with machine screws or bolts threaded into a fixed array of internally threaded holes provided in the table surface. The adapter plate preferably has two, and preferably parallel, elongated locating slots each freely receiving a portion of one of the locking discs for secure affixation of the adapter plate to the optic table. A plurality of threaded apertures provided in the adapter plate are available to attach optical mounts or other devices onto the adapter plate in an orientation not limited by the disposition of the array of threaded holes in the table surface. An axially aligned but radially offset hole through each locking disc receives a screw that tightens onto the table, such that prior to tightening of the screw the locking disc may rotate and translate within each locating slot of the adapter plate for maximum flexibility of the orientation thereof.
A Massively Parallel Computational Method of Reading Index Files for SOAPsnv.
Zhu, Xiaoqian; Peng, Shaoliang; Liu, Shaojie; Cui, Yingbo; Gu, Xiang; Gao, Ming; Fang, Lin; Fang, Xiaodong
2015-12-01
SOAPsnv is the software used for identifying the single nucleotide variation in cancer genes. However, its performance is yet to match the massive amount of data to be processed. Experiments reveal that the main performance bottleneck of SOAPsnv software is the pileup algorithm. The original pileup algorithm's I/O process is time-consuming and inefficient to read input files. Moreover, the scalability of the pileup algorithm is also poor. Therefore, we designed a new algorithm, named BamPileup, aiming to improve the performance of sequential read, and the new pileup algorithm implemented a parallel read mode based on index. Using this method, each thread can directly read the data start from a specific position. The results of experiments on the Tianhe-2 supercomputer show that, when reading data in a multi-threaded parallel I/O way, the processing time of algorithm is reduced to 3.9 s and the application program can achieve a speedup up to 100×. Moreover, the scalability of the new algorithm is also satisfying.
CCC7-119 Reactive Molecular Dynamics Simulations of Hot Spot Growth in Shocked Energetic Materials
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thompson, Aidan P.
2015-03-01
The purpose of this work is to understand how defects control initiation in energetic materials used in stockpile components; Sequoia gives us the core-count to run very large-scale simulations of up to 10 million atoms and; Using an OpenMP threaded implementation of the ReaxFF package in LAMMPS, we have been able to get good parallel efficiency running on 16k nodes of Sequoia, with 1 hardware thread per core.
A Tutorial on Parallel and Concurrent Programming in Haskell
NASA Astrophysics Data System (ADS)
Peyton Jones, Simon; Singh, Satnam
This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
Software Defined Radio with Parallelized Software Architecture
NASA Technical Reports Server (NTRS)
Heckler, Greg
2013-01-01
This software implements software-defined radio procession over multicore, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to approx.50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Software Defined Radio with Parallelized Software Architecture
NASA Technical Reports Server (NTRS)
Heckler, Greg
2013-01-01
This software implements software-defined radio procession over multi-core, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to .50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Using all of your CPU's in HIPE
NASA Astrophysics Data System (ADS)
Jacobson, J. D.; Fadda, D.
2012-09-01
Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
Liu, Nishuang; Ma, Wenzhen; Tao, Jiayou; Zhang, Xianghui; Su, Jun; Li, Luying; Yang, Congxing; Gao, Yihua; Golberg, Dmitri; Bando, Yoshio
2013-09-20
A novel cable-type flexible supercapacitor with excellent performance is fabricated using 3D polypyrrole(PPy)-MnO2 -CNT-cotton thread multi-grade nanostructure-based electrodes. The multiple supercapacitors with a high areal capacitance 1.49 F cm(-2) at a scan rate of 1 mV s(-1) connected in series and in parallel can successfully drive a LED segment display. Such an excellent performance is attributed to the cumulative effect of conducting single-walled carbon nanotubes on cotton thread, active mesoporous flower-like MnO2 nanoplates, and PPy conductive wrapping layer improving the conductivity, and acting as pseudocapacitance material simultaneously. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics.
Ragothaman, Anjani; Boddu, Sairam Chowdary; Kim, Nayong; Feinstein, Wei; Brylinski, Michal; Jha, Shantenu; Kim, Joohyun
2014-01-01
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.
NASA Astrophysics Data System (ADS)
Mills, R. T.
2014-12-01
As the high performance computing (HPC) community pushes towards the exascale horizon, the importance and prevalence of fine-grained parallelism in new computer architectures is increasing. This is perhaps most apparent in the proliferation of so-called "accelerators" such as the Intel Xeon Phi or NVIDIA GPGPUs, but the trend also holds for CPUs, where serial performance has grown slowly and effective use of hardware threads and vector units are becoming increasingly important to realizing high performance. This has significant implications for weather, climate, and Earth system modeling codes, many of which display impressive scalability across MPI ranks but take relatively little advantage of threading and vector processing. In addition to increasing parallelism, next generation codes will also need to address increasingly deep hierarchies for data movement: NUMA/cache levels, on node vs. off node, local vs. wide neighborhoods on the interconnect, and even in the I/O system. We will discuss some approaches (grounded in experiences with the Intel Xeon Phi architecture) for restructuring Earth science codes to maximize concurrency across multiple levels (vectors, threads, MPI ranks), and also discuss some novel approaches for minimizing expensive data movement/communication.
OpenMP performance for benchmark 2D shallow water equations using LBM
NASA Astrophysics Data System (ADS)
Sabri, Khairul; Rabbani, Hasbi; Gunawan, Putu Harry
2018-03-01
Shallow water equations or commonly referred as Saint-Venant equations are used to model fluid phenomena. These equations can be solved numerically using several methods, like Lattice Boltzmann method (LBM), SIMPLE-like Method, Finite Difference Method, Godunov-type Method, and Finite Volume Method. In this paper, the shallow water equation will be approximated using LBM or known as LABSWE and will be simulated in performance of parallel programming using OpenMP. To evaluate the performance between 2 and 4 threads parallel algorithm, ten various number of grids Lx and Ly are elaborated. The results show that using OpenMP platform, the computational time for solving LABSWE can be decreased. For instance using grid sizes 1000 × 500, the speedup of 2 and 4 threads is observed 93.54 s and 333.243 s respectively.
High-Performance, Multi-Node File Copies and Checksums for Clustered File Systems
NASA Technical Reports Server (NTRS)
Kolano, Paul Z.; Ciotti, Robert B.
2012-01-01
Modern parallel file systems achieve high performance using a variety of techniques, such as striping files across multiple disks to increase aggregate I/O bandwidth and spreading disks across multiple servers to increase aggregate interconnect bandwidth. To achieve peak performance from such systems, it is typically necessary to utilize multiple concurrent readers/writers from multiple systems to overcome various singlesystem limitations, such as number of processors and network bandwidth. The standard cp and md5sum tools of GNU coreutils found on every modern Unix/Linux system, however, utilize a single execution thread on a single CPU core of a single system, and hence cannot take full advantage of the increased performance of clustered file systems. Mcp and msum are drop-in replacements for the standard cp and md5sum programs that utilize multiple types of parallelism and other optimizations to achieve maximum copy and checksum performance on clustered file systems. Multi-threading is used to ensure that nodes are kept as busy as possible. Read/write parallelism allows individual operations of a single copy to be overlapped using asynchronous I/O. Multinode cooperation allows different nodes to take part in the same copy/checksum. Split-file processing allows multiple threads to operate concurrently on the same file. Finally, hash trees allow inherently serial checksums to be performed in parallel. Mcp and msum provide significant performance improvements over standard cp and md5sum using multiple types of parallelism and other optimizations. The total speed-ups from all improvements are significant. Mcp improves cp performance over 27x, msum improves md5sum performance almost 19x, and the combination of mcp and msum improves verified copies via cp and md5sum by almost 22x. These improvements come in the form of drop-in replacements for cp and md5sum, so are easily used and are available for download as open source software at http://mutil.sourceforge.net.
NASA Astrophysics Data System (ADS)
Riekel, C.; Craig, C. L.; Burghammer, M.; Müller, M.
2001-01-01
Scanning X-ray microdiffraction (SXD) permits the 'imaging' in-situ of crystalline phases, crystallinity and texture in whole biopolymer samples on the micrometre scale. SXD complements transmission electron microscopy (TEM) techniques, which reach sub-nanometre lateral resolution but require thin sections and a vacuum environment. This is demonstrated using a support thread from a web spun by the orb-weaving spider Eriophora fuliginea (C.L. Koch). Scanning electron microscopy (SEM) shows a central thread composed of two fibres to which thinner fibres are loosely attached. SXD of a piece of support thread approximately 60 µm long shows in addition the presence of nanometre-sized crystallites with the β-poly(L-alanine) structure in all fibres. The crystallinity of the thin fibres appears to be higher than that of the central thread, which probably reflects a higher polyalanine content of the fibroins. The molecular axis of the polymer chains in the central thread is orientated parallel to the macroscopic fibre axis, but in the thin fibres the molecular axis is tilted by about 71° to the macroscopic fibre axis. A helical model is tentatively proposed to describe this morphology. The central thread has a homogeneous distribution of crystallinity along the macroscopic fibre axis.
THE THERMAL INSTABILITY OF SOLAR PROMINENCE THREADS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Soler, R.; Goossens, M.; Ballester, J. L., E-mail: roberto.soler@wis.kuleuven.be
The fine structure of solar prominences and filaments appears as thin and long threads in high-resolution images. In H{alpha} observations of filaments, some threads can be observed for only 5-20 minutes before they seem to fade and eventually disappear, suggesting that these threads may have very short lifetimes. The presence of an instability might be the cause of this quick disappearance. Here, we study the thermal instability of prominence threads as an explanation of their sudden disappearance from H{alpha} observations. We model a prominence thread as a magnetic tube with prominence conditions embedded in a coronal environment. We assume amore » variation of the physical properties in the transverse direction so that the temperature and density continuously change from internal to external values in an inhomogeneous transitional layer representing the particular prominence-corona transition region (PCTR) of the thread. We use the nonadiabatic and resistive magnetohydrodynamic equations, which include terms due to thermal conduction parallel and perpendicular to the magnetic field, radiative losses, heating, and magnetic diffusion. We combine both analytical and numerical methods to study linear perturbations from the equilibrium state, focusing on unstable thermal solutions. We find that thermal modes are unstable in the PCTR for temperatures higher than 80,000 K, approximately. These modes are related to temperature disturbances that can lead to changes in the equilibrium due to rapid plasma heating or cooling. For typical prominence parameters, the instability timescale is of the order of a few minutes and is independent of the form of the temperature profile within the PCTR of the thread. This result indicates that thermal instability may play an important role for the short lifetimes of threads in the observations.« less
Dynamics of threading dislocations in porous heteroepitaxial GaN films
NASA Astrophysics Data System (ADS)
Gutkin, M. Yu.; Rzhavtsev, E. A.
2017-12-01
Behavior of threading dislocations in porous heteroepitaxial gallium nitride (GaN) films has been studied using computer simulation by the two-dimensional discrete dislocation dynamics approach. A computational scheme, where pores are modeled as cross sections of cylindrical cavities, elastically interacting with unidirectional parallel edge dislocations, which imitate threading dislocations, is used. Time dependences of coordinates and velocities of each dislocation from dislocation ensembles under investigation are obtained. Visualization of current structure of dislocation ensemble is performed in the form of a location map of dislocations at any time. It has been shown that the density of appearing dislocation structures significantly depends on the ratio of area of a pore cross section to area of the simulation region. In particular, increasing the portion of pores surface on the layer surface up to 2% should lead to about a 1.5-times decrease of the final density of threading dislocations, and increase of this portion up to 15% should lead to approximately a 4.5-times decrease of it.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2014-02-11
Data communications in a parallel active messaging interface ('PAMI') or a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution of a compute node, including specification of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications instruction, the instruction characterized by instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance witht the instruction type, the transfer data from the origin endpoin to the target endpoint.
Reader set encoding for directory of shared cache memory in multiprocessor system
Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang
2014-06-10
In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel
2014-07-22
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diversemore » manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.« less
Study of Thread Level Parallelism in a Video Encoding Application for Chip Multiprocessor Design
NASA Astrophysics Data System (ADS)
Debes, Eric; Kaine, Greg
2002-11-01
In media applications there is a high level of available thread level parallelism (TLP). In this paper we study the intra TLP in a video encoder. We show that a well-distributed highly optimized encoder running on a symmetric multiprocessor (SMP) system can run 3.2 faster on a 4-way SMP machine than on a single processor. The multithreaded encoder running on an SMP system is then used to understand the requirements of a chip multiprocessor (CMP) architecture, which is one possible architectural direction to better exploit TLP. In the framework of this study, we use a software approach to evaluate the dataflow between processors for the video encoder running on an SMP system. An estimation of the dataflow is done with L2 cache miss event counters using Intel® VTuneTM performance analyzer. The experimental measurements are compared to theoretical results.
Implementation of the NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Frumkin, Michael A.; Schultz, Matthew; Jin, Haoqiang; Yan, Jerry; Biegel, Bryan (Technical Monitor)
2002-01-01
Several features make Java an attractive choice for High Performance Computing (HPC). In order to gauge the applicability of Java to Computational Fluid Dynamics (CFD), we have implemented the NAS (NASA Advanced Supercomputing) Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would position Java closer to Fortran in the competition for CFD applications.
Performance and Scalability of the NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Frumkin, Michael A.; Schultz, Matthew; Jin, Haoqiang; Yan, Jerry; Biegel, Bryan A. (Technical Monitor)
2002-01-01
Several features make Java an attractive choice for scientific applications. In order to gauge the applicability of Java to Computational Fluid Dynamics (CFD), we have implemented the NAS (NASA Advanced Supercomputing) Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would position Java closer to Fortran in the competition for scientific applications.
Robust Parallel Motion Estimation and Mapping with Stereo Cameras in Underground Infrastructure
NASA Astrophysics Data System (ADS)
Liu, Chun; Li, Zhengning; Zhou, Yuan
2016-06-01
Presently, we developed a novel robust motion estimation method for localization and mapping in underground infrastructure using a pre-calibrated rigid stereo camera rig. Localization and mapping in underground infrastructure is important to safety. Yet it's also nontrivial since most underground infrastructures have poor lighting condition and featureless structure. Overcoming these difficulties, we discovered that parallel system is more efficient than the EKF-based SLAM approach since parallel system divides motion estimation and 3D mapping tasks into separate threads, eliminating data-association problem which is quite an issue in SLAM. Moreover, the motion estimation thread takes the advantage of state-of-art robust visual odometry algorithm which is highly functional under low illumination and provides accurate pose information. We designed and built an unmanned vehicle and used the vehicle to collect a dataset in an underground garage. The parallel system was evaluated by the actual dataset. Motion estimation results indicated a relative position error of 0.3%, and 3D mapping results showed a mean position error of 13cm. Off-line process reduced position error to 2cm. Performance evaluation by actual dataset showed that our system is capable of robust motion estimation and accurate 3D mapping in poor illumination and featureless underground environment.
A Family of ACO Routing Protocols for Mobile Ad Hoc Networks.
Rupérez Cañas, Delfín; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier; Kim, Tai-Hoon
2017-05-22
In this work, an ACO routing protocol for mobile ad hoc networks based on AntHocNet is specified. As its predecessor, this new protocol, called AntOR, is hybrid in the sense that it contains elements from both reactive and proactive routing. Specifically, it combines a reactive route setup process with a proactive route maintenance and improvement process. Key aspects of the AntOR protocol are the disjoint-link and disjoint-node routes, separation between the regular pheromone and the virtual pheromone in the diffusion process and the exploration of routes, taking into consideration the number of hops in the best routes. In this work, a family of ACO routing protocols based on AntOR is also specified. These protocols are based on protocol successive refinements. In this work, we also present a parallelized version of AntOR that we call PAntOR. Using programming multiprocessor architectures based on the shared memory protocol, PAntOR allows running tasks in parallel using threads. This parallelization is applicable in the route setup phase, route local repair process and link failure notification. In addition, a variant of PAntOR that consists of having more than one interface, which we call PAntOR-MI (PAntOR-Multiple Interface), is specified. This approach parallelizes the sending of broadcast messages by interface through threads.
Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shan, Hongzhang; Williams, Samuel; Jong, Wibe de
In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in tt native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant effort was required to safely and efficiently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI OpenMP hybrid implementations attain up to 65x better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6x better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
Thread-level parallelization and optimization of NWChem for the Intel MIC architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shan, Hongzhang; Williams, Samuel; de Jong, Wibe
In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
Developing eThread Pipeline Using SAGA-Pilot Abstraction for Large-Scale Structural Bioinformatics
Ragothaman, Anjani; Feinstein, Wei; Jha, Shantenu; Kim, Joohyun
2014-01-01
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread—a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure. PMID:24995285
NASA Astrophysics Data System (ADS)
Sorokin, V. A.; Volkov, Yu V.; Sherstneva, A. I.; Botygin, I. A.
2016-11-01
This paper overviews a method of generating climate regions based on an analytic signal theory. When applied to atmospheric surface layer temperature data sets, the method allows forming climatic structures with the corresponding changes in the temperature to make conclusions on the uniformity of climate in an area and to trace the climate changes in time by analyzing the type group shifts. The algorithm is based on the fact that the frequency spectrum of the thermal oscillation process is narrow-banded and has only one mode for most weather stations. This allows using the analytic signal theory, causality conditions and introducing an oscillation phase. The annual component of the phase, being a linear function, was removed by the least squares method. The remaining phase fluctuations allow consistent studying of their coordinated behavior and timing, using the Pearson correlation coefficient for dependence evaluation. This study includes program experiments to evaluate the calculation efficiency in the phase grouping task. The paper also overviews some single-threaded and multi-threaded computing models. It is shown that the phase grouping algorithm for meteorological data can be parallelized and that a multi-threaded implementation leads to a 25-30% increase in the performance.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shi, Yuqian; Hellinga, Homme W.; Beese, Lorena S.
Human exonuclease 1 (hExo1) is a member of the RAD2/XPG structure-specific 5'-nuclease superfamily. Its dominant, processive 5'–3' exonuclease and secondary 5'-flap endonuclease activities participate in various DNA repair, recombination, and replication processes. A single active site processes both recessed ends and 5'-flap substrates. By initiating enzyme reactions in crystals, we have trapped hExo1 reaction intermediates that reveal structures of these substrates before and after their exo- and endonucleolytic cleavage, as well as structures of uncleaved, unthreaded, and partially threaded 5' flaps. Their distinctive 5' ends are accommodated by a small, mobile arch in the active site that binds recessed endsmore » at its base and threads 5' flaps through a narrow aperture within its interior. A sequence of successive, interlocking conformational changes guides the two substrate types into a shared reaction mechanism that catalyzes their cleavage by an elaborated variant of the two-metal, in-line hydrolysis mechanism. Coupling of substrate-dependent arch motions to transition-state stabilization suppresses inappropriate or premature cleavage, enhancing processing fidelity. The striking reduction in flap conformational entropy is catalyzed, in part, by arch motions and transient binding interactions between the flap and unprocessed DNA strand. At the end of the observed reaction sequence, hExo1 resets without relinquishing DNA binding, suggesting a structural basis for its processivity.« less
Shi, Yuqian; Hellinga, Homme W; Beese, Lorena S
2017-06-06
Human exonuclease 1 (hExo1) is a member of the RAD2/XPG structure-specific 5'-nuclease superfamily. Its dominant, processive 5'-3' exonuclease and secondary 5'-flap endonuclease activities participate in various DNA repair, recombination, and replication processes. A single active site processes both recessed ends and 5'-flap substrates. By initiating enzyme reactions in crystals, we have trapped hExo1 reaction intermediates that reveal structures of these substrates before and after their exo- and endonucleolytic cleavage, as well as structures of uncleaved, unthreaded, and partially threaded 5' flaps. Their distinctive 5' ends are accommodated by a small, mobile arch in the active site that binds recessed ends at its base and threads 5' flaps through a narrow aperture within its interior. A sequence of successive, interlocking conformational changes guides the two substrate types into a shared reaction mechanism that catalyzes their cleavage by an elaborated variant of the two-metal, in-line hydrolysis mechanism. Coupling of substrate-dependent arch motions to transition-state stabilization suppresses inappropriate or premature cleavage, enhancing processing fidelity. The striking reduction in flap conformational entropy is catalyzed, in part, by arch motions and transient binding interactions between the flap and unprocessed DNA strand. At the end of the observed reaction sequence, hExo1 resets without relinquishing DNA binding, suggesting a structural basis for its processivity.
Hine, P M; Wakefield, St J; Mackereth, G; Morrison, R
2016-09-26
The morphogenesis of large icosahedral viruses associated with lymphocystis-like lesions in the skin of parore Girella tricuspidata is described. The electron-lucent perinuclear viromatrix comprised putative DNA with open capsids at the periphery, very large arrays of smooth endoplasmic reticulum (sER), much of it with a reticulated appearance (rsER) or occurring as rows of vesicles. Lysosomes, degenerating mitochondria and virions in various stages of assembly, and paracrystalline arrays were also present. Long electron-dense inclusions (EDIs) with 15 nm repeating units split terminally and curled to form tubular structures internalising the 15 nm repeating structures. These tubular structures appeared to form the virion capsids. Large parallel arrays of sER sometimes alternated with aligned arrays of crinkled cisternae along which passed a uniformly wide (20 nm) thread-like structure. Strings of small vesicles near open capsids may also have been involved in formation of an inner lipid layer. Granules with a fine fibrillar appearance also occurred in the viromatrix, and from the presence of a halo around mature virions it appeared that the fibrils may form a layer around the capsid. The general features of virogenesis of large icosahedral dsDNA viruses, the large amount of ER, particularly rsER and the EDIs, are features of nucleo-cytoplasmic large DNA viruses, rather than features of 1 genus or family.
Toward Enhancing OpenMP's Work-Sharing Directives
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chapman, B M; Huang, L; Jin, H
2006-05-17
OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requires greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressivity problems in the current OpenMP specification. We then propose mechanisms to overcome these limitations, including thread subteams and thread topologies. Thus, we identify language features that improve OpenMP application performance on emerging and large-scale platforms while preserving ease of programming.
Implementation of NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Schultz, Matthew; Jin, Hao-Qiang; Yan, Jerry
2000-01-01
A number of features make Java an attractive but a debatable choice for High Performance Computing (HPC). In order to gauge the applicability of Java to the Computational Fluid Dynamics (CFD) we have implemented NAS Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would move Java closer to Fortran in the competition for CFD applications.
Hierarchical Parallelization of Gene Differential Association Analysis
2011-01-01
Background Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Results Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. Conclusions The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels. PMID:21936916
Hierarchical parallelization of gene differential association analysis.
Needham, Mark; Hu, Rui; Dwarkadas, Sandhya; Qiu, Xing
2011-09-21
Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels.
Performance of the Heavy Flavor Tracker (HFT) detector in star experiment at RHIC
NASA Astrophysics Data System (ADS)
Alruwaili, Manal
With the growing technology, the number of the processors is becoming massive. Current supercomputer processing will be available on desktops in the next decade. For mass scale application software development on massive parallel computing available on desktops, existing popular languages with large libraries have to be augmented with new constructs and paradigms that exploit massive parallel computing and distributed memory models while retaining the user-friendliness. Currently, available object oriented languages for massive parallel computing such as Chapel, X10 and UPC++ exploit distributed computing, data parallel computing and thread-parallelism at the process level in the PGAS (Partitioned Global Address Space) memory model. However, they do not incorporate: 1) any extension at for object distribution to exploit PGAS model; 2) the programs lack the flexibility of migrating or cloning an object between places to exploit load balancing; and 3) lack the programming paradigms that will result from the integration of data and thread-level parallelism and object distribution. In the proposed thesis, I compare different languages in PGAS model; propose new constructs that extend C++ with object distribution and object migration; and integrate PGAS based process constructs with these extensions on distributed objects. Object cloning and object migration. Also a new paradigm MIDD (Multiple Invocation Distributed Data) is presented when different copies of the same class can be invoked, and work on different elements of a distributed data concurrently using remote method invocations. I present new constructs, their grammar and their behavior. The new constructs have been explained using simple programs utilizing these constructs.
NASA Astrophysics Data System (ADS)
Gupta, Bhupender S.
The first conversion of naturally occurring fibers into threads strong enough to be looped into snares, knit to form nets, or woven into fabrics is lost in prehistory. Unlike stone weapons, such threads, cords, and fabrics—being organic in nature—have in most part disappeared, although in some dry caves traces remain. There is ample evidence to indicate that spindles used to assist in the twisting of fibers together had been developed long before the dawn of recorded history. In that spinning process, fibers such as wool were drawn out of a loose mass, perhaps held in a distaff, and made parallel by human fingers. (A maidservant so spins in Giotto's The Annunciation to Anne, ca. A.D. 1306, Arena Chapel, Padua, Italy.1) A rod (spindle), hooked to the lengthening thread, was rotated so that the fibers while so held were twisted together to form additional thread. The finished length then was wound by hand around the spindle, which, in becoming the core on which the finished product was accumulated, served the dual role of twisting and storing, and, in so doing, established a principle still in use today.
Dynamics of flexible molecules in thinning fluid filaments
NASA Astrophysics Data System (ADS)
Arratia, Paulo E.; Juarez, Gabriel
2011-11-01
Newtonian liquids that contain small amounts (~ppm) of flexible polymers can exhibit viscoelastic behavior in extensional flows. In this talk, we report the results of experiments on the thinning and breakup of polymeric fluids in a simple microfluidic device. We aim to understand the stretching dynamics of flexible polymers by direct visualization of fluorescent DNA molecules, a model polymer. A Boger fluid, composed of 100 ppm polyacrylamide and 85% w/w glycerol, is seeded with stained lambdaâDNA molecules (<10% v/v) imaged by high speed epifluorescence microscopy. We observe that the strong flow in the thinning fluid threads provide sufficient forces to stretch the DNA molecules away from their equilibrium coiled state. The distribution of stretch lengths, however, is very heterogeneous due to molecular individualism and initial conditions. Once the molecules are stretched to their full length and aligned with the flow, they translate along the fluid thread as rigid rods until the point of pinch off. After pinch off, both the fluid and molecules return to a relaxed state.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Buluc, Aydn; Pothen, Alex
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
Azad, Ariful; Buluc, Aydn; Pothen, Alex
2016-03-24
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George Widgery
We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore » both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.« less
Data Acquisition System for Multi-Frequency Radar Flight Operations Preparation
NASA Technical Reports Server (NTRS)
Leachman, Jonathan
2010-01-01
A three-channel data acquisition system was developed for the NASA Multi-Frequency Radar (MFR) system. The system is based on a commercial-off-the-shelf (COTS) industrial PC (personal computer) and two dual-channel 14-bit digital receiver cards. The decimated complex envelope representations of the three radar signals are passed to the host PC via the PCI bus, and then processed in parallel by multiple cores of the PC CPU (central processing unit). The innovation is this parallelization of the radar data processing using multiple cores of a standard COTS multi-core CPU. The data processing portion of the data acquisition software was built using autonomous program modules or threads, which can run simultaneously on different cores. A master program module calculates the optimal number of processing threads, launches them, and continually supplies each with data. The benefit of this new parallel software architecture is that COTS PCs can be used to implement increasingly complex processing algorithms on an increasing number of radar range gates and data rates. As new PCs become available with higher numbers of CPU cores, the software will automatically utilize the additional computational capacity.
2011-12-18
Proceedings of the SIGMET- RICS Symposium on Parallel and Distributed Tools, pages 48–59, 1998. [8] A. Dinning and E. Schonberg . Detecting access...multi- threaded programs. ACM Trans. Comput. Syst., 15(4):391– 411, 1997. [38] E. Schonberg . On-the-fly detection of access anomalies. In Proceedings
Scaling Irregular Applications through Data Aggregation and Software Multithreading
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morari, Alessandro; Tumeo, Antonino; Chavarría-Miranda, Daniel
Bioinformatics, data analytics, semantic databases, knowledge discovery are emerging high performance application areas that exploit dynamic, linked data structures such as graphs, unbalanced trees or unstructured grids. These data structures usually are very large, requiring significantly more memory than available on single shared memory systems. Additionally, these data structures are difficult to partition on distributed memory systems. They also present poor spatial and temporal locality, thus generating unpredictable memory and network accesses. The Partitioned Global Address Space (PGAS) programming model seems suitable for these applications, because it allows using a shared memory abstraction across distributed-memory clusters. However, current PGAS languagesmore » and libraries are built to target regular remote data accesses and block transfers. Furthermore, they usually rely on the Single Program Multiple Data (SPMD) parallel control model, which is not well suited to the fine grained, dynamic and unbalanced parallelism of irregular applications. In this paper we present {\\bf GMT} (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. GMT integrates a PGAS data substrate with simple fork/join parallelism and provides automatic load balancing on a per node basis. It implements multi-level aggregation and lightweight multithreading to maximize memory and network bandwidth with fine-grained data accesses and tolerate long data access latencies. A key innovation in the GMT runtime is its thread specialization (workers, helpers and communication threads) that realize the overall functionality. We compare our approach with other PGAS models, such as UPC running using GASNet, and hand-optimized MPI code on a set of typical large-scale irregular applications, demonstrating speedups of an order of magnitude.« less
Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.
Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut
2018-05-03
Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.
Implementing Shared Memory Parallelism in MCBEND
NASA Astrophysics Data System (ADS)
Bird, Adam; Long, David; Dobson, Geoff
2017-09-01
MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.
A Family of ACO Routing Protocols for Mobile Ad Hoc Networks
Rupérez Cañas, Delfín; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier; Kim, Tai-hoon
2017-01-01
In this work, an ACO routing protocol for mobile ad hoc networks based on AntHocNet is specified. As its predecessor, this new protocol, called AntOR, is hybrid in the sense that it contains elements from both reactive and proactive routing. Specifically, it combines a reactive route setup process with a proactive route maintenance and improvement process. Key aspects of the AntOR protocol are the disjoint-link and disjoint-node routes, separation between the regular pheromone and the virtual pheromone in the diffusion process and the exploration of routes, taking into consideration the number of hops in the best routes. In this work, a family of ACO routing protocols based on AntOR is also specified. These protocols are based on protocol successive refinements. In this work, we also present a parallelized version of AntOR that we call PAntOR. Using programming multiprocessor architectures based on the shared memory protocol, PAntOR allows running tasks in parallel using threads. This parallelization is applicable in the route setup phase, route local repair process and link failure notification. In addition, a variant of PAntOR that consists of having more than one interface, which we call PAntOR-MI (PAntOR-Multiple Interface), is specified. This approach parallelizes the sending of broadcast messages by interface through threads. PMID:28531159
NDL-v2.0: A new version of the numerical differentiation library for parallel architectures
NASA Astrophysics Data System (ADS)
Hadjidoukas, P. E.; Angelikopoulos, P.; Voglis, C.; Papageorgiou, D. G.; Lagaris, I. E.
2014-07-01
We present a new version of the numerical differentiation library (NDL) used for the numerical estimation of first and second order partial derivatives of a function by finite differencing. In this version we have restructured the serial implementation of the code so as to achieve optimal task-based parallelization. The pure shared-memory parallelization of the library has been based on the lightweight OpenMP tasking model allowing for the full extraction of the available parallelism and efficient scheduling of multiple concurrent library calls. On multicore clusters, parallelism is exploited by means of TORC, an MPI-based multi-threaded tasking library. The new MPI implementation of NDL provides optimal performance in terms of function calls and, furthermore, supports asynchronous execution of multiple library calls within legacy MPI programs. In addition, a Python interface has been implemented for all cases, exporting the functionality of our library to sequential Python codes. Catalog identifier: AEDG_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 63036 No. of bytes in distributed program, including test data, etc.: 801872 Distribution format: tar.gz Programming language: ANSI Fortran-77, ANSI C, Python. Computer: Distributed systems (clusters), shared memory systems. Operating system: Linux, Unix. Has the code been vectorized or parallelized?: Yes. RAM: The library uses O(N) internal storage, N being the dimension of the problem. It can use up to O(N2) internal storage for Hessian calculations, if a task throttling factor has not been set by the user. Classification: 4.9, 4.14, 6.5. Catalog identifier of previous version: AEDG_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180(2009)1404 Does the new version supersede the previous version?: Yes Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, and sensitivity analysis. For a large number of scientific and engineering applications, the underlying functions correspond to simulation codes for which analytical estimation of derivatives is difficult or almost impossible. A parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with a carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Reasons for new version: The updated version was motivated by our endeavors to extend a parallel Bayesian uncertainty quantification framework [1], by incorporating higher order derivative information as in most state-of-the-art stochastic simulation methods such as Stochastic Newton MCMC [2] and Riemannian Manifold Hamiltonian MC [3]. The function evaluations are simulations with significant time-to-solution, which also varies with the input parameters such as in [1, 4]. The runtime of the N-body-type of problem changes considerably with the introduction of a longer cut-off between the bodies. In the first version of the library, the OpenMP-parallel subroutines spawn a new team of threads and distribute the function evaluations with a PARALLEL DO directive. This limits the functionality of the library as multiple concurrent calls require nested parallelism support from the OpenMP environment. Therefore, either their function evaluations will be serialized or processor oversubscription is likely to occur due to the increased number of OpenMP threads. In addition, the Hessian calculations include two explicit parallel regions that compute first the diagonal and then the off-diagonal elements of the array. Due to the barrier between the two regions, the parallelism of the calculations is not fully exploited. These issues have been addressed in the new version by first restructuring the serial code and then running the function evaluations in parallel using OpenMP tasks. Although the MPI-parallel implementation of the first version is capable of fully exploiting the task parallelism of the PNDL routines, it does not utilize the caching mechanism of the serial code and, therefore, performs some redundant function evaluations in the Hessian and Jacobian calculations. This can lead to: (a) higher execution times if the number of available processors is lower than the total number of tasks, and (b) significant energy consumption due to wasted processor cycles. Overcoming these drawbacks, which become critical as the time of a single function evaluation increases, was the primary goal of this new version. Due to the code restructure, the MPI-parallel implementation (and the OpenMP-parallel in accordance) avoids redundant calls, providing optimal performance in terms of the number of function evaluations. Another limitation of the library was that the library subroutines were collective and synchronous calls. In the new version, each MPI process can issue any number of subroutines for asynchronous execution. We introduce two library calls that provide global and local task synchronizations, similarly to the BARRIER and TASKWAIT directives of OpenMP. The new MPI-implementation is based on TORC, a new tasking library for multicore clusters [5-7]. TORC improves the portability of the software, as it relies exclusively on the POSIX-Threads and MPI programming interfaces. It allows MPI processes to utilize multiple worker threads, offering a hybrid programming and execution environment similar to MPI+OpenMP, in a completely transparent way. Finally, to further improve the usability of our software, a Python interface has been implemented on top of both the OpenMP and MPI versions of the library. This allows sequential Python codes to exploit shared and distributed memory systems. Summary of revisions: The revised code improves the performance of both parallel (OpenMP and MPI) implementations. The functionality and the user-interface of the MPI-parallel version have been extended to support the asynchronous execution of multiple PNDL calls, issued by one or multiple MPI processes. A new underlying tasking library increases portability and allows MPI processes to have multiple worker threads. For both implementations, an interface to the Python programming language has been added. Restrictions: The library uses only double precision arithmetic. The MPI implementation assumes the homogeneity of the execution environment provided by the operating system. Specifically, the processes of a single MPI application must have identical address space and a user function resides at the same virtual address. In addition, address space layout randomization should not be used for the application. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 23 ms for the serial distribution, 25 ms for the OpenMP with 2 threads, 53 ms and 1.01 s for the MPI parallel distribution using 2 threads and 2 processes respectively and yield-time for idle workers equal to 10 ms. References: [1] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Bayesian uncertainty quantification and propagation in molecular dynamics simulations: a high performance computing framework, J. Chem. Phys 137 (14). [2] H.P. Flath, L.C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput. 33 (1) (2011) 407-432. [3] M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73 (2) (2011) 123-214. [4] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Data driven, predictive molecular dynamics for nanoscale flow simulations under uncertainty, J. Phys. Chem. B 117 (47) (2013) 14808-14816. [5] P.E. Hadjidoukas, E. Lappas, V.V. Dimakopoulos, A runtime library for platform-independent task parallelism, in: PDP, IEEE, 2012, pp. 229-236. [6] C. Voglis, P.E. Hadjidoukas, D.G. Papageorgiou, I. Lagaris, A parallel hybrid optimization algorithm for fitting interatomic potentials, Appl. Soft Comput. 13 (12) (2013) 4481-4492. [7] P.E. Hadjidoukas, C. Voglis, V.V. Dimakopoulos, I. Lagaris, D.G. Papageorgiou, Supporting adaptive and irregular parallelism for non-linear numerical optimization, Appl. Math. Comput. 231 (2014) 544-559.
Degidi, Marco; Perrotti, Vittoria; Shibli, Jamil A; Mortellaro, Carmen; Piattelli, Adriano; Iezzi, Giovanna
2014-05-01
The long-term high percentages of survival and success of dental implants reported in the literature are related mainly to new, innovative implant and thread designs, and new implant surfaces that allow to obtain very good primary and secondary stability in most anatomical and clinical situations, even in low quality and quantity of bone, promoting a more rapid osseointegration. The aim of this retrospective study was a histological and histomorphometrical evaluation of the bone response around implants with a parallel-wall configuration, condensing thread macrodesign, and self-tapping apex, retrieved from man for different causes. A total of 10 implants were reported in the present study, and these implants had been retrieved after a loading period comprised between a few weeks to about 8 years. Mineralized newly formed bone was found at the interface of all the implants, in direct contact with the implant surface, with no gaps or connective fibrous tissue. This bone adapted very well to the microirregularities of the implant surface. Areas of bone remodeling were present in some regions of the interface, with many reversal lines. High bone-implant contact percentages were found. In conclusion, both the macrostructure and the microstructure of this specific type of implant could be very helpful in the long-term high survival and success implant percentages.
Shi, Yuqian; Hellinga, Homme W.; Beese, Lorena S.
2017-01-01
Human exonuclease 1 (hExo1) is a member of the RAD2/XPG structure-specific 5′-nuclease superfamily. Its dominant, processive 5′–3′ exonuclease and secondary 5′-flap endonuclease activities participate in various DNA repair, recombination, and replication processes. A single active site processes both recessed ends and 5′-flap substrates. By initiating enzyme reactions in crystals, we have trapped hExo1 reaction intermediates that reveal structures of these substrates before and after their exo- and endonucleolytic cleavage, as well as structures of uncleaved, unthreaded, and partially threaded 5′ flaps. Their distinctive 5′ ends are accommodated by a small, mobile arch in the active site that binds recessed ends at its base and threads 5′ flaps through a narrow aperture within its interior. A sequence of successive, interlocking conformational changes guides the two substrate types into a shared reaction mechanism that catalyzes their cleavage by an elaborated variant of the two-metal, in-line hydrolysis mechanism. Coupling of substrate-dependent arch motions to transition-state stabilization suppresses inappropriate or premature cleavage, enhancing processing fidelity. The striking reduction in flap conformational entropy is catalyzed, in part, by arch motions and transient binding interactions between the flap and unprocessed DNA strand. At the end of the observed reaction sequence, hExo1 resets without relinquishing DNA binding, suggesting a structural basis for its processivity. PMID:28533382
A hybrid algorithm for parallel molecular dynamics simulations
NASA Astrophysics Data System (ADS)
Mangiardi, Chris M.; Meyer, R.
2017-10-01
This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
Integrating end-to-end threads of control into object-oriented analysis and design
NASA Technical Reports Server (NTRS)
Mccandlish, Janet E.; Macdonald, James R.; Graves, Sara J.
1993-01-01
Current object-oriented analysis and design methodologies fall short in their use of mechanisms for identifying threads of control for the system being developed. The scenarios which typically describe a system are more global than looking at the individual objects and representing their behavior. Unlike conventional methodologies that use data flow and process-dependency diagrams, object-oriented methodologies do not provide a model for representing these global threads end-to-end. Tracing through threads of control is key to ensuring that a system is complete and timing constraints are addressed. The existence of multiple threads of control in a system necessitates a partitioning of the system into processes. This paper describes the application and representation of end-to-end threads of control to the object-oriented analysis and design process using object-oriented constructs. The issue of representation is viewed as a grouping problem, that is, how to group classes/objects at a higher level of abstraction so that the system may be viewed as a whole with both classes/objects and their associated dynamic behavior. Existing object-oriented development methodology techniques are extended by adding design-level constructs termed logical composite classes and process composite classes. Logical composite classes are design-level classes which group classes/objects both logically and by thread of control information. Process composite classes further refine the logical composite class groupings by using process partitioning criteria to produce optimum concurrent execution results. The goal of these design-level constructs is to ultimately provide the basis for a mechanism that can support the creation of process composite classes in an automated way. Using an automated mechanism makes it easier to partition a system into concurrently executing elements that can be run in parallel on multiple processors.
NASA Astrophysics Data System (ADS)
Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.
2018-05-01
Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
Yokohama, Noriya
2013-07-01
This report was aimed at structuring the design of architectures and studying performance measurement of a parallel computing environment using a Monte Carlo simulation for particle therapy using a high performance computing (HPC) instance within a public cloud-computing infrastructure. Performance measurements showed an approximately 28 times faster speed than seen with single-thread architecture, combined with improved stability. A study of methods of optimizing the system operations also indicated lower cost.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2013-10-29
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.
Durrani, Owais Khalid; Shaheed, Sohrab; Khan, Arsalan; Bashir, Ulfat
2017-10-01
The purpose of this study was to compare the in-vivo failure rates of single-thread and dual-thread temporary anchorage device (TAD) designs over 18 months. Thirty patients with skeletal Class II Division 1 malocclusion requiring anchorage from TADs for retraction of maxillary incisors into the extracted premolar space were recruited in this parallel group, split-mouth, randomized controlled trial. A block randomization sequence was generated with Random Allocation Software (version 2.0; Isfahan, Iran) with the allocations concealed in sequentially numbered, opaque, sealed envelopes. A total of 60 TADs (diameter, 2 mm; length, 10 mm) were placed in the maxillary arches of these patients with random allocation of the 2 types to the left and the right sides in a 1:1 ratio. All TADs were placed between the roots of the second premolar and the first molar and were immediately loaded. Patients were followed for a minimum of 12 months and a maximum of 18 months for the failure of the TADs. Data were analyzed blindly on an intention-to-treat basis. Four TADs (13.3%) failed in the single-thread group, and 6 TADs (20%) failed in the dual-thread group. The McNemar test showed an insignificant difference (P = 0.72) between the 2 groups. An odds ratio of 1.6 (95% confidence interval, 0.39-6.97) showed no significant associations among the variables. Most TADs failed in the first month after insertion (50%). The failure rate of dual-thread TADs compared with single-thread TADs is statistically insignificant when placed in the maxilla for retraction of the anterior segment. Registration: The trial was not registered before commencement. The protocol was not published before the trial. Copyright © 2016 American Association of Orthodontists. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Jang, W.; Engda, T. A.; Neff, J. C.; Herrick, J.
2017-12-01
Many crop models are increasingly used to evaluate crop yields at regional and global scales. However, implementation of these models across large areas using fine-scale grids is limited by computational time requirements. In order to facilitate global gridded crop modeling with various scenarios (i.e., different crop, management schedule, fertilizer, and irrigation) using the Environmental Policy Integrated Climate (EPIC) model, we developed a distributed parallel computing framework in Python. Our local desktop with 14 cores (28 threads) was used to test the distributed parallel computing framework in Iringa, Tanzania which has 406,839 grid cells. High-resolution soil data, SoilGrids (250 x 250 m), and climate data, AgMERRA (0.25 x 0.25 deg) were also used as input data for the gridded EPIC model. The framework includes a master file for parallel computing, input database, input data formatters, EPIC model execution, and output analyzers. Through the master file for parallel computing, the user-defined number of threads of CPU divides the EPIC simulation into jobs. Then, Using EPIC input data formatters, the raw database is formatted for EPIC input data and the formatted data moves into EPIC simulation jobs. Then, 28 EPIC jobs run simultaneously and only interesting results files are parsed and moved into output analyzers. We applied various scenarios with seven different slopes and twenty-four fertilizer ranges. Parallelized input generators create different scenarios as a list for distributed parallel computing. After all simulations are completed, parallelized output analyzers are used to analyze all outputs according to the different scenarios. This saves significant computing time and resources, making it possible to conduct gridded modeling at regional to global scales with high-resolution data. For example, serial processing for the Iringa test case would require 113 hours, while using the framework developed in this study requires only approximately 6 hours, a nearly 95% reduction in computing time.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2013-11-12
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.
Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E
2014-02-11
Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.
2014-08-12
Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.
Performance evaluation of canny edge detection on a tiled multicore architecture
NASA Astrophysics Data System (ADS)
Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald
2011-01-01
In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
Meandering instability of a viscous thread
NASA Astrophysics Data System (ADS)
Morris, Stephen W.; Dawes, Jonathan H. P.; Ribe, Neil M.; Lister, John R.
2008-06-01
A viscous thread falling from a nozzle onto a surface exhibits the famous rope-coiling effect, in which the thread buckles to form loops. If the surface is replaced by a belt moving with speed U , the rotational symmetry of the buckling instability is broken and a wealth of interesting states are observed [see S. Chiu-Webster and J. R. Lister, J. Fluid Mech. 569, 89 (2006)]. We experimentally studied this “fluid-mechanical sewing machine” in a more precise apparatus. As U is reduced, the steady catenary thread bifurcates into a meandering state in which the thread displacements are only transverse to the motion of the belt. We measured the amplitude and frequency ω of the meandering close to the bifurcation. For smaller U , single-frequency meandering bifurcates to a two-frequency “figure-8” state, which contains a significant 2ω component and parallel as well as transverse displacements. This eventually reverts to single-frequency coiling at still smaller U . More complex, highly hysteretic states with additional frequencies are observed for larger nozzle heights. We propose to understand this zoology in terms of the generic amplitude equations appropriate for resonant interactions between two oscillatory modes with frequencies ω and 2ω . The form of the amplitude equations captures both the axisymmetry of the U=0 coiling state and the symmetry-breaking effects induced by the moving belt.
Thread amplitudes and frequencies in a fluid mechanical `sewing machine'
NASA Astrophysics Data System (ADS)
Morris, Stephen W.; Dawes, J. H. P.; Lister, John; Dalziel, Stuart
2006-11-01
A viscous thread falling on a surface exhibits the famous rope- coiling effect, in which the thread buckles to form loops. If the surface is replaced by a belt moving at speed U, the rotational symmetry of the buckling instability is broken and a wealth of interesting states are observed (1). We experimentally studied this fluid mechanical `sewing machine' in a new, more precise apparatus. As U is reduced, the stretched thread bifurcates into a meandering state in which the thread displacements are only transverse to the motion of the belt. We measured the amplitudes A and frequency φ of the meandering close to the bifurcation. For small U, single- frequency meandering bifurcates to a two-frequency `figure 8' state, which contains a significant 2φ component and parallel as well as transverse displacements. This eventually reverts to single-frequency coiling at smaller U. More complex, highly hysteretic states with additional harmonics are observed for larger nozzle heights. We propose to understand this zoology in terms of the generic amplitude equations appropriate for resonant interactions between three oscillatory modes with frequencies φ, 2φ and 3φ. The form of the amplitude equations captures both the axisymmetry of the U=0 coiling state and the symmetry-breaking effects induced by the moving belt.(1) Chiu-Webster and Lister, J. Fluid Mech., in press.
Meandering instability of a viscous thread.
Morris, Stephen W; Dawes, Jonathan H P; Ribe, Neil M; Lister, John R
2008-06-01
A viscous thread falling from a nozzle onto a surface exhibits the famous rope-coiling effect, in which the thread buckles to form loops. If the surface is replaced by a belt moving with speed U , the rotational symmetry of the buckling instability is broken and a wealth of interesting states are observed [see S. Chiu-Webster and J. R. Lister, J. Fluid Mech. 569, 89 (2006)]. We experimentally studied this "fluid-mechanical sewing machine" in a more precise apparatus. As U is reduced, the steady catenary thread bifurcates into a meandering state in which the thread displacements are only transverse to the motion of the belt. We measured the amplitude and frequency omega of the meandering close to the bifurcation. For smaller U , single-frequency meandering bifurcates to a two-frequency "figure-8" state, which contains a significant 2omega component and parallel as well as transverse displacements. This eventually reverts to single-frequency coiling at still smaller U . More complex, highly hysteretic states with additional frequencies are observed for larger nozzle heights. We propose to understand this zoology in terms of the generic amplitude equations appropriate for resonant interactions between two oscillatory modes with frequencies omega and 2omega . The form of the amplitude equations captures both the axisymmetry of the U=0 coiling state and the symmetry-breaking effects induced by the moving belt.
Implementing and analyzing the multi-threaded LP-inference
NASA Astrophysics Data System (ADS)
Bolotova, S. Yu; Trofimenko, E. V.; Leschinskaya, M. V.
2018-03-01
The logical production equations provide new possibilities for the backward inference optimization in intelligent production-type systems. The strategy of a relevant backward inference is aimed at minimization of a number of queries to external information source (either to a database or an interactive user). The idea of the method is based on the computing of initial preimages set and searching for the true preimage. The execution of each stage can be organized independently and in parallel and the actual work at a given stage can also be distributed between parallel computers. This paper is devoted to the parallel algorithms of the relevant inference based on the advanced scheme of the parallel computations “pipeline” which allows to increase the degree of parallelism. The author also provides some details of the LP-structures implementation.
A Parallel Saturation Algorithm on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Guo, Jianqiu; Yang, Yu; Wu, Fangzhen
Synchrotron X-ray Topography is a powerful technique to study defects structures particularly dislocation configurations in single crystals. Complementing this technique with geometrical and contrast analysis can enhance the efficiency of quantitatively characterizing defects. In this study, the use of Synchrotron White Beam X-ray Topography (SWBXT) to determine the line directions of threading dislocations in 4H–SiC axial slices (sample cut parallel to the growth axis from the boule) is demonstrated. This technique is based on the fact that the projected line directions of dislocations on different reflections are different. Another technique also discussed is the determination of the absolute Burgers vectorsmore » of threading mixed dislocations (TMDs) using Synchrotron Monochromatic Beam X-ray Topography (SMBXT). This technique utilizes the fact that the contrast from TMDs varies on SMBXT images as their Burgers vectors change. By comparing observed contrast with the contrast from threading dislocations provided by Ray Tracing Simulations, the Burgers vectors can be determined. Thereafter the distribution of TMDs with different Burgers vectors across the wafer is mapped and investigated.« less
FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks
Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
2015-01-01
Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758
FastGCN: a GPU accelerated tool for fast gene co-expression networks.
Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
2015-01-01
Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Scalable Algorithms for Parallel Discrete Event Simulation Systems in Multicore Environments
2013-05-01
consolidated at the sender side. At the receiver side, the messages are deconsolidated and delivered to the appropriate thread. This approach bears some...Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. Panda . Performance comparison of mpi implementations over infiniband, myrinet and quadrics
NASA Astrophysics Data System (ADS)
Yu, Leiming; Nina-Paravecino, Fanny; Kaeli, David; Fang, Qianqian
2018-01-01
We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs.
Experiments and Analyses of Data Transfers Over Wide-Area Dedicated Connections
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rao, Nageswara S.; Liu, Qiang; Sen, Satyabrata
Dedicated wide-area network connections are increasingly employed in high-performance computing and big data scenarios. One might expect the performance and dynamics of data transfers over such connections to be easy to analyze due to the lack of competing traffic. However, non-linear transport dynamics and end-system complexities (e.g., multi-core hosts and distributed filesystems) can in fact make analysis surprisingly challenging. We present extensive measurements of memory-to-memory and disk-to-disk file transfers over 10 Gbps physical and emulated connections with 0–366 ms round trip times (RTTs). For memory-to-memory transfers, profiles of both TCP and UDT throughput as a function of RTT show concavemore » and convex regions; large buffer sizes and more parallel flows lead to wider concave regions, which are highly desirable. TCP and UDT both also display complex throughput dynamics, as indicated by their Poincare maps and Lyapunov exponents. For disk-to-disk transfers, we determine that high throughput can be achieved via a combination of parallel I/O threads, parallel network threads, and direct I/O mode. Our measurements also show that Lustre filesystems can be mounted over long-haul connections using LNet routers, although challenges remain in jointly optimizing file I/O and transport method parameters to achieve peak throughput.« less
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.
Zhang, Jing; Wang, Hao; Feng, Wu-Chun
2017-01-01
BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
Parallel-wire grid assembly with method and apparatus for construction thereof
Lewandowski, E.F.; Vrabec, J.
1981-10-26
Disclosed is a parallel wire grid and an apparatus and method for making the same. The grid consists of a generally coplanar array of parallel spaced-apart wires secured between metallic frame members by an electrically conductive epoxy. The method consists of continuously winding a wire about a novel winding apparatus comprising a plurality of spaced-apart generally parallel spindles. Each spindle is threaded with a number of predeterminedly spaced-apart grooves which receive and accurately position the wire at predetermined positions along the spindle. Overlying frame members coated with electrically conductive epoxy are then placed on either side of the wire array and are drawn together. After the epoxy hardens, portions of the wire array lying outside the frame members are trimmed away.
Parallel-wire grid assembly with method and apparatus for construction thereof
Lewandowski, Edward F.; Vrabec, John
1984-01-01
Disclosed is a parallel wire grid and an apparatus and method for making the same. The grid consists of a generally coplanar array of parallel spaced-apart wires secured between metallic frame members by an electrically conductive epoxy. The method consists of continuously winding a wire about a novel winding apparatus comprising a plurality of spaced-apart generally parallel spindles. Each spindle is threaded with a number of predeterminedly spaced-apart grooves which receive and accurately position the wire at predetermined positions along the spindle. Overlying frame members coated with electrically conductive epoxy are then placed on either side of the wire array and are drawn together. After the epoxy hardens, portions of the wire array lying outside the frame members are trimmed away.
60Ma of legume nodulation. What's new? What's changing?
Sprent, Janet I
2008-01-01
Current evidence suggests that legumes evolved about 60 million years ago. Genetic material for nodulation was recruited from existing DNA, often following gene duplication. The initial process of infection probably did not involve either root hairs or infection threads. From this initial event, two branched pathways of nodule developmental processes evolved, one involving and one not involving the development of infection threads to 'escort' bacteria to young nodule cells. Extant legumes have a wide range of nodule structures and at least 25% of them do not have infection threads. The latter have uniform infected tissue whereas those that have infection threads have infected cells interspersed with uninfected (interstitial) cells. Each type of nodule may develop indeterminately, with an apical meristem, or show determinate growth. These nodule structures are host determined and are largely congruent with taxonomic position. In addition to variation on the plant side, the last 10 years have seen the recognition of many new types of 'rhizobia', bacteria that can induce nodulation and fix nitrogen. It is not yet possible to fit these into the emerging pattern of nodule evolution.
Fixture for holding testing transducer
Wagner, T.A.; Engel, H.P.
A fixture for mounting an ultrasonic transducer against the end of a threaded bolt or stud to test the same for flaws. A base means threadedly secured to the side of the bolt has a rotating ring thereon. A post rising up from the ring (parallel to the axis of the workpiece) pivotally mounts a variable length cross arm, on the inner end of which is mounted the transducer. A spring means acts between the cross arm and the base to apply the testing transducer against the workpiece at a constant pressure. The device maintains constant for successive tests the radial and circumferential positions of the testing transducer and its contact pressure against the end of the workpiece.
Fixture for holding testing transducer
Wagner, Thomas A.; Engel, Herbert P.
1984-01-01
A fixture for mounting an ultrasonic transducer against the end of a threaded bolt or stud to test the same for flaws. A base means threadedly secured to the side of the bolt has a rotating ring thereon. A post rising up from the ring (parallel to the axis of the workpiece) pivotally mounts a variable length cross arm, on the inner end of which is mounted the transducer. A spring means acts between the cross arm and the base to apply the testing transducer against the workpiece at a constant pressure. The device maintains constant for successive tests the radial and circumferential positions of the testing transducer and its contact pressure against the end of the workpiece.
NASA Astrophysics Data System (ADS)
O'Reilly, Andrew J.; Quitoriano, Nathaniel J.
2018-02-01
Si0.973Ge0.027 epilayers were grown on a Si (0 0 1) substrate by a lateral liquid-phase epitaxy (LLPE) technique. The lateral growth mechanism favoured the glide of misfit dislocations and inhibited the nucleation of new dislocations by maintaining the thickness less than the critical thicknesses for dislocation nucleation and greater than the critical thickness for glide. This promoted the formation of an array of long misfit dislocations parallel to the [1 1 0] growth direction and reduced the threading dislocation density to 103 cm-2, two orders of magnitude lower than the seed area with an isotropic misfit dislocation network.
PHoToNs–A parallel heterogeneous and threads oriented code for cosmological N-body simulation
NASA Astrophysics Data System (ADS)
Wang, Qiao; Cao, Zong-Yan; Gao, Liang; Chi, Xue-Bin; Meng, Chen; Wang, Jie; Wang, Long
2018-06-01
We introduce a new code for cosmological simulations, PHoToNs, which incorporates features for performing massive cosmological simulations on heterogeneous high performance computer (HPC) systems and threads oriented programming. PHoToNs adopts a hybrid scheme to compute gravitational force, with the conventional Particle-Mesh (PM) algorithm to compute the long-range force, the Tree algorithm to compute the short range force and the direct summation Particle-Particle (PP) algorithm to compute gravity from very close particles. A self-similar space filling a Peano-Hilbert curve is used to decompose the computing domain. Threads programming is advantageously used to more flexibly manage the domain communication, PM calculation and synchronization, as well as Dual Tree Traversal on the CPU+MIC platform. PHoToNs scales well and efficiency of the PP kernel achieves 68.6% of peak performance on MIC and 74.4% on CPU platforms. We also test the accuracy of the code against the much used Gadget-2 in the community and found excellent agreement.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R
Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with themore » endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.« less
Parallel Computer System for 3D Visualization Stereo on GPU
NASA Astrophysics Data System (ADS)
Al-Oraiqat, Anas M.; Zori, Sergii A.
2018-03-01
This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Parallel sort with a ranged, partitioned key-value store in a high perfomance computing environment
Bent, John M.; Faibish, Sorin; Grider, Gary; Torres, Aaron; Poole, Stephen W.
2016-01-26
Improved sorting techniques are provided that perform a parallel sort using a ranged, partitioned key-value store in a high performance computing (HPC) environment. A plurality of input data files comprising unsorted key-value data in a partitioned key-value store are sorted. The partitioned key-value store comprises a range server for each of a plurality of ranges. Each input data file has an associated reader thread. Each reader thread reads the unsorted key-value data in the corresponding input data file and performs a local sort of the unsorted key-value data to generate sorted key-value data. A plurality of sorted, ranged subsets of each of the sorted key-value data are generated based on the plurality of ranges. Each sorted, ranged subset corresponds to a given one of the ranges and is provided to one of the range servers corresponding to the range of the sorted, ranged subset. Each range server sorts the received sorted, ranged subsets and provides a sorted range. A plurality of the sorted ranges are concatenated to obtain a globally sorted result.
DNA Knots: Theory and Experiments
NASA Astrophysics Data System (ADS)
Sumners, D. W.
Cellular DNA is a long, thread-like molecule with remarkably complex topology. Enzymes that manipulate the geometry and topology of cellular DNA perform many vital cellular processes (including segregation of daughter chromosomes, gene regulation, DNA repair, and generation of antibody diversity). Some enzymes pass DNA through itself via enzyme-bridged transient breaks in the DNA; other enzymes break the DNA apart and reconnect it to different ends. In the topological approach to enzymology, circular DNA is incubated with an enzyme, producing an enzyme signature in the form of DNA knots and links. By observing the changes in DNA geometry (supercoiling) and topology (knotting and linking) due to enzyme action, the enzyme binding and mechanism can often be characterized. This paper will discuss some personal research history, and the tangle model for the analysis of site-specific recombination experiments on circular DNA.
Rolling-circle amplification under topological constraints
Kuhn, Heiko; Demidov, Vadim V.; Frank-Kamenetskii, Maxim D.
2002-01-01
We have performed rolling-circle amplification (RCA) reactions on three DNA templates that differ distinctly in their topology: an unlinked DNA circle, a linked DNA circle within a pseudorotaxane-type structure and a linked DNA circle within a catenane. In the linked templates, the single-stranded circle (dubbed earring probe) is threaded, with the aid of two peptide nucleic acid openers, between the two strands of double-stranded DNA (dsDNA). We have found that the RCA efficiency of amplification was essentially unaffected when the linked templates were employed. By showing that the DNA catenane remains intact after RCA reactions, we prove that certain DNA polymerases can carry out the replicative synthesis under topological constraints allowing detection of several hundred copies of a dsDNA marker without DNA denaturation. Our finding may have practical implications in the area of DNA diagnostics. PMID:11788721
The Wang Landau parallel algorithm for the simple grids. Optimizing OpenMPI parallel implementation
NASA Astrophysics Data System (ADS)
Kussainov, A. S.
2017-12-01
The Wang Landau Monte Carlo algorithm to calculate density of states for the different simple spin lattices was implemented. The energy space was split between the individual threads and balanced according to the expected runtime for the individual processes. Custom spin clustering mechanism, necessary for overcoming of the critical slowdown in the certain energy subspaces, was devised. Stable reconstruction of the density of states was of primary importance. Some data post-processing techniques were involved to produce the expected smooth density of states.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Caubet, Jordi; Biegel, Bryan A. (Technical Monitor)
2002-01-01
In this paper we describe how to apply powerful performance analysis techniques to understand the behavior of multilevel parallel applications. We use the Paraver/OMPItrace performance analysis system for our study. This system consists of two major components: The OMPItrace dynamic instrumentation mechanism, which allows the tracing of processes and threads and the Paraver graphical user interface for inspection and analyses of the generated traces. We describe how to use the system to conduct a detailed comparative study of a benchmark code implemented in five different programming paradigms applicable for shared memory
Guo, Peixuan; Zhao, Zhengyi; Haak, Jeannie; Wang, Shaoying; Wu, Dong; Meng, Bing; Weitao, Tao
2014-01-01
Biomotors were once described into two categories: linear motor and rotation motor. Recently, a third type of biomotor with revolution mechanism without rotation has been discovered. By analogy, rotation resembles the Earth rotating on its axis in a complete cycle every 24h, while revolution resembles the Earth revolving around the Sun one circle per 365 days (see animations http://nanobio.uky.edu/movie.html). The action of revolution that enables a motor free of coiling and torque has solved many puzzles and debates that have occurred throughout the history of viral DNA packaging motor studies. It also settles the discrepancies concerning the structure, stoichiometry, and functioning of DNA translocation motors. This review uses bacteriophages Phi29, HK97, SPP1, P22, T4, and T7 as well as bacterial DNA translocase FtsK and SpoIIIE or the large eukaryotic dsDNA viruses such as mimivirus and vaccinia virus as examples to elucidate the puzzles. These motors use ATPase, some of which have been confirmed to be a hexamer, to revolve around the dsDNA sequentially. ATP binding induces conformational change and possibly an entropy alteration in ATPase to a high affinity toward dsDNA; but ATP hydrolysis triggers another entropic and conformational change in ATPase to a low affinity for DNA, by which dsDNA is pushed toward an adjacent ATPase subunit. The rotation and revolution mechanisms can be distinguished by the size of channel: the channels of rotation motors are equal to or smaller than 2 nm, that is the size of dsDNA, whereas channels of revolution motors are larger than 3 nm. Rotation motors use parallel threads to operate with a right-handed channel, while revolution motors use a left-handed channel to drive the right-handed DNA in an anti-chiral arrangement. Coordination of several vector factors in the same direction makes viral DNA-packaging motors unusually powerful and effective. Revolution mechanism that avoids DNA coiling in translocating the lengthy genomic dsDNA helix could be advantageous for cell replication such as bacterial binary fission and cell mitosis without the need for topoisomerase or helicase to consume additional energy. Copyright © 2014 Elsevier Inc. All rights reserved.
NASA Technical Reports Server (NTRS)
Schunk, Richard Gregory; Chung, T. J.
2001-01-01
A parallelized version of the Flowfield Dependent Variation (FDV) Method is developed to analyze a problem of current research interest, the flowfield resulting from a triple shock/boundary layer interaction. Such flowfields are often encountered in the inlets of high speed air-breathing vehicles including the NASA Hyper-X research vehicle. In order to resolve the complex shock structure and to provide adequate resolution for boundary layer computations of the convective heat transfer from surfaces inside the inlet, models containing over 500,000 nodes are needed. Efficient parallelization of the computation is essential to achieving results in a timely manner. Results from a parallelization scheme, based upon multi-threading, as implemented on multiple processor supercomputers and workstations is presented.
Multi-petascale highly efficient parallel supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time andmore » supports DMA functionality allowing for parallel processing message-passing.« less
NASA Astrophysics Data System (ADS)
Zimovets, Artem; Matviychuk, Alexander; Ushakov, Vladimir
2016-12-01
The paper presents two different approaches to reduce the time of computer calculation of reachability sets. First of these two approaches use different data structures for storing the reachability sets in the computer memory for calculation in single-threaded mode. Second approach is based on using parallel algorithms with reference to the data structures from the first approach. Within the framework of this paper parallel algorithm of approximate reachability set calculation on computer with SMP-architecture is proposed. The results of numerical modelling are presented in the form of tables which demonstrate high efficiency of parallel computing technology and also show how computing time depends on the used data structure.
Electromagnetic Physics Models for Parallel Computing Architectures
NASA Astrophysics Data System (ADS)
Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
2016-10-01
The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
Wolff, Jonas O; van der Meijden, Arie; Herberstein, Marie E
2017-07-26
Building behaviour in animals extends biological functions beyond bodies. Many studies have emphasized the role of behavioural programmes, physiology and extrinsic factors for the structure and function of buildings. Structure attachments associated with animal constructions offer yet unrealized research opportunities. Spiders build a variety of one- to three-dimensional structures from silk fibres. The evolution of economic web shapes as a key for ecological success in spiders has been related to the emergence of high performance silks and thread coating glues. However, the role of thread anchorages has been widely neglected in those models. Here, we show that orb-web (Araneidae) and hunting spiders (Sparassidae) use different silk application patterns that determine the structure and robustness of the joint in silk thread anchorages. Silk anchorages of orb-web spiders show a greater robustness against different loading situations, whereas the silk anchorages of hunting spiders have their highest pull-off resistance when loaded parallel to the substrate along the direction of dragline spinning. This suggests that the behavioural 'printing' of silk into attachment discs along with spinneret morphology was a prerequisite for the evolution of extended silk use in a three-dimensional space. This highlights the ecological role of attachments in the evolution of animal architectures. © 2017 The Author(s).
Illustrating Thermodynamic Concepts Using a Hero's Engine
NASA Astrophysics Data System (ADS)
Muiño, Pedro L.; Hodgson, James R.
2000-05-01
A modified Hero's engine is used to illustrate concepts of thermodynamics and engineering design suitable for introductory chemistry courses and more advanced physical chemistry courses. The engine is a boiler made of Pyrex with two off-center nozzles. Upon boiling, the vapor exits the nozzles, creating two opposite, off-center forces that result in a circular motion by the engine around the vertical axis. The engine is suspended from a horizontal bar by means of two parallel threads. The rotation of the engine results in the twisting of the threads, with two important effects: the engine is raised vertically, and potential energy is stored in the coiling of the threads. When the engine is raised, it is removed from the heating source. This stops the boiling. The stored potential energy is then released into kinetic energy; that is, the threads uncoil, and the engine rotates in the opposite direction. This lowers the engine into the flame, so the water resumes boiling and the engine can be raised again. This cycle continues until all the liquid water is vaporized. This demonstration is suitable to illustrate concepts like gas expansion, gas cooling through expansion (Joule-Thompson experiment), conversion of heat to work, interconversion between kinetic energy and potential energy, and feedback mechanisms.
An overview of the Opus language and runtime system
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Haines, Matthew
1994-01-01
We have recently introduced a new language, called Opus, which provides a set of Fortran language extensions that allow for integrated support of task and data parallelism. lt also provides shared data abstractions (SDA's) as a method for communication and synchronization among these tasks. In this paper, we first provide a brief description of the language features and then focus on both the language-dependent and language-independent parts of the runtime system that support the language. The language-independent portion of the runtime system supports lightweight threads across multiple address spaces, and is built upon existing lightweight thread and communication systems. The language-dependent portion of the runtime system supports conditional invocation of SDA methods and distributed SDA argument handling.
Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan
2017-01-01
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan
2017-08-04
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
NASA Astrophysics Data System (ADS)
Srinivasa, K. G.; Shree Devi, B. N.
2017-10-01
String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU's for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document's length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA's GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.
Iran's Implicit Philosophy of Education
ERIC Educational Resources Information Center
Bagheri Noaparast, Khosrow
2018-01-01
This paper aims to extract Iran's philosophy of education from two sources of the constitution and the course of practice in educational institutions. Regarding the first source, it is argued that parallel to the two main threads of the constitution, Iran's main elements of philosophy of education are expected to be derived from; (1) Islam and (2)…
ERIC Educational Resources Information Center
Science Teacher, 1989
1989-01-01
Describes classroom activities and models for migration, mutation, and isolation; a diffusion model; Bernoulli's principle; sound in a vacuum; time regression mystery of DNA; seating chart lesson plan; algae mystery laboratory; water as mass; science fair; flipped book; making a cloud; wet mount slide; timer adaptation; thread slide model; and…
2014-01-01
Background Double-stranded DNA translocation is ubiquitous in living systems. Cell mitosis, bacterial binary fission, DNA replication or repair, homologous recombination, Holliday junction resolution, viral genome packaging and cell entry all involve biomotor-driven dsDNA translocation. Previously, biomotors have been primarily classified into linear and rotational motors. We recently discovered a third class of dsDNA translocation motors in Phi29 utilizing revolution mechanism without rotation. Analogically, the Earth rotates around its own axis every 24 hours, but revolves around the Sun every 365 days. Results Single-channel DNA translocation conductance assay combined with structure inspections of motor channels on bacteriophages P22, SPP1, HK97, T7, T4, Phi29, and other dsDNA translocation motors such as bacterial FtsK and eukaryotic mimiviruses or vaccinia viruses showed that revolution motor is widespread. The force generation mechanism for revolution motors is elucidated. Revolution motors can be differentiated from rotation motors by their channel size and chirality. Crystal structure inspection revealed that revolution motors commonly exhibit channel diameters larger than 3 nm, while rotation motors that rotate around one of the two separated DNA strands feature a diameter smaller than 2 nm. Phi29 revolution motor translocated double- and tetra-stranded DNA that occupied 32% and 64% of the narrowest channel cross-section, respectively, evidencing that revolution motors exhibit channel diameters significantly wider than the dsDNA. Left-handed oriented channels found in revolution motors drive the right-handed dsDNA via anti-chiral interaction, while right-handed channels observed in rotation motors drive the right-handed dsDNA via parallel threads. Tethering both the motor and the dsDNA distal-end of the revolution motor does not block DNA packaging, indicating that no rotation is required for motors of dsDNA phages, while a small-angle left-handed twist of dsDNA that is aligned with the channel could occur due to the conformational change of the phage motor channels from a left-handed configuration for DNA entry to a right-handed configuration for DNA ejection for host cell infection. Conclusions The revolution motor is widespread among biological systems, and can be distinguished from rotation motors by channel size and chirality. The revolution mechanism renders dsDNA void of coiling and torque during translocation of the lengthy helical chromosome, thus resulting in more efficient motor energy conversion. PMID:24940480
Thread mapping using system-level model for shared memory multicores
NASA Astrophysics Data System (ADS)
Mitra, Reshmi
Exploring thread-to-core mapping options for a parallel application on a multicore architecture is computationally very expensive. For the same algorithm, the mapping strategy (MS) with the best response time may change with data size and thread counts. The primary challenge is to design a fast, accurate and automatic framework for exploring these MSs for large data-intensive applications. This is to ensure that the users can explore the design space within reasonable machine hours, without thorough understanding on how the code interacts with the platform. Response time is related to the cycles per instructions retired (CPI), taking into account both active and sleep states of the pipeline. This work establishes a hybrid approach, based on Markov Chain Model (MCM) and Model Tree (MT) for system-level steady state CPI prediction. It is designed for shared memory multicore processors with coarse-grained multithreading. The thread status is represented by the MCM states. The program characteristics are modeled as the transition probabilities, representing the system moving between active and suspended thread states. The MT model extrapolates these probabilities for the actual application size (AS) from the smaller AS performance. This aspect of the framework, along with, the use of mathematical expressions for the actual AS performance information, results in a tremendous reduction in the CPI prediction time. The framework is validated using an electromagnetics application. The average performance prediction error for steady state CPI results with 12 different MSs is less than 1%. The total run time of model is of the order of minutes, whereas the actual application execution time is in terms of days.
Multithreaded Model for Dynamic Load Balancing Parallel Adaptive PDE Computations
NASA Technical Reports Server (NTRS)
Chrisochoides, Nikos
1995-01-01
We present a multithreaded model for the dynamic load-balancing of numerical, adaptive computations required for the solution of Partial Differential Equations (PDE's) on multiprocessors. Multithreading is used as a means of exploring concurrency in the processor level in order to tolerate synchronization costs inherent to traditional (non-threaded) parallel adaptive PDE solvers. Our preliminary analysis for parallel, adaptive PDE solvers indicates that multithreading can be used an a mechanism to mask overheads required for the dynamic balancing of processor workloads with computations required for the actual numerical solution of the PDE's. Also, multithreading can simplify the implementation of dynamic load-balancing algorithms, a task that is very difficult for traditional data parallel adaptive PDE computations. Unfortunately, multithreading does not always simplify program complexity, often makes code re-usability not an easy task, and increases software complexity.
Parallel algorithm of VLBI software correlator under multiprocessor environment
NASA Astrophysics Data System (ADS)
Zheng, Weimin; Zhang, Dong
2007-11-01
The correlator is the key signal processing equipment of a Very Lone Baseline Interferometry (VLBI) synthetic aperture telescope. It receives the mass data collected by the VLBI observatories and produces the visibility function of the target, which can be used to spacecraft position, baseline length measurement, synthesis imaging, and other scientific applications. VLBI data correlation is a task of data intensive and computation intensive. This paper presents the algorithms of two parallel software correlators under multiprocessor environments. A near real-time correlator for spacecraft tracking adopts the pipelining and thread-parallel technology, and runs on the SMP (Symmetric Multiple Processor) servers. Another high speed prototype correlator using the mixed Pthreads and MPI (Massage Passing Interface) parallel algorithm is realized on a small Beowulf cluster platform. Both correlators have the characteristic of flexible structure, scalability, and with 10-station data correlating abilities.
Electromagnetic physics models for parallel computing architectures
Amadio, G.; Ananya, A.; Apostolakis, J.; ...
2016-11-21
The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less
Life's origin: the cosmic, planetary and biological processes
NASA Technical Reports Server (NTRS)
Scattergood, T.; Des Marais, D.; Jahnke, L.
1987-01-01
From elements formed in interstellar furnaces to humans peering back at the stars, the evolution of life has been a long, intricate and perhaps inevitable process. Life as we know it requires a planet orbiting a star at just the right distance so that water can exist in liquid form. It needs a rich supply of chemicals and energy sources. On Earth, the combination of chemistry and energy generated molecules that evolved ways of replicating themselves and of passing information from one generation to the next. Thus, the thread of life began. This chart traces the thread, maintained by DNA molecules for much of its history, as it weaves its way through the primitive oceans, gaining strength and diversity along the way. Organisms eventually moved onto the land, where advanced forms, including humans, ultimately arose. Finally, assisted by a technology of its own making, life has reached back out into space to understand its own origins, to expand into new realms, and to seek other living threads in the cosmos.
National Centers for Environmental Prediction
the number of threads used. HWRF group cannot access Zeus and Jet for real-time data transfers from nodes used.). All single jobs will be run on one rack and will not share with parallel jobs. No official change the group when using tag_rstprod (-g option). autotag_rstprod is a script that tags all files. It
Web 2.0, Pedagogical Support for Reflexive and Emotional Social Interaction among Swedish Students
ERIC Educational Resources Information Center
Augustsson, Gunnar
2010-01-01
Collaborative social interaction when using Web 2.0 in terms of VoiceThread is investigated in a case study of a Swedish university course in social psychology. The case study method was chosen because of the desire not to manipulate the students' behaviour, and data was collected in parallel with course implementation. Two particular…
Playback system designed for X-Band SAR
NASA Astrophysics Data System (ADS)
Yuquan, Liu; Changyong, Dou
2014-03-01
SAR(Synthetic Aperture Radar) has extensive application because it is daylight and weather independent. In particular, X-Band SAR strip map, designed by Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, provides high ground resolution images, at the same time it has a large spatial coverage and a short acquisition time, so it is promising in multi-applications. When sudden disaster comes, the emergency situation acquires radar signal data and image as soon as possible, in order to take action to reduce loss and save lives in the first time. This paper summarizes a type of X-Band SAR playback processing system designed for disaster response and scientific needs. It describes SAR data workflow includes the payload data transmission and reception process. Playback processing system completes signal analysis on the original data, providing SAR level 0 products and quick image. Gigabit network promises radar signal transmission efficiency from recorder to calculation unit. Multi-thread parallel computing and ping pong operation can ensure computation speed. Through gigabit network, multi-thread parallel computing and ping pong operation, high speed data transmission and processing meet the SAR radar data playback real time requirement.
Reducing False Positives in Runtime Analysis of Deadlocks
NASA Technical Reports Server (NTRS)
Bensalem, Saddek; Havelund, Klaus; Clancy, Daniel (Technical Monitor)
2002-01-01
This paper presents an improvement of a standard algorithm for detecting dead-lock potentials in multi-threaded programs, in that it reduces the number of false positives. The standard algorithm works as follows. The multi-threaded program under observation is executed, while lock and unlock events are observed. A graph of locks is built, with edges between locks symbolizing locking orders. Any cycle in the graph signifies a potential for a deadlock. The typical standard example is the group of dining philosophers sharing forks. The algorithm is interesting because it can catch deadlock potentials even though no deadlocks occur in the examined trace, and at the same time it scales very well in contrast t o more formal approaches to deadlock detection. The algorithm, however, can yield false positives (as well as false negatives). The extension of the algorithm described in this paper reduces the amount of false positives for three particular cases: when a gate lock protects a cycle, when a single thread introduces a cycle, and when the code segments in different threads that cause the cycle can actually not execute in parallel. The paper formalizes a theory for dynamic deadlock detection and compares it to model checking and static analysis techniques. It furthermore describes an implementation for analyzing Java programs and its application to two case studies: a planetary rover and a space craft altitude control system.
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness
NASA Astrophysics Data System (ADS)
Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.
2014-09-01
Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
Parallelization and checkpointing of GPU applications through program transformation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Solano-Quinde, Lizandro Damian
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solvemore » the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and to develop support for application-level fault tolerance in applications using multiple GPUs. Our techniques reduce the burden of enhancing single-GPU applications to support these features. To achieve our goal, this work designs and implements a framework for enhancing a single-GPU OpenCL application through application transformation.« less
NASA Astrophysics Data System (ADS)
Takenaka, Shigeori
2017-07-01
It is known that naphthalene diimide carrying two substituents binds to DNA duplex with threading intercalation. Naphthalene diimide carrying ferrocene moieties, ferrocenylnaphthalene diimide (FND), formed a stable complex with DNA duplex and an electrochemical gene detection was achieved with current signal generated from FND bound to the DNA duplex between target DNA and DNA probe immobilized electrode. FND couldn't bind to the mismatched and its surrounding region of DNA duplex and thus FND was applied to the precision detection of single nucleotide polymorphisms (SNPs) using the improved discrimination ability between fully matched and mismatched DNA hybrids and multi-electrode chip. Some of FND derivatives bound to telomere DNA tetraplex stronger than to DNA duplex and was applied to cancer diagnosis as a measure of the elongated telomere DNA with telomerase as a suitable maker of cancer. Furthermore, cyclic naphthalene diimides realized the extremely high preference for DNA tetraplex over DNA duplex. Such molecules will open an effective anti-cancer drug based on telomerase specific inhibitor.
Algasaier, Sana I.; Exell, Jack C.; Bennet, Ian A.; Thompson, Mark J.; Gotham, Victoria J. B.; Shaw, Steven J.; Craggs, Timothy D.; Finger, L. David; Grasby, Jane A.
2016-01-01
Human flap endonuclease-1 (hFEN1) catalyzes the essential removal of single-stranded flaps arising at DNA junctions during replication and repair processes. hFEN1 biological function must be precisely controlled, and consequently, the protein relies on a combination of protein and substrate conformational changes as a prerequisite for reaction. These include substrate bending at the duplex-duplex junction and transfer of unpaired reacting duplex end into the active site. When present, 5′-flaps are thought to thread under the helical cap, limiting reaction to flaps with free 5′-termini in vivo. Here we monitored DNA bending by FRET and DNA unpairing using 2-aminopurine exciton pair CD to determine the DNA and protein requirements for these substrate conformational changes. Binding of DNA to hFEN1 in a bent conformation occurred independently of 5′-flap accommodation and did not require active site metal ions or the presence of conserved active site residues. More stringent requirements exist for transfer of the substrate to the active site. Placement of the scissile phosphate diester in the active site required the presence of divalent metal ions, a free 5′-flap (if present), a Watson-Crick base pair at the terminus of the reacting duplex, and the intact secondary structure of the enzyme helical cap. Optimal positioning of the scissile phosphate additionally required active site conserved residues Tyr40, Asp181, and Arg100 and a reacting duplex 5′-phosphate. These studies suggest a FEN1 reaction mechanism where junctions are bound and 5′-flaps are threaded (when present), and finally the substrate is transferred onto active site metals initiating cleavage. PMID:26884332
Locality Aware Concurrent Start for Stencil Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shrestha, Sunil; Gao, Guang R.; Manzano Franco, Joseph B.
Stencil computations are at the heart of many physical simulations used in scientific codes. Thus, there exists a plethora of optimization efforts for this family of computations. Among these techniques, tiling techniques that allow concurrent start have proven to be very efficient in providing better performance for these critical kernels. Nevertheless, with many core designs being the norm, these optimization techniques might not be able to fully exploit locality (both spatial and temporal) on multiple levels of the memory hierarchy without compromising parallelism. It is no longer true that the machine can be seen as a homogeneous collection of nodesmore » with caches, main memory and an interconnect network. New architectural designs exhibit complex grouping of nodes, cores, threads, caches and memory connected by an ever evolving network-on-chip design. These new designs may benefit greatly from carefully crafted schedules and groupings that encourage parallel actors (i.e. threads, cores or nodes) to be aware of the computational history of other actors in close proximity. In this paper, we provide an efficient tiling technique that allows hierarchical concurrent start for memory hierarchy aware tile groups. Each execution schedule and tile shape exploit the available parallelism, load balance and locality present in the given applications. We demonstrate our technique on the Intel Xeon Phi architecture with selected and representative stencil kernels. We show improvement ranging from 5.58% to 31.17% over existing state-of-the-art techniques.« less
Massively parallel multicanonical simulations
NASA Astrophysics Data System (ADS)
Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard
2018-03-01
Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.
NASA Astrophysics Data System (ADS)
Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel
2018-01-01
This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Blocksome, Michael A.; Mamidala, Amith R.
2015-06-02
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Blocksome, Michael A.; Mamidala, Amith R.
2015-06-09
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Blocksome, Michael A.; Mamidala, Amith R.
2015-08-11
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Blocksome, Michael A.; Mamidala, Amith R.
2015-06-30
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2015-02-03
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (`RTS`) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2014-11-18
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (`RTS`) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data.
GPU-accelerated adjoint algorithmic differentiation
NASA Astrophysics Data System (ADS)
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation.
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2015-01-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography. PMID:26941443
Memory Benchmarks for SMP-Based High Performance Parallel Computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoo, A B; de Supinski, B; Mueller, F
2001-11-20
As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even moremore » complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.« less
An intercalation-locked parallel-stranded DNA tetraplex
Tripathi, S.; Zhang, D.; Paukstelis, P. J.
2015-01-27
DNA has proved to be an excellent material for nanoscale construction because complementary DNA duplexes are programmable and structurally predictable. However, in the absence of Watson–Crick pairings, DNA can be structurally more diverse. Here, we describe the crystal structures of d(ACTCGGATGAT) and the brominated derivative, d(AC BrUCGGA BrUGAT). These oligonucleotides form parallel-stranded duplexes with a crystallographically equivalent strand, resulting in the first examples of DNA crystal structures that contains four different symmetric homo base pairs. Two of the parallel-stranded duplexes are coaxially stacked in opposite directions and locked together to form a tetraplex through intercalation of the 5'-most A–A basemore » pairs between adjacent G–G pairs in the partner duplex. The intercalation region is a new type of DNA tertiary structural motif with similarities to the i-motif. 1H– 1H nuclear magnetic resonance and native gel electrophoresis confirmed the formation of a parallel-stranded duplex in solution. Finally, we modified specific nucleotide positions and added d(GAY) motifs to oligonucleotides and were readily able to obtain similar crystals. This suggests that this parallel-stranded DNA structure may be useful in the rational design of DNA crystals and nanostructures.« less
A path-level exact parallelization strategy for sequential simulation
NASA Astrophysics Data System (ADS)
Peredo, Oscar F.; Baeza, Daniel; Ortiz, Julián M.; Herrero, José R.
2018-01-01
Sequential Simulation is a well known method in geostatistical modelling. Following the Bayesian approach for simulation of conditionally dependent random events, Sequential Indicator Simulation (SIS) method draws simulated values for K categories (categorical case) or classes defined by K different thresholds (continuous case). Similarly, Sequential Gaussian Simulation (SGS) method draws simulated values from a multivariate Gaussian field. In this work, a path-level approach to parallelize SIS and SGS methods is presented. A first stage of re-arrangement of the simulation path is performed, followed by a second stage of parallel simulation for non-conflicting nodes. A key advantage of the proposed parallelization method is to generate identical realizations as with the original non-parallelized methods. Case studies are presented using two sequential simulation codes from GSLIB: SISIM and SGSIM. Execution time and speedup results are shown for large-scale domains, with many categories and maximum kriging neighbours in each case, achieving high speedup results in the best scenarios using 16 threads of execution in a single machine.
Enabling the High Level Synthesis of Data Analytics Accelerators
DOE Office of Scientific and Technical Information (OSTI.GOV)
Minutoli, Marco; Castellana, Vito G.; Tumeo, Antonino
Conventional High Level Synthesis (HLS) tools mainly tar- get compute intensive kernels typical of digital signal pro- cessing applications. We are developing techniques and ar- chitectural templates to enable HLS of data analytics appli- cations. These applications are memory intensive, present fine-grained, unpredictable data accesses, and irregular, dy- namic task parallelism. We discuss an architectural tem- plate based around a distributed controller to efficiently ex- ploit thread level parallelism. We present a memory in- terface that supports parallel memory subsystems and en- ables implementing atomic memory operations. We intro- duce a dynamic task scheduling approach to efficiently ex- ecute heavilymore » unbalanced workload. The templates are val- idated by synthesizing queries from the Lehigh University Benchmark (LUBM), a well know SPARQL benchmark.« less
Parallelization of elliptic solver for solving 1D Boussinesq model
NASA Astrophysics Data System (ADS)
Tarwidi, D.; Adytia, D.
2018-03-01
In this paper, a parallel implementation of an elliptic solver in solving 1D Boussinesq model is presented. Numerical solution of Boussinesq model is obtained by implementing a staggered grid scheme to continuity, momentum, and elliptic equation of Boussinesq model. Tridiagonal system emerging from numerical scheme of elliptic equation is solved by cyclic reduction algorithm. The parallel implementation of cyclic reduction is executed on multicore processors with shared memory architectures using OpenMP. To measure the performance of parallel program, large number of grids is varied from 28 to 214. Two test cases of numerical experiment, i.e. propagation of solitary and standing wave, are proposed to evaluate the parallel program. The numerical results are verified with analytical solution of solitary and standing wave. The best speedup of solitary and standing wave test cases is about 2.07 with 214 of grids and 1.86 with 213 of grids, respectively, which are executed by using 8 threads. Moreover, the best efficiency of parallel program is 76.2% and 73.5% for solitary and standing wave test cases, respectively.
Symbolic Analysis of Concurrent Programs with Polymorphism
NASA Technical Reports Server (NTRS)
Rungta, Neha Shyam
2010-01-01
The current trend of multi-core and multi-processor computing is causing a paradigm shift from inherently sequential to highly concurrent and parallel applications. Certain thread interleavings, data input values, or combinations of both often cause errors in the system. Systematic verification techniques such as explicit state model checking and symbolic execution are extensively used to detect errors in such systems [7, 9]. Explicit state model checking enumerates possible thread schedules and input data values of a program in order to check for errors [3, 9]. To partially mitigate the state space explosion from data input values, symbolic execution techniques substitute data input values with symbolic values [5, 7, 6]. Explicit state model checking and symbolic execution techniques used in conjunction with exhaustive search techniques such as depth-first search are unable to detect errors in medium to large-sized concurrent programs because the number of behaviors caused by data and thread non-determinism is extremely large. We present an overview of abstraction-guided symbolic execution for concurrent programs that detects errors manifested by a combination of thread schedules and data values [8]. The technique generates a set of key program locations relevant in testing the reachability of the target locations. The symbolic execution is then guided along these locations in an attempt to generate a feasible execution path to the error state. This allows the execution to focus in parts of the behavior space more likely to contain an error.
The influence of ionic strength on DNA diffusion in gel networks
NASA Astrophysics Data System (ADS)
Fu, Yuanxi; Jee, Ah-Young; Kim, Hyeong-Ju; Granick, Steve
Cations are known to reduce the rigidity of the DNA molecules by screening the negative charge along the sugar phosphate backbone. This was established by optical tweezer pulling experiment of immobilized DNA strands. However, little is known regarding the influence of ions on the motion of DNA molecules as they thread through network meshes. We imaged in real time the Brownian diffusion of fluorescent labeled lambda-DNA in an agarose gel network in the presence of salt with monovalent or multivalent cations. Each movie was analyzed using home-written program to yield a trajectory of center of the mass and the accompanying history of the shape fluctuations. One preliminary finding is that ionic strength has a profound influence on the slope of the trace of mean square displacement (MSD) versus time. The influence of ionic strength on DNA diffusion in gel networks.
Random Number Generation for High Performance Computing
2015-01-01
number streams, a quality metric for the parallel random number streams. * * * * * Atty. Dkt . No.: 5660-14400 Customer No. 35690 Eric B. Meyertons...responsibility to ensure timely payment of maintenance fees when due. Pagel of3 PTOL-85 (Rev. 02/11) Atty. Dkt . No.: 5660-14400 Page 1 Meyertons...with each subtask executed by a separate thread or process (henceforth, process). Each process has Atty. Dkt . No.: 5660-14400 Page 2 Meyertons
Implementation of BT, SP, LU, and FT of NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Schultz, Matthew; Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry
2000-01-01
A number of Java features make it an attractive but a debatable choice for High Performance Computing. We have implemented benchmarks working on single structured grid BT,SP,LU and FT in Java. The performance and scalability of the Java code shows that a significant improvement in Java compiler technology and in Java thread implementation are necessary for Java to compete with Fortran in HPC applications.
NASA Astrophysics Data System (ADS)
Huang, Melin; Huang, Bormin; Huang, Allen H.
2014-10-01
The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
Optimizing Approximate Weighted Matching on Nvidia Kepler K40
DOE Office of Scientific and Technical Information (OSTI.GOV)
Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh
Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set ofmore » $269$ inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by $10$ to $$100\\times$$ for over $100$ instances, and from $100$ to $$1000\\times$$ for $15$ instances. We also demonstrate up to $$20\\times$$ speedup relative to $2$ threads, and up to $$5\\times$$ relative to $16$ threads on Intel Xeon platform with $16$ cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.« less
El-Zawawy, Mohamed A.
2014-01-01
This paper introduces new approaches for the analysis of frequent statement and dereference elimination for imperative and object-oriented distributed programs running on parallel machines equipped with hierarchical memories. The paper uses languages whose address spaces are globally partitioned. Distributed programs allow defining data layout and threads writing to and reading from other thread memories. Three type systems (for imperative distributed programs) are the tools of the proposed techniques. The first type system defines for every program point a set of calculated (ready) statements and memory accesses. The second type system uses an enriched version of types of the first type system and determines which of the ready statements and memory accesses are used later in the program. The third type system uses the information gather so far to eliminate unnecessary statement computations and memory accesses (the analysis of frequent statement and dereference elimination). Extensions to these type systems are also presented to cover object-oriented distributed programs. Two advantages of our work over related work are the following. The hierarchical style of concurrent parallel computers is similar to the memory model used in this paper. In our approach, each analysis result is assigned a type derivation (serves as a correctness proof). PMID:24892098
Ling, Cheng; Hamada, Tsuyoshi; Gao, Jingyang; Zhao, Guoguang; Sun, Donghong; Shi, Weifeng
2016-01-01
MrBayes is a widespread phylogenetic inference tool harnessing empirical evolutionary models and Bayesian statistics. However, the computational cost on the likelihood estimation is very expensive, resulting in undesirably long execution time. Although a number of multi-threaded optimizations have been proposed to speed up MrBayes, there are bottlenecks that severely limit the GPU thread-level parallelism of likelihood estimations. This study proposes a high performance and resource-efficient method for GPU-oriented parallelization of likelihood estimations. Instead of having to rely on empirical programming, the proposed novel decomposition storage model implements high performance data transfers implicitly. In terms of performance improvement, a speedup factor of up to 178 can be achieved on the analysis of simulated datasets by four Tesla K40 cards. In comparison to the other publicly available GPU-oriented MrBayes, the tgMC 3 ++ method (proposed herein) outperforms the tgMC 3 (v1.0), nMC 3 (v2.1.1) and oMC 3 (v1.00) methods by speedup factors of up to 1.6, 1.9 and 2.9, respectively. Moreover, tgMC 3 ++ supports more evolutionary models and gamma categories, which previous GPU-oriented methods fail to take into analysis.
A comparison of parallel and diverging screw angles in the stability of locked plate constructs.
Wähnert, D; Windolf, M; Brianza, S; Rothstock, S; Radtke, R; Brighenti, V; Schwieger, K
2011-09-01
We investigated the static and cyclical strength of parallel and angulated locking plate screws using rigid polyurethane foam (0.32 g/cm(3)) and bovine cancellous bone blocks. Custom-made stainless steel plates with two conically threaded screw holes with different angulations (parallel, 10° and 20° divergent) and 5 mm self-tapping locking screws underwent pull-out and cyclical pull and bending tests. The bovine cancellous blocks were only subjected to static pull-out testing. We also performed finite element analysis for the static pull-out test of the parallel and 20° configurations. In both the foam model and the bovine cancellous bone we found the significantly highest pull-out force for the parallel constructs. In the finite element analysis there was a 47% more damage in the 20° divergent constructs than in the parallel configuration. Under cyclical loading, the mean number of cycles to failure was significantly higher for the parallel group, followed by the 10° and 20° divergent configurations. In our laboratory setting we clearly showed the biomechanical disadvantage of a diverging locking screw angle under static and cyclical loading.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions.
Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros
2014-06-25
The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions
2014-01-01
Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems. PMID:24964802
NASA Astrophysics Data System (ADS)
Fehr, M.; Navarro, V.; Martin, L.; Fletcher, E.
2013-08-01
Space Situational Awareness[8] (SSA) is defined as the comprehensive knowledge, understanding and maintained awareness of the population of space objects, the space environment and existing threats and risks. As ESA's SSA Conjunction Prediction Service (CPS) requires the repetitive application of a processing algorithm against a data set of man-made space objects, it is crucial to exploit the highly parallelizable nature of this problem. Currently the CPS system makes use of OpenMP[7] for parallelization purposes using CPU threads, but only a GPU with its hundreds of cores can fully benefit from such high levels of parallelism. This paper presents the adaptation of several core algorithms[5] of the CPS for general-purpose computing on graphics processing units (GPGPU) using NVIDIAs Compute Unified Device Architecture (CUDA).
NASA Astrophysics Data System (ADS)
Xie, Lizhe; Hu, Yining; Chen, Yang; Shi, Luyao
2015-03-01
Projection and back-projection are the most computational consuming parts in Computed Tomography (CT) reconstruction. Parallelization strategies using GPU computing techniques have been introduced. We in this paper present a new parallelization scheme for both projection and back-projection. The proposed method is based on CUDA technology carried out by NVIDIA Corporation. Instead of build complex model, we aimed on optimizing the existing algorithm and make it suitable for CUDA implementation so as to gain fast computation speed. Besides making use of texture fetching operation which helps gain faster interpolation speed, we fixed sampling numbers in the computation of projection, to ensure the synchronization of blocks and threads, thus prevents the latency caused by inconsistent computation complexity. Experiment results have proven the computational efficiency and imaging quality of the proposed method.
Trapitz, P; Glätzer, K H; Bünemann, H
1992-11-01
The understanding of structure and function of the so-called fertility genes of Drosophila is very limited due to their unusual size--several megabases--and their location on the heterochromatic Y chromosome. Since mapping of these genes has mainly been done by classical cytogenetic analyses using a small number of cytologically visible lampbrush loops as the sole markers for particular fertility genes, the resolution of the genetic map of the Y chromosome is restricted to 3-5 Mb. Here we demonstrate that a substantially finer subdivision of the megabase-sized fertility genes in the subtelomeric regions of the Y chromosome of Drosophila hydei can be achieved by a combination of digestion with restriction enzymes having 6 bp recognition sequences, and pulsed field gel electrophoresis. The physical subdivision is based upon large conserved fragments of repetitive DNA in the size range from 50 up to 1600 kb and refers to the long-range organization of several families of repetitive DNA involved in Y chromosomal transcription processes in primary spermatocytes. We conclude from our results that at least five different families of repetitive DNA specifically transcribed on the lampbrush loops nooses and threads are organized as extended clusters of several hundred kb, essentially free of interspersed non-repetitive sequences.
NASA Astrophysics Data System (ADS)
Akil, Mohamed
2017-05-01
The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
ParTIES: a toolbox for Paramecium interspersed DNA elimination studies.
Denby Wilkes, Cyril; Arnaiz, Olivier; Sperling, Linda
2016-02-15
Developmental DNA elimination occurs in a wide variety of multicellular organisms, but ciliates are the only single-celled eukaryotes in which this phenomenon has been reported. Despite considerable interest in ciliates as models for DNA elimination, no standard methods for identification and characterization of the eliminated sequences are currently available. We present the Paramecium Toolbox for Interspersed DNA Elimination Studies (ParTIES), designed for Paramecium species, that (i) identifies eliminated sequences, (ii) measures their presence in a sequencing sample and (iii) detects rare elimination polymorphisms. ParTIES is multi-threaded Perl software available at https://github.com/oarnaiz/ParTIES. ParTIES is distributed under the GNU General Public Licence v3. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
A pervasive parallel framework for visualization: final report for FWP 10-014707
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moreland, Kenneth D.
2014-01-01
We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documentsmore » the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.« less
Modeling of outgassing and matrix decomposition in carbon-phenolic composites
NASA Technical Reports Server (NTRS)
Mcmanus, Hugh L.
1994-01-01
Work done in the period Jan. - June 1994 is summarized. Two threads of research have been followed. First, the thermodynamics approach was used to model the chemical and mechanical responses of composites exposed to high temperatures. The thermodynamics approach lends itself easily to the usage of variational principles. This thermodynamic-variational approach has been applied to the transpiration cooling problem. The second thread is the development of a better algorithm to solve the governing equations resulting from the modeling. Explicit finite difference method is explored for solving the governing nonlinear, partial differential equations. The method allows detailed material models to be included and solution on massively parallel supercomputers. To demonstrate the feasibility of the explicit scheme in solving nonlinear partial differential equations, a transpiration cooling problem was solved. Some interesting transient behaviors were captured such as stress waves and small spatial oscillations of transient pressure distribution.
Characterizing Task-Based OpenMP Programs
Muddukrishna, Ananya; Jonsson, Peter A.; Brorsson, Mats
2015-01-01
Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance. PMID:25860023
NASA Astrophysics Data System (ADS)
Childers, J. T.; Uram, T. D.; LeCompte, T. J.; Papka, M. E.; Benjamin, D. P.
2017-01-01
As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application and the performance that was achieved.
DNA looping by FokI: the impact of synapse geometry on loop topology at varied site orientations
Rusling, David A.; Laurens, Niels; Pernstich, Christian; Wuite, Gijs J. L.; Halford, Stephen E.
2012-01-01
Most restriction endonucleases, including FokI, interact with two copies of their recognition sequence before cutting DNA. On DNA with two sites they act in cis looping out the intervening DNA. While many restriction enzymes operate symmetrically at palindromic sites, FokI acts asymmetrically at a non-palindromic site. The directionality of its sequence means that two FokI sites can be bridged in either parallel or anti-parallel alignments. Here we show by biochemical and single-molecule biophysical methods that FokI aligns two recognition sites on separate DNA molecules in parallel and that the parallel arrangement holds for sites in the same DNA regardless of whether they are in inverted or repeated orientations. The parallel arrangement dictates the topology of the loop trapped between sites in cis: the loop from inverted sites has a simple 180° bend, while that with repeated sites has a convoluted 360° turn. The ability of FokI to act at asymmetric sites thus enabled us to identify the synapse geometry for sites in trans and in cis, which in turn revealed the relationship between synapse geometry and loop topology. PMID:22362745
Drug-DNA interactions at single molecule level: A view with optical tweezers
NASA Astrophysics Data System (ADS)
Paramanathan, Thayaparan
Studies of small molecule--DNA interactions are essential for developing new drugs for challenging diseases like cancer and HIV. The main idea behind developing these molecules is to target and inhibit the reproduction of the tumor cells and infected cells. We mechanically manipulate single DNA molecule using optical tweezers to investigate two molecules that have complex and multiple binding modes. Mononuclear ruthenium complexes have been extensively studied as a test for rational drug design. Potential drug candidates should have high affinity to DNA and slow dissociation kinetics. To achieve this, motifs of the ruthenium complexes are altered. Our collaborators designed a dumb-bell shaped binuclear ruthenium complex that can only intercalate DNA by threading through its bases. Studying the binding properties of this complex in bulk studies took hours. By mechanically manipulating a single DNA molecule held with optical tweezers, we lower the barrier to thread and make it fast compared to the bulk experiments. Stretching single DNA molecules with different concentration of drug molecules and holding it at a constant force allows the binding to reach equilibrium. By this we can obtain the equilibrium fractional ligand binding and length of DNA at saturated binding. Fitting these results yields quantitative measurements of the binding thermodynamics and kinetics of this complex process. The second complex discussed in this study is Actinomycin D (ActD), a well studied anti-cancer agent that is used as a prototype for developing new generations of drugs. However, the biophysical basis of its activity is still unclear. Because ActD is known to intercalate double stranded DNA (dsDNA), it was assumed to block replication by stabilizing dsDNA in front of the replication fork. However, recent studies have shown that ActD binds with even higher affinity to imperfect duplexes and some sequences of single stranded DNA (ssDNA). We directly measure the on and off rates by stretching the DNA molecule to a certain force and holding it at constant force while adding the drug and then while washing off the drug. Our finding resolves the long lasting controversy of ActD binding modes, clearly showing that both the dsDNA binding and ssDNA binding converge to the same single mode. The result supports the hypothesis that the primary characteristic of ActD that contributes to its biological activity is its ability to inhibit cellular replication by binding to transcription bubbles and causing cell death.
Distributed parallel computing in stochastic modeling of groundwater systems.
Dong, Yanhui; Li, Guomin; Xu, Haizhen
2013-03-01
Stochastic modeling is a rapidly evolving, popular approach to the study of the uncertainty and heterogeneity of groundwater systems. However, the use of Monte Carlo-type simulations to solve practical groundwater problems often encounters computational bottlenecks that hinder the acquisition of meaningful results. To improve the computational efficiency, a system that combines stochastic model generation with MODFLOW-related programs and distributed parallel processing is investigated. The distributed computing framework, called the Java Parallel Processing Framework, is integrated into the system to allow the batch processing of stochastic models in distributed and parallel systems. As an example, the system is applied to the stochastic delineation of well capture zones in the Pinggu Basin in Beijing. Through the use of 50 processing threads on a cluster with 10 multicore nodes, the execution times of 500 realizations are reduced to 3% compared with those of a serial execution. Through this application, the system demonstrates its potential in solving difficult computational problems in practical stochastic modeling. © 2012, The Author(s). Groundwater © 2012, National Ground Water Association.
1995-01-01
possible to determine communication points. For this version, a C program spawning Posix threads and using semaphores to synchronize would have to...performance such as the time required for network communication and synchronization as well as issues of asynchrony and memory hierarchy. For example...enhances reusability. Process (or task) parallel computations can also be succinctly expressed with a small set of process creation and synchronization
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
Optics Program Modified for Multithreaded Parallel Computing
NASA Technical Reports Server (NTRS)
Lou, John; Bedding, Dave; Basinger, Scott
2006-01-01
A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.
Multi-threaded ATLAS simulation on Intel Knights Landing processors
NASA Astrophysics Data System (ADS)
Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration
2017-10-01
The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
Event Reconstruction for Many-core Architectures using Java
DOE Office of Scientific and Technical Information (OSTI.GOV)
Graf, Norman A.; /SLAC
Although Moore's Law remains technically valid, the performance enhancements in computing which traditionally resulted from increased CPU speeds ended years ago. Chip manufacturers have chosen to increase the number of core CPUs per chip instead of increasing clock speed. Unfortunately, these extra CPUs do not automatically result in improvements in simulation or reconstruction times. To take advantage of this extra computing power requires changing how software is written. Event reconstruction is globally serial, in the sense that raw data has to be unpacked first, channels have to be clustered to produce hits before those hits are identified as belonging tomore » a track or shower, tracks have to be found and fit before they are vertexed, etc. However, many of the individual procedures along the reconstruction chain are intrinsically independent and are perfect candidates for optimization using multi-core architecture. Threading is perhaps the simplest approach to parallelizing a program and Java includes a powerful threading facility built into the language. We have developed a fast and flexible reconstruction package (org.lcsim) written in Java that has been used for numerous physics and detector optimization studies. In this paper we present the results of our studies on optimizing the performance of this toolkit using multiple threads on many-core architectures.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek
2016-10-06
We profile and optimize calculations performed with the BerkeleyGW code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW methodmore » is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.« less
Automation of Data Traffic Control on DSM Architecture
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry
2001-01-01
The design of distributed shared memory (DSM) computers liberates users from the duty to distribute data across processors and allows for the incremental development of parallel programs using, for example, OpenMP or Java threads. DSM architecture greatly simplifies the development of parallel programs having good performance on a few processors. However, to achieve a good program scalability on DSM computers requires that the user understand data flow in the application and use various techniques to avoid data traffic congestions. In this paper we discuss a number of such techniques, including data blocking, data placement, data transposition and page size control and evaluate their efficiency on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks. We also present a tool which automates the detection of constructs causing data congestions in Fortran array oriented codes and advises the user on code transformations for improving data traffic in the application.
Parallel Eclipse Project Checkout
NASA Technical Reports Server (NTRS)
Crockett, Thomas M.; Joswig, Joseph C.; Shams, Khawaja S.; Powell, Mark W.; Bachmann, Andrew G.
2011-01-01
Parallel Eclipse Project Checkout (PEPC) is a program written to leverage parallelism and to automate the checkout process of plug-ins created in Eclipse RCP (Rich Client Platform). Eclipse plug-ins can be aggregated in a feature project. This innovation digests a feature description (xml file) and automatically checks out all of the plug-ins listed in the feature. This resolves the issue of manually checking out each plug-in required to work on the project. To minimize the amount of time necessary to checkout the plug-ins, this program makes the plug-in checkouts parallel. After parsing the feature, a request to checkout for each plug-in in the feature has been inserted. These requests are handled by a thread pool with a configurable number of threads. By checking out the plug-ins in parallel, the checkout process is streamlined before getting started on the project. For instance, projects that took 30 minutes to checkout now take less than 5 minutes. The effect is especially clear on a Mac, which has a network monitor displaying the bandwidth use. When running the client from a developer s home, the checkout process now saturates the bandwidth in order to get all the plug-ins checked out as fast as possible. For comparison, a checkout process that ranged from 8-200 Kbps from a developer s home is now able to saturate a pipe of 1.3 Mbps, resulting in significantly faster checkouts. Eclipse IDE (integrated development environment) tries to build a project as soon as it is downloaded. As part of another optimization, this innovation programmatically tells Eclipse to stop building while checkouts are happening, which dramatically reduces lock contention and enables plug-ins to continue downloading until all of them finish. Furthermore, the software re-enables automatic building, and forces Eclipse to do a clean build once it finishes checking out all of the plug-ins. This software is fully generic and does not contain any NASA-specific code. It can be applied to any Eclipse-based repository with a similar structure. It also can apply build parameters and preferences automatically at the end of the checkout.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tripathi, S.; Zhang, D.; Paukstelis, P. J.
DNA has proved to be an excellent material for nanoscale construction because complementary DNA duplexes are programmable and structurally predictable. However, in the absence of Watson–Crick pairings, DNA can be structurally more diverse. Here, we describe the crystal structures of d(ACTCGGATGAT) and the brominated derivative, d(AC BrUCGGA BrUGAT). These oligonucleotides form parallel-stranded duplexes with a crystallographically equivalent strand, resulting in the first examples of DNA crystal structures that contains four different symmetric homo base pairs. Two of the parallel-stranded duplexes are coaxially stacked in opposite directions and locked together to form a tetraplex through intercalation of the 5'-most A–A basemore » pairs between adjacent G–G pairs in the partner duplex. The intercalation region is a new type of DNA tertiary structural motif with similarities to the i-motif. 1H– 1H nuclear magnetic resonance and native gel electrophoresis confirmed the formation of a parallel-stranded duplex in solution. Finally, we modified specific nucleotide positions and added d(GAY) motifs to oligonucleotides and were readily able to obtain similar crystals. This suggests that this parallel-stranded DNA structure may be useful in the rational design of DNA crystals and nanostructures.« less
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
Parallel satellite orbital situational problems solver for space missions design and control
NASA Astrophysics Data System (ADS)
Atanassov, Atanas Marinov
2016-11-01
Solving different scientific problems for space applications demands implementation of observations, measurements or realization of active experiments during time intervals in which specific geometric and physical conditions are fulfilled. The solving of situational problems for determination of these time intervals when the satellite instruments work optimally is a very important part of all activities on every stage of preparation and realization of space missions. The elaboration of universal, flexible and robust approach for situation analysis, which is easily portable toward new satellite missions, is significant for reduction of missions' preparation times and costs. Every situation problem could be based on one or more situation conditions. Simultaneously solving different kinds of situation problems based on different number and types of situational conditions, each one of them satisfied on different segments of satellite orbit requires irregular calculations. Three formal approaches are presented. First one is related to situation problems description that allows achieving flexibility in situation problem assembling and presentation in computer memory. The second formal approach is connected with developing of situation problem solver organized as processor that executes specific code for every particular situational condition. The third formal approach is related to solver parallelization utilizing threads and dynamic scheduling based on "pool of threads" abstraction and ensures a good load balance. The developed situation problems solver is intended for incorporation in the frames of multi-physics multi-satellite space mission's design and simulation tools.
GPU-based Branchless Distance-Driven Projection and Backprojection
Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
2017-01-01
Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm. PMID:29333480
GPU-based Branchless Distance-Driven Projection and Backprojection.
Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
2017-12-01
Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm.
Jiang, Hanyu; Ganesan, Narayan
2016-02-27
HMMER software suite is widely used for analysis of homologous protein and nucleotide sequences with high sensitivity. The latest version of hmmsearch in HMMER 3.x, utilizes heuristic-pipeline which consists of MSV/SSV (Multiple/Single ungapped Segment Viterbi) stage, P7Viterbi stage and the Forward scoring stage to accelerate homology detection. Since the latest version is highly optimized for performance on modern multi-core CPUs with SSE capabilities, only a few acceleration attempts report speedup. However, the most compute intensive tasks within the pipeline (viz., MSV/SSV and P7Viterbi stages) still stand to benefit from the computational capabilities of massively parallel processors. A Multi-Tiered Parallel Framework (CUDAMPF) implemented on CUDA-enabled GPUs presented here, offers a finer-grained parallelism for MSV/SSV and Viterbi algorithms. We couple SIMT (Single Instruction Multiple Threads) mechanism with SIMD (Single Instructions Multiple Data) video instructions with warp-synchronism to achieve high-throughput processing and eliminate thread idling. We also propose a hardware-aware optimal allocation scheme of scarce resources like on-chip memory and caches in order to boost performance and scalability of CUDAMPF. In addition, runtime compilation via NVRTC available with CUDA 7.0 is incorporated into the presented framework that not only helps unroll innermost loop to yield upto 2 to 3-fold speedup than static compilation but also enables dynamic loading and switching of kernels depending on the query model size, in order to achieve optimal performance. CUDAMPF is designed as a hardware-aware parallel framework for accelerating computational hotspots within the hmmsearch pipeline as well as other sequence alignment applications. It achieves significant speedup by exploiting hierarchical parallelism on single GPU and takes full advantage of limited resources based on their own performance features. In addition to exceeding performance of other acceleration attempts, comprehensive evaluations against high-end CPUs (Intel i5, i7 and Xeon) shows that CUDAMPF yields upto 440 GCUPS for SSV, 277 GCUPS for MSV and 14.3 GCUPS for P7Viterbi all with 100 % accuracy, which translates to a maximum speedup of 37.5, 23.1 and 11.6-fold for MSV, SSV and P7Viterbi respectively. The source code is available at https://github.com/Super-Hippo/CUDAMPF.
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Rizk, Yehia M.
1999-01-01
The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each group are approximately balanced. A proper number of threads are initially allocated to each group, and in subsequent iterations during the run-time, the number of threads are adjusted to achieve load balancing across the processes. Each process exploits the multitasking directives already established in Overflow.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Childers, J. T.; Uram, T. D.; LeCompte, T. J.
As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the World- wide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Childers, J. T.; Uram, T. D.; LeCompte, T. J.; ...
2016-09-29
As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. Finally, this paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Childers, J. T.; Uram, T. D.; LeCompte, T. J.
As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. Finally, this paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Liu, Lei; Zhao, Jing
2014-01-01
An efficient location-based query algorithm of protecting the privacy of the user in the distributed networks is given. This algorithm utilizes the location indexes of the users and multiple parallel threads to search and select quickly all the candidate anonymous sets with more users and their location information with more uniform distribution to accelerate the execution of the temporal-spatial anonymous operations, and it allows the users to configure their custom-made privacy-preserving location query requests. The simulated experiment results show that the proposed algorithm can offer simultaneously the location query services for more users and improve the performance of the anonymous server and satisfy the anonymous location requests of the users. PMID:24790579
2nd-Order CESE Results For C1.4: Vortex Transport by Uniform Flow
NASA Technical Reports Server (NTRS)
Friedlander, David J.
2015-01-01
The Conservation Element and Solution Element (CESE) method was used as implemented in the NASA research code ez4d. The CESE method is a time accurate formulation with flux-conservation in both space and time. The method treats the discretized derivatives of space and time identically and while the 2nd-order accurate version was used, high-order versions exist, the 2nd-order accurate version was used. In regards to the ez4d code, it is an unstructured Navier-Stokes solver coded in C++ with serial and parallel versions available. As part of its architecture, ez4d has the capability to utilize multi-thread and Messaging Passage Interface (MPI) for parallel runs.
Zhong, Cheng; Liu, Lei; Zhao, Jing
2014-01-01
An efficient location-based query algorithm of protecting the privacy of the user in the distributed networks is given. This algorithm utilizes the location indexes of the users and multiple parallel threads to search and select quickly all the candidate anonymous sets with more users and their location information with more uniform distribution to accelerate the execution of the temporal-spatial anonymous operations, and it allows the users to configure their custom-made privacy-preserving location query requests. The simulated experiment results show that the proposed algorithm can offer simultaneously the location query services for more users and improve the performance of the anonymous server and satisfy the anonymous location requests of the users.
Parallel gene analysis with allele-specific padlock probes and tag microarrays
Banér, Johan; Isaksson, Anders; Waldenström, Erik; Jarvius, Jonas; Landegren, Ulf; Nilsson, Mats
2003-01-01
Parallel, highly specific analysis methods are required to take advantage of the extensive information about DNA sequence variation and of expressed sequences. We present a scalable laboratory technique suitable to analyze numerous target sequences in multiplexed assays. Sets of padlock probes were applied to analyze single nucleotide variation directly in total genomic DNA or cDNA for parallel genotyping or gene expression analysis. All reacted probes were then co-amplified and identified by hybridization to a standard tag oligonucleotide array. The technique was illustrated by analyzing normal and pathogenic variation within the Wilson disease-related ATP7B gene, both at the level of DNA and RNA, using allele-specific padlock probes. PMID:12930977
Bradshaw, Charles Richard; Surendranath, Vineeth; Henschel, Robert; Mueller, Matthias Stefan; Habermann, Bianca Hermine
2011-03-10
Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at http://vm1-hmmerthread.age.mpg.de.
Bradshaw, Charles Richard; Surendranath, Vineeth; Henschel, Robert; Mueller, Matthias Stefan; Habermann, Bianca Hermine
2011-01-01
Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at http://vm1-hmmerthread.age.mpg.de. PMID:21423752
Comprehensive Synchronization Elimination for Java (PREPRINT)
2003-01-01
e : % thread-local % reentrant % enclosed Figure...0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 ca ss ow ar y ja va c ja va cu p ja va do c jg l jle x pi zz a ar ra y in st an td b jlo go pl as m a sl ic e Figure 6...1998. [DR98] P. Diniz and M. Rinard. Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-based Programs. In Journal
Parallel heuristics for scalable community detection
Lu, Hao; Halappanavar, Mahantesh; Kalyanaraman, Ananth
2015-08-14
Community detection has become a fundamental operation in numerous graph-theoretic applications. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method ismore » also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains. Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing real speedups of up to 16x using 32 threads.« less
The architecture of a eukaryotic replisome
Sun, Jingchuan; Yuan, Zuanning; Shi, Yi; ...
2015-11-02
At the eukaryotic DNA replication fork, it is widely believed that the Cdc45–Mcm2–7–GINS (CMG) helicase is positioned in front to unwind DNA and that DNA polymerases trail behind the helicase. Here we used single-particle EM to directly image a Saccharomyces cerevisiae replisome. Contrary to expectations, the leading strand Pol ε is positioned ahead of CMG helicase, whereas Ctf4 and the lagging-strand polymerase (Pol) α–primase are behind the helicase. This unexpected architecture indicates that the leading-strand DNA travels a long distance before reaching Pol ε, first threading through the Mcm2–7 ring and then making a U-turn at the bottom and reachingmore » Pol ε at the top of CMG. Lastly, our work reveals an unexpected configuration of the eukaryotic replisome, suggests possible reasons for this architecture and provides a basis for further structural and biochemical replisome studies.« less
Miyoshi, Daisuke; Ueda, Yu-Mi; Shimada, Naohiko; Nakano, Shu-Ichi; Sugimoto, Naoki; Maruyama, Atsushi
2014-09-01
Electrostatic interactions play a major role in protein-DNA interactions. As a model system of a cationic protein, herein we focused on a comb-type copolymer of a polycation backbone and dextran side chains, poly(L-lysine)-graft-dextran (PLL-g-Dex), which has been reported to form soluble interpolyelectrolyte complexes with DNA strands. We investigated the effects of PLL-g-Dex on the conformation and thermodynamics of DNA oligonucleotides forming various secondary structures. Thermodynamic analysis of the DNA structures showed that the parallel conformations involved in both DNA duplexes and triplexes were significantly and specifically stabilized by PLL-g-Dex. On the basis of thermodynamic parameters, it was further possible to design DNA switches that undergo structural transition responding to PLL-g-Dex from an antiparallel duplex to a parallel triplex even with mismatches in the third strand hybridization. These results suggest that polycationic molecules are able to induce structural polymorphism of DNA oligonucleotides, because of the conformation-selective stabilization effects. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas
2015-01-01
Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
Magnetoresistance devices based on single-walled carbon nanotubes
NASA Astrophysics Data System (ADS)
Hod, Oded; Rabani, Eran; Baer, Roi
2005-08-01
We demonstrate the physical principles for the construction of a nanometer-sized magnetoresistance device based on the Aharonov-Bohm effect [Phys. Rev. 115, 485 (1959)]. The proposed device is made of a short single-walled carbon nanotube (SWCNT) placed on a substrate and coupled to a tip/contacts. We consider conductance due to the motion of electrons along the circumference of the tube (as opposed to the motion parallel to its axis). We find that the circumference conductance is sensitive to magnetic fields threading the SWCNT due to the Aharonov-Bohm effect, and show that by retracting the tip/contacts, so that the coupling to the SWCNT is reduced, very high sensitivity to the threading magnetic field develops. This is due to the formation of a narrow resonance through which the tunneling current flows. Using a bias potential the resonance can be shifted to low magnetic fields, allowing the control of conductance with magnetic fields of the order of 1 T.
Separation and parallel sequencing of the genomes and transcriptomes of single cells using G&T-seq.
Macaulay, Iain C; Teng, Mabel J; Haerty, Wilfried; Kumar, Parveen; Ponting, Chris P; Voet, Thierry
2016-11-01
Parallel sequencing of a single cell's genome and transcriptome provides a powerful tool for dissecting genetic variation and its relationship with gene expression. Here we present a detailed protocol for G&T-seq, a method for separation and parallel sequencing of genomic DNA and full-length polyA(+) mRNA from single cells. We provide step-by-step instructions for the isolation and lysis of single cells; the physical separation of polyA(+) mRNA from genomic DNA using a modified oligo-dT bead capture and the respective whole-transcriptome and whole-genome amplifications; and library preparation and sequence analyses of these amplification products. The method allows the detection of thousands of transcripts in parallel with the genetic variants captured by the DNA-seq data from the same single cell. G&T-seq differs from other currently available methods for parallel DNA and RNA sequencing from single cells, as it involves physical separation of the DNA and RNA and does not require bespoke microfluidics platforms. The process can be implemented manually or through automation. When performed manually, paired genome and transcriptome sequencing libraries from eight single cells can be produced in ∼3 d by researchers experienced in molecular laboratory work. For users with experience in the programming and operation of liquid-handling robots, paired DNA and RNA libraries from 96 single cells can be produced in the same time frame. Sequence analysis and integration of single-cell G&T-seq DNA and RNA data requires a high level of bioinformatics expertise and familiarity with a wide range of informatics tools.
Yoon, M; Park, W; Nam, Y K; Kim, D S
2012-02-01
Genetic diversities, population genetic structures and demographic histories of the thread-sail filefish Stephanolepis cirrhifer were investigated by nucleotide sequencing of 336 base pairs of the mitochondrial DNA (mtDNA) control region in 111 individuals collected from six populations in Korean coastal waters. A total of 70 haplotypes were defined by 58 variable nucleotide sites. The neighbor-joining tree of the 70 haplotypes was shallow and did not provide evidence of geographical associations. Expansion of S. cirrhifer populations began approximate 51,000 to 102,000 years before present, correlating with the period of sea level rise since the late Pleistocene glacial maximum. High levels of haplotype diversities (0.974±0.029 to 1.000±0.076) and nucleotide diversities (0.014 to 0.019), and low levels of genetic differentiation among populations inferred from pairwise population F ST values (-0.007 to 0.107), support an expansion of the S. cirrhifer population. Hierarchical analysis of molecular variance (AMOVA) revealed weak but significant genetic structures among three groups (F CT = 0.028, p<0.05), and no genetic variation within groups (0.53%; F SC = 0.005, p = 0.23). These results may help establish appropriate fishery management strategies for stocks of S. cirrhifer and related species.
Yoon, M.; Park, W.; Nam, Y. K.; Kim, D. S.
2012-01-01
Genetic diversities, population genetic structures and demographic histories of the thread-sail filefish Stephanolepis cirrhifer were investigated by nucleotide sequencing of 336 base pairs of the mitochondrial DNA (mtDNA) control region in 111 individuals collected from six populations in Korean coastal waters. A total of 70 haplotypes were defined by 58 variable nucleotide sites. The neighbor-joining tree of the 70 haplotypes was shallow and did not provide evidence of geographical associations. Expansion of S. cirrhifer populations began approximate 51,000 to 102,000 years before present, correlating with the period of sea level rise since the late Pleistocene glacial maximum. High levels of haplotype diversities (0.974±0.029 to 1.000±0.076) and nucleotide diversities (0.014 to 0.019), and low levels of genetic differentiation among populations inferred from pairwise population FST values (−0.007 to 0.107), support an expansion of the S. cirrhifer population. Hierarchical analysis of molecular variance (AMOVA) revealed weak but significant genetic structures among three groups (FCT = 0.028, p<0.05), and no genetic variation within groups (0.53%; FSC = 0.005, p = 0.23). These results may help establish appropriate fishery management strategies for stocks of S. cirrhifer and related species. PMID:25049547
Practical Formal Verification of MPI and Thread Programs
NASA Astrophysics Data System (ADS)
Gopalakrishnan, Ganesh; Kirby, Robert M.
Large-scale simulation codes in science and engineering are written using the Message Passing Interface (MPI). Shared memory threads are widely used directly, or to implement higher level programming abstractions. Traditional debugging methods for MPI or thread programs are incapable of providing useful formal guarantees about coverage. They get bogged down in the sheer number of interleavings (schedules), often missing shallow bugs. In this tutorial we will introduce two practical formal verification tools: ISP (for MPI C programs) and Inspect (for Pthread C programs). Unlike other formal verification tools, ISP and Inspect run directly on user source codes (much like a debugger). They pursue only the relevant set of process interleavings, using our own customized Dynamic Partial Order Reduction algorithms. For a given test harness, DPOR allows these tools to guarantee the absence of deadlocks, instrumented MPI object leaks and communication races (using ISP), and shared memory races (using Inspect). ISP and Inspect have been used to verify large pieces of code: in excess of 10,000 lines of MPI/C for ISP in under 5 seconds, and about 5,000 lines of Pthread/C code in a few hours (and much faster with the use of a cluster or by exploiting special cases such as symmetry) for Inspect. We will also demonstrate the Microsoft Visual Studio and Eclipse Parallel Tools Platform integrations of ISP (these will be available on the LiveCD).
Initial Kernel Timing Using a Simple PIM Performance Model
NASA Technical Reports Server (NTRS)
Katz, Daniel S.; Block, Gary L.; Springer, Paul L.; Sterling, Thomas; Brockman, Jay B.; Callahan, David
2005-01-01
This presentation will describe some initial results of paper-and-pencil studies of 4 or 5 application kernels applied to a processor-in-memory (PIM) system roughly similar to the Cascade Lightweight Processor (LWP). The application kernels are: * Linked list traversal * Sun of leaf nodes on a tree * Bitonic sort * Vector sum * Gaussian elimination The intent of this work is to guide and validate work on the Cascade project in the areas of compilers, simulators, and languages. We will first discuss the generic PIM structure. Then, we will explain the concepts needed to program a parallel PIM system (locality, threads, parcels). Next, we will present a simple PIM performance model that will be used in the remainder of the presentation. For each kernel, we will then present a set of codes, including codes for a single PIM node, and codes for multiple PIM nodes that move data to threads and move threads to data. These codes are written at a fairly low level, between assembly and C, but much closer to C than to assembly. For each code, we will present some hand-drafted timing forecasts, based on the simple PIM performance model. Finally, we will conclude by discussing what we have learned from this work, including what programming styles seem to work best, from the point-of-view of both expressiveness and performance.
Condensin confers the longitudinal rigidity of chromosomes.
Houlard, Martin; Godwin, Jonathan; Metson, Jean; Lee, Jibak; Hirano, Tatsuya; Nasmyth, Kim
2015-06-01
In addition to inter-chromatid cohesion, mitotic and meiotic chromatids must have three physical properties: compaction into 'threads' roughly co-linear with their DNA sequence, intra-chromatid cohesion determining their rigidity, and a mechanism to promote sister chromatid disentanglement. A fundamental issue in chromosome biology is whether a single molecular process accounts for all three features. There is universal agreement that a pair of Smc-kleisin complexes called condensin I and II facilitate sister chromatid disentanglement, but whether they also confer thread formation or longitudinal rigidity is either controversial or has never been directly addressed respectively. We show here that condensin II (beta-kleisin) has an essential role in all three processes during meiosis I in mouse oocytes and that its function overlaps with that of condensin I (gamma-kleisin), which is otherwise redundant. Pre-assembled meiotic bivalents unravel when condensin is inactivated by TEV cleavage, proving that it actually holds chromatin fibres together.
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm
NASA Astrophysics Data System (ADS)
Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian
2017-03-01
DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.
THREaD Mapper Studio: a novel, visual web server for the estimation of genetic linkage maps
Cheema, Jitender; Ellis, T. H. Noel; Dicks, Jo
2010-01-01
The estimation of genetic linkage maps is a key component in plant and animal research, providing both an indication of the genetic structure of an organism and a mechanism for identifying candidate genes associated with traits of interest. Because of this importance, several computational solutions to genetic map estimation exist, mostly implemented as stand-alone software packages. However, the estimation process is often largely hidden from the user. Consequently, problems such as a program crashing may occur that leave a user baffled. THREaD Mapper Studio (http://cbr.jic.ac.uk/threadmapper) is a new web site that implements a novel, visual and interactive method for the estimation of genetic linkage maps from DNA markers. The rationale behind the web site is to make the estimation process as transparent and robust as possible, while also allowing users to use their expert knowledge during analysis. Indeed, the 3D visual nature of the tool allows users to spot features in a data set, such as outlying markers and potential structural rearrangements that could cause problems with the estimation procedure and to account for them in their analysis. Furthermore, THREaD Mapper Studio facilitates the visual comparison of genetic map solutions from third party software, aiding users in developing robust solutions for their data sets. PMID:20494977
NASA Astrophysics Data System (ADS)
Huang, Melin; Huang, Bormin; Huang, Allen H.
2014-10-01
For weather forecasting and research, the Weather Research and Forecasting (WRF) model has been developed, consisting of several components such as dynamic solvers and physical simulation modules. WRF includes several Land- Surface Models (LSMs). The LSMs use atmospheric information, the radiative and precipitation forcing from the surface layer scheme, the radiation scheme, and the microphysics/convective scheme all together with the land's state variables and land-surface properties, to provide heat and moisture fluxes over land and sea-ice points. The WRF 5-layer thermal diffusion simulation is an LSM based on the MM5 5-layer soil temperature model with an energy budget that includes radiation, sensible, and latent heat flux. The WRF LSMs are very suitable for massively parallel computation as there are no interactions among horizontal grid points. The features, efficient parallelization and vectorization essentials, of Intel Many Integrated Core (MIC) architecture allow us to optimize this WRF 5-layer thermal diffusion scheme. In this work, we present the results of the computing performance on this scheme with Intel MIC architecture. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.1x. Accordingly, the same CPU-based optimizations improved the performance on Intel Xeon E5- 2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
Parallel mutual information estimation for inferring gene regulatory networks on GPUs
2011-01-01
Background Mutual information is a measure of similarity between two variables. It has been widely used in various application domains including computational biology, machine learning, statistics, image processing, and financial computing. Previously used simple histogram based mutual information estimators lack the precision in quality compared to kernel based methods. The recently introduced B-spline function based mutual information estimation method is competitive to the kernel based methods in terms of quality but at a lower computational complexity. Results We present a new approach to accelerate the B-spline function based mutual information estimation algorithm with commodity graphics hardware. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDA-MI, can achieve speedups of up to 82 using double precision on a single GPU compared to a multi-threaded implementation on a quad-core CPU for large microarray datasets. We have used the results obtained by CUDA-MI to infer gene regulatory networks (GRNs) from microarray data. The comparisons to existing methods including ARACNE and TINGe show that CUDA-MI produces GRNs of higher quality in less time. Conclusions CUDA-MI is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant speedup over sequential multi-threaded implementation by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:21672264
Jali - Unstructured Mesh Infrastructure for Multi-Physics Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Garimella, Rao V; Berndt, Markus; Coon, Ethan
2017-04-13
Jali is a parallel unstructured mesh infrastructure library designed for use by multi-physics simulations. It supports 2D and 3D arbitrary polyhedral meshes distributed over hundreds to thousands of nodes. Jali can read write Exodus II meshes along with fields and sets on the mesh and support for other formats is partially implemented or is (https://github.com/MeshToolkit/MSTK), an open source general purpose unstructured mesh infrastructure library from Los Alamos National Laboratory. While it has been made to work with other mesh frameworks such as MOAB and STKmesh in the past, support for maintaining the interface to these frameworks has been suspended formore » now. Jali supports distributed as well as on-node parallelism. Support of on-node parallelism is through direct use of the the mesh in multi-threaded constructs or through the use of "tiles" which are submeshes or sub-partitions of a partition destined for a compute node.« less
Performance enhancement of various real-time image processing techniques via speculative execution
NASA Astrophysics Data System (ADS)
Younis, Mohamed F.; Sinha, Purnendu; Marlowe, Thomas J.; Stoyenko, Alexander D.
1996-03-01
In real-time image processing, an application must satisfy a set of timing constraints while ensuring the semantic correctness of the system. Because of the natural structure of digital data, pure data and task parallelism have been used extensively in real-time image processing to accelerate the handling time of image data. These types of parallelism are based on splitting the execution load performed by a single processor across multiple nodes. However, execution of all parallel threads is mandatory for correctness of the algorithm. On the other hand, speculative execution is an optimistic execution of part(s) of the program based on assumptions on program control flow or variable values. Rollback may be required if the assumptions turn out to be invalid. Speculative execution can enhance average, and sometimes worst-case, execution time. In this paper, we target various image processing techniques to investigate applicability of speculative execution. We identify opportunities for safe and profitable speculative execution in image compression, edge detection, morphological filters, and blob recognition.
The Software Correlator of the Chinese VLBI Network
NASA Technical Reports Server (NTRS)
Zheng, Weimin; Quan, Ying; Shu, Fengchun; Chen, Zhong; Chen, Shanshan; Wang, Weihua; Wang, Guangli
2010-01-01
The software correlator of the Chinese VLBI Network (CVN) has played an irreplaceable role in the CVN routine data processing, e.g., in the Chinese lunar exploration project. This correlator will be upgraded to process geodetic and astronomical observation data. In the future, with several new stations joining the network, CVN will carry out crustal movement observations, quick UT1 measurements, astrophysical observations, and deep space exploration activities. For the geodetic or astronomical observations, we need a wide-band 10-station correlator. For spacecraft tracking, a realtime and highly reliable correlator is essential. To meet the scientific and navigation requirements of CVN, two parallel software correlators in the multiprocessor environments are under development. A high speed, 10-station prototype correlator using the mixed Pthreads and MPI (Massage Passing Interface) parallel algorithm on a computer cluster platform is being developed. Another real-time software correlator for spacecraft tracking adopts the thread-parallel technology, and it runs on the SMP (Symmetric Multiple Processor) servers. Both correlators have the characteristic of flexible structure and scalability.
Energy-aware Thread and Data Management in Heterogeneous Multi-core, Multi-memory Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Su, Chun-Yi
By 2004, microprocessor design focused on multicore scaling—increasing the number of cores per die in each generation—as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitivemore » or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access (NUMA) systems. I use critical path analysis to quantify memory contention in the NUMA memory system and determine thread mappings. In addition, I implement a runtime system that combines concurrent throttling and a novel thread mapping algorithm to manage thread resources and improve energy efficient execution in multi-core, NUMA systems.« less
Screw-Thread Standards for Federal Services, 1957. Handbook H28 (1957), Part 3
1957-09-01
MOUNTING THREADS PHOTOGRAPHIC EQUIPMENT THREADS ISO METRIC THREADS; MISCELLANEOUS THREADS CLASS 5 INTERFERENCE-FIT THREADS, TRIAL STANDARD WRENCH...Bibliography on measurement of pitch diameter by means of wires 60 Appendix 14. Metric screw-thread standards 61 1. ISO thread profiles...61 2. Standard series for ISO metric threads 62 3. Designations for ISO metric threads 62 Tables Page Table XII. 1.—Basic
Yang, Haozhe; Mei, Hui; Seela, Frank
2015-07-06
Reverse Watson-Crick DNA with parallel-strand orientation (ps DNA) has been constructed. Pyrrolo-dC (PyrdC) nucleosides with phenyl and pyridinyl residues linked to the 6 position of the pyrrolo[2,3-d]pyrimidine base have been incorporated in 12- and 25-mer oligonucleotide duplexes and utilized as silver-ion binding sites. Thermal-stability studies on the parallel DNA strands demonstrated extremely strong silver-ion binding and strongly enhanced duplex stability. Stoichiometric UV and fluorescence titration experiments verified that a single (2py) PyrdC-(2py) PyrdC pair captures two silver ions in ps DNA. A structure for the PyrdC silver-ion base pair that aligns 7-deazapurine bases head-to-tail instead of head-to-head, as suggested for canonical DNA, is proposed. The silver DNA double helix represents the first example of a ps DNA structure built up of bidentate and tridentate reverse Watson-Crick base pairs stabilized by a dinuclear silver-mediated PyrdC pair. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Tsui, Nancy B. Y.; Jiang, Peiyong; Chow, Katherine C. K.; Su, Xiaoxi; Leung, Tak Y.; Sun, Hao; Chan, K. C. Allen; Chiu, Rossa W. K.; Lo, Y. M. Dennis
2012-01-01
Background Fetal DNA in maternal urine, if present, would be a valuable source of fetal genetic material for noninvasive prenatal diagnosis. However, the existence of fetal DNA in maternal urine has remained controversial. The issue is due to the lack of appropriate technology to robustly detect the potentially highly degraded fetal DNA in maternal urine. Methodology We have used massively parallel paired-end sequencing to investigate cell-free DNA molecules in maternal urine. Catheterized urine samples were collected from seven pregnant women during the third trimester of pregnancies. We detected fetal DNA by identifying sequenced reads that contained fetal-specific alleles of the single nucleotide polymorphisms. The sizes of individual urinary DNA fragments were deduced from the alignment positions of the paired reads. We measured the fractional fetal DNA concentration as well as the size distributions of fetal and maternal DNA in maternal urine. Principal Findings Cell-free fetal DNA was detected in five of the seven maternal urine samples, with the fractional fetal DNA concentrations ranged from 1.92% to 4.73%. Fetal DNA became undetectable in maternal urine after delivery. The total urinary cell-free DNA molecules were less intact when compared with plasma DNA. Urinary fetal DNA fragments were very short, and the most dominant fetal sequences were between 29 bp and 45 bp in length. Conclusions With the use of massively parallel sequencing, we have confirmed the existence of transrenal fetal DNA in maternal urine, and have shown that urinary fetal DNA was heavily degraded. PMID:23118982
Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D
2008-08-01
The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
A Pervasive Parallel Processing Framework for Data Visualization and Analysis at Extreme Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ma, Kwan-Liu
Most of today’s visualization libraries and applications are based off of what is known today as the visualization pipeline. In the visualization pipeline model, algorithms are encapsulated as “filtering” components with inputs and outputs. These components can be combined by connecting the outputs of one filter to the inputs of another filter. The visualization pipeline model is popular because it provides a convenient abstraction that allows users to combine algorithms in powerful ways. Unfortunately, the visualization pipeline cannot run effectively on exascale computers. Experts agree that the exascale machine will comprise processors that contain many cores. Furthermore, physical limitations willmore » prevent data movement in and out of the chip (that is, between main memory and the processing cores) from keeping pace with improvements in overall compute performance. To use these processors to their fullest capability, it is essential to carefully consider memory access. This is where the visualization pipeline fails. Each filtering component in the visualization library is expected to take a data set in its entirety, perform some computation across all of the elements, and output the complete results. The process of iterating over all elements must be repeated in each filter, which is one of the worst possible ways to traverse memory when trying to maximize the number of executions per memory access. This project investigates a new type of visualization framework that exhibits a pervasive parallelism necessary to run on exascale machines. Our framework achieves this by defining algorithms in terms of functors, which are localized, stateless operations. Functors can be composited in much the same way as filters in the visualization pipeline. But, functors’ design allows them to be concurrently running on massive amounts of lightweight threads. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale computer. This project concludes with a functional prototype containing pervasively parallel algorithms that perform demonstratively well on many-core processors. These algorithms are fundamental for performing data analysis and visualization at extreme scale.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chiu, George L.; Eichenberger, Alexandre E.; O'Brien, John K. P.
The present disclosure relates generally to a dedicated memory structure (that is, hardware device) holding data for detecting available worker thread(s) and informing available worker thread(s) of task(s) to execute.
A dynamic bead-based microarray for parallel DNA detection
NASA Astrophysics Data System (ADS)
Sochol, R. D.; Casavant, B. P.; Dueck, M. E.; Lee, L. P.; Lin, L.
2011-05-01
A microfluidic system has been designed and constructed by means of micromachining processes to integrate both microfluidic mixing of mobile microbeads and hydrodynamic microbead arraying capabilities on a single chip to simultaneously detect multiple bio-molecules. The prototype system has four parallel reaction chambers, which include microchannels of 18 × 50 µm2 cross-sectional area and a microfluidic mixing section of 22 cm length. Parallel detection of multiple DNA oligonucleotide sequences was achieved via molecular beacon probes immobilized on polystyrene microbeads of 16 µm diameter. Experimental results show quantitative detection of three distinct DNA oligonucleotide sequences from the Hepatitis C viral (HCV) genome with single base-pair mismatch specificity. Our dynamic bead-based microarray offers an effective microfluidic platform to increase parallelization of reactions and improve microbead handling for various biological applications, including bio-molecule detection, medical diagnostics and drug screening.
Combinatorial algorithms for design of DNA arrays.
Hannenhalli, Sridhar; Hubell, Earl; Lipshutz, Robert; Pevzner, Pavel A
2002-01-01
Optimal design of DNA arrays requires the development of algorithms with two-fold goals: reducing the effects caused by unintended illumination (border length minimization problem) and reducing the complexity of masks (mask decomposition problem). We describe algorithms that reduce the number of rectangles in mask decomposition by 20-30% as compared to a standard array design under the assumption that the arrangement of oligonucleotides on the array is fixed. This algorithm produces provably optimal solution for all studied real instances of array design. We also address the difficult problem of finding an arrangement which minimizes the border length and come up with a new idea of threading that significantly reduces the border length as compared to standard designs.
A DNA-based semantic fusion model for remote sensing data.
Sun, Heng; Weng, Jian; Yu, Guangchuang; Massawe, Richard H
2013-01-01
Semantic technology plays a key role in various domains, from conversation understanding to algorithm analysis. As the most efficient semantic tool, ontology can represent, process and manage the widespread knowledge. Nowadays, many researchers use ontology to collect and organize data's semantic information in order to maximize research productivity. In this paper, we firstly describe our work on the development of a remote sensing data ontology, with a primary focus on semantic fusion-driven research for big data. Our ontology is made up of 1,264 concepts and 2,030 semantic relationships. However, the growth of big data is straining the capacities of current semantic fusion and reasoning practices. Considering the massive parallelism of DNA strands, we propose a novel DNA-based semantic fusion model. In this model, a parallel strategy is developed to encode the semantic information in DNA for a large volume of remote sensing data. The semantic information is read in a parallel and bit-wise manner and an individual bit is converted to a base. By doing so, a considerable amount of conversion time can be saved, i.e., the cluster-based multi-processes program can reduce the conversion time from 81,536 seconds to 4,937 seconds for 4.34 GB source data files. Moreover, the size of result file recording DNA sequences is 54.51 GB for parallel C program compared with 57.89 GB for sequential Perl. This shows that our parallel method can also reduce the DNA synthesis cost. In addition, data types are encoded in our model, which is a basis for building type system in our future DNA computer. Finally, we describe theoretically an algorithm for DNA-based semantic fusion. This algorithm enables the process of integration of the knowledge from disparate remote sensing data sources into a consistent, accurate, and complete representation. This process depends solely on ligation reaction and screening operations instead of the ontology.
A DNA-Based Semantic Fusion Model for Remote Sensing Data
Sun, Heng; Weng, Jian; Yu, Guangchuang; Massawe, Richard H.
2013-01-01
Semantic technology plays a key role in various domains, from conversation understanding to algorithm analysis. As the most efficient semantic tool, ontology can represent, process and manage the widespread knowledge. Nowadays, many researchers use ontology to collect and organize data's semantic information in order to maximize research productivity. In this paper, we firstly describe our work on the development of a remote sensing data ontology, with a primary focus on semantic fusion-driven research for big data. Our ontology is made up of 1,264 concepts and 2,030 semantic relationships. However, the growth of big data is straining the capacities of current semantic fusion and reasoning practices. Considering the massive parallelism of DNA strands, we propose a novel DNA-based semantic fusion model. In this model, a parallel strategy is developed to encode the semantic information in DNA for a large volume of remote sensing data. The semantic information is read in a parallel and bit-wise manner and an individual bit is converted to a base. By doing so, a considerable amount of conversion time can be saved, i.e., the cluster-based multi-processes program can reduce the conversion time from 81,536 seconds to 4,937 seconds for 4.34 GB source data files. Moreover, the size of result file recording DNA sequences is 54.51 GB for parallel C program compared with 57.89 GB for sequential Perl. This shows that our parallel method can also reduce the DNA synthesis cost. In addition, data types are encoded in our model, which is a basis for building type system in our future DNA computer. Finally, we describe theoretically an algorithm for DNA-based semantic fusion. This algorithm enables the process of integration of the knowledge from disparate remote sensing data sources into a consistent, accurate, and complete representation. This process depends solely on ligation reaction and screening operations instead of the ontology. PMID:24116207
Rhizobium sp. Degradation of Legume Root Hair Cell Wall at the Site of Infection Thread Origin
Ridge, Robert W.; Rolfe, Barry G.
1985-01-01
Using a new microinoculation technique, we demonstrated that penetration of Rhizobium sp. into the host root hair cell occurs at 20 to 22 h after inoculation. It did this by dissolving the cell wall maxtrix, leaving a layer of depolymerized wall microfibrils. Colony growth pressure “stretched” the weakened wall, forming a bulge into an interfacial zone between the wall and plasmalemma. At the same time vesicular bodies, similar to plasmalemmasomes, accumulated at the penetration site in a manner which parallels host-pathogen systems. Images PMID:16346892
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor
NASA Astrophysics Data System (ADS)
Hristov, Ivan; Goranov, Goran; Hristova, Radoslava
2018-02-01
We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.
OpenMP parallelization of a gridded SWAT (SWATG)
NASA Astrophysics Data System (ADS)
Zhang, Ying; Hou, Jinliang; Cao, Yongpan; Gu, Juan; Huang, Chunlin
2017-12-01
Large-scale, long-term and high spatial resolution simulation is a common issue in environmental modeling. A Gridded Hydrologic Response Unit (HRU)-based Soil and Water Assessment Tool (SWATG) that integrates grid modeling scheme with different spatial representations also presents such problems. The time-consuming problem affects applications of very high resolution large-scale watershed modeling. The OpenMP (Open Multi-Processing) parallel application interface is integrated with SWATG (called SWATGP) to accelerate grid modeling based on the HRU level. Such parallel implementation takes better advantage of the computational power of a shared memory computer system. We conducted two experiments at multiple temporal and spatial scales of hydrological modeling using SWATG and SWATGP on a high-end server. At 500-m resolution, SWATGP was found to be up to nine times faster than SWATG in modeling over a roughly 2000 km2 watershed with 1 CPU and a 15 thread configuration. The study results demonstrate that parallel models save considerable time relative to traditional sequential simulation runs. Parallel computations of environmental models are beneficial for model applications, especially at large spatial and temporal scales and at high resolutions. The proposed SWATGP model is thus a promising tool for large-scale and high-resolution water resources research and management in addition to offering data fusion and model coupling ability.
Kenney, Rachael M; Buxton, Katherine E; Glazier, Samantha
2016-09-01
Doxorubicin and nogalamycin are antitumor antibiotics that interact with DNA via intercalation and threading mechanisms, respectively. Because the importance of water, particularly its impact on entropy changes, has been established in other biological processes, we investigated the role of water in these two drug-DNA binding events. We used the osmotic stress method to calculate the number of water molecules exchanged (Δnwater), and isothermal titration calorimetry to measure Kbinding, ΔH, and ΔS for two synthetic DNAs, poly(dA·dT) and poly(dG·dC), and calf thymus DNA (CT DNA). For nogalamycin, Δnwater<0 for CT DNA and poly(dG·dC). For doxorubicin, Δnwater>0 for CT DNA and Δnwater<0 for poly(dG·dC). For poly(dA·dT), Δnwater~0 with both drugs. Net enthalpy changes were always negative, but net entropy changes depended on the drug. The effect of water exchange on the overall sign of entropy change appears to be smaller than other contributions. Copyright © 2016 Elsevier B.V. All rights reserved.
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
2016-07-19
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
Synchronizing compute node time bases in a parallel computer
Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip
2015-01-27
Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.
Synchronizing compute node time bases in a parallel computer
Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip
2014-12-30
Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.
NASA Astrophysics Data System (ADS)
Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.
2016-05-01
In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Heber, Gerd; Biswas, Rupak
2000-01-01
The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.
Pre-Assembly of Near-Infrared Fluorescent Multivalent Molecular Probes for Biological Imaging.
Peck, Evan M; Battles, Paul M; Rice, Douglas R; Roland, Felicia M; Norquest, Kathryn A; Smith, Bradley D
2016-05-18
A programmable pre-assembly method is described and shown to produce near-infrared fluorescent molecular probes with tunable multivalent binding properties. The modular assembly process threads one or two copies of a tetralactam macrocycle onto a fluorescent PEGylated squaraine scaffold containing a complementary number of docking stations. Appended to the macrocycle periphery are multiple copies of a ligand that is known to target a biomarker. The structure and high purity of each threaded complex was determined by independent spectrometric methods and also by gel electrophoresis. Especially helpful were diagnostic red-shift and energy transfer features in the absorption and fluorescence spectra. The threaded complexes were found to be effective multivalent molecular probes for fluorescence microscopy and in vivo fluorescence imaging of living subjects. Two multivalent probes were prepared and tested for targeting of bone in mice. A pre-assembled probe with 12 bone-targeting iminodiacetate ligands produced more bone accumulation than an analogous pre-assembled probe with six iminodiacetate ligands. Notably, there was no loss in probe fluorescence at the bone target site after 24 h in the living animal, indicating that the pre-assembled fluorescent probe maintained very high mechanical and chemical stability on the skeletal surface. The study shows how this versatile pre-assembly method can be used in a parallel combinatorial manner to produce libraries of near-infrared fluorescent multivalent molecular probes for different types of imaging and diagnostic applications, with incremental structural changes in the number of targeting groups, linker lengths, linker flexibility, and degree of PEGylation.
Hierarchical resilience with lightweight threads.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wheeler, Kyle Bruce
2011-10-01
This paper proposes methodology for providing robustness and resilience for a highly threaded distributed- and shared-memory environment based on well-defined inputs and outputs to lightweight tasks. These inputs and outputs form a failure 'barrier', allowing tasks to be restarted or duplicated as necessary. These barriers must be expanded based on task behavior, such as communication between tasks, but do not prohibit any given behavior. One of the trends in high-performance computing codes seems to be a trend toward self-contained functions that mimic functional programming. Software designers are trending toward a model of software design where their core functions are specifiedmore » in side-effect free or low-side-effect ways, wherein the inputs and outputs of the functions are well-defined. This provides the ability to copy the inputs to wherever they need to be - whether that's the other side of the PCI bus or the other side of the network - do work on that input using local memory, and then copy the outputs back (as needed). This design pattern is popular among new distributed threading environment designs. Such designs include the Barcelona STARS system, distributed OpenMP systems, the Habanero-C and Habanero-Java systems from Vivek Sarkar at Rice University, the HPX/ParalleX model from LSU, as well as our own Scalable Parallel Runtime effort (SPR) and the Trilinos stateless kernels. This design pattern is also shared by CUDA and several OpenMP extensions for GPU-type accelerators (e.g. the PGI OpenMP extensions).« less
Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver
NASA Astrophysics Data System (ADS)
Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre
2014-06-01
This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.
Lee, Anthony; Yau, Christopher; Giles, Michael B.; Doucet, Arnaud; Holmes, Christopher C.
2011-01-01
We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design. PMID:22003276
An Intrinsic Algorithm for Parallel Poisson Disk Sampling on Arbitrary Surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-03-08
Poisson disk sampling plays an important role in a variety of visual computing, due to its useful statistical property in distribution and the absence of aliasing artifacts. While many effective techniques have been proposed to generate Poisson disk distribution in Euclidean space, relatively few work has been reported to the surface counterpart. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. We propose a new technique for parallelizing the dart throwing. Rather than the conventional approaches that explicitly partition the spatial domain to generate the samples in parallel, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. It is worth noting that our algorithm is accurate as the generated Poisson disks are uniformly and randomly distributed without bias. Our method is intrinsic in that all the computations are based on the intrinsic metric and are independent of the embedding space. This intrinsic feature allows us to generate Poisson disk distributions on arbitrary surfaces. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
Threading DNA through nanopores for biosensing applications
NASA Astrophysics Data System (ADS)
Fyta, Maria
2015-07-01
This review outlines the recent achievements in the field of nanopore research. Nanopores are typically used in single-molecule experiments and are believed to have a high potential to realize an ultra-fast and very cheap genome sequencer. Here, the various types of nanopore materials, ranging from biological to 2D nanopores are discussed together with their advantages and disadvantages. These nanopores can utilize different protocols to read out the DNA nucleobases. Although, the first nanopore devices have reached the market, many still have issues which do not allow a full realization of a nanopore sequencer able to sequence the human genome in about a day. Ways to control the DNA, its dynamics and speed as the biomolecule translocates the nanopore in order to increase the signal-to-noise ratio in the reading-out process are examined in this review. Finally, the advantages, as well as the drawbacks in distinguishing the DNA nucleotides, i.e., the genetic information, are presented in view of their importance in the field of nanopore sequencing.
Thread selection according to power characteristics during context switching on compute nodes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Archer, Charles J.; Blocksome, Michael A.; Randles, Amanda E.
Methods, apparatus, and products are disclosed for thread selection during context switching on a plurality of compute nodes that includes: executing, by a compute node, an application using a plurality of threads of execution, including executing one or more of the threads of execution; selecting, by the compute node from a plurality of available threads of execution for the application, a next thread of execution in dependence upon power characteristics for each of the available threads; determining, by the compute node, whether criteria for a thread context switch are satisfied; and performing, by the compute node, the thread context switchmore » if the criteria for a thread context switch are satisfied, including executing the next thread of execution.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
None, None
Methods, apparatus, and products are disclosed for thread selection during context switching on a plurality of compute nodes that includes: executing, by a compute node, an application using a plurality of threads of execution, including executing one or more of the threads of execution; selecting, by the compute node from a plurality of available threads of execution for the application, a next thread of execution in dependence upon power characteristics for each of the available threads; determining, by the compute node, whether criteria for a thread context switch are satisfied; and performing, by the compute node, the thread context switchmore » if the criteria for a thread context switch are satisfied, including executing the next thread of execution.« less
Evaluation of massively parallel sequencing for forensic DNA methylation profiling.
Richards, Rebecca; Patel, Jayshree; Stevenson, Kate; Harbison, SallyAnn
2018-05-11
Epigenetics is an emerging area of interest in forensic science. DNA methylation, a type of epigenetic modification, can be applied to chronological age estimation, identical twin differentiation and body fluid identification. However, there is not yet an agreed, established methodology for targeted detection and analysis of DNA methylation markers in forensic research. Recently a massively parallel sequencing-based approach has been suggested. The use of massively parallel sequencing is well established in clinical epigenetics and is emerging as a new technology in the forensic field. This review investigates the potential benefits, limitations and considerations of this technique for the analysis of DNA methylation in a forensic context. The importance of a robust protocol, regardless of the methodology used, that minimises potential sources of bias is highlighted. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Modified locking thread form for fastener
NASA Technical Reports Server (NTRS)
Roopnarine, (Inventor); Vranish, John D. (Inventor)
1998-01-01
A threaded fastener has a standard part with a standard thread form characterized by thread walls with a standard included angle, and a modified part complementary to the standard part having a modified thread form characterized by thread walls which are symmetrically inclined with a modified included angle that is different from the standard included angle of the standard part's thread walls, such that the threads of one part make pre-loaded edge contact with the thread walls of the other part. The thread form of the modified part can have an included angle that is greater, less, or compound as compared to the included angle of the standard part. The standard part may be a bolt and the modified part a nut, or vice versa. The modified thread form holds securely even under large vibrational forces, it permits bi-directional use of standard mating threads, is impervious to the build up of tolerances and can be manufactured with a wider range of tolerances without loss of functionality, and distributes loading stresses (per thread) in a manner that decreases the possibility of single thread failure.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Druinsky, Alex; Ghysels, Pieter; Li, Xiaoye S.
In this paper, we study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism andmore » made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. Finally, as a result, significant speedups were obtained on both machines.« less
Challenges in scaling NLO generators to leadership computers
NASA Astrophysics Data System (ADS)
Benjamin, D.; Childers, JT; Hoeche, S.; LeCompte, T.; Uram, T.
2017-10-01
Exascale computing resources are roughly a decade away and will be capable of 100 times more computing than current supercomputers. In the last year, Energy Frontier experiments crossed a milestone of 100 million core-hours used at the Argonne Leadership Computing Facility, Oak Ridge Leadership Computing Facility, and NERSC. The Fortran-based leading-order parton generator called Alpgen was successfully scaled to millions of threads to achieve this level of usage on Mira. Sherpa and MadGraph are next-to-leading order generators used heavily by LHC experiments for simulation. Integration times for high-multiplicity or rare processes can take a week or more on standard Grid machines, even using all 16-cores. We will describe our ongoing work to scale the Sherpa generator to thousands of threads on leadership-class machines and reduce run-times to less than a day. This work allows the experiments to leverage large-scale parallel supercomputers for event generation today, freeing tens of millions of grid hours for other work, and paving the way for future applications (simulation, reconstruction) on these and future supercomputers.
Designing Next Generation Massively Multithreaded Architectures for Irregular Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tumeo, Antonino; Secchi, Simone; Villa, Oreste
Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multi-threaded architectures with large node count, like the Cray XMT, have been shown to address their requirements better than commodity clusters. In this paper we present the approaches that we are currently pursuing to design future generations of these architectures. First, we introduce the Cray XMT and compare it to other multithreaded architectures. We then propose an evolution of the architecture, integrating multiple cores per node and next generation network interconnect. We advocate the use of hardware support for remote memory referencemore » aggregation to optimize network utilization. For this evaluation we developed a highly parallel, custom simulation infrastructure for multi-threaded systems. Our simulator executes unmodified XMT binaries with very large datasets, capturing effects due to contention and hot-spotting, while predicting execution times with greater than 90% accuracy. We also discuss the FPGA prototyping approach that we are employing to study efficient support for irregular applications in next generation manycore processors.« less
NASA Astrophysics Data System (ADS)
Hauth, T.; Innocente and, V.; Piparo, D.
2012-12-01
The processing of data acquired by the CMS detector at LHC is carried out with an object-oriented C++ software framework: CMSSW. With the increasing luminosity delivered by the LHC, the treatment of recorded data requires extraordinary large computing resources, also in terms of CPU usage. A possible solution to cope with this task is the exploitation of the features offered by the latest microprocessor architectures. Modern CPUs present several vector units, the capacity of which is growing steadily with the introduction of new processor generations. Moreover, an increasing number of cores per die is offered by the main vendors, even on consumer hardware. Most recent C++ compilers provide facilities to take advantage of such innovations, either by explicit statements in the programs sources or automatically adapting the generated machine instructions to the available hardware, without the need of modifying the existing code base. Programming techniques to implement reconstruction algorithms and optimised data structures are presented, that aim to scalable vectorization and parallelization of the calculations. One of their features is the usage of new language features of the C++11 standard. Portions of the CMSSW framework are illustrated which have been found to be especially profitable for the application of vectorization and multi-threading techniques. Specific utility components have been developed to help vectorization and parallelization. They can easily become part of a larger common library. To conclude, careful measurements are described, which show the execution speedups achieved via vectorised and multi-threaded code in the context of CMSSW.
Lannan, Ford M; Mamajanov, Irena; Hud, Nicholas V
2012-09-19
Structures formed by human telomere sequence (HTS) DNA are of interest due to the implication of telomeres in the aging process and cancer. We present studies of HTS DNA folding in an anhydrous, high viscosity deep eutectic solvent (DES) comprised of choline choride and urea. In this solvent, the HTS DNA forms a G-quadruplex with the parallel-stranded ("propeller") fold, consistent with observations that reduced water activity favors the parallel fold, whereas alternative folds are favored at high water activity. Surprisingly, adoption of the parallel structure by HTS DNA in the DES, after thermal denaturation and quick cooling to room temperature, requires several months, as opposed to less than 2 min in an aqueous solution. This extended folding time in the DES is, in part, due to HTS DNA becoming kinetically trapped in a folded state that is apparently not accessed in lower viscosity solvents. A comparison of times required for the G-quadruplex to convert from its aqueous-preferred folded state to its parallel fold also reveals a dependence on solvent viscosity that is consistent with Kramers rate theory, which predicts that diffusion-controlled transitions will slow proportionally with solvent friction. These results provide an enhanced view of a G-quadruplex folding funnel and highlight the necessity to consider solvent viscosity in studies of G-quadruplex formation in vitro and in vivo. Additionally, the solvents and analyses presented here should prove valuable for understanding the folding of many other nucleic acids and potentially have applications in DNA-based nanotechnology where time-dependent structures are desired.
GeNemo: a search engine for web-based functional genomic data.
Zhang, Yongqing; Cao, Xiaoyi; Zhong, Sheng
2016-07-08
A set of new data types emerged from functional genomic assays, including ChIP-seq, DNase-seq, FAIRE-seq and others. The results are typically stored as genome-wide intensities (WIG/bigWig files) or functional genomic regions (peak/BED files). These data types present new challenges to big data science. Here, we present GeNemo, a web-based search engine for functional genomic data. GeNemo searches user-input data against online functional genomic datasets, including the entire collection of ENCODE and mouse ENCODE datasets. Unlike text-based search engines, GeNemo's searches are based on pattern matching of functional genomic regions. This distinguishes GeNemo from text or DNA sequence searches. The user can input any complete or partial functional genomic dataset, for example, a binding intensity file (bigWig) or a peak file. GeNemo reports any genomic regions, ranging from hundred bases to hundred thousand bases, from any of the online ENCODE datasets that share similar functional (binding, modification, accessibility) patterns. This is enabled by a Markov Chain Monte Carlo-based maximization process, executed on up to 24 parallel computing threads. By clicking on a search result, the user can visually compare her/his data with the found datasets and navigate the identified genomic regions. GeNemo is available at www.genemo.org. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
2012-09-11
In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
Cutting thread at flexible endoscopy.
Gong, F; Swain, P; Kadirkamanathan, S; Hepworth, C; Laufer, J; Shelton, J; Mills, T
1996-12-01
New thread-cutting techniques were developed for use at flexible endoscopy. A guillotine was designed to follow and cut thread at the endoscope tip. A new method was developed for guiding suture cutters. Efficacy of Nd: YAG laser cutting of threads was studied. Experimental and clinical experience with thread-cutting methods is presented. A 2.4 mm diameter flexible thread-cutting guillotine was constructed featuring two lateral holes with sharp edges through which sutures to be cut are passed. Standard suture cutters were guided by backloading thread through the cutters extracorporeally. A snare cutter was constructed to retrieve objects sewn to tissue. Efficacy and speed of Nd: YAG laser in cutting twelve different threads were studied. The guillotine cut thread faster (p < 0.05) than standard suture cutters. Backloading thread shortened time taken to cut thread (p < 0.001) compared with free-hand cutting. Nd: YAG laser was ineffective in cutting uncolored threads and slower than mechanical cutters. Results of thread cutting in clinical studies using sewing machine (n = 77 cutting episodes in 21 patients), in-vivo experiments (n = 156), and postsurgical cases (n = 15 over 15 years) are presented. New thread-cutting methods are described and their efficacy demonstrated in experimental and clinical studies.
Tool Removes Coil-Spring Thread Inserts
NASA Technical Reports Server (NTRS)
Collins, Gerald J., Jr.; Swenson, Gary J.; Mcclellan, J. Scott
1991-01-01
Tool removes coil-spring thread inserts from threaded holes. Threads into hole, pries insert loose, grips insert, then pulls insert to thread it out of hole. Effects essentially reverse of insertion process to ease removal and avoid further damage to threaded inner surface of hole.
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF
NASA Astrophysics Data System (ADS)
Mielikainen, J.; Huang, B.; Huang, A.
2011-12-01
The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has multiple levels, which correspond to various vertical heights in the atmosphere. The size of the CONUS 12 km domain is 433 x 308 horizontal grid points with 35 vertical levels. First, the entire SBU-YLIN Fortran code was rewritten in C in preparation of GPU accelerated version. After that, C code was verified against Fortran code for identical outputs. Default compiler options from WRF were used for gfortran and gcc compilers. The processing time for the original Fortran code is 12274 ms and 12893 ms for C version. The processing times for GPU implementation of SBU-YLIN microphysics scheme with I/O are 57.7 ms and 37.2 ms for 1 and 2 GPUs, respectively. The corresponding speedups are 213x and 330x compared to a Fortran implementation. Without I/O the speedup is 896x on 1 GPU. Obviously, ignoring I/O time speedup scales linearly with GPUs. Thus, 2 GPUs have a speedup of 1788x without I/O. Microphysics computation is just a small part of the whole WRF model. After having completely implemented WRF on GPU, the inputs for SBU-YLIN do not have to be transferred from CPU. Instead they are results of previous WRF modules. Therefore, the role of I/O is greatly diminished once all of WRF have been converted to run on GPUs. In the near future, we expect to have a WRF running completely on GPUs for a superior performance.
Probe DNA-Cisplatin Interaction with Solid-State Nanopores
NASA Astrophysics Data System (ADS)
Zhou, Zhi; Hu, Ying; Li, Wei; Xu, Zhi; Wang, Pengye; Bai, Xuedong; Shan, Xinyan; Lu, Xinghua; Nanopore Collaboration
2014-03-01
Understanding the mechanism of DNA-cisplatin interaction is essential for clinical application and novel drug design. As an emerging single-molecule technology, solid-state nanopore has been employed in biomolecule detection and probing DNA-molecule interactions. Herein, we reported a real-time monitoring of DNA-cisplatin interaction by employing solid-state SiN nanopores. The DNA-cisplatin interacting process is clearly classified into three stages by measuring the capture rate of DNA-cisplatin adducts. In the first stage, the negative charged DNA molecules were partially discharged due to the bonding of positive charged cisplatin and forming of mono-adducts. In the second stage, forming of DNA-cisplatin di-adducts with the adjacent bases results in DNA bending and softening. The capture rate increases since the softened bi-adducts experience a lower barrier to thread into the nanopores. In the third stage, complex structures, such as micro-loop, are formed and the DNA-cisplatin adducts are aggregated. The capture rate decreases to zero as the aggregated adduct grows to the size of the pore. The characteristic time of this stage was found to be linear with the diameter of the nanopore and this dynamic process can be described with a second-order reaction model. We are grateful to Laboratory of Microfabrication, Dr. Y. Yao, and Prof. R.C. Yu (Institute of Physics, Chinese Academy of Sciences) for technical assistance.
Poudel, Lokendra; Steinmetz, Nicole F; French, Roger H; Parsegian, V Adrian; Podgornik, Rudolf; Ching, Wai-Yim
2016-08-03
We present a first-principles density functional study elucidating the effects of solvent, metal ions and topology on the electronic structure and hydrogen bonding of 12 well-designed three dimensional G-quadruplex (G4-DNA) models in different environments. Our study shows that the parallel strand structures are more stable in dry environments and aqueous solutions containing K(+) ions within the tetrad of guanine but conversely, that the anti-parallel structure is more stable in solutions containing the Na(+) ions within the tetrad of guanine. The presence of metal ions within the tetrad of the guanine channel always enhances the stability of the G4-DNA models. The parallel strand structures have larger HOMO-LUMO gaps than antiparallel structures, which are in the range of 0.98 eV to 3.11 eV. Partial charge calculations show that sugar and alkali ions are positively charged whereas nucleobases, PO4 groups and water molecules are all negatively charged. Partial charges on each functional group with different signs and magnitudes contribute differently to the electrostatic interactions involving G4-DNA and favor the parallel structure. A comparative study between specific pairs of different G4-DNA models shows that the Hoogsteen OH and NH hydrogen bonds in the guanine tetrad are significantly influenced by the presence of metal ions and water molecules, collectively affecting the structure and the stability of G4-DNA.
Climate Modeling with a Million CPUs
NASA Astrophysics Data System (ADS)
Tobis, M.; Jackson, C. S.
2010-12-01
Michael Tobis, Ph.D. Research Scientist Associate University of Texas Institute for Geophysics Charles S. Jackson Research Scientist University of Texas Institute for Geophysics Meteorological, oceanographic, and climatological applications have been at the forefront of scientific computing since its inception. The trend toward ever larger and more capable computing installations is unabated. However, much of the increase in capacity is accompanied by an increase in parallelism and a concomitant increase in complexity. An increase of at least four additional orders of magnitude in the computational power of scientific platforms is anticipated. It is unclear how individual climate simulations can continue to make effective use of the largest platforms. Conversion of existing community codes to higher resolution, or to more complex phenomenology, or both, presents daunting design and validation challenges. Our alternative approach is to use the expected resources to run very large ensembles of simulations of modest size, rather than to await the emergence of very large simulations. We are already doing this in exploring the parameter space of existing models using the Multiple Very Fast Simulated Annealing algorithm, which was developed for seismic imaging. Our experiments have the dual intentions of tuning the model and identifying ranges of parameter uncertainty. Our approach is less strongly constrained by the dimensionality of the parameter space than are competing methods. Nevertheless, scaling up remains costly. Much could be achieved by increasing the dimensionality of the search and adding complexity to the search algorithms. Such ensemble approaches scale naturally to very large platforms. Extensions of the approach are anticipated. For example, structurally different models can be tuned to comparable effectiveness. This can provide an objective test for which there is no realistic precedent with smaller computations. We find ourselves inventing new code to manage our ensembles. Component computations involve tens to hundreds of CPUs and tens to hundreds of hours. The results of these moderately large parallel jobs influence the scheduling of subsequent jobs, and complex algorithms may be easily contemplated for this. The operating system concept of a "thread" re-emerges at a very coarse level, where each thread manages atomic computations of thousands of CPU-hours. That is, rather than multiple threads operating on a processor, at this level, multiple processors operate within a single thread. In collaboration with the Texas Advanced Computing Center, we are developing a software library at the system level, which should facilitate the development of computations involving complex strategies which invoke large numbers of moderately large multi-processor jobs. While this may have applications in other sciences, our key intent is to better characterize the coupled behavior of a very large set of climate model configurations.
Shi, Zhenyu; Wedd, Anthony G.; Gras, Sally L.
2013-01-01
The development of synthetic biology requires rapid batch construction of large gene networks from combinations of smaller units. Despite the availability of computational predictions for well-characterized enzymes, the optimization of most synthetic biology projects requires combinational constructions and tests. A new building-brick-style parallel DNA assembly framework for simple and flexible batch construction is presented here. It is based on robust recombination steps and allows a variety of DNA assembly techniques to be organized for complex constructions (with or without scars). The assembly of five DNA fragments into a host genome was performed as an experimental demonstration. PMID:23468883
Thread gauge for measuring thread pitch diameters
Brewster, A.L.
1985-11-19
A thread gauge which attaches to a vernier caliper to measure the thread pitch diameter of both externally threaded and internally threaded parts is disclosed. A pair of anvils are externally threaded with threads having the same pitch as those of the threaded part. Each anvil is mounted on a stem having a ball on which the anvil can rotate to properly mate with the parts to which the anvils are applied. The stems are detachably secured to the caliper blades by attachment collars having keyhole openings for receiving the stems and caliper blades. A set screw is used to secure each collar on its caliper blade. 2 figs.
Thread gauge for measuring thread pitch diameters
Brewster, Albert L.
1985-01-01
A thread gauge which attaches to a vernier caliper to measure the thread pitch diameter of both externally threaded and internally threaded parts. A pair of anvils are externally threaded with threads having the same pitch as those of the threaded part. Each anvil is mounted on a stem having a ball on which the anvil can rotate to properly mate with the parts to which the anvils are applied. The stems are detachably secured to the caliper blades by attachment collars having keyhole openings for receiving the stems and caliper blades. A set screw is used to secure each collar on its caliper blade.
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-10-23
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m < n), such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP) complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation.
ng: What next-generation languages can teach us about HENP frameworks in the manycore era
NASA Astrophysics Data System (ADS)
Binet, Sébastien
2011-12-01
Current High Energy and Nuclear Physics (HENP) frameworks were written before multicore systems became widely deployed. A 'single-thread' execution model naturally emerged from that environment, however, this no longer fits into the processing model on the dawn of the manycore era. Although previous work focused on minimizing the changes to be applied to the LHC frameworks (because of the data taking phase) while still trying to reap the benefits of the parallel-enhanced CPU architectures, this paper explores what new languages could bring to the design of the next-generation frameworks. Parallel programming is still in an intensive phase of R&D and no silver bullet exists despite the 30+ years of literature on the subject. Yet, several parallel programming styles have emerged: actors, message passing, communicating sequential processes, task-based programming, data flow programming, ... to name a few. We present the work of the prototyping of a next-generation framework in new and expressive languages (python and Go) to investigate how code clarity and robustness are affected and what are the downsides of using languages younger than FORTRAN/C/C++.
NASA Astrophysics Data System (ADS)
Stuart, J. A.
2011-12-01
This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors, and more specifically GPUs. As a case study, we design and implement the ``DCGN'' API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU-based MPI implementations while providing fully-dynamic communication.
System, methods and apparatus for program optimization for multi-threaded processor architectures
Bastoul, Cedric; Lethin, Richard A; Leung, Allen K; Meister, Benoit J; Szilagyi, Peter; Vasilache, Nicolas T; Wohlford, David E
2015-01-06
Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least two multi-stage execution units that allow for parallel execution of tasks. The first custom computing apparatus optimizes the code for parallelism, locality of operations and contiguity of memory accesses on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shipman, Galen M.
These are the slides for a presentation on programming models in HPC, at the Los Alamos National Laboratory's Parallel Computing Summer School. The following topics are covered: Flynn's Taxonomy of computer architectures; single instruction single data; single instruction multiple data; multiple instruction multiple data; address space organization; definition of Trinity (Intel Xeon-Phi is a MIMD architecture); single program multiple data; multiple program multiple data; ExMatEx workflow overview; definition of a programming model, programming languages, runtime systems; programming model and environments; MPI (Message Passing Interface); OpenMP; Kokkos (Performance Portable Thread-Parallel Programming Model); Kokkos abstractions, patterns, policies, and spaces; RAJA, a systematicmore » approach to node-level portability and tuning; overview of the Legion Programming Model; mapping tasks and data to hardware resources; interoperability: supporting task-level models; Legion S3D execution and performance details; workflow, integration of external resources into the programming model.« less
Cache Locality Optimization for Recursive Programs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lifflander, Jonathan; Krishnamoorthy, Sriram
We present an approach to optimize the cache locality for recursive programs by dynamically splicing--recursively interleaving--the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data reuse opportunities across function invocations and interleave them to reduce reuse distance. We present algorithms that efficiently track effects in recursive programs, detect interference and dependencies, and interleave execution of function invocations using user-level (non-kernel) lightweight threads. To enable multi-core execution, a program is parallelized using a nested fork/join programming model. Our cache optimization strategy is designed to work in the context of a random work stealing scheduler. Wemore » present an implementation using the MIT Cilk framework that demonstrates significant improvements in sequential and parallel performance, competitive with a state-of-the-art compile-time optimizer for loop programs and a domain- specific optimizer for stencil programs.« less
UPC++ Programmer’s Guide (v1.0 2017.9)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bachan, J.; Baden, S.; Bonachea, D.
UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, allmore » operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
UPC++ Programmer’s Guide, v1.0-2018.3.0
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bachan, J.; Baden, S.; Bonachea, Dan
UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operationsmore » that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
Intershot Analysis of Flows in DIII-D
NASA Astrophysics Data System (ADS)
Meyer, W. H.; Allen, S. L.; Samuell, C. M.; Howard, J.
2016-10-01
Analysis of the DIII-D flow diagnostic data require demodulation of interference images, and inversion of the resultant line integrated emissivity and flow (phase) images. Four response matrices are pre-calculated: the emissivity line integral and the line integral of the scalar product of the lines-of-site with the orthogonal unit vectors of parallel flow. Equilibrium data determines the relative weight of the component matrices used in the final flow inversion matrix. Serial processing has been used for the lower divertor viewing flow camera 800x600 pixel image. The full cross section viewing camera will require parallel processing of the 2160x2560 pixel image. We will discuss using a Posix thread pool and a Tesla K40c GPU in the processing of this data. Prepared by LLNL under Contract DE-AC52-07NA27344. This material is based upon work supported by the U.S. DOE, Office of Science, Fusion Energy Sciences.
LAMMPS strong scaling performance optimization on Blue Gene/Q
DOE Office of Scientific and Technical Information (OSTI.GOV)
Coffman, Paul; Jiang, Wei; Romero, Nichols A.
2014-11-12
LAMMPS "Large-scale Atomic/Molecular Massively Parallel Simulator" is an open-source molecular dynamics package from Sandia National Laboratories. Significant performance improvements in strong-scaling and time-to-solution for this application on IBM's Blue Gene/Q have been achieved through computational optimizations of the OpenMP versions of the short-range Lennard-Jones term of the CHARMM force field and the long-range Coulombic interaction implemented with the PPPM (particle-particle-particle mesh) algorithm, enhanced by runtime parameter settings controlling thread utilization. Additionally, MPI communication performance improvements were made to the PPPM calculation by re-engineering the parallel 3D FFT to use MPICH collectives instead of point-to-point. Performance testing was done using anmore » 8.4-million atom simulation scaling up to 16 racks on the Mira system at Argonne Leadership Computing Facility (ALCF). Speedups resulting from this effort were in some cases over 2x.« less
Parallel optimization algorithm for drone inspection in the building industry
NASA Astrophysics Data System (ADS)
Walczyński, Maciej; BoŻejko, Wojciech; Skorupka, Dariusz
2017-07-01
In this paper we present an approach for Vehicle Routing Problem with Drones (VRPD) in case of building inspection from the air. In autonomic inspection process there is a need to determine of the optimal route for inspection drone. This is especially important issue because of the very limited flight time of modern multicopters. The method of determining solutions for Traveling Salesman Problem(TSP), described in this paper bases on Parallel Evolutionary Algorithm (ParEA)with cooperative and independent approach for communication between threads. This method described first by Bożejko and Wodecki [1] bases on the observation that if exists some number of elements on certain positions in a number of permutations which are local minima, then those elements will be in the same position in the optimal solution for TSP problem. Numerical experiments were made on BEM computational cluster with using MPI library.
Gregarious Data Re-structuring in a Many Core Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shrestha, Sunil; Manzano Franco, Joseph B.; Marquez, Andres
this paper, we have developed a new methodology that takes in consideration the access patterns from a single parallel actor (e.g. a thread), as well as, the access patterns of “grouped” parallel actors that share a resource (e.g. a distributed Level 3 cache). We start with a hierarchical tile code for our target machine and apply a series of transformations at the tile level to improve data residence in a given memory hierarchy level. The contribution of this paper includes (a) collaborative data restructuring for group reuse and (b) low overhead transformation technique to improve access pattern and bring closelymore » connected data elements together. Preliminary results in a many core architecture, Tilera TileGX, shows promising improvements over optimized OpenMP code (up to 31% increase in GFLOPS) and over our own previous work on fine grained runtimes (up to 16%) for selected kernels« less
Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing
NASA Technical Reports Server (NTRS)
Sterling, T. L.; Zima, H. P.
2002-01-01
Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commodity clusters. In this paper we describe the design of Gilgamesh, a PIM-based massively parallel architecture, and elements of its execution model. Gilgamesh extends existing PIM capabilities by incorporating advanced mechanisms for virtualizing tasks and data and providing adaptive resource management for load balancing and latency tolerance. The Gilgamesh execution model is based on macroservers, a middleware layer which supports object-based runtime management of data and threads allowing explicit and dynamic control of locality and load balancing. The paper concludes with a discussion of related research activities and an outlook to future work.
Federal Register 2010, 2011, 2012, 2013, 2014
2013-12-19
... DEPARTMENT OF COMMERCE International Trade Administration [C-533-856] Steel Threaded Rod From... exporters of steel threaded rod from India. The period of investigation (``POI'') is January 1, 2012... this investigation is steel threaded rod. Steel threaded rod is certain threaded rod, bar, or studs, of...
A simple procedure for parallel sequence analysis of both strands of 5'-labeled DNA.
Razvi, F; Gargiulo, G; Worcel, A
1983-08-01
Ligation of a 5'-labeled DNA restriction fragment results in a circular DNA molecule carrying the two 32Ps at the reformed restriction site. Double digestions of the circular DNA with the original enzyme and a second restriction enzyme cleavage near the labeled site allows direct chemical sequencing of one 5'-labeled DNA strand. Similar double digestions, using an isoschizomer that cleaves differently at the 32P-labeled site, allows direct sequencing of the now 3'-labeled complementary DNA strand. It is possible to directly sequence both strands of cloned DNA inserts by using the above protocol and a multiple cloning site vector that provides the necessary restriction sites. The simultaneous and parallel visualization of both DNA strands eliminates sequence ambiguities. In addition, the labeled circular molecules are particularly useful for single-hit DNA cleavage studies and DNA footprint analysis. As an example, we show here an analysis of the micrococcal nuclease-induced breaks on the two strands of the somatic 5S RNA gene of Xenopus borealis, which suggests that the enzyme may recognize and cleave small AT-containing palindromes along the DNA helix.
Morphological relationships in the chromospheric H-alpha fine structure
NASA Technical Reports Server (NTRS)
Foukal, P.
1971-01-01
A continuous relationship is proposed between the basic elements of the dark fine structure of the quiet and active chromosphere. A progression from chromospheric bushes to fibrils, then to chromospheric threads and active region filaments, and finally to diffuse quiescent filaments, is described. It is shown that the horizontal component of the field on opposite sides of an active region quiescent filament can be in the same direction and closely parallel to the filament axis. Consequently, it is unnecessary to postulate twisted or otherwise complex field configurations to reconcile the support mechanism of filaments with the observed motion along their axis.
NASA Astrophysics Data System (ADS)
Lai, Siyan; Xu, Ying; Shao, Bo; Guo, Menghan; Lin, Xiaola
2017-04-01
In this paper we study on Monte Carlo method for solving systems of linear algebraic equations (SLAE) based on shared memory. Former research demostrated that GPU can effectively speed up the computations of this issue. Our purpose is to optimize Monte Carlo method simulation on GPUmemoryachritecture specifically. Random numbers are organized to storein shared memory, which aims to accelerate the parallel algorithm. Bank conflicts can be avoided by our Collaborative Thread Arrays(CTA)scheme. The results of experiments show that the shared memory based strategy can speed up the computaions over than 3X at most.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lorenz, Daniel; Wolf, Felix
2016-02-17
The PRIMA-X (Performance Retargeting of Instrumentation, Measurement, and Analysis Technologies for Exascale Computing) project is the successor of the DOE PRIMA (Performance Refactoring of Instrumentation, Measurement, and Analysis Technologies for Petascale Computing) project, which addressed the challenge of creating a core measurement infrastructure that would serve as a common platform for both integrating leading parallel performance systems (notably TAU and Scalasca) and developing next-generation scalable performance tools. The PRIMA-X project shifts the focus away from refactorization of robust performance tools towards a re-targeting of the parallel performance measurement and analysis architecture for extreme scales. The massive concurrency, asynchronous execution dynamics,more » hardware heterogeneity, and multi-objective prerequisites (performance, power, resilience) that identify exascale systems introduce fundamental constraints on the ability to carry forward existing performance methodologies. In particular, there must be a deemphasis of per-thread observation techniques to significantly reduce the otherwise unsustainable flood of redundant performance data. Instead, it will be necessary to assimilate multi-level resource observations into macroscopic performance views, from which resilient performance metrics can be attributed to the computational features of the application. This requires a scalable framework for node-level and system-wide monitoring and runtime analyses of dynamic performance information. Also, the interest in optimizing parallelism parameters with respect to performance and energy drives the integration of tool capabilities in the exascale environment further. Initially, PRIMA-X was a collaborative project between the University of Oregon (lead institution) and the German Research School for Simulation Sciences (GRS). Because Prof. Wolf, the PI at GRS, accepted a position as full professor at Technische Universität Darmstadt (TU Darmstadt) starting February 1st, 2015, the project ended at GRS on January 31st, 2015. This report reflects the work accomplished at GRS until then. The work of GRS is expected to be continued at TU Darmstadt. The first main accomplishment of GRS is the design of different thread-level aggregation techniques. We created a prototype capable of aggregating the thread-level information in performance profiles using these techniques. The next step will be the integration of the most promising techniques into the Score-P measurement system and their evaluation. The second main accomplishment is a substantial increase of Score-P’s scalability, achieved by improving the design of the system-tree representation in Score-P’s profile format. We developed a new representation and a distributed algorithm to create the scalable system tree representation. Finally, we developed a lightweight approach to MPI wait-state profiling. Former algorithms either needed piggy-backing, which can cause significant runtime overhead, or tracing, which comes with its own set of scaling challenges. Our approach works with local data only and, thus, is scalable and has very little overhead.« less
Parallel, Distributed Scripting with Python
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, P J
2002-05-24
Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI librarymore » gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.« less
Wesemann, Dorette; Grunwald, Martin
2008-09-01
Online discussion forums are often used by people with eating disorders. This study analyses 2,072 threads containing a total of 14,903 postings from an unmoderated German "prorecovery" forum for persons suffering from bulimia nervosa (www.ab-server.de) during the period from October 2004 to May 2006. The threads were inductively analyzed for underlying structural types, and the various types found were then analyzed for differences in temporal and quantitative parameters. Communication in the online discussion forum occurred in three types of thread: (1) problem-oriented threads (78.8% of threads), (2) communication-oriented threads (15.3% of threads), and (3) metacommunication threads (2.6% of threads). Metacommunication threads contained significantly more postings than problem-oriented and communication-oriented threads, and they were viewed significantly more often. Moreover, there are temporal differences between the structural types. Topics relating to active management of the disorder receive great attention in prorecovery forums. (c) 2008 by Wiley Periodicals, Inc.
Ali, Yasser Helmy
2018-02-01
Thread-lifting rejuvenation procedures have evolved again, with the development of absorbable threads. Although they have gained popularity among plastic surgeons and dermatologists, very few articles have been written in literature about absorbable threads. This study aims to evaluate two years' outcome of thread lifting using absorbable barbed threads for facial rejuvenation. Prospective comparative stud both objectively and subjectively and follow-up assessment for 24 months. Thread lifting for face rejuvenation has significant long-lasting effects that include skin lifting from 3-10 mm and high degree of patients' satisfaction with less incidence rate of complications, about 4.8%. Augmented results are obtained when thread lifting is combined with other lifting and rejuvenation modalities. Significant facial rejuvenation is achieved by thread lifting and highly augmented results are observed when they are combined with Botox, fillers, and/or platelet rich plasma (PRP) rejuvenations.
Thread gauge for tapered threads
Brewster, Albert L.
1994-01-11
The thread gauge permits the user to determine the pitch diameter of tapered threads at the intersection of the pitch cone and the end face of the object being measured. A pair of opposed anvils having lines of threads which match the configuration and taper of the threads on the part being measured are brought into meshing engagement with the threads on opposite sides of the part. The anvils are located linearly into their proper positions by stop fingers on the anvils that are brought into abutting engagement with the end face of the part. This places predetermined reference points of the pitch cone of the thread anvils in registration with corresponding points on the end face of the part being measured, resulting in an accurate determination of the pitch diameter at that location. The thread anvils can be arranged for measuring either internal or external threads.
Thread gauge for tapered threads
Brewster, A.L.
1994-01-11
The thread gauge permits the user to determine the pitch diameter of tapered threads at the intersection of the pitch cone and the end face of the object being measured. A pair of opposed anvils having lines of threads which match the configuration and taper of the threads on the part being measured are brought into meshing engagement with the threads on opposite sides of the part. The anvils are located linearly into their proper positions by stop fingers on the anvils that are brought into abutting engagement with the end face of the part. This places predetermined reference points of the pitch cone of the thread anvils in registration with corresponding points on the end face of the part being measured, resulting in an accurate determination of the pitch diameter at that location. The thread anvils can be arranged for measuring either internal or external threads. 13 figures.
CNT coated thread micro-electro-mechanical system for finger proprioception sensing
NASA Astrophysics Data System (ADS)
Shafi, A. A.; Wicaksono, D. H. B.
2017-04-01
In this paper, we aim to fabricate cotton thread based sensor for proprioceptive application. Cotton threads are utilized as the structural component of flexible sensors. The thread is coated with multi-walled carbon nanotube (MWCNT) dispersion by using facile conventional dipping-drying method. The electrical characterization of the coated thread found that the resistance per meter of the coated thread decreased with increasing the number of dipping. The CNT coated thread sensor works based on piezoresistive theory in which the resistance of the coated thread changes when force is applied. This thread sensor is sewed on glove at the index finger between middle and proximal phalanx parts and the resistance change is measured upon grasping mechanism. The thread based microelectromechanical system (MEMS) enables the flexible sensor to easily fit perfectly on the finger joint and gives reliable response as proprioceptive sensing.
A force-based, parallel assay for the quantification of protein-DNA interactions.
Limmer, Katja; Pippig, Diana A; Aschenbrenner, Daniela; Gaub, Hermann E
2014-01-01
Analysis of transcription factor binding to DNA sequences is of utmost importance to understand the intricate regulatory mechanisms that underlie gene expression. Several techniques exist that quantify DNA-protein affinity, but they are either very time-consuming or suffer from possible misinterpretation due to complicated algorithms or approximations like many high-throughput techniques. We present a more direct method to quantify DNA-protein interaction in a force-based assay. In contrast to single-molecule force spectroscopy, our technique, the Molecular Force Assay (MFA), parallelizes force measurements so that it can test one or multiple proteins against several DNA sequences in a single experiment. The interaction strength is quantified by comparison to the well-defined rupture stability of different DNA duplexes. As a proof-of-principle, we measured the interaction of the zinc finger construct Zif268/NRE against six different DNA constructs. We could show the specificity of our approach and quantify the strength of the protein-DNA interaction.
Nanomechanical DNA origami pH sensors.
Kuzuya, Akinori; Watanabe, Ryosuke; Yamanaka, Yusei; Tamaki, Takuya; Kaino, Masafumi; Ohya, Yuichi
2014-10-16
Single-molecule pH sensors have been developed by utilizing molecular imaging of pH-responsive shape transition of nanomechanical DNA origami devices with atomic force microscopy (AFM). Short DNA fragments that can form i-motifs were introduced to nanomechanical DNA origami devices with pliers-like shape (DNA Origami Pliers), which consist of two levers of 170-nm long and 20-nm wide connected at a Holliday-junction fulcrum. DNA Origami Pliers can be observed as in three distinct forms; cross, antiparallel and parallel forms, and cross form is the dominant species when no additional interaction is introduced to DNA Origami Pliers. Introduction of nine pairs of 12-mer sequence (5'-AACCCCAACCCC-3'), which dimerize into i-motif quadruplexes upon protonation of cytosine, drives transition of DNA Origami Pliers from open cross form into closed parallel form under acidic conditions. Such pH-dependent transition was clearly imaged on mica in molecular resolution by AFM, showing potential application of the system to single-molecular pH sensors.
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-01-01
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m < n), such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP) complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation. PMID:26512650
Design of internal screw thread measuring device based on the Three-Line method principle
NASA Astrophysics Data System (ADS)
Hu, Dachao; Chen, Jianguo
2010-08-01
In accordance with the principle of Three-Line, this paper analyze the correlation of every main parameter of internal screw thread, and then designed a device to measure the main parameters of internal screw thread. Internal thread parameters, such as the pitch diameter, thread angle and screw-pitch of common screw thread, terraced screw thread, zigzag screw thread were obtained through calculation and measurement. The practical applications have proved that this device is convenience to use, and the measurements have a high accuracy. Meanwhile, the application for the patent of invention has been accepted by the Patent Office (Filing number: 200710044081.5).
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images.
Du, Xiaogang; Dang, Jianwu; Wang, Yangping; Wang, Song; Lei, Tao
2016-01-01
The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU).
Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K
2010-01-01
An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
Parallel protein secondary structure prediction based on neural networks.
Zhong, Wei; Altun, Gulsah; Tian, Xinmin; Harrison, Robert; Tai, Phang C; Pan, Yi
2004-01-01
Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, binary and tertiary classifiers of protein secondary structure prediction are implemented on Denoeux belief neural network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 and PSSM (position specific scoring matrix) are experimented separately as the encoding schemes for DBNN. The experimental results contribute to the design of new encoding schemes. New binary classifier for Helix versus not Helix ( approximately H) for DBNN produces prediction accuracy of 87% when PSSM is used for the input profile. The performance of DBNN binary classifier is comparable to other best prediction methods. The good test results for binary classifiers open a new approach for protein structure prediction with neural networks. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the hyperthreading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that hyperthreading technology for Intel architecture is efficient for parallel biological algorithms.
An intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-09-01
Poisson disk sampling has excellent spatial and spectral properties, and plays an important role in a variety of visual computing. Although many promising algorithms have been proposed for multidimensional sampling in euclidean space, very few studies have been reported with regard to the problem of generating Poisson disks on surfaces due to the complicated nature of the surface. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. In sharp contrast to the conventional parallel approaches, our method neither partitions the given surface into small patches nor uses any spatial data structure to maintain the voids in the sampling domain. Instead, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. Our algorithm guarantees that the generated Poisson disks are uniformly and randomly distributed without bias. It is worth noting that our method is intrinsic and independent of the embedding space. This intrinsic feature allows us to generate Poisson disk patterns on arbitrary surfaces in IR(n). To our knowledge, this is the first intrinsic, parallel, and accurate algorithm for surface Poisson disk sampling. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
Convergent and parallel evolution in life habit of the scallops (Bivalvia: Pectinidae)
2011-01-01
Background We employed a phylogenetic framework to identify patterns of life habit evolution in the marine bivalve family Pectinidae. Specifically, we examined the number of independent origins of each life habit and distinguished between convergent and parallel trajectories of life habit evolution using ancestral state estimation. We also investigated whether ancestral character states influence the frequency or type of evolutionary trajectories. Results We determined that temporary attachment to substrata by byssal threads is the most likely ancestral condition for the Pectinidae, with subsequent transitions to the five remaining habit types. Nearly all transitions between life habit classes were repeated in our phylogeny and the majority of these transitions were the result of parallel evolution from byssate ancestors. Convergent evolution also occurred within the Pectinidae and produced two additional gliding clades and two recessing lineages. Furthermore, our analysis indicates that byssal attaching gave rise to significantly more of the transitions than any other life habit and that the cementing and nestling classes are only represented as evolutionary outcomes in our phylogeny, never as progenitor states. Conclusions Collectively, our results illustrate that both convergence and parallelism generated repeated life habit states in the scallops. Bias in the types of habit transitions observed may indicate constraints due to physical or ontogenetic limitations of particular phenotypes. PMID:21672233
Thread angle dependency on flame spread shape over kenaf/polyester combined fabric
NASA Astrophysics Data System (ADS)
Azahari Razali, Mohd; Sapit, Azwan; Nizam Mohammed, Akmal; Nor Anuar Mohamad, Md; Nordin, Normayati; Sadikin, Azmahani; Faisal Hushim, Mohd; Jaat, Norrizam; Khalid, Amir
2017-09-01
Understanding flame spread behavior is crucial to Fire Safety Engineering. It is noted that the natural fiber exhibits different flame spread behavior than the one of the synthetic fiber. This different may influences the flame spread behavior over combined fabric. There is a research has been done to examined the flame spread behavior over kenaf/polyester fabric. It is seen that the flame spread shape is dependent on the thread angle dependency. However, the explanation of this phenomenon is not described in detail in that research. In this study, explanation about this phenomenon is given in detail. Results show that the flame spread shape is dependent on the position of synthetic thread. For thread angle, θ = 0°, the polyester thread is breaking when the flame approach to the thread and the kenaf thread tends to move to the breaking direction. This behavior produces flame to be ‘V’ shape. However, for thread angle, θ = 90°, the polyester thread melts while the kenaf thread decomposed and burned. At this angle, the distance between kenaf threads remains constant as flame approaches.
The effect of thread pattern upon implant osseointegration.
Abuhussein, Heba; Pagni, Giorgio; Rebaudi, Alberto; Wang, Hom-Lay
2010-02-01
Implant design features such as macro- and micro-design may influence overall implant success. Limited information is currently available. Therefore, it is the purpose of this paper to examine these factors such as thread pitch, thread geometry, helix angle, thread depth and width as well as implant crestal module may affect implant stability. A literature search was conducted using MEDLINE to identify studies, from simulated laboratory models, animal, to human, related to this topic using the keywords of implant thread, implant macrodesign, thread pitch, thread geometry, helix angle, thread depth, thread width and implant crestal module. The results showed how thread geometry affects the distribution of stress forces around the implant. A decreased thread pitch may positively influence implant stability. Excess helix angles in spite of a faster insertion may jeopardize the ability of implants to sustain axial load. Deeper threads seem to have an important effect on the stabilization in poorer bone quality situations. The addition of threads or microthreads up to the crestal module of an implant might provide a potential positive contribution on bone-to to-implant contact as well as on the preservation of marginal bone; nonetheless this remains to be determined. Appraising the current literature on this subject and combining existing data to verify the presence of any association between the selected characteristics may be critical in the achievement of overall implant success.
Method for molding threads in graphite panels
Short, W.W.; Spencer, C.
1994-11-29
A graphite panel with a hole having a damaged thread is repaired by drilling the hole to remove all of the thread and making a new hole of larger diameter. A bolt with a lubricated thread is placed in the new hole and the hole is packed with graphite cement to fill the hole and the thread on the bolt. The graphite cement is cured, and the bolt is unscrewed therefrom to leave a thread in the cement which is at least as strong as that of the original thread. 8 figures.
The measure method of internal screw thread and the measure device design
NASA Astrophysics Data System (ADS)
Hu, Dachao; Chen, Jianguo
2008-12-01
In accordance with the principle of Three-Line, this paper analyzed the correlation of every main parameter of internal screw thread, and then designed a device to measure the main parameters of internal screw thread. Basis on the measured value and corresponding formula calculation, we can get the internal thread parameters, such as the pitch diameter, thread angle and screw-pitch of common screw thread, terraced screw thread, zigzag screw thread and some else. The practical application has proved that this operation of this device is convenience, and the measured dates have a high accuracy. Meanwhile, the application of this device's patent of invention is accepted by the Patent Office. (The filing number: 200710044081.5)
Insertion tube methods and apparatus
Casper, William L.; Clark, Don T.; Grover, Blair K.; Mathewson, Rodney O.; Seymour, Craig A.
2007-02-20
A drill string comprises a first drill string member having a male end; and a second drill string member having a female end configured to be joined to the male end of the first drill string member, the male end having a threaded portion including generally square threads, the male end having a non-threaded extension portion coaxial with the threaded portion, and the male end further having a bearing surface, the female end having a female threaded portion having corresponding female threads, the female end having a non-threaded extension portion coaxial with the female threaded portion, and the female end having a bearing surface. Installation methods, including methods of installing instrumented probes are also provided.
Casper, William L [Rigby, ID; Clark, Don T [Idaho Falls, ID; Grover, Blair K [Idaho Falls, ID; Mathewson, Rodney O [Idaho Falls, ID; Seymour, Craig A [Idaho Falls, ID
2008-10-07
A drill string comprises a first drill string member having a male end; and a second drill string member having a female end configured to be joined to the male end of the first drill string member, the male end having a threaded portion including generally square threads, the male end having a non-threaded extension portion coaxial with the threaded portion, and the male end further having a bearing surface, the female end having a female threaded portion having corresponding female threads, the female end having a non-threaded extension portion coaxial with the female threaded portion, and the female end having a bearing surface. Installation methods, including methods of installing instrumented probes are also provided.
Yamaguchi, Yoko; Shiota, Makoto; FuJii, Masaki; Sekiya, Michi; Ozeki, Masahiko
2016-01-01
Primary stability after implant placement is essential for osseointegration. It is important to understand the bone/implant interface for analyzing the influence of implant design on primary stability. In this study rigid polyurethane foam is used as artificial bone to evaluate the bone-implant interface and to identify where the torque is being generated during placement. Five implant systems-Straumann-Standard (ST), Straumann-Bone Level (BL), Straumann-Tapered Effect (TE), Nobel Biocare-Brånemark MKIII (MK3), and Nobel Biocare-Brånemark MKIV (MK4)-were used for this experiment. Artificial bone blocks were prepared and the implant was installed. After placement, a metal jig and one side artificial bone block were removed and then the implant embedded in the artificial bone was exposed for observing the bone-implant interface. A digital micro-analyzer was used for observing the contact interface. The insertion torque values were 39.35, 23.78, 12.53, 26.35, and 17.79 N cm for MK4, BL, ST, TE, and MK3, respectively. In ST, MK3, TE, MK4, and BL the white layer areas were 61 × 103 μm(2), 37 × 103 μm(2), 103 × 103 μm(2) in the tapered portion and 84 × 03 μm(2) in the parallel portion, 134 × 103 μm(2), and 98 × 103 μm(2) in the tapered portion and 87 × 103 μm(2) in the parallel portion, respectively. The direct observation method of the implant/artificial bone interface is a simple and useful method that enables the identification of the area where implant retention occurs. A white layer at the site of stress concentration during implant placement was identified and the magnitude of the stress was quantitatively estimated. The site where the highest torque occurred was the area from the thread crest to the thread root and the under and lateral aspect of the platform. The artificial bone debris created by the self-tapping blade accumulated in both the cutting chamber and in the space between the threads and artificial bone.
Parallel Task Management Library for MARTe
NASA Astrophysics Data System (ADS)
Valcarcel, Daniel F.; Alves, Diogo; Neto, Andre; Reux, Cedric; Carvalho, Bernardo B.; Felton, Robert; Lomas, Peter J.; Sousa, Jorge; Zabeo, Luca
2014-06-01
The Multithreaded Application Real-Time executor (MARTe) is a real-time framework with increasing popularity and support in the thermonuclear fusion community. It allows modular code to run in a multi-threaded environment leveraging on the current multi-core processor (CPU) technology. One application that relies on the MARTe framework is the Joint European Torus (JET) tokamak WAll Load Limiter System (WALLS). It calculates and monitors the temperature on metal tiles and plasma facing components (PFCs) that can melt or flake if their temperature gets too high when exposed to power loads. One of the main time consuming tasks in WALLS is the calculation of thermal diffusion models in real-time. These models tend to be described by very large state-space models thus making them perfect candidates for parallelisation. MARTe's traditional approach for task parallelisation is to split the problem into several Real-Time Threads, each responsible for a self-contained sequential execution of an input-to-output chain. This is usually possible, but it might not always be practical for algorithmic or technical reasons. Also, it might not be easily scalable with an increase in the number of available CPU cores. The WorkLibrary introduces a “GPU-like approach” of splitting work among the available cores of modern CPUs that is (i) straightforward to use in an application, (ii) scalable with the availability of cores and all of this (iii) without rewriting or recompiling the source code. The first part of this article explains the motivation behind the library, its architecture and implementation. The second part presents a real application for WALLS, a parallel version of a large state-space model describing the 2D thermal diffusion on a JET tile.
NASA Astrophysics Data System (ADS)
Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.
2016-12-01
The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
A feasibility study on porting the community land model onto accelerators using OpenACC
Wang, Dali; Wu, Wei; Winkler, Frank; ...
2014-01-01
As environmental models (such as Accelerated Climate Model for Energy (ACME), Parallel Reactive Flow and Transport Model (PFLOTRAN), Arctic Terrestrial Simulator (ATS), etc.) became more and more complicated, we are facing enormous challenges regarding to porting those applications onto hybrid computing architecture. OpenACC appears as a very promising technology, therefore, we have conducted a feasibility analysis on porting the Community Land Model (CLM), a terrestrial ecosystem model within the Community Earth System Models (CESM)). Specifically, we used automatic function testing platform to extract a small computing kernel out of CLM, then we apply this kernel into the actually CLM dataflowmore » procedure, and investigate the strategy of data parallelization and the benefit of data movement provided by current implementation of OpenACC. Even it is a non-intensive kernel, on a single 16-core computing node, the performance (based on the actual computation time using one GPU) of OpenACC implementation is 2.3 time faster than that of OpenMP implementation using single OpenMP thread, but it is 2.8 times slower than the performance of OpenMP implementation using 16 threads. On multiple nodes, MPI_OpenACC implementation demonstrated very good scalability on up to 128 GPUs on 128 computing nodes. This study also provides useful information for us to look into the potential benefits of “deep copy” capability and “routine” feature of OpenACC standards. In conclusion, we believe that our experience on the environmental model, CLM, can be beneficial to many other scientific research programs who are interested to porting their large scale scientific code using OpenACC onto high-end computers, empowered by hybrid computing architecture.« less
Improvement and speed optimization of numerical tsunami modelling program using OpenMP technology
NASA Astrophysics Data System (ADS)
Chernov, A.; Zaytsev, A.; Yalciner, A.; Kurkin, A.
2009-04-01
Currently, the basic problem of tsunami modeling is low speed of calculations which is unacceptable for services of the operative notification. Existing algorithms of numerical modeling of hydrodynamic processes of tsunami waves are developed without taking the opportunities of modern computer facilities. There is an opportunity to have considerable acceleration of process of calculations by using parallel algorithms. We discuss here new approach to parallelization tsunami modeling code using OpenMP Technology (for multiprocessing systems with the general memory). Nowadays, multiprocessing systems are easily accessible for everyone. The cost of the use of such systems becomes much lower comparing to the costs of clusters. This opportunity also benefits all programmers to apply multithreading algorithms on desktop computers of researchers. Other important advantage of the given approach is the mechanism of the general memory - there is no necessity to send data on slow networks (for example Ethernet). All memory is the common for all computing processes; it causes almost linear scalability of the program and processes. In the new version of NAMI DANCE using OpenMP technology and multi-threading algorithm provide 80% gain in speed in comparison with the one-thread version for dual-processor unit. The speed increased and 320% gain was attained for four core processor unit of PCs. Thus, it was possible to reduce considerably time of performance of calculations on the scientific workstations (desktops) without complete change of the program and user interfaces. The further modernization of algorithms of preparation of initial data and processing of results using OpenMP looks reasonable. The final version of NAMI DANCE with the increased computational speed can be used not only for research purposes but also in real time Tsunami Warning Systems.
Molecular Sticker Model Stimulation on Silicon for a Maximum Clique Problem
Ning, Jianguo; Li, Yanmei; Yu, Wen
2015-01-01
Molecular computers (also called DNA computers), as an alternative to traditional electronic computers, are smaller in size but more energy efficient, and have massive parallel processing capacity. However, DNA computers may not outperform electronic computers owing to their higher error rates and some limitations of the biological laboratory. The stickers model, as a typical DNA-based computer, is computationally complete and universal, and can be viewed as a bit-vertically operating machine. This makes it attractive for silicon implementation. Inspired by the information processing method on the stickers computer, we propose a novel parallel computing model called DEM (DNA Electronic Computing Model) on System-on-a-Programmable-Chip (SOPC) architecture. Except for the significant difference in the computing medium—transistor chips rather than bio-molecules—the DEM works similarly to DNA computers in immense parallel information processing. Additionally, a plasma display panel (PDP) is used to show the change of solutions, and helps us directly see the distribution of assignments. The feasibility of the DEM is tested by applying it to compute a maximum clique problem (MCP) with eight vertices. Owing to the limited computing sources on SOPC architecture, the DEM could solve moderate-size problems in polynomial time. PMID:26075867
NASA Technical Reports Server (NTRS)
Macmartin, Malcolm
1995-01-01
Improved screw-thread lock engaged after screw tightened in nut or other mating threaded part. Device does not release contaminating material during tightening of screw. Includes pellet of soft material encased in screw and retained by pin. Hammer blow on pin extrudes pellet into slot, engaging threads in threaded hole or in nut.
Method for molding threads in graphite panels
Short, William W.; Spencer, Cecil
1994-01-01
A graphite panel (10) with a hole (11) having a damaged thread (12) is repaired by drilling the hole (11) to remove all of the thread and make a new hole (13) of larger diameter. A bolt (14) with a lubricated thread (17) is placed in the new hole (13) and the hole (13) is packed with graphite cement (16) to fill the hole and the thread on the bolt. The graphite cement (16) is cured, and the bolt is unscrewed therefrom to leave a thread (20) in the cement (16) which is at least as strong as that of the original thread (12).
Self-locking threaded fasteners
Glovan, Ronald J.; Tierney, John C.; McLean, Leroy L.; Johnson, Lawrence L.
1996-01-01
A threaded fastener with a shape memory alloy (SMA) coatings on its threads is disclosed. The fastener has special usefulness in high temperature applications where high reliability is important. The SMA coated fastener is threaded into or onto a mating threaded part at room temperature to produce a fastened object. The SMA coating is distorted during the assembly. At elevated temperatures the coating tries to recover its original shape and thereby exerts locking forces on the threads. When the fastened object is returned to room temperature the locking forces dissipate. Consequently the threaded fasteners can be readily disassembled at room temperature but remains securely fastened at high temperatures. A spray technique is disclosed as a particularly useful method of coating of threads of a fastener with a shape memory alloy.
Buechner, Claudia N.; Heil, Korbinian; Michels, Gudrun; Carell, Thomas; Kisker, Caroline; Tessmer, Ingrid
2014-01-01
Recognition and removal of DNA damages is essential for cellular and organismal viability. Nucleotide excision repair (NER) is the sole mechanism in humans for the repair of carcinogenic UV irradiation-induced photoproducts in the DNA, such as cyclobutane pyrimidine dimers. The broad substrate versatility of NER further includes, among others, various bulky DNA adducts. It has been proposed that the 5′-3′ helicase XPD (xeroderma pigmentosum group D) protein plays a decisive role in damage verification. However, despite recent advances such as the identification of a DNA-binding channel and central pore in the protein, through which the DNA is threaded, as well as a dedicated lesion recognition pocket near the pore, the exact process of target site recognition and verification in eukaryotic NER still remained elusive. Our single molecule analysis by atomic force microscopy reveals for the first time that XPD utilizes different recognition strategies to verify structurally diverse lesions. Bulky fluorescein damage is preferentially detected on the translocated strand, whereas the opposite strand preference is observed for a cyclobutane pyrimidine dimer lesion. Both states, however, lead to similar conformational changes in the resulting specific complexes, indicating a merge to a “final” verification state, which may then trigger the recruitment of further NER proteins. PMID:24338567
Efficient molecular dynamics simulations with many-body potentials on graphics processing units
NASA Astrophysics Data System (ADS)
Fan, Zheyong; Chen, Wei; Vierimaa, Ville; Harju, Ari
2017-09-01
Graphics processing units have been extensively used to accelerate classical molecular dynamics simulations. However, there is much less progress on the acceleration of force evaluations for many-body potentials compared to pairwise ones. In the conventional force evaluation algorithm for many-body potentials, the force, virial stress, and heat current for a given atom are accumulated within different loops, which could result in write conflict between different threads in a CUDA kernel. In this work, we provide a new force evaluation algorithm, which is based on an explicit pairwise force expression for many-body potentials derived recently (Fan et al., 2015). In our algorithm, the force, virial stress, and heat current for a given atom can be accumulated within a single thread and is free of write conflicts. We discuss the formulations and algorithms and evaluate their performance. A new open-source code, GPUMD, is developed based on the proposed formulations. For the Tersoff many-body potential, the double precision performance of GPUMD using a Tesla K40 card is equivalent to that of the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) molecular dynamics code running with about 100 CPU cores (Intel Xeon CPU X5670 @ 2.93 GHz).
NASA Astrophysics Data System (ADS)
Nardi, Albert; Idiart, Andrés; Trinchero, Paolo; de Vries, Luis Manuel; Molinero, Jorge
2014-08-01
This paper presents the development, verification and application of an efficient interface, denoted as iCP, which couples two standalone simulation programs: the general purpose Finite Element framework COMSOL Multiphysics® and the geochemical simulator PHREEQC. The main goal of the interface is to maximize the synergies between the aforementioned codes, providing a numerical platform that can efficiently simulate a wide number of multiphysics problems coupled with geochemistry. iCP is written in Java and uses the IPhreeqc C++ dynamic library and the COMSOL Java-API. Given the large computational requirements of the aforementioned coupled models, special emphasis has been placed on numerical robustness and efficiency. To this end, the geochemical reactions are solved in parallel by balancing the computational load over multiple threads. First, a benchmark exercise is used to test the reliability of iCP regarding flow and reactive transport. Then, a large scale thermo-hydro-chemical (THC) problem is solved to show the code capabilities. The results of the verification exercise are successfully compared with those obtained using PHREEQC and the application case demonstrates the scalability of a large scale model, at least up to 32 threads.
NASA Astrophysics Data System (ADS)
Abadjiev, Valentin; Abadjieva, Emilia
2016-06-01
Hyperboloid gear drives with face mating gears are used to transform rotations between shafts with non-parallel and non-intersecting axes. A special case of these transmissions are Spiroid and Helicon gear drives. The classical gear drives of this type are the Archimedean ones. The objective of this study are hyperboloid gear drives with face meshing, when the pinion possesses threads of conic convolute, Archimedean and involute types, or the pinion has threads of cylindrical convolute, Archimedean and involute types. For simplicity, all three types transmis- sions with face mating gears and a conic pinion are titled Spiroid and all three types transmissions with face mating gears and a cylindrical pinion are titled Helicon. Principles of the mathematical modelling of tooth contact synthesis are discussed in this study. The presented research shows that the synthesis is realized by application of two mathematical models: pitch contact point and mesh region models. Two approaches for synthesis of the gear drives in accordance with Olivier's principles are illustrated. The algorithms and computer programs for optimization synthesis and design of the studied hyperboloid gear drives are presented.
Method for Estimating Thread Strength Reduction of Damaged Parent Holes with Inserts
NASA Technical Reports Server (NTRS)
Johnson, David L.; Stratton, Troy C.
2005-01-01
During normal assembly and disassembly of bolted-joint components, thread damage and/or deformation may occur. If threads are overloaded, thread damage/deformation can also be anticipated. Typical inspection techniques (e.g. using GO-NO GO gages) may not provide adequate visibility of the extent of thread damage. More detailed inspection techniques have provided actual pitch-diameter profiles of damaged-hardware holes. A method to predict the reduction in thread shear-out capacity of damaged threaded holes has been developed. This method was based on testing and analytical modeling. Test samples were machined to simulate damaged holes in the hardware of interest. Test samples containing pristine parent-holes were also manufactured from the same bar-stock material to provide baseline results for comparison purposes. After the particular parent-hole thread profile was machined into each sample a helical insert was installed into the threaded hole. These samples were tested in a specially designed fixture to determine the maximum load required to shear out the parent threads. It was determined from the pristine-hole samples that, for the specific material tested, each individual thread could resist an average load of 3980 pounds. The shear-out loads of the holes having modified pitch diameters were compared to the ultimate loads of the specimens with pristine holes. An equivalent number of missing helical coil threads was then determined based on the ratio of shear-out loads for each thread configuration. These data were compared with the results from a finite element model (FEM). The model gave insights into the ability of the thread loads to redistribute for both pristine and simulated damage configurations. In this case, it was determined that the overall potential reduction in thread load-carrying capability in the hardware of interest was equal to having up to three fewer threads in the hole that bolt threads could engage. One- half of this potential reduction was due to local pitch-diameter variations and the other half was due to overall pitch-diameter enlargement beyond Class 2 fit. This result was important in that the thread shear capacity for this particular hardware design was the limiting structural capability. The details of the method development, including the supporting testing, data reduction and analytical model results comparison will be discussed hereafter.
NASA Astrophysics Data System (ADS)
Hu, Yuanyuan; Xu, Yingying; Hao, Qun; Hu, Yao
2013-12-01
The tubing internal thread plays an irreplaceable role in the petroleum equipment. The unqualified tubing can directly lead to leakage, slippage and bring huge losses for oil industry. For the purpose of improving efficiency and precision of tubing internal thread detection, we develop a new non-contact tubing internal thread measurement system based on the laser triangulation principle. Firstly, considering that the tubing thread had a small diameter and relatively smooth surface, we built a set of optical system with a line structured light to irradiate the internal thread surface and obtain an image which contains the internal thread profile information through photoelectric sensor. Secondly, image processing techniques were used to do the edge detection of the internal thread from the obtained image. One key method was the sub-pixel technique which greatly improved the detection accuracy under the same hardware conditions. Finally, we restored the real internal thread contour information on the basis of laser triangulation method and calculated tubing thread parameters such as the pitch, taper and tooth type angle. In this system, the profile of several thread teeth can be obtained at the same time. Compared with other existing scanning methods using point light and stepper motor, this system greatly improves the detection efficiency. Experiment results indicate that this system can achieve the high precision and non-contact measurement of the tubing internal thread.
Measurement of Sound Speed in Thread
NASA Astrophysics Data System (ADS)
Saito, Shigemi; Shibata, Yasuhiro; Ichiki, Akira; Miyazaki, Akiho
2006-05-01
By employing thin wires, human hairs and threads, the measurement of sound speed in a thread whose diameter is smaller than 0.2 mm has been attempted. Preparing two cylindrical ceramic transducers with a 300 kHz resonance frequency, a perforated glass bead to be knotted by a sample thread is bonded to the center of the end surface of each transducer. After connecting these transducers with a sample thread, a receiving transducer is attached at a ceiling so as to hang another transmitting transducer with the thread. A glass bead is bonded to another end surface of the transmitting transducer so that tension, varied with a hanged plumb, can be applied to the sample thread. The time delay of the received signal relative to the transmitting pulse is measured while gradually shortening the thread. Sound speed is determined by the proportionality of time delay with thread length. Although the measured values for metallic wires are somewhat different from the values derived from the density and Young’s modulus cited in references, they are reproducible. The sound speed for human hairs of over twenty samples, which varies between 2000 and 2500 m/s, seems to depend on hair quality. Sound speed in a cotton thread is found to approach a constant value under large tension. An advanced measurement system available for uncut threads is also presented, where semi cylindrical transducers pinch the thread.
Federal Register 2010, 2011, 2012, 2013, 2014
2013-12-31
... DEPARTMENT OF COMMERCE International Trade Administration [A-549-831] Steel Threaded Rod From... ``Department'') preliminarily determines that steel threaded rod from Thailand is being, or is likely to be... Investigation The merchandise covered by this investigation is steel threaded rod. Steel threaded rod is certain...
49 CFR 178.46 - Specification 3AL seamless aluminum cylinders.
Code of Federal Regulations, 2012 CFR
2012-10-01
... circular. (5) All openings must be threaded. Threads must comply with the following: (i) Each thread must be clean cut, even, without checks, and to gauge. (ii) Taper threads, when used, must conform to one of the following: (A) American Standard Pipe Thread (NPT) type, conforming to the requirements of NBS...
49 CFR 178.46 - Specification 3AL seamless aluminum cylinders.
Code of Federal Regulations, 2014 CFR
2014-10-01
... circular. (5) All openings must be threaded. Threads must comply with the following: (i) Each thread must be clean cut, even, without checks, and to gauge. (ii) Taper threads, when used, must conform to one of the following: (A) American Standard Pipe Thread (NPT) type, conforming to the requirements of NBS...
49 CFR 178.46 - Specification 3AL seamless aluminum cylinders.
Code of Federal Regulations, 2013 CFR
2013-10-01
... circular. (5) All openings must be threaded. Threads must comply with the following: (i) Each thread must be clean cut, even, without checks, and to gauge. (ii) Taper threads, when used, must conform to one of the following: (A) American Standard Pipe Thread (NPT) type, conforming to the requirements of NBS...
49 CFR 178.46 - Specification 3AL seamless aluminum cylinders.
Code of Federal Regulations, 2011 CFR
2011-10-01
... circular. (5) All openings must be threaded. Threads must comply with the following: (i) Each thread must be clean cut, even, without checks, and to gauge. (ii) Taper threads, when used, must conform to one of the following: (A) American Standard Pipe Thread (NPT) type, conforming to the requirements of NBS...
AN MHD AVALANCHE IN A MULTI-THREADED CORONAL LOOP
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hood, A. W.; Cargill, P. J.; Tam, K. V.
For the first time, we demonstrate how an MHD avalanche might occur in a multithreaded coronal loop. Considering 23 non-potential magnetic threads within a loop, we use 3D MHD simulations to show that only one thread needs to be unstable in order to start an avalanche even when the others are below marginal stability. This has significant implications for coronal heating in that it provides for energy dissipation with a trigger mechanism. The instability of the unstable thread follows the evolution determined in many earlier investigations. However, once one stable thread is disrupted, it coalesces with a neighboring thread andmore » this process disrupts other nearby threads. Coalescence with these disrupted threads then occurs leading to the disruption of yet more threads as the avalanche develops. Magnetic energy is released in discrete bursts as the surrounding stable threads are disrupted. The volume integrated heating, as a function of time, shows short spikes suggesting that the temporal form of the heating is more like that of nanoflares than of constant heating.« less
Self-locking threaded fasteners
Glovan, R.J.; Tierney, J.C.; McLean, L.L.; Johnson, L.L.
1996-01-16
A threaded fastener with a shape memory alloy (SMA) coatings on its threads is disclosed. The fastener has special usefulness in high temperature applications where high reliability is important. The SMA coated fastener is threaded into or onto a mating threaded part at room temperature to produce a fastened object. The SMA coating is distorted during the assembly. At elevated temperatures the coating tries to recover its original shape and thereby exerts locking forces on the threads. When the fastened object is returned to room temperature the locking forces dissipate. Consequently the threaded fasteners can be readily disassembled at room temperature but remains securely fastened at high temperatures. A spray technique is disclosed as a particularly useful method of coating of threads of a fastener with a shape memory alloy. 13 figs.
Bahira, Meriem; McCauley, Micah J; Almaqwashi, Ali A; Lincoln, Per; Westerlund, Fredrik; Rouzina, Ioulia; Williams, Mark C
2015-10-15
Several multi-component DNA intercalating small molecules have been designed around ruthenium-based intercalating monomers to optimize DNA binding properties for therapeutic use. Here we probe the DNA binding ligand [μ-C4(cpdppz)2(phen)4Ru2](4+), which consists of two Ru(phen)2dppz(2+) moieties joined by a flexible linker. To quantify ligand binding, double-stranded DNA is stretched with optical tweezers and exposed to ligand under constant applied force. In contrast to other bis-intercalators, we find that ligand association is described by a two-step process, which consists of fast bimolecular intercalation of the first dppz moiety followed by ∼10-fold slower intercalation of the second dppz moiety. The second step is rate-limited by the requirement for a DNA-ligand conformational change that allows the flexible linker to pass through the DNA duplex. Based on our measured force-dependent binding rates and ligand-induced DNA elongation measurements, we are able to map out the energy landscape and structural dynamics for both ligand binding steps. In addition, we find that at zero force the overall binding process involves fast association (∼10 s), slow dissociation (∼300 s), and very high affinity (Kd ∼10 nM). The methodology developed in this work will be useful for studying the mechanism of DNA binding by other multi-step intercalating ligands and proteins. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
FastID: Extremely Fast Forensic DNA Comparisons
2017-05-19
FastID: Extremely Fast Forensic DNA Comparisons Darrell O. Ricke, PhD Bioengineering Systems & Technologies Massachusetts Institute of...Technology Lincoln Laboratory Lexington, MA USA Darrell.Ricke@ll.mit.edu Abstract—Rapid analysis of DNA forensic samples can have a critical impact on...time sensitive investigations. Analysis of forensic DNA samples by massively parallel sequencing is creating the next gold standard for DNA
Wunnicke, Dorith; Ding, Ping; Yang, Haozhe; Seela, Frank; Steinhoff, Heinz-Jürgen
2015-10-29
Parallel-stranded (ps) DNA characterized by its sugar-phosphate backbones pointing in the same direction represents an alternative pairing system to antiparallel-stranded (aps) DNA with the potential to inhibit transcription and translation. 25-mer oligonucleotides were selected containing only dA·dT base pairs to compare spin-labeled nucleobase distances over a range of 10 or 15 base pairs in ps DNA with those in aps DNA. By means of the copper(I)-catalyzed Huisgen-Meldal-Sharpless alkyne-azide cycloaddition, the spin label 4-azido-2,2,6,6-tetramethylpiperidine-1-oxyl was clicked to 7-ethynyl-7-deaza-2'-deoxyadenosine or 5-ethynyl-2'-deoxyuridine to yield 25-mer oligonucleotides incorporating two spin labels. The interspin distances between spin labeled residues were determined by pulse EPR spectroscopy. The results reveal that in ps DNA these distances are between 5 and 10% longer than in aps DNA when the labeled DNA segment is located near the center of the double helix. The interspin distance in ps DNA becomes shorter compared with aps DNA when one of the spin labels occupies a position near the end of the double helix.
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
NASA Astrophysics Data System (ADS)
Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin
2016-08-01
This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
Suplatov, Dmitry; Popova, Nina; Zhumatiy, Sergey; Voevodin, Vladimir; Švedas, Vytas
2016-04-01
Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .
Understanding thread properties for red blood cell antigen assays: weak ABO blood typing.
Nilghaz, Azadeh; Zhang, Liyuan; Li, Miaosi; Ballerini, David R; Shen, Wei
2014-12-24
"Thread-based microfluidics" research has so far focused on utilizing and manipulating the wicking properties of threads to form controllable microfluidic channels. In this study we aim to understand the separation properties of threads, which are important to their microfluidic detection applications for blood analysis. Confocal microscopy was utilized to investigate the effect of the microscale surface morphologies of fibers on the thread's separation efficiency of red blood cells. We demonstrated the remarkably different separation properties of threads made using silk and cotton fibers. Thread separation properties dominate the clarity of blood typing assays of the ABO groups and some of their weak subgroups (Ax and A3). The microfluidic thread-based analytical devices (μTADs) designed in this work were used to accurately type different blood samples, including 89 normal ABO and 6 weak A subgroups. By selecting thread with the right surface morphology, we were able to build μTADs capable of providing rapid and accurate typing of the weak blood groups with high clarity.
NASA Astrophysics Data System (ADS)
Ji, Shude; Li, Zhengwei; Zhou, Zhenlu; Wu, Baosheng
2017-10-01
This study focused on the effects of thread on hook and cold lap formation, lap shear property and impact toughness of alclad 2024-T4 friction stir lap welding (FSLW) joints. Except the traditional threaded pin tool (TR-tool), three new tools with different thread locations and orientations were designed. Results showed that thread significantly affected hook, cold lap morphologies and lap shear properties. The tool with tip-threaded pin (T-tool) fabricated joint with flat hook and cold lap, which resulted in shear fracture mode. The tools with bottom-threaded pin (B-tool) eliminated the hook. The tool with reverse-threaded pin (R-tool) widened the stir zone width. When using configuration A, the joints fabricated by the three new tools showed higher failure loads than the joint fabricated by the TR-tool. The joint using the T-tool owned the optimum impact toughness. This study demonstrated the significance of thread during FSLW and provided a reference to optimize tool geometry.
A Parallel Processing Algorithm for Remote Sensing Classification
NASA Technical Reports Server (NTRS)
Gualtieri, J. Anthony
2005-01-01
A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.
Parallel algorithm of real-time infrared image restoration based on total variation theory
NASA Astrophysics Data System (ADS)
Zhu, Ran; Li, Miao; Long, Yunli; Zeng, Yaoyuan; An, Wei
2015-10-01
Image restoration is a necessary preprocessing step for infrared remote sensing applications. Traditional methods allow us to remove the noise but penalize too much the gradients corresponding to edges. Image restoration techniques based on variational approaches can solve this over-smoothing problem for the merits of their well-defined mathematical modeling of the restore procedure. The total variation (TV) of infrared image is introduced as a L1 regularization term added to the objective energy functional. It converts the restoration process to an optimization problem of functional involving a fidelity term to the image data plus a regularization term. Infrared image restoration technology with TV-L1 model exploits the remote sensing data obtained sufficiently and preserves information at edges caused by clouds. Numerical implementation algorithm is presented in detail. Analysis indicates that the structure of this algorithm can be easily implemented in parallelization. Therefore a parallel implementation of the TV-L1 filter based on multicore architecture with shared memory is proposed for infrared real-time remote sensing systems. Massive computation of image data is performed in parallel by cooperating threads running simultaneously on multiple cores. Several groups of synthetic infrared image data are used to validate the feasibility and effectiveness of the proposed parallel algorithm. Quantitative analysis of measuring the restored image quality compared to input image is presented. Experiment results show that the TV-L1 filter can restore the varying background image reasonably, and that its performance can achieve the requirement of real-time image processing.
A Verification System for Distributed Objects with Asynchronous Method Calls
NASA Astrophysics Data System (ADS)
Ahrendt, Wolfgang; Dylla, Maximilian
We present a verification system for Creol, an object-oriented modeling language for concurrent distributed applications. The system is an instance of KeY, a framework for object-oriented software verification, which has so far been applied foremost to sequential Java. Building on KeY characteristic concepts, like dynamic logic, sequent calculus, explicit substitutions, and the taclet rule language, the system presented in this paper addresses functional correctness of Creol models featuring local cooperative thread parallelism and global communication via asynchronous method calls. The calculus heavily operates on communication histories which describe the interfaces of Creol units. Two example scenarios demonstrate the usage of the system.
Kalman filter tracking on parallel architectures
NASA Astrophysics Data System (ADS)
Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.
2017-10-01
We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
A wavelet approach to binary blackholes with asynchronous multitasking
NASA Astrophysics Data System (ADS)
Lim, Hyun; Hirschmann, Eric; Neilsen, David; Anderson, Matthew; Debuhr, Jackson; Zhang, Bo
2016-03-01
Highly accurate simulations of binary black holes and neutron stars are needed to address a variety of interesting problems in relativistic astrophysics. We present a new method for the solving the Einstein equations (BSSN formulation) using iterated interpolating wavelets. Wavelet coefficients provide a direct measure of the local approximation error for the solution and place collocation points that naturally adapt to features of the solution. Further, they exhibit exponential convergence on unevenly spaced collection points. The parallel implementation of the wavelet simulation framework presented here deviates from conventional practice in combining multi-threading with a form of message-driven computation sometimes referred to as asynchronous multitasking.
Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2
Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.
2014-01-01
SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745
Data Acquisition with GPUs: The DAQ for the Muon $g$-$2$ Experiment at Fermilab
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gohn, W.
Graphical Processing Units (GPUs) have recently become a valuable computing tool for the acquisition of data at high rates and for a relatively low cost. The devices work by parallelizing the code into thousands of threads, each executing a simple process, such as identifying pulses from a waveform digitizer. The CUDA programming library can be used to effectively write code to parallelize such tasks on Nvidia GPUs, providing a significant upgrade in performance over CPU based acquisition systems. The muonmore » $g$-$2$ experiment at Fermilab is heavily relying on GPUs to process its data. The data acquisition system for this experiment must have the ability to create deadtime-free records from 700 $$\\mu$$s muon spills at a raw data rate 18 GB per second. Data will be collected using 1296 channels of $$\\mu$$TCA-based 800 MSPS, 12 bit waveform digitizers and processed in a layered array of networked commodity processors with 24 GPUs working in parallel to perform a fast recording of the muon decays during the spill. The described data acquisition system is currently being constructed, and will be fully operational before the start of the experiment in 2017.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feo, J.T.
1993-10-01
This report contain papers on: Programmability and performance issues; The case of an iterative partial differential equation solver; Implementing the kernal of the Australian Region Weather Prediction Model in Sisal; Even and quarter-even prime length symmetric FFTs and their Sisal Implementations; Top-down thread generation for Sisal; Overlapping communications and computations on NUMA architechtures; Compiling technique based on dataflow analysis for funtional programming language Valid; Copy elimination for true multidimensional arrays in Sisal 2.0; Increasing parallelism for an optimization that reduces copying in IF2 graphs; Caching in on Sisal; Cache performance of Sisal Vs. FORTRAN; FFT algorithms on a shared-memory multiprocessor;more » A parallel implementation of nonnumeric search problems in Sisal; Computer vision algorithms in Sisal; Compilation of Sisal for a high-performance data driven vector processor; Sisal on distributed memory machines; A virtual shared addressing system for distributed memory Sisal; Developing a high-performance FFT algorithm in Sisal for a vector supercomputer; Implementation issues for IF2 on a static data-flow architechture; and Systematic control of parallelism in array-based data-flow computation. Selected papers have been indexed separately for inclusion in the Energy Science and Technology Database.« less
Arima, Kunimasa
2006-10-01
The microtubule-associated protein tau aggregates into filaments in the form of neurofibrillary tangles, neuropil threads and argyrophilic grains in neurons, in the form of variable astrocytic tangles in astrocytes and in the form of coiled bodies and argyrophilic threads in oligodendrocytes. These tau filaments may be classified into two types, straight filaments or tubules with 9-18 nm diameters and "twisted ribbons" composed of two parallel aligned components. In the same disease, the fine structure of tau filaments in glial cells roughly resembles that in neurons. In sporadic tauopathies, individual tau filaments show characteristic sizes, shapes and arrangements, and therefore contribute to neuropathologic differential diagnosis. In frontotemporal dementias caused by tau gene mutations, variable filamentous profiles were observed in association with mutation sites and insoluble tau isoforms, including straight filaments or tubules, paired helical filament-like filaments, and twisted ribbons. Pre-embedding immunoelectron microscopic studies were carried out using anti-3-repeat tau and anti-4-repeat tau specific antibodies, RD3 and RD4. Straight tubules in neuronal and astrocytic Pick bodies were immunolabeled by the anti-3-repeat tau antibody. The anti-4-repeat tau antibody recognized abnormal tubules comprising neurofibrillary tangles, coiled bodies and argyrophilic threads in progressive supranuclear palsy (PSP) and corticobasal degeneration. In the pre-embedding immunoelectron microscopic study using the phosphorylated tau AT8 antibody, tuft-shaped astrocytes of PSP were found to be composed of bundles of abnormal tubules in processes and perikarya of protoplasmic astrocytes. In this study, the 3-repeat tau or 4-repeat tau epitope was detected in situ at the ultrastructural level in abnormal tubules in representative pathological lesions in Pick's disease, PSP and corticobasal degeneration.
Study of a Fine Grained Threaded Framework Design
NASA Astrophysics Data System (ADS)
Jones, C. D.
2012-12-01
Traditionally, HEP experiments exploit the multiple cores in a CPU by having each core process one event. However, future PC designs are expected to use CPUs which double the number of processing cores at the same rate as the cost of memory falls by a factor of two. This effectively means the amount of memory per processing core will remain constant. This is a major challenge for LHC processing frameworks since the LHC is expected to deliver more complex events (e.g. greater pileup events) in the coming years while the LHC experiment's frameworks are already memory constrained. Therefore in the not so distant future we may need to be able to efficiently use multiple cores to process one event. In this presentation we will discuss a design for an HEP processing framework which can allow very fine grained parallelization within one event as well as supporting processing multiple events simultaneously while minimizing the memory footprint of the job. The design is built around the libdispatch framework created by Apple Inc. (a port for Linux is available) whose central concept is the use of task queues. This design also accommodates the reality that not all code will be thread safe and therefore allows one to easily mark modules or sub parts of modules as being thread unsafe. In addition, the design efficiently handles the requirement that events in one run must all be processed before starting to process events from a different run. After explaining the design we will provide measurements from simulating different processing scenarios where the processing times used for the simulation are drawn from processing times measured from actual CMS event processing.
Simultaneous G-Quadruplex DNA Logic.
Bader, Antoine; Cockroft, Scott L
2018-04-03
A fundamental principle of digital computer operation is Boolean logic, where inputs and outputs are described by binary integer voltages. Similarly, inputs and outputs may be processed on the molecular level as exemplified by synthetic circuits that exploit the programmability of DNA base-pairing. Unlike modern computers, which execute large numbers of logic gates in parallel, most implementations of molecular logic have been limited to single computing tasks, or sensing applications. This work reports three G-quadruplex-based logic gates that operate simultaneously in a single reaction vessel. The gates respond to unique Boolean DNA inputs by undergoing topological conversion from duplex to G-quadruplex states that were resolved using a thioflavin T dye and gel electrophoresis. The modular, addressable, and label-free approach could be incorporated into DNA-based sensors, or used for resolving and debugging parallel processes in DNA computing applications. © 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Structural Turnbuckle Bears Compressive or Tensile Loads
NASA Technical Reports Server (NTRS)
Bateman, W. A.; Lang, C. H.
1985-01-01
Column length adjuster based on turnbuckle principle. Device consists of internally and externally threaded bushing, threaded housing and threaded rod. Housing attached to one part and threaded rod attached to other part of structure. Turning double threaded bushing contracts or extends rod in relation to housing. Once adjusted, bushing secured with jamnuts. Device used for axially loaded members requiring length adjustment during installation.
Do dual-thread orthodontic mini-implants improve bone/tissue mechanical retention?
Lin, Yang-Sung; Chang, Yau-Zen; Yu, Jian-Hong; Lin, Chun-Li
2014-12-01
The aim of this study was to understand whether the pitch relationship between micro and macro thread designs with a parametrical relationship in a dual-thread mini-implant can improve primary stability. Three types of mini-implants consisting of single-thread (ST) (0.75 mm pitch in whole length), dual-thread A (DTA) with double-start 0.375 mm pitch, and dual-thread B (DTB) with single-start 0.2 mm pitch in upper 2-mm micro thread region for performing insertion and pull-out testing. Histomorphometric analysis was performed in these specimens in evaluating peri-implant bone defects using a non-contact vision measuring system. The maximum inserted torque (Tmax) in type DTA was found to be the smallest significantly, but corresponding values found no significant difference between ST and DTB. The largest pull-out strength (Fmax) in the DTA mini-implant was found significantly greater than that for the ST mini-implant regardless of implant insertion orientation. Mini-implant engaged the cortical bone well as observed in ST and DTA types. Dual-thread mini-implant with correct micro thread pitch (parametrical relationship with macro thread pitch) in the cortical bone region can improve primary stability and enhanced mechanical retention.
Geramizadeh, Maryam; Katoozian, Hamidreza; Amid, Reza; Kadkhodazadeh, Mahdi
2018-04-01
This study aimed to optimize the thread depth and pitch of a recently designed dental implant to provide uniform stress distribution by means of a response surface optimization method available in finite element (FE) software. The sensitivity of simulation to different mechanical parameters was also evaluated. A three-dimensional model of a tapered dental implant with micro-threads in the upper area and V-shaped threads in the rest of the body was modeled and analyzed using finite element analysis (FEA). An axial load of 100 N was applied to the top of the implants. The model was optimized for thread depth and pitch to determine the optimal stress distribution. In this analysis, micro-threads had 0.25 to 0.3 mm depth and 0.27 to 0.33 mm pitch, and V-shaped threads had 0.405 to 0.495 mm depth and 0.66 to 0.8 mm pitch. The optimized depth and pitch were 0.307 and 0.286 mm for micro-threads and 0.405 and 0.808 mm for V-shaped threads, respectively. In this design, the most effective parameters on stress distribution were the depth and pitch of the micro-threads based on sensitivity analysis results. Based on the results of this study, the optimal implant design has micro-threads with 0.307 and 0.286 mm depth and pitch, respectively, in the upper area and V-shaped threads with 0.405 and 0.808 mm depth and pitch in the rest of the body. These results indicate that micro-thread parameters have a greater effect on stress and strain values.
DNA Assembly with De Bruijn Graphs Using an FPGA Platform.
Poirier, Carl; Gosselin, Benoit; Fortier, Paul
2018-01-01
This paper presents an FPGA implementation of a DNA assembly algorithm, called Ray, initially developed to run on parallel CPUs. The OpenCL language is used and the focus is placed on modifying and optimizing the original algorithm to better suit the new parallelization tool and the radically different hardware architecture. The results show that the execution time is roughly one fourth that of the CPU and factoring energy consumption yields a tenfold savings.
Deploy Nalu/Kokkos algorithmic infrastructure with performance benchmarking.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Domino, Stefan P.; Ananthan, Shreyas; Knaus, Robert C.
The former Nalu interior heterogeneous algorithm design, which was originally designed to manage matrix assembly operations over all elemental topology types, has been modified to operate over homogeneous collections of mesh entities. This newly templated kernel design allows for removal of workset variable resize operations that were formerly required at each loop over a Sierra ToolKit (STK) bucket (nominally, 512 entities in size). Extensive usage of the Standard Template Library (STL) std::vector has been removed in favor of intrinsic Kokkos memory views. In this milestone effort, the transition to Kokkos as the underlying infrastructure to support performance and portability onmore » many-core architectures has been deployed for key matrix algorithmic kernels. A unit-test driven design effort has developed a homogeneous entity algorithm that employs a team-based thread parallelism construct. The STK Single Instruction Multiple Data (SIMD) infrastructure is used to interleave data for improved vectorization. The collective algorithm design, which allows for concurrent threading and SIMD management, has been deployed for the core low-Mach element- based algorithm. Several tests to ascertain SIMD performance on Intel KNL and Haswell architectures have been carried out. The performance test matrix includes evaluation of both low- and higher-order methods. The higher-order low-Mach methodology builds on polynomial promotion of the core low-order control volume nite element method (CVFEM). Performance testing of the Kokkos-view/SIMD design indicates low-order matrix assembly kernel speed-up ranging between two and four times depending on mesh loading and node count. Better speedups are observed for higher-order meshes (currently only P=2 has been tested) especially on KNL. The increased workload per element on higher-order meshes bene ts from the wide SIMD width on KNL machines. Combining multiple threads with SIMD on KNL achieves a 4.6x speedup over the baseline, with assembly timings faster than that observed on Haswell architecture. The computational workload of higher-order meshes, therefore, seems ideally suited for the many-core architecture and justi es further exploration of higher-order on NGP platforms. A Trilinos/Tpetra-based multi-threaded GMRES preconditioned by symmetric Gauss Seidel (SGS) represents the core solver infrastructure for the low-Mach advection/diffusion implicit solves. The threaded solver stack has been tested on small problems on NREL's Peregrine system using the newly developed and deployed Kokkos-view/SIMD kernels. fforts are underway to deploy the Tpetra-based solver stack on NERSC Cori system to benchmark its performance at scale on KNL machines.« less
Flow cytometry for enrichment and titration in massively parallel DNA sequencing
Sandberg, Julia; Ståhl, Patrik L.; Ahmadian, Afshin; Bjursell, Magnus K.; Lundeberg, Joakim
2009-01-01
Massively parallel DNA sequencing is revolutionizing genomics research throughout the life sciences. However, the reagent costs and labor requirements in current sequencing protocols are still substantial, although improvements are continuously being made. Here, we demonstrate an effective alternative to existing sample titration protocols for the Roche/454 system using Fluorescence Activated Cell Sorting (FACS) technology to determine the optimal DNA-to-bead ratio prior to large-scale sequencing. Our method, which eliminates the need for the costly pilot sequencing of samples during titration is capable of rapidly providing accurate DNA-to-bead ratios that are not biased by the quantification and sedimentation steps included in current protocols. Moreover, we demonstrate that FACS sorting can be readily used to highly enrich fractions of beads carrying template DNA, with near total elimination of empty beads and no downstream sacrifice of DNA sequencing quality. Automated enrichment by FACS is a simple approach to obtain pure samples for bead-based sequencing systems, and offers an efficient, low-cost alternative to current enrichment protocols. PMID:19304748
SEM and fractography analysis of screw thread loosening in dental implants.
Scarano, A; Quaranta, M; Traini, T; Piattelli, M; Piattelli, A
2007-01-01
Biological and technical failures of implants have already been reported. Mechanical factors are certainly of importance in implant failures, even if their exact nature has not yet been established. The abutment screw fracture or loosening represents a rare, but quite unpleasant failure. The aim of the present research is an analysis and structural examination of screw thread or abutment loosening compared with screw threads or abutment without loosening. The loosening of screw threads was compared to screw thread without loosening of three different implant systems; Branemark (Nobel Biocare, Gothenburg, Sweden), T.B.R. implant systems (Benax, Ancona, Italy) and Restore (Lifecore Biomedical, Chaska, Minnesota, USA). In this study broken screws were excluded. A total of 16 screw thread loosenings were observed (Group I) (4 Branemark, 4 T.B.R and 5 Restore), 10 screw threads without loosening were removed (Group II), and 6 screw threads as received by the manufacturer (unused) (Group III) were used as control (2 Branemark, 2 T.B.R and 2 Restore). The loosened abutment screws were retrieved and analyzed under SEM. Many alterations and deformations were present in concavities and convexities of screw threads in group I. No macroscopic alterations or deformations were observed in groups II and III. A statistical difference of the presence of microcracks were observed between screw threads with an abutment loosening and screw threads without an abutment loosening.
Fast parallel molecular algorithms for DNA-based computation: factoring integers.
Chang, Weng-Long; Guo, Minyi; Ho, Michael Shan-Hui
2005-06-01
The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.
A Moiré Pattern-Based Thread Counter
ERIC Educational Resources Information Center
Reich, Gary
2017-01-01
Thread count is a term used in the textile industry as a measure of how closely woven a fabric is. It is usually defined as the sum of the number of warp threads per inch (or cm) and the number of weft threads per inch. (It is sometimes confusingly described as the number of threads per square inch.) In recent years it has also become a subject of…
Iwatsubo, T; Hasegawa, M; Esaki, Y; Ihara, Y
1992-02-01
Immunocytochemically, neuropil threads (curly fibers) were investigated in the Alzheimer's disease brain using a confocal laser scanning fluorescence microscope by double labeling with tau/ubiquitin antibodies. Ubiquitin immunoreactivities were found to be lacking at one or both ends in more than 40% of tau-positive threads. Immunoelectron microscopy showed that bundles of paired helical filaments, which constitute neuropil threads, were positive for ubiquitin around their midportions, but often negative at their ends. Since it is reasonable to postulate that tau deposition as paired helical filaments precedes ubiquitination, the aforementioned observation suggests that the ends of the threads are newly formed portions, and thus the threads are often growing bidirectionally in small neuronal processes.
Parallel Online Temporal Difference Learning for Motor Control.
Caarls, Wouter; Schuitema, Erik
2016-07-01
Temporal difference (TD) learning, a key concept in reinforcement learning, is a popular method for solving simulated control problems. However, in real systems, this method is often avoided in favor of policy search methods because of its long learning time. But policy search suffers from its own drawbacks, such as the necessity of informed policy parameterization and initialization. In this paper, we show that TD learning can work effectively in real robotic systems as well, using parallel model learning and planning. Using locally weighted linear regression and trajectory sampled planning with 14 concurrent threads, we can achieve a speedup of almost two orders of magnitude over regular TD control on simulated control benchmarks. For a real-world pendulum swing-up task and a two-link manipulator movement task, we report a speedup of 20× to 60× , with a real-time learning speed of less than half a minute. The results are competitive with state-of-the-art policy search.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brogi, Bharat Bhushan, E-mail: brogi-221179@yahoo.in; Ahluwalia, P. K.; Chand, Shyam
2015-06-24
Theoretical study of the Coulomb blockade effect on transport properties (Transmission Probability and I-V characteristics) for varied configuration of coupled quantum dot system has been studied by using Non Equilibrium Green Function(NEGF) formalism and Equation of Motion(EOM) method in the presence of magnetic flux. The self consistent approach and intra-dot Coulomb interaction is being taken into account. As the key parameters of the coupled quantum dot system such as dot-lead coupling, inter-dot tunneling and magnetic flux threading through the system can be tuned, the effect of asymmetry parameter and magnetic flux on this tuning is being explored in Coulomb blockademore » regime. The presence of the Coulomb blockade due to on-dot Coulomb interaction decreases the width of transmission peak at energy level ε + U and by adjusting the magnetic flux the swapping effect in the Fano peaks in asymmetric and symmetric parallel configuration sustains despite strong Coulomb blockade effect.« less
Pythran: enabling static optimization of scientific Python programs
NASA Astrophysics Data System (ADS)
Guelton, Serge; Brunet, Pierrick; Amini, Mehdi; Merlini, Adrien; Corbillon, Xavier; Raynaud, Alan
2015-01-01
Pythran is an open source static compiler that turns modules written in a subset of Python language into native ones. Assuming that scientific modules do not rely much on the dynamic features of the language, it trades them for powerful, possibly inter-procedural, optimizations. These optimizations include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE. In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical idioms such as expression templates. Unlike the Cython approach, Pythran input code remains compatible with the Python interpreter. Output code is generally as efficient as the annotated Cython equivalent, if not more, but without the backward compatibility loss.
RCrawler: An R package for parallel web crawling and scraping
NASA Astrophysics Data System (ADS)
Khalil, Salim; Fakir, Mohamed
RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results.
Thermal stability analysis of the fine structure of solar prominences
NASA Technical Reports Server (NTRS)
Demoulin, Pascal; Malherbe, Jean-Marie; Schmieder, Brigitte; Raadu, Mickael A.
1986-01-01
The linear thermal stability of a 2D periodic structure (alternatively hot and cold) in a uniform magnetic field is analyzed. The energy equation includes wave heating (assumed proportional to density), radiative cooling and both conduction parallel and orthogonal to magnetic lines. The equilibrium is perturbed at constant gas pressure. With parallel conduction only, it is found to be unstable when the length scale 1// is greater than 45 Mn. In that case, orthogonal conduction becomes important and stabilizes the structure when the length scale is smaller than 5 km. On the other hand, when the length scale is greater than 5 km, the thermal equilibrium is unstable, and the corresponding time scale is about 10,000 s: this result may be compared to observations showing that the lifetime of the fine structure of solar prominences is about one hour; consequently, our computations suggest that the size of the unresolved threads could be of the order of 10 km only.
Gold thread implantation promotes hair growth in human and mice
Kim, Jong-Hwan; Cho, Eun-Young; Kwon, Euna; Kim, Woo-Ho; Park, Jin-Sung; Lee, Yong-Soon
2017-01-01
Thread-embedding therapy has been widely applied for cosmetic purposes such as wrinkle reduction and skin tightening. Particularly, gold thread was reported to support connective tissue regeneration, but, its role in hair biology remains largely unknown due to lack of investigation. When we implanted gold thread and Happy Lift™ in human patient for facial lifting, we unexpectedly found an increase of hair regrowth in spite of no use of hair growth medications. When embedded into the depilated dorsal skin of mice, gold thread or polyglycolic acid (PGA) thread, similarly to 5% minoxidil, significantly increased the number of hair follicles on day 14 after implantation. And, hair re-growth promotion in the gold threadimplanted mice were significantly higher than that in PGA thread group on day 11 after depilation. In particular, the skin tissue of gold thread-implanted mice showed stronger PCNA staining and higher collagen density compared with control mice. These results indicate that gold thread implantation can be an effective way to promote hair re-growth although further confirmatory study is needed for more information on therapeutic mechanisms and long-term safety. PMID:29399026
NASA Astrophysics Data System (ADS)
Wu, Yuanfeng; Gao, Lianru; Zhang, Bing; Zhao, Haina; Li, Jun
2014-01-01
We present a parallel implementation of the optimized maximum noise fraction (G-OMNF) transform algorithm for feature extraction of hyperspectral images on commodity graphics processing units (GPUs). The proposed approach explored the algorithm data-level concurrency and optimized the computing flow. We first defined a three-dimensional grid, in which each thread calculates a sub-block data to easily facilitate the spatial and spectral neighborhood data searches in noise estimation, which is one of the most important steps involved in OMNF. Then, we optimized the processing flow and computed the noise covariance matrix before computing the image covariance matrix to reduce the original hyperspectral image data transmission. These optimization strategies can greatly improve the computing efficiency and can be applied to other feature extraction algorithms. The proposed parallel feature extraction algorithm was implemented on an Nvidia Tesla GPU using the compute unified device architecture and basic linear algebra subroutines library. Through the experiments on several real hyperspectral images, our GPU parallel implementation provides a significant speedup of the algorithm compared with the CPU implementation, especially for highly data parallelizable and arithmetically intensive algorithm parts, such as noise estimation. In order to further evaluate the effectiveness of G-OMNF, we used two different applications: spectral unmixing and classification for evaluation. Considering the sensor scanning rate and the data acquisition time, the proposed parallel implementation met the on-board real-time feature extraction.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoemmen, Mark
2010-11-01
Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images
Wang, Yangping; Wang, Song
2016-01-01
The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU). PMID:28053653
Immobilization of human papillomavirus DNA probe for surface plasmon resonance imaging
NASA Astrophysics Data System (ADS)
Chong, Xinyuan; Ji, Yanhong; Ma, Suihua; Liu, Le; Liu, Zhiyi; Li, Yao; He, Yonghong; Guo, Jihua
2009-08-01
Human papillomavirus (HPV) is a kind of double-stranded DNA virus whose subspecies have diversity. Near 40 kinds of subspecies can invade reproductive organ and cause some high risk disease, such as cervical carcinoma. In order to detect the type of the subspecies of the HPV DNA, we used the parallel scan spectral surface plasmon resonance (SPR) imaging technique, which is a novel type of two- dimensional bio-sensing method based on surface plasmon resonance and is proposed in our previous work, to study the immobilization of the HPV DNA probes on the gold film. In the experiment, four kinds of the subspecies of the HPV DNA (HPV16, HPV18, HPV31, HPV58) probes are fixed on one gold film, and incubate in the constant temperature condition to get a HPV DNA probe microarray. We use the parallel scan spectral SPR imaging system to detect the reflective indices of the HPV DNA subspecies probes. The benefits of this new approach are high sensitive, label-free, strong specificity and high through-put.
System and method for a parallel immunoassay system
Stevens, Fred J.
2002-01-01
A method and system for detecting a target antigen using massively parallel immunoassay technology. In this system, high affinity antibodies of the antigen are covalently linked to small beads or particles. The beads are exposed to a solution containing DNA-oligomer-mimics of the antigen. The mimics which are reactive with the covalently attached antibody or antibodies will bind to the appropriate antibody molecule on the bead. The particles or beads are then washed to remove any unbound DNA-oligomer-mimics and are then immobilized or trapped. The bead-antibody complexes are then exposed to a test solution which may contain the targeted antigens. If the antigen is present it will replace the mimic since it has a greater affinity for the respective antibody. The particles are then removed from the solution leaving a residual solution. This residual solution is applied a DNA chip containing many samples of complimentary DNA. If the DNA tag from a mimic binds with its complimentary DNA, it indicates the presence of the target antigen. A flourescent tag can be used to more easily identify the bound DNA tag.
Scheduler for multiprocessor system switch with selective pairing
Gara, Alan; Gschwind, Michael Karl; Salapura, Valentina
2015-01-06
System, method and computer program product for scheduling threads in a multiprocessing system with selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). The method configures the selective pairing facility to use checking provide one highly reliable thread for high-reliability and allocate threads to corresponding processor cores indicating need for hardware checking. The method configures the selective pairing facility to provide multiple independent cores and allocate threads to corresponding processor cores indicating inherent resilience.
Biological Nanomotors with a Revolution, Linear, or Rotation Motion Mechanism
Noji, Hiroyuki; Yengo, Christopher M.; Zhao, Zhengyi; Grainge, Ian
2016-01-01
SUMMARY The ubiquitous biological nanomotors were classified into two categories in the past: linear and rotation motors. In 2013, a third type of biomotor, revolution without rotation (http://rnanano.osu.edu/movie.html), was discovered and found to be widespread among bacteria, eukaryotic viruses, and double-stranded DNA (dsDNA) bacteriophages. This review focuses on recent findings about various aspects of motors, including chirality, stoichiometry, channel size, entropy, conformational change, and energy usage rate, in a variety of well-studied motors, including FoF1 ATPase, helicases, viral dsDNA-packaging motors, bacterial chromosome translocases, myosin, kinesin, and dynein. In particular, dsDNA translocases are used to illustrate how these features relate to the motion mechanism and how nature elegantly evolved a revolution mechanism to avoid coiling and tangling during lengthy dsDNA genome transportation in cell division. Motor chirality and channel size are two factors that distinguish rotation motors from revolution motors. Rotation motors use right-handed channels to drive the right-handed dsDNA, similar to the way a nut drives the bolt with threads in same orientation; revolution motors use left-handed motor channels to revolve the right-handed dsDNA. Rotation motors use small channels (<2 nm in diameter) for the close contact of the channel wall with single-stranded DNA (ssDNA) or the 2-nm dsDNA bolt; revolution motors use larger channels (>3 nm) with room for the bolt to revolve. Binding and hydrolysis of ATP are linked to different conformational entropy changes in the motor that lead to altered affinity for the substrate and allow work to be done, for example, helicase unwinding of DNA or translocase directional movement of DNA. PMID:26819321
Inatomi, Osamu; Bamba, Shigeki; Shioya, Makoto; Mochizuki, Yosuke; Ban, Hiromitsu; Tsujikawa, Tomoyuki; Saito, Yasuharu; Andoh, Akira; Fujiyama, Yoshihide
2013-02-14
Although endoscopic biliary stents have been accepted as part of palliative therapy for cases of malignant hilar obstruction, the optimal endoscopic management regime remains controversial. In this study, we evaluated the safety and efficacy of placing a threaded stent above the sphincter of Oddi (threaded inside plastic stents, threaded PS) and compared the results with those of other stent types. Patients with malignant hilar obstruction, including those requiring biliary drainage for stent occlusion, were selected. Patients received either one of the following endoscopic indwelling stents: threaded PS, conventional plastic stents (conventional PS), or metallic stents (MS). Duration of stent patency and the incident of complication were compared in these patients. Forty-two patients underwent placement of endoscopic indwelling stents (threaded PS = 12, conventional PS = 17, MS = 13). The median duration of threaded PS patency was significantly longer than that of conventional PS patency (142 vs. 32 days; P = 0.04, logrank test). The median duration of threaded PS and MS patency was not significantly different (142 vs. 150 days, P = 0.83). Stent migration did not occur in any group. Among patients who underwent threaded PS placement as a salvage therapy after MS obstruction due to tumor ingrowth, the median duration of MS patency was significantly shorter than that of threaded PS patency (123 vs. 240 days). Threaded PS are safe and effective in cases of malignant hilar obstruction; moreover, it is a suitable therapeutic option not only for initial drainage but also for salvage therapy.
Exploration of microfluidic devices based on multi-filament threads and textiles: A review
Nilghaz, A.; Ballerini, D. R.; Shen, W.
2013-01-01
In this paper, we review the recent progress in the development of low-cost microfluidic devices based on multifilament threads and textiles for semi-quantitative diagnostic and environmental assays. Hydrophilic multifilament threads are capable of transporting aqueous and non-aqueous fluids via capillary action and possess desirable properties for building fluid transport pathways in microfluidic devices. Thread can be sewn onto various support materials to form fluid transport channels without the need for the patterned hydrophobic barriers essential for paper-based microfluidic devices. Thread can also be used to manufacture fabrics which can be patterned to achieve suitable hydrophilic-hydrophobic contrast, creating hydrophilic channels which allow the control of fluids flow. Furthermore, well established textile patterning methods and combination of hydrophilic and hydrophobic threads can be applied to fabricate low-cost microfluidic devices that meet the low-cost and low-volume requirements. In this paper, we review the current limitations and shortcomings of multifilament thread and textile-based microfluidics, and the research efforts to date on the development of fluid flow control concepts and fabrication methods. We also present a summary of different methods for modelling the fluid capillary flow in microfluidic thread and textile-based systems. Finally, we summarized the published works of thread surface treatment methods and the potential of combining multifilament thread with other materials to construct devices with greater functionality. We believe these will be important research focuses of thread- and textile-based microfluidics in future. PMID:24086179
Single-molecule DNA detection with an engineered MspA protein nanopore
Butler, Tom Z.; Pavlenok, Mikhail; Derrington, Ian M.; Niederweis, Michael; Gundlach, Jens H.
2008-01-01
Nanopores hold great promise as single-molecule analytical devices and biophysical model systems because the ionic current blockades they produce contain information about the identity, concentration, structure, and dynamics of target molecules. The porin MspA of Mycobacterium smegmatis has remarkable stability against environmental stresses and can be rationally modified based on its crystal structure. Further, MspA has a short and narrow channel constriction that is promising for DNA sequencing because it may enable improved characterization of short segments of a ssDNA molecule that is threaded through the pore. By eliminating the negative charge in the channel constriction, we designed and constructed an MspA mutant capable of electronically detecting and characterizing single molecules of ssDNA as they are electrophoretically driven through the pore. A second mutant with additional exchanges of negatively-charged residues for positively-charged residues in the vestibule region exhibited a factor of ≈20 higher interaction rates, required only half as much voltage to observe interaction, and allowed ssDNA to reside in the vestibule ≈100 times longer than the first mutant. Our results introduce MspA as a nanopore for nucleic acid analysis and highlight its potential as an engineerable platform for single-molecule detection and characterization applications. PMID:19098105
Iwatsubo, T.; Hasegawa, M.; Esaki, Y.; Ihara, Y.
1992-01-01
Immunocytochemically, neuropil threads (curly fibers) were investigated in the Alzheimer's disease brain using a confocal laser scanning fluorescence microscope by double labeling with tau/ubiquitin antibodies. Ubiquitin immunoreactivities were found to be lacking at one or both ends in more than 40% of tau-positive threads. Immunoelectron microscopy showed that bundles of paired helical filaments, which constitute neuropil threads, were positive for ubiquitin around their midportions, but often negative at their ends. Since it is reasonable to postulate that tau deposition as paired helical filaments precedes ubiquitination, the aforementioned observation suggests that the ends of the threads are newly formed portions, and thus the threads are often growing bidirectionally in small neuronal processes. Images Figure 1 Figure 2 PMID:1310831
Fatigue acceptance test limit criterion for larger diameter rolled thread fasteners
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kephart, A.R.
1997-05-01
This document describes a fatigue lifetime acceptance test criterion by which studs having rolled threads, larger than 1.0 inches in diameter, can be assured to meet minimum quality attributes associated with a controlled rolling process. This criterion is derived from a stress dependent, room temperature air fatigue database for test studs having a 0.625 inch diameter threads of Alloys X-750 HTH and direct aged 625. Anticipated fatigue lives of larger threads are based on thread root elastic stress concentration factors which increase with increasing thread diameters. Over the thread size range of interest, a 30% increase in notch stress ismore » equivalent to a factor of five (5X) reduction in fatigue life. The resulting diameter dependent fatigue acceptance criterion is normalized to the aerospace rolled thread acceptance standards for a 1.0 inch diameter, 0.125 inch pitch, Unified National thread with a controlled Root radius (UNR). Testing was conducted at a stress of 50% of the minimum specified material ultimate strength, 80 Ksi, and at a stress ratio (R) of 0.10. Limited test data for fastener diameters of 1.00 to 2.25 inches are compared to the acceptance criterion. Sensitivity of fatigue life of threads to test nut geometry variables was also shown to be dependent on notch stress conditions. Bearing surface concavity of the compression nuts and thread flank contact mismatch conditions can significantly affect the fastener fatigue life. Without improved controls these conditions could potentially provide misleading acceptance data. Alternate test nut geometry features are described and implemented in the rolled thread stud specification, MIL-DTL-24789(SH), to mitigate the potential effects on fatigue acceptance data.« less
Wedges for ultrasonic inspection
Gavin, Donald A.
1982-01-01
An ultrasonic transducer device is provided which is used in ultrasonic inspection of the material surrounding a threaded hole and which comprises a wedge of plastic or the like including a curved threaded surface adapted to be screwed into the threaded hole and a generally planar surface on which a conventional ultrasonic transducer is mounted. The plastic wedge can be rotated within the threaded hole to inspect for flaws in the material surrounding the threaded hole.
Apparatus for accurately preloading auger attachment means for frangible protective material
NASA Technical Reports Server (NTRS)
Wood, K. E.
1983-01-01
Apparatus for preloading a spring loaded threaded member is described. The apparatus is formed of three telescoping tubes. The innermost tube has means to prevent rotation of the threaded member. The middle tube is threadedly engaged with the threaded member and by axial movement applies a preload thereto. The outer tube engages a nut which may be rotated to retain the threaded member in axial position to maintain the preload.
Park, Duck-Gun; Song, Hoon; Kishore, M B; Vértesy, G; Lee, Duk-Hyun
2013-11-01
In this study, a magnetic sensor utilizing Planar Hall Resistance (PHR) and cyclic Voltammetry (CV) for detecting the radiation effect was fabricated. Specifically, we applied in parallel a PHR sensor and CV device to monitor the irradiation effect on DNA and protein respectively. Through parallel measurements, we demonstrated that the PHR sensor and CV are sensitive enough to measure irradiation effect. The PHR voltage decreased by magnetic nanobead labeled DNA was slightly recovered after gamma ray irradiation. The behavior of cdk inhibitor protein p21 having a sandwich structure of Au/protein G/Ab/Ag/Ab was checked by monitoring the cyclic Voltammetry signal in analyzing the gamma ray irradiation effect.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Otto, C., Thomas, G.A.; Peticolas, W.L.; Rippe, K.
Raman spectra of the parallel-stranded duplex formed from the deoxyoligonucleotides 5{prime}-d-((A){sub 10}TAATTTTAAATATTT)-3{prime} (D1) and 5{prime}-d((T){sub 10}ATTAAAATTTATAAA)-3{prime} (D2) in H{sub 2}O and D{sub 2}O have been acquired. The spectra of the parallel-stranded DNA are then compared to the spectra of the antiparallel double helix formed from the deoxyoligonucleotides D1 and 5{prime}-d(AAATATTTAAAATTA-(T){sub 10})-3{prime} (D3). The Raman spectra of the antiparallel-stranded (aps) duplex are reminiscent of the spectra of poly(d(A)){center dot}poly(d(T)) and a B-form structure similar to that adopted by the homopolymer duplex is assigned to the antiparallel double helix. The spectra of the parallel-stranded (ps) and antiparallel-stranded duplexes differ significantly due tomore » changes in helical organization, i.e., base pairing, base stacking, and backbone conformation. Large changes observed in the carbonyl stretching region implicate the involvement of the C(2) carbonyl of thymine in base pairing. The interaction of adenine with the C(2) carbonyl of thymine is consistent with formation of reverse Watson-Crick base pairing in parallel-stranded DNA. Phosphate-furanose vibrations similar to those observed for B-form DNA of heterogeneous sequence and high A,T content are observed at 843 and 1,092 cm{sup {minus}1} in the spectra of the parallel-stranded duplex.« less
Effect of thread shape on screw stress concentration by photoelastic measurements
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dragoni, E.
1994-11-01
The screw stress concentration for six nut-bolt connections embodying three different thread profiles and two nut shapes is measured photoelastically. Buttress (nearly zero flank angle), trapezoidal (15-deg flank angle), and triangular (30-deg flank angle) thread forms are examined in combination with standard and lip-type nuts. The effect of the thread profile on the screw stress concentration appears to be dependent upon the kind of nut considered. If the fastening incorporates a standard nut, the buttress thread is stronger than the triangular one, which, in turn, behaves better than the trapezoidal contour. The improvement is roughly a 20% reduction in themore » stress concentration factor from the trapezoidal to the buttress thread. In the case of lip nut, conversely, this tendency is somewhat reversed, with the trapezoidal thread performing slightly (but not decidedly) better than the other two shapes. Finally, averaged over all three thread forms, the lip nut exhibits a stress concentration factor which is about 50% lower than that of the standard nut.« less
NASA Technical Reports Server (NTRS)
Weddendorf, Bruce (Inventor)
1994-01-01
A quick connect fastener and method of use is presented wherein the quick connect fastener is suitable for replacing available bolts and screws, the quick connect fastener being capable of installation by simply pushing a threaded portion of the connector into a member receptacle hole, the inventive apparatus being comprised of an externally threaded fastener having a threaded portion slidably mounted upon a stud or bolt shaft, wherein the externally threaded fastener portion is expandable by a preloaded spring member. The fastener, upon contact with the member receptacle hole, has the capacity of presenting cylindrical threads of a reduced diameter for insertion purposes and once inserted into the receiving threads of the receptacle member hole, are expandable for engagement of the receptacle hole threads forming a quick connect of the fastener and the member to be fastened, the quick connect fastener can be further secured by rotation after insertion, even to the point of locking engagement, the quick connect fastener being disengagable only by reverse rotation of the mated thread engagement.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mariscal, R.N.; McLean, R.B.; Hand, C.
1977-01-01
Unlike most nematocysts, undischarged spirocyst threads bear hollow tubules rather than spines. The undischarged tubules are interconnected in hexagonal arrays and appear to be arranged in bundles along the length of the thread. Although the wall of the thread is folded in length and width, the tubules are not. Upon discharge and contact with sea water, the tubules solubilize and adhere to various substrates and prey. Traction between such objects and the everting thread causes the tubules to spin out into a web or meshwork of fine microfibrillae. Lack of contact of the everting thread with objects results in themore » tubules forming small droplets of partially solubilized material, some of which appear to be arranged in a helical pattern around the thread. The web or meshwork formed by the solubilized tubules in contact with various substrates probably serves to increase significantly the surface area and adhesive properties of the everted spirocyst thread.« less
DNA Barcoding of Zooplankton in the Hampton Roads Area: A Biodiversity Assessment
NASA Astrophysics Data System (ADS)
Salcedo, A.; Rodríguez, Á. E.; Gibson, D. M.
2016-02-01
The study of zooplankton biodiversity and distribution is crucial to understand oceanic ecosystems and anticipate the effects of climate change. Previously, identification of zooplankton relied in morphological identification employed by expert taxonomists. DNA barcoding, a technique that uses the mitochondrial DNA (mtDNA) Cytochrome Oxidase 1 (CO1) gene is widely used for taxonomic identification. Thus, this molecular technique will be used to begin a detailed characterization of zooplankton diversity, abundance and community structure in the Hampton Roads Area (HRA). Stations 1 (Jones Creek) and 3 (lower Chesapeake Bay) were sampled in June 19, 2015. Stations 1, 2 (James River), and 3 were sampled in September 2015. Zooplankton samples were collected in triplicates with a 0.5m, 200 µm mesh net. Physical parameters (dissolved oxygen, salinity, temperature and, water transparency) were measured. Species identified as Opistonema oglinum (Atlantic Thread Herring) and Paracalanus parvus copepods were found at station 3; Anchoa mitchilli and Acartia tonsa copepods were found at stations 1 and 3. This study indicates that mtDNA-CO1 barcoding is suitable to identify zooplankton to the species level and helps validate DNA barcoding as a faster, more accurate taxonomic approach. The long term objective of this project is to provide a comprehensive assessment of zooplankton in the HRA and to generate a reference record for broad monitoring programs; vital for a better understanding and management of ecologically and commercially important species.
Study on the SPR responses of various DNA probe concentrations by parallel scan spectral SPR imaging
NASA Astrophysics Data System (ADS)
Ma, Suihua; Liu, Le; Lu, Weiping; Zhang, Yaou; He, Yonghong; Guo, Jihua
2008-12-01
SPR sensors have become a high sensitive and label free method for characterizing and quantifying chemical and biochemical interactions. However, the relations between the SPR refractive index response and the property (such as concentrations) of biochemical probes are still lacking. In this paper, an experimental study on the SPR responses of varies concentrations of Legionella pneumophila mip DNA probes is presented. We developed a novel two-dimensional SPR sensing technique-parallel scan spectral SPR imaging-to detect an array of mip gene probes. This technique offers quantitative refractive index information with a high sensing throughput. By detecting mip DNA probes with different concentrations, we obtained the relations between the SPR refractive index response and the concentrations of mip DNA probes. These results are valuable for design and developing SPR based mip gene biochips.
Wang, Zhaocai; Ji, Zuwen; Wang, Xiaoming; Wu, Tunhua; Huang, Wei
2017-12-01
As a promising approach to solve the computationally intractable problem, the method based on DNA computing is an emerging research area including mathematics, computer science and molecular biology. The task scheduling problem, as a well-known NP-complete problem, arranges n jobs to m individuals and finds the minimum execution time of last finished individual. In this paper, we use a biologically inspired computational model and describe a new parallel algorithm to solve the task scheduling problem by basic DNA molecular operations. In turn, we skillfully design flexible length DNA strands to represent elements of the allocation matrix, take appropriate biological experiment operations and get solutions of the task scheduling problem in proper length range with less than O(n 2 ) time complexity. Copyright © 2017. Published by Elsevier B.V.
Waugh, Caryll; Cromer, Deborah; Grimm, Andrew; Chopra, Abha; Mallal, Simon; Davenport, Miles; Mak, Johnson
2015-04-09
Massive, parallel sequencing is a potent tool for dissecting the regulation of biological processes by revealing the dynamics of the cellular RNA profile under different conditions. Similarly, massive, parallel sequencing can be used to reveal the complexity of viral quasispecies that are often found in the RNA virus infected host. However, the production of cDNA libraries for next-generation sequencing (NGS) necessitates the reverse transcription of RNA into cDNA and the amplification of the cDNA template using PCR, which may introduce artefact in the form of phantom nucleic acids species that can bias the composition and interpretation of original RNA profiles. Using HIV as a model we have characterised the major sources of error during the conversion of viral RNA to cDNA, namely excess RNA template and the RNaseH activity of the polymerase enzyme, reverse transcriptase. In addition we have analysed the effect of PCR cycle on detection of recombinants and assessed the contribution of transfection of highly similar plasmid DNA to the formation of recombinant species during the production of our control viruses. We have identified RNA template concentrations, RNaseH activity of reverse transcriptase, and PCR conditions as key parameters that must be carefully optimised to minimise chimeric artefacts. Using our optimised RT-PCR conditions, in combination with our modified PCR amplification procedure, we have developed a reliable technique for accurate determination of RNA species using NGS technology.
Self-assembled catalytic DNA nanostructures for synthesis of para-directed polyaniline.
Wang, Zhen-Gang; Zhan, Pengfei; Ding, Baoquan
2013-02-26
Templated synthesis has been considered as an efficient approach to produce polyaniline (PANI) nanostructures. The features of DNA molecules enable a DNA template to be an intriguing template for fabrication of emeraldine PANI. In this work, we assembled HRP-mimicking DNAzyme with different artificial DNA nanostructures, aiming to manipulate the molecular structures and morphologies of PANI nanostructures through the controlled DNA self-assembly. UV-vis absorption spectra were used to investigate the molecular structures of PANI and monitor kinetic growth of PANI. It was found that PANI was well-doped at neutral pH and the redox behaviors of the resultant PANI were dependent on the charge density of the template, which was controlled by the template configurations. CD spectra indicated that the PANI threaded tightly around the helical DNA backbone, resulting in the right handedness of PANI. These reveal the formation of the emeraldine form of PANI that was doped by the DNA. The morphologies of the resultant PANI were studied by AFM and SEM. It was concluded from the imaging and spectroscopic kinetic results that PANI grew preferably from the DNAzyme sites and then expanded over the template to form 1D PANI nanostructures. The strategy of the DNAzyme-DNA template assembly brings several advantages in the synthesis of para-coupling PANI, including the region-selective growth of PANI, facilitating the formation of a para-coupling structure and facile regulation. We believe this study contributes significantly to the fabrication of doped PANI nanopatterns with controlled complexity, and the development of DNA nanotechnology.
NASA Astrophysics Data System (ADS)
Ivlev, B.
2017-07-01
Unusual chemical bonds are proposed. Each bond is characterized by the thread of a small radius, 10-11 cm, extended between two nuclei in a molecule. An analogue of a potential well, of the depth of MeV scale, is formed within the thread. This occurs due to the local reduction of zero point electromagnetic energy. This is similar to formation of the Casimir well. The electron-photon interaction only is not sufficient for formation of thread state. The mechanism of electron mass generation is involved in the close vicinity, 10-16 cm, of the thread. Thread bonds are stable and cannot be created or destructed in chemical or optical processes.
Flare particle acceleration in the interaction of twisted coronal flux ropes
NASA Astrophysics Data System (ADS)
Threlfall, J.; Hood, A. W.; Browning, P. K.
2018-03-01
Aim. The aim of this work is to investigate and characterise non-thermal particle behaviour in a three-dimensional (3D) magnetohydrodynamical (MHD) model of unstable multi-threaded flaring coronal loops. Methods: We have used a numerical scheme which solves the relativistic guiding centre approximation to study the motion of electrons and protons. The scheme uses snapshots from high resolution numerical MHD simulations of coronal loops containing two threads, where a single thread becomes unstable and (in one case) destabilises and merges with an additional thread. Results: The particle responses to the reconnection and fragmentation in MHD simulations of two loop threads are examined in detail. We illustrate the role played by uniform background resistivity and distinguish this from the role of anomalous resistivity using orbits in an MHD simulation where only one thread becomes unstable without destabilising further loop threads. We examine the (scalable) orbit energy gains and final positions recovered at different stages of a second MHD simulation wherein a secondary loop thread is destabilised by (and merges with) the first thread. We compare these results with other theoretical particle acceleration models in the context of observed energetic particle populations during solar flares.
Long-term effect of the insoluble thread-lifting technique.
Fukaya, Mototsugu
2017-01-01
Although the thread-lifting technique for sagging faces has become more common and popular, medical literature evaluating its effects is scarce. Studies on its long-term prognosis are particularly uncommon. One hundred individuals who had previously undergone insoluble thread-lifting were retrospectively investigated. Photos in frontal and oblique views from the first and last visits were evaluated by six female individuals by guessing the patients' ages. The mean guessed age was defined as the apparent age, and the difference between the real and apparent ages was defined as the youth value. The difference between the youth values before and after the thread-lift was defined as the rejuvenation effect and analyzed in relation to the time since the operation, the number of threads used and the number of thread-lift operations performed. The rejuvenation effect decreased over the first year after the operation, but showed an increasing trend thereafter. The rejuvenation effect increased with the number of threads used and the number of thread-lift operations performed. The insoluble thread-lifting technique appears to be associated with both early and late effects. The rejuvenation effect appeared to decrease during the first year, but increased thereafter. A multicenter trial is necessary to confirm these findings.
Lighting Up the Thioflavin T by Parallel-Stranded TG(GA) n DNA Homoduplexes.
Zhu, Jinbo; Yan, Zhiqiang; Zhou, Weijun; Liu, Chuanbo; Wang, Jin; Wang, Erkang
2018-06-22
Thioflavin T (ThT) was once regarded to be a specific fluorescent probe for the human telomeric G-quadruplex, but more other kinds of DNA were found that can also bind to ThT in recent years. Herein, we focus on G-rich parallel-stranded DNA and utilize fluorescence, absorbance, circular dichroism, and surface plasmon resonance spectroscopy to investigate its interaction with ThT. Pyrene label and molecular modeling are applied to unveil the binding mechanism. We find a new class of non-G-quadruplex G-rich parallel-stranded ( ps) DNA with the sequence of TG(GA) n can bind to ThT and increase the fluorescence with an enhancement ability superior to G-quadruplex. The optimal binding specificity for ThT is conferred by two parts. The first part is composed of two bases TG at the 5' end, which is a critical domain and plays an important role in the formation of the binding site for ThT. The second part is the rest alternative d(GA) bases, which forms the ps homoduplex and cooperates with the TG bases at the 5' end to bind the ThT.
Nanopore arrays in a silicon membrane for parallel single-molecule detection: DNA translocation
NASA Astrophysics Data System (ADS)
Zhang, Miao; Schmidt, Torsten; Jemt, Anders; Sahlén, Pelin; Sychugov, Ilya; Lundeberg, Joakim; Linnros, Jan
2015-08-01
Optical nanopore sensing offers great potential in single-molecule detection, genotyping, or DNA sequencing for high-throughput applications. However, one of the bottle-necks for fluorophore-based biomolecule sensing is the lack of an optically optimized membrane with a large array of nanopores, which has large pore-to-pore distance, small variation in pore size and low background photoluminescence (PL). Here, we demonstrate parallel detection of single-fluorophore-labeled DNA strands (450 bps) translocating through an array of silicon nanopores that fulfills the above-mentioned requirements for optical sensing. The nanopore array was fabricated using electron beam lithography and anisotropic etching followed by electrochemical etching resulting in pore diameters down to ∼7 nm. The DNA translocation measurements were performed in a conventional wide-field microscope tailored for effective background PL control. The individual nanopore diameter was found to have a substantial effect on the translocation velocity, where smaller openings slow the translocation enough for the event to be clearly detectable in the fluorescence. Our results demonstrate that a uniform silicon nanopore array combined with wide-field optical detection is a promising alternative with which to realize massively-parallel single-molecule detection.
Thread Migration in the Presence of Pointers
NASA Technical Reports Server (NTRS)
Cronk, David; Haines, Matthew; Mehrotra, Piyush
1996-01-01
Dynamic migration of lightweight threads supports both data locality and load balancing. However, migrating threads that contain pointers referencing data in both the stack and heap remains an open problem. In this paper we describe a technique by which threads with pointers referencing both stack and non-shared heap data can be migrated such that the pointers remain valid after migration. As a result, threads containing pointers can now be migrated between processors in a homogeneous distributed memory environment.
Real-time inextensible surgical thread simulation.
Xu, Lang; Liu, Qian
2018-03-27
This paper discusses a real-time simulation method of inextensible surgical thread based on the Cosserat rod theory using position-based dynamics (PBD). The method realizes stable twining and knotting of surgical thread while including inextensibility, bending, twisting and coupling effects. The Cosserat rod theory is used to model the nonlinear elastic behavior of surgical thread. The surgical thread model is solved with PBD to achieve a real-time, extremely stable simulation. Due to the one-dimensional linear structure of surgical thread, the direct solution of the distance constraint based on tridiagonal matrix algorithm is used to enhance stretching resistance in every constraint projection iteration. In addition, continuous collision detection and collision response guarantee a large time step and high performance. Furthermore, friction is integrated into the constraint projection process to stabilize the twining of multiple threads and complex contact situations. Through comparisons with existing methods, the surgical thread maintains constant length under large deformation after applying the direct distance constraint in our method. The twining and knotting of multiple threads correspond to stable solutions to contact and friction forces. A surgical suture scene is also modeled to demonstrate the practicality and simplicity of our method. Our method achieves stable and fast simulation of inextensible surgical thread. Benefiting from the unified particle framework, the rigid body, elastic rod, and soft body can be simultaneously simulated. The method is appropriate for applications in virtual surgery that require multiple dynamic bodies.
High precision optomechanical assembly using threads as mechanical reference
NASA Astrophysics Data System (ADS)
Lamontagne, Frédéric; Desnoyers, Nichola; Bergeron, Guy; Cantin, Mario
2016-09-01
A convenient method to assemble optomechanical components is to use threaded interface. For example, lenses are often secured inside barrels using threaded rings. In other cases, multiple optical sub-assemblies such as lens barrels can be threaded to each other. Threads have the advantage to provide a simple assembly method, to be easy to manufacture, and to offer a compact mechanical design. On the other hand, threads are not considered to provide accurate centering between parts because of the assembly clearance between the inner and outer threads. For that reason, threads are often used in conjunction with precision cylindrical surfaces to limit the radial clearance between the parts to be centered. Therefore, tight manufacturing tolerances are needed on these pilot diameters, which affect the cost of the optical assembly. This paper presents a new optomechanical approach that uses threads as mechanical reference. This innovative method relies on geometric principles to auto-center parts to each other with a very low centering error that is usually less than 5 μm. The method allows to auto-center an optical group in a main barrel, to perform an axial adjustment of an optical group inside a main barrel, and to perform stacking of multiple barrels. In conjunction with the lens auto-centering method that also used threads as a mechanical reference, this novel solution opens new possibilities to realize a variety of different high precision optomechanical assemblies at lower cost.
NASA Astrophysics Data System (ADS)
Olson, Richard F.
2013-05-01
Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Massively Parallel DNA Sequencing Facilitates Diagnosis of Patients with Usher Syndrome Type 1
Yoshimura, Hidekane; Iwasaki, Satoshi; Nishio, Shin-ya; Kumakawa, Kozo; Tono, Tetsuya; Kobayashi, Yumiko; Sato, Hiroaki; Nagai, Kyoko; Ishikawa, Kotaro; Ikezono, Tetsuo; Naito, Yasushi; Fukushima, Kunihiro; Oshikawa, Chie; Kimitsuki, Takashi; Nakanishi, Hiroshi; Usami, Shin-ichi
2014-01-01
Usher syndrome is an autosomal recessive disorder manifesting hearing loss, retinitis pigmentosa and vestibular dysfunction, and having three clinical subtypes. Usher syndrome type 1 is the most severe subtype due to its profound hearing loss, lack of vestibular responses, and retinitis pigmentosa that appears in prepuberty. Six of the corresponding genes have been identified, making early diagnosis through DNA testing possible, with many immediate and several long-term advantages for patients and their families. However, the conventional genetic techniques, such as direct sequence analysis, are both time-consuming and expensive. Targeted exon sequencing of selected genes using the massively parallel DNA sequencing technology will potentially enable us to systematically tackle previously intractable monogenic disorders and improve molecular diagnosis. Using this technique combined with direct sequence analysis, we screened 17 unrelated Usher syndrome type 1 patients and detected probable pathogenic variants in the 16 of them (94.1%) who carried at least one mutation. Seven patients had the MYO7A mutation (41.2%), which is the most common type in Japanese. Most of the mutations were detected by only the massively parallel DNA sequencing. We report here four patients, who had probable pathogenic mutations in two different Usher syndrome type 1 genes, and one case of MYO7A/PCDH15 digenic inheritance. This is the first report of Usher syndrome mutation analysis using massively parallel DNA sequencing and the frequency of Usher syndrome type 1 genes in Japanese. Mutation screening using this technique has the power to quickly identify mutations of many causative genes while maintaining cost-benefit performance. In addition, the simultaneous mutation analysis of large numbers of genes is useful for detecting mutations in different genes that are possibly disease modifiers or of digenic inheritance. PMID:24618850
Massively parallel DNA sequencing facilitates diagnosis of patients with Usher syndrome type 1.
Yoshimura, Hidekane; Iwasaki, Satoshi; Nishio, Shin-Ya; Kumakawa, Kozo; Tono, Tetsuya; Kobayashi, Yumiko; Sato, Hiroaki; Nagai, Kyoko; Ishikawa, Kotaro; Ikezono, Tetsuo; Naito, Yasushi; Fukushima, Kunihiro; Oshikawa, Chie; Kimitsuki, Takashi; Nakanishi, Hiroshi; Usami, Shin-Ichi
2014-01-01
Usher syndrome is an autosomal recessive disorder manifesting hearing loss, retinitis pigmentosa and vestibular dysfunction, and having three clinical subtypes. Usher syndrome type 1 is the most severe subtype due to its profound hearing loss, lack of vestibular responses, and retinitis pigmentosa that appears in prepuberty. Six of the corresponding genes have been identified, making early diagnosis through DNA testing possible, with many immediate and several long-term advantages for patients and their families. However, the conventional genetic techniques, such as direct sequence analysis, are both time-consuming and expensive. Targeted exon sequencing of selected genes using the massively parallel DNA sequencing technology will potentially enable us to systematically tackle previously intractable monogenic disorders and improve molecular diagnosis. Using this technique combined with direct sequence analysis, we screened 17 unrelated Usher syndrome type 1 patients and detected probable pathogenic variants in the 16 of them (94.1%) who carried at least one mutation. Seven patients had the MYO7A mutation (41.2%), which is the most common type in Japanese. Most of the mutations were detected by only the massively parallel DNA sequencing. We report here four patients, who had probable pathogenic mutations in two different Usher syndrome type 1 genes, and one case of MYO7A/PCDH15 digenic inheritance. This is the first report of Usher syndrome mutation analysis using massively parallel DNA sequencing and the frequency of Usher syndrome type 1 genes in Japanese. Mutation screening using this technique has the power to quickly identify mutations of many causative genes while maintaining cost-benefit performance. In addition, the simultaneous mutation analysis of large numbers of genes is useful for detecting mutations in different genes that are possibly disease modifiers or of digenic inheritance.
2013-01-01
Background Although endoscopic biliary stents have been accepted as part of palliative therapy for cases of malignant hilar obstruction, the optimal endoscopic management regime remains controversial. In this study, we evaluated the safety and efficacy of placing a threaded stent above the sphincter of Oddi (threaded inside plastic stents, threaded PS) and compared the results with those of other stent types. Methods Patients with malignant hilar obstruction, including those requiring biliary drainage for stent occlusion, were selected. Patients received either one of the following endoscopic indwelling stents: threaded PS, conventional plastic stents (conventional PS), or metallic stents (MS). Duration of stent patency and the incident of complication were compared in these patients. Results Forty-two patients underwent placement of endoscopic indwelling stents (threaded PS = 12, conventional PS = 17, MS = 13). The median duration of threaded PS patency was significantly longer than that of conventional PS patency (142 vs. 32 days; P = 0.04, logrank test). The median duration of threaded PS and MS patency was not significantly different (142 vs. 150 days, P = 0.83). Stent migration did not occur in any group. Among patients who underwent threaded PS placement as a salvage therapy after MS obstruction due to tumor ingrowth, the median duration of MS patency was significantly shorter than that of threaded PS patency (123 vs. 240 days). Conclusions Threaded PS are safe and effective in cases of malignant hilar obstruction; moreover, it is a suitable therapeutic option not only for initial drainage but also for salvage therapy. PMID:23410217
Federal Register 2010, 2011, 2012, 2013, 2014
2013-02-25
... DEPARTMENT OF COMMERCE International Trade Administration [A-570-932] Certain Steel Threaded Rod... Preliminary Determination of the circumvention inquiry concerning the antidumping duty order on certain steel threaded rod (``steel threaded rod'') from the People's Republic of China (``PRC'').\\1\\ The period of...
Grindon, Christina; Harris, Sarah; Evans, Tom; Novik, Keir; Coveney, Peter; Laughton, Charles
2004-07-15
Molecular modelling played a central role in the discovery of the structure of DNA by Watson and Crick. Today, such modelling is done on computers: the more powerful these computers are, the more detailed and extensive can be the study of the dynamics of such biological macromolecules. To fully harness the power of modern massively parallel computers, however, we need to develop and deploy algorithms which can exploit the structure of such hardware. The Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a scalable molecular dynamics code including long-range Coulomb interactions, which has been specifically designed to function efficiently on parallel platforms. Here we describe the implementation of the AMBER98 force field in LAMMPS and its validation for molecular dynamics investigations of DNA structure and flexibility against the benchmark of results obtained with the long-established code AMBER6 (Assisted Model Building with Energy Refinement, version 6). Extended molecular dynamics simulations on the hydrated DNA dodecamer d(CTTTTGCAAAAG)(2), which has previously been the subject of extensive dynamical analysis using AMBER6, show that it is possible to obtain excellent agreement in terms of static, dynamic and thermodynamic parameters between AMBER6 and LAMMPS. In comparison with AMBER6, LAMMPS shows greatly improved scalability in massively parallel environments, opening up the possibility of efficient simulations of order-of-magnitude larger systems and/or for order-of-magnitude greater simulation times.
A hierarchical wavefront reconstruction algorithm for gradient sensors
NASA Astrophysics Data System (ADS)
Bharmal, Nazim; Bitenc, Urban; Basden, Alastair; Myers, Richard
2013-12-01
ELT-scale extreme adaptive optics systems will require new approaches tocompute the wavefront suitably quickly, when the computational burden ofapplying a MVM is no longer practical. An approach is demonstrated here whichis hierarchical in transforming wavefront slopes from a WFS into a wavefront,and then to actuator values. First, simple integration in 1D is used to create1D-wavefront estimates with unknown starting points at the edges of independentspatial domains. Second, these starting points are estimated globally. By thesestarting points are a sub-set of the overall grid where wavefront values are tobe estimated, sparse representations are produced and numerical complexity canbe chosen by the spacing of the starting point grid relative to the overallgrid. Using a combination of algebraic expressions, sparse representation, anda conjugate gradient solver, the number of non-parallelized operations forreconstruction on a 100x100 sub-aperture sized problem is ~600,000 or O(N^3/2),which is approximately the same as for each thread of a MVM solutionparallelized over 100 threads. To reduce the effects of noise propagationwithin each domain, a noise reduction algorithm can be applied which ensuresthe continuity of the wavefront. To apply this additional step has a cost of~1,200,000 operations. We conclude by briefly discussing how the final step ofconverting from wavefront to actuator values can be achieved.
Valve actuator for internal combustion engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Uchida, T.
1987-06-16
A valve actuating mechanism is described for an overhead valve and overhead cam type internal combustion engine in which the camshaft is positioned above and between the valve and a cam follower seat member in a cylinder head of the engine. The cam follower seat member is threadedly mounted in the cylinder head and has a semi-spherical recess facing upwardly. A cam follower has an adjustable bolt threadedly received in one end of the cam follower. The adjustable bolt has a spherical fulcrum engaging the semispherical recess of the seat member. The cam follower also has a downwardly facing meansmore » on the other end for engaging the valve and an upwardly facing slipper face for sliding engagement with a cam on the camshaft. The cam is adapted to rotate across the slipper face in the direction of the valve. The slipper face has a surface shape for engaging the cam at the start of valve-lifting movement of the cam follower at a point through which a line tangent to the slipper face is substantially parallel to a line through contact points between the cam follower. The seat member and valve for minimizing the lateral forces are imposed on the cam follower by the cam at the start of the valve-lifting movement.« less