k-RP*{sub s}: A scalable distributed data structure for high-performance multi-attribute access
DOE Office of Scientific and Technical Information (OSTI.GOV)
Litwin, W.; Neimat, M.A.
k-RP*{sub s} is a new data structure for scalable multicomputer files with multi-attribute (k-d) keys. We discuss the k-RP*{sub s} file evolution and search algorithms. Performance analysis shows that a k-RP*{sub s} file can be much larger and orders of magnitude faster than a traditional k-d file. The speed-up is especially important for range and partial match searches that are often impractical with traditional k-d files. This opens up a new perspective for many applications.
LVFS: A Scalable Petabye/Exabyte Data Storage System
NASA Astrophysics Data System (ADS)
Golpayegani, N.; Halem, M.; Masuoka, E. J.; Ye, G.; Devine, N. K.
2013-12-01
Managing petabytes of data with hundreds of millions of files is the first step necessary towards an effective big data computing and collaboration environment in a distributed system. We describe here the MODAPS LAADS Virtual File System (LVFS), a new storage architecture which replaces the previous MODAPS operational Level 1 Land Atmosphere Archive Distribution System (LAADS) NFS based approach to storing and distributing datasets from several instruments, such as MODIS, MERIS, and VIIRS. LAADS is responsible for the distribution of over 4 petabytes of data and over 300 million files across more than 500 disks. We present here the first LVFS big data comparative performance results and new capabilities not previously possible with the LAADS system. We consider two aspects in addressing inefficiencies of massive scales of data. First, is dealing in a reliable and resilient manner with the volume and quantity of files in such a dataset, and, second, minimizing the discovery and lookup times for accessing files in such large datasets. There are several popular file systems that successfully deal with the first aspect of the problem. Their solution, in general, is through distribution, replication, and parallelism of the storage architecture. The Hadoop Distributed File System (HDFS), Parallel Virtual File System (PVFS), and Lustre are examples of such file systems that deal with petabyte data volumes. The second aspect deals with data discovery among billions of files, the largest bottleneck in reducing access time. The metadata of a file, generally represented in a directory layout, is stored in ways that are not readily scalable. This is true for HDFS, PVFS, and Lustre as well. Recent experimental file systems, such as Spyglass or Pantheon, have attempted to address this problem through redesign of the metadata directory architecture. LVFS takes a radically different architectural approach by eliminating the need for a separate directory within the file system. The LVFS system replaces the NFS disk mounting approach of LAADS and utilizes the already existing highly optimized metadata database server, which is applicable to most scientific big data intensive compute systems. Thus, LVFS ties the existing storage system with the existing metadata infrastructure system which we believe leads to a scalable exabyte virtual file system. The uniqueness of the implemented design is not limited to LAADS but can be employed with most scientific data processing systems. By utilizing the Filesystem In Userspace (FUSE), a kernel module available in many operating systems, LVFS was able to replace the NFS system while staying POSIX compliant. As a result, the LVFS system becomes scalable to exabyte sizes owing to the use of highly scalable database servers optimized for metadata storage. The flexibility of the LVFS design allows it to organize data on the fly in different ways, such as by region, date, instrument or product without the need for duplication, symbolic links, or any other replication methods. We proposed here a strategic reference architecture that addresses the inefficiencies of scientific petabyte/exabyte file system access through the dynamic integration of the observing system's large metadata file.
An Ephemeral Burst-Buffer File System for Scientific Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Teng; Moody, Adam; Yu, Weikuan
BurstFS is a distributed file system for node-local burst buffers on high performance computing systems. BurstFS presents a shared file system space across the burst buffers so that applications that use shared files can access the highly-scalable burst buffers without changing their applications.
Distributed PACS using distributed file system with hierarchical meta data servers.
Hiroyasu, Tomoyuki; Minamitani, Yoshiyuki; Miki, Mitsunori; Yokouchi, Hisatake; Yoshimi, Masato
2012-01-01
In this research, we propose a new distributed PACS (Picture Archiving and Communication Systems) which is available to integrate several PACSs that exist in each medical institution. The conventional PACS controls DICOM file into one data-base. On the other hand, in the proposed system, DICOM file is separated into meta data and image data and those are stored individually. Using this mechanism, since file is not always accessed the entire data, some operations such as finding files, changing titles, and so on can be performed in high-speed. At the same time, as distributed file system is utilized, accessing image files can also achieve high-speed access and high fault tolerant. The introduced system has a more significant point. That is the simplicity to integrate several PACSs. In the proposed system, only the meta data servers are integrated and integrated system can be constructed. This system also has the scalability of file access with along to the number of file numbers and file sizes. On the other hand, because meta-data server is integrated, the meta data server is the weakness of this system. To solve this defect, hieratical meta data servers are introduced. Because of this mechanism, not only fault--tolerant ability is increased but scalability of file access is also increased. To discuss the proposed system, the prototype system using Gfarm was implemented. For evaluating the implemented system, file search operating time of Gfarm and NFS were compared.
Scalable global grid catalogue for Run3 and beyond
NASA Astrophysics Data System (ADS)
Martinez Pedreira, M.; Grigoras, C.;
2017-10-01
The AliEn (ALICE Environment) file catalogue is a global unique namespace providing mapping between a UNIX-like logical name structure and the corresponding physical files distributed over 80 storage elements worldwide. Powerful search tools and hierarchical metadata information are integral parts of the system and are used by the Grid jobs as well as local users to store and access all files on the Grid storage elements. The catalogue has been in production since 2005 and over the past 11 years has grown to more than 2 billion logical file names. The backend is a set of distributed relational databases, ensuring smooth growth and fast access. Due to the anticipated fast future growth, we are looking for ways to enhance the performance and scalability by simplifying the catalogue schema while keeping the functionality intact. We investigated different backend solutions, such as distributed key value stores, as replacement for the relational database. This contribution covers the architectural changes in the system, together with the technology evaluation, benchmark results and conclusions.
Deceit: A flexible distributed file system
NASA Technical Reports Server (NTRS)
Siegel, Alex; Birman, Kenneth; Marzullo, Keith
1989-01-01
Deceit, a distributed file system (DFS) being developed at Cornell, focuses on flexible file semantics in relation to efficiency, scalability, and reliability. Deceit servers are interchangeable and collectively provide the illusion of a single, large server machine to any clients of the Deceit service. Non-volatile replicas of each file are stored on a subset of the file servers. The user is able to set parameters on a file to achieve different levels of availability, performance, and one-copy serializability. Deceit also supports a file version control mechanism. In contrast with many recent DFS efforts, Deceit can behave like a plain Sun Network File System (NFS) server and can be used by any NFS client without modifying any client software. The current Deceit prototype uses the ISIS Distributed Programming Environment for all communication and process group management, an approach that reduces system complexity and increases system robustness.
DOE Office of Scientific and Technical Information (OSTI.GOV)
2013-04-04
Spindle is software infrastructure that solves file system scalabiltiy problems associated with starting dynamically linked applications in HPC environments. When an HPC applications starts up thousands of pricesses at once, and those processes simultaneously access a shared file system to look for shared libraries, it can cause significant performance problems for both the application and other users. Spindle scalably coordinates the distribution of shared libraries to an application to avoid hammering the shared file system.
Reliable file sharing in distributed operating system using web RTC
NASA Astrophysics Data System (ADS)
Dukiya, Rajesh
2017-12-01
Since, the evolution of distributed operating system, distributed file system is come out to be important part in operating system. P2P is a reliable way in Distributed Operating System for file sharing. It was introduced in 1999, later it became a high research interest topic. Peer to Peer network is a type of network, where peers share network workload and other load related tasks. A P2P network can be a period of time connection, where a bunch of computers connected by a USB (Universal Serial Bus) port to transfer or enable disk sharing i.e. file sharing. Currently P2P requires special network that should be designed in P2P way. Nowadays, there is a big influence of browsers in our life. In this project we are going to study of file sharing mechanism in distributed operating system in web browsers, where we will try to find performance bottlenecks which our research will going to be an improvement in file sharing by performance and scalability in distributed file systems. Additionally, we will discuss the scope of Web Torrent file sharing and free-riding in peer to peer networks.
Architectural Considerations for Highly Scalable Computing to Support On-demand Video Analytics
2017-04-19
enforcement . The system was tested in the wild using video files as well as a commercial Video Management System supporting more than 100 surveillance...research were used to implement a distributed on-demand video analytics system that was prototyped for the use of forensics investigators in law...cameras as video sources. The architectural considerations of this system are presented. Issues to be reckoned with in implementing a scalable
Accessing Data Federations with CVMFS
Weitzel, Derek; Bockelman, Brian; Dykstra, Dave; ...
2017-11-23
Data federations have become an increasingly common tool for large collaborations such as CMS and Atlas to efficiently distribute large data files. Unfortunately, these typically are implemented with weak namespace semantics and a non-POSIX API. On the other hand, CVMFS has provided a POSIX-compliant read-only interface for use cases with a small working set size (such as software distribution). The metadata required for the CVMFS POSIX interface is distributed through a caching hierarchy, allowing it to scale to the level of about a hundred thousand hosts. In this paper, we will describe our contributions to CVMFS that merges the datamore » scalability of XRootD-based data federations (such as AAA) with metadata scalability and POSIX interface of CVMFS. We modified CVMFS so it can serve unmodified files without copying them to the repository server. CVMFS 2.2.0 is also able to redirect requests for data files to servers outside of the CVMFS content distribution network. Finally, we added the ability to manage authorization and authentication using security credentials such as X509 proxy certificates. We combined these modifications with the OSGs StashCache regional XRootD caching infrastructure to create a cached data distribution network. Here, we will show performance metrics accessing the data federation through CVMFS compared to direct data federation access. Additionally, we will discuss the improved user experience of providing access to a data federation through a POSIX filesystem.« less
Accessing Data Federations with CVMFS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Weitzel, Derek; Bockelman, Brian; Dykstra, Dave
Data federations have become an increasingly common tool for large collaborations such as CMS and Atlas to efficiently distribute large data files. Unfortunately, these typically are implemented with weak namespace semantics and a non-POSIX API. On the other hand, CVMFS has provided a POSIX-compliant read-only interface for use cases with a small working set size (such as software distribution). The metadata required for the CVMFS POSIX interface is distributed through a caching hierarchy, allowing it to scale to the level of about a hundred thousand hosts. In this paper, we will describe our contributions to CVMFS that merges the datamore » scalability of XRootD-based data federations (such as AAA) with metadata scalability and POSIX interface of CVMFS. We modified CVMFS so it can serve unmodified files without copying them to the repository server. CVMFS 2.2.0 is also able to redirect requests for data files to servers outside of the CVMFS content distribution network. Finally, we added the ability to manage authorization and authentication using security credentials such as X509 proxy certificates. We combined these modifications with the OSGs StashCache regional XRootD caching infrastructure to create a cached data distribution network. Here, we will show performance metrics accessing the data federation through CVMFS compared to direct data federation access. Additionally, we will discuss the improved user experience of providing access to a data federation through a POSIX filesystem.« less
Accessing Data Federations with CVMFS
NASA Astrophysics Data System (ADS)
Weitzel, Derek; Bockelman, Brian; Dykstra, Dave; Blomer, Jakob; Meusel, Ren
2017-10-01
Data federations have become an increasingly common tool for large collaborations such as CMS and Atlas to efficiently distribute large data files. Unfortunately, these typically are implemented with weak namespace semantics and a non-POSIX API. On the other hand, CVMFS has provided a POSIX-compliant read-only interface for use cases with a small working set size (such as software distribution). The metadata required for the CVMFS POSIX interface is distributed through a caching hierarchy, allowing it to scale to the level of about a hundred thousand hosts. In this paper, we will describe our contributions to CVMFS that merges the data scalability of XRootD-based data federations (such as AAA) with metadata scalability and POSIX interface of CVMFS. We modified CVMFS so it can serve unmodified files without copying them to the repository server. CVMFS 2.2.0 is also able to redirect requests for data files to servers outside of the CVMFS content distribution network. Finally, we added the ability to manage authorization and authentication using security credentials such as X509 proxy certificates. We combined these modifications with the OSGs StashCache regional XRootD caching infrastructure to create a cached data distribution network. We will show performance metrics accessing the data federation through CVMFS compared to direct data federation access. Additionally, we will discuss the improved user experience of providing access to a data federation through a POSIX filesystem.
Sun, Xiaobo; Gao, Jingjing; Jin, Peng; Eng, Celeste; Burchard, Esteban G; Beaty, Terri H; Ruczinski, Ingo; Mathias, Rasika A; Barnes, Kathleen; Wang, Fusheng; Qin, Zhaohui S
2018-06-01
Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.
Gao, Jingjing; Jin, Peng; Eng, Celeste; Burchard, Esteban G; Beaty, Terri H; Ruczinski, Ingo; Mathias, Rasika A; Barnes, Kathleen; Wang, Fusheng
2018-01-01
Abstract Background Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)–based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems. PMID:29762754
AsyncStageOut: Distributed user data management for CMS Analysis
NASA Astrophysics Data System (ADS)
Riahi, H.; Wildish, T.; Ciangottini, D.; Hernández, J. M.; Andreeva, J.; Balcas, J.; Karavakis, E.; Mascheroni, M.; Tanasijczuk, A. J.; Vaandering, E. W.
2015-12-01
AsyncStageOut (ASO) is a new component of the distributed data analysis system of CMS, CRAB, designed for managing users' data. It addresses a major weakness of the previous model, namely that mass storage of output data was part of the job execution resulting in inefficient use of job slots and an unacceptable failure rate at the end of the jobs. ASO foresees the management of up to 400k files per day of various sizes, spread worldwide across more than 60 sites. It must handle up to 1000 individual users per month, and work with minimal delay. This creates challenging requirements for system scalability, performance and monitoring. ASO uses FTS to schedule and execute the transfers between the storage elements of the source and destination sites. It has evolved from a limited prototype to a highly adaptable service, which manages and monitors the user file placement and bookkeeping. To ensure system scalability and data monitoring, it employs new technologies such as a NoSQL database and re-uses existing components of PhEDEx and the FTS Dashboard. We present the asynchronous stage-out strategy and the architecture of the solution we implemented to deal with those issues and challenges. The deployment model for the high availability and scalability of the service is discussed. The performance of the system during the commissioning and the first phase of production are also shown, along with results from simulations designed to explore the limits of scalability.
AsyncStageOut: Distributed User Data Management for CMS Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Riahi, H.; Wildish, T.; Ciangottini, D.
2015-12-23
AsyncStageOut (ASO) is a new component of the distributed data analysis system of CMS, CRAB, designed for managing users' data. It addresses a major weakness of the previous model, namely that mass storage of output data was part of the job execution resulting in inefficient use of job slots and an unacceptable failure rate at the end of the jobs. ASO foresees the management of up to 400k files per day of various sizes, spread worldwide across more than 60 sites. It must handle up to 1000 individual users per month, and work with minimal delay. This creates challenging requirementsmore » for system scalability, performance and monitoring. ASO uses FTS to schedule and execute the transfers between the storage elements of the source and destination sites. It has evolved from a limited prototype to a highly adaptable service, which manages and monitors the user file placement and bookkeeping. To ensure system scalability and data monitoring, it employs new technologies such as a NoSQL database and re-uses existing components of PhEDEx and the FTS Dashboard. We present the asynchronous stage-out strategy and the architecture of the solution we implemented to deal with those issues and challenges. The deployment model for the high availability and scalability of the service is discussed. The performance of the system during the commissioning and the first phase of production are also shown, along with results from simulations designed to explore the limits of scalability.« less
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework
2012-01-01
Background For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. Results We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. Conclusion The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources. PMID:23216909
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework.
Lewis, Steven; Csordas, Attila; Killcoyne, Sarah; Hermjakob, Henning; Hoopmann, Michael R; Moritz, Robert L; Deutsch, Eric W; Boyle, John
2012-12-05
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
The Scalable Checkpoint/Restart Library
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moody, A.
The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes.more » It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.« less
Scalable Trust of Next-Generation Management (STRONGMAN)
2004-10-01
remote logins might be policy controlled to allow only strongly encrypted IPSec tunnels to log in remotely, to access selected files, etc. The...and Angelos D. Keromytis. Drop-in Security for Distributed and Portable Computing Elements. Emerald Journal of Internet Research. Electronic...Security and Privacy, pp. 17-31, May 1999. [2] S. M. Bellovin. Distributed Firewalls. ; login : magazine, special issue on security, November 1999. [3] M
pcircle - A Suite of Scalable Parallel File System Tools
DOE Office of Scientific and Technical Information (OSTI.GOV)
WANG, FEIYI
2015-10-01
Most of the software related to file system are written for conventional local file system, they are serialized and can't take advantage of the benefit of a large scale parallel file system. "pcircle" software builds on top of ubiquitous MPI in cluster computing environment and "work-stealing" pattern to provide a scalable, high-performance suite of file system tools. In particular - it implemented parallel data copy and parallel data checksumming, with advanced features such as async progress report, checkpoint and restart, as well as integrity checking.
New directions in the CernVM file system
NASA Astrophysics Data System (ADS)
Blomer, Jakob; Buncic, Predrag; Ganis, Gerardo; Hardi, Nikola; Meusel, Rene; Popescu, Radu
2017-10-01
The CernVM File System today is commonly used to host and distribute application software stacks. In addition to this core task, recent developments expand the scope of the file system into two new areas. Firstly, CernVM-FS emerges as a good match for container engines to distribute the container image contents. Compared to native container image distribution (e.g. through the “Docker registry”), CernVM-FS massively reduces the network traffic for image distribution. This has been shown, for instance, by a prototype integration of CernVM-FS into Mesos developed by Mesosphere, Inc. We present a path for a smooth integration of CernVM-FS and Docker. Secondly, CernVM-FS recently raised new interest as an option for the distribution of experiment conditions data. Here, the focus is on improved versioning capabilities of CernVM-FS that allows to link the conditions data of a run period to the state of a CernVM-FS repository. Lastly, CernVM-FS has been extended to provide a name space for physics data for the LIGO and CMS collaborations. Searching through a data namespace is often done by a central, experiment specific database service. A name space on CernVM-FS can particularly benefit from an existing, scalable infrastructure and from the POSIX file system interface.
Jaschob, Daniel; Riffle, Michael
2012-07-30
Laboratories engaged in computational biology or bioinformatics frequently need to run lengthy, multistep, and user-driven computational jobs. Each job can tie up a computer for a few minutes to several days, and many laboratories lack the expertise or resources to build and maintain a dedicated computer cluster. JobCenter is a client-server application and framework for job management and distributed job execution. The client and server components are both written in Java and are cross-platform and relatively easy to install. All communication with the server is client-driven, which allows worker nodes to run anywhere (even behind external firewalls or "in the cloud") and provides inherent load balancing. Adding a worker node to the worker pool is as simple as dropping the JobCenter client files onto any computer and performing basic configuration, which provides tremendous ease-of-use, flexibility, and limitless horizontal scalability. Each worker installation may be independently configured, including the types of jobs it is able to run. Executed jobs may be written in any language and may include multistep workflows. JobCenter is a versatile and scalable distributed job management system that allows laboratories to very efficiently distribute all computational work among available resources. JobCenter is freely available at http://code.google.com/p/jobcenter/.
A scalable infrastructure for CMS data analysis based on OpenStack Cloud and Gluster file system
NASA Astrophysics Data System (ADS)
Toor, S.; Osmani, L.; Eerola, P.; Kraemer, O.; Lindén, T.; Tarkoma, S.; White, J.
2014-06-01
The challenge of providing a resilient and scalable computational and data management solution for massive scale research environments requires continuous exploration of new technologies and techniques. In this project the aim has been to design a scalable and resilient infrastructure for CERN HEP data analysis. The infrastructure is based on OpenStack components for structuring a private Cloud with the Gluster File System. We integrate the state-of-the-art Cloud technologies with the traditional Grid middleware infrastructure. Our test results show that the adopted approach provides a scalable and resilient solution for managing resources without compromising on performance and high availability.
Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.
Niemenmaa, Matti; Kallio, Aleksi; Schumacher, André; Klemelä, Petri; Korpelainen, Eija; Heljanko, Keijo
2012-03-15
Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.
Cost Considerations in Cloud Computing
2014-01-01
investments. 2. Database Options The potential promise that “ big data ” analytics holds for many enterprise mission areas makes relevant the question of the...development of a range of new distributed file systems and data - bases that have better scalability properties than traditional SQL databases. Hadoop ... data . Many systems exist that extend or supplement Hadoop —such as Apache Accumulo, which provides a highly granular mechanism for managing security
JuxtaView - A tool for interactive visualization of large imagery on scalable tiled displays
Krishnaprasad, N.K.; Vishwanath, V.; Venkataraman, S.; Rao, A.G.; Renambot, L.; Leigh, J.; Johnson, A.E.; Davis, B.
2004-01-01
JuxtaView is a cluster-based application for viewing ultra-high-resolution images on scalable tiled displays. We present in JuxtaView, a new parallel computing and distributed memory approach for out-of-core montage visualization, using LambdaRAM, a software-based network-level cache system. The ultimate goal of JuxtaView is to enable a user to interactively roam through potentially terabytes of distributed, spatially referenced image data such as those from electron microscopes, satellites and aerial photographs. In working towards this goal, we describe our first prototype implemented over a local area network, where the image is distributed using LambdaRAM, on the memory of all nodes of a PC cluster driving a tiled display wall. Aggressive pre-fetching schemes employed by LambdaRAM help to reduce latency involved in remote memory access. We compare LambdaRAM with a more traditional memory-mapped file approach for out-of-core visualization. ?? 2004 IEEE.
2012-01-01
Background Laboratories engaged in computational biology or bioinformatics frequently need to run lengthy, multistep, and user-driven computational jobs. Each job can tie up a computer for a few minutes to several days, and many laboratories lack the expertise or resources to build and maintain a dedicated computer cluster. Results JobCenter is a client–server application and framework for job management and distributed job execution. The client and server components are both written in Java and are cross-platform and relatively easy to install. All communication with the server is client-driven, which allows worker nodes to run anywhere (even behind external firewalls or “in the cloud”) and provides inherent load balancing. Adding a worker node to the worker pool is as simple as dropping the JobCenter client files onto any computer and performing basic configuration, which provides tremendous ease-of-use, flexibility, and limitless horizontal scalability. Each worker installation may be independently configured, including the types of jobs it is able to run. Executed jobs may be written in any language and may include multistep workflows. Conclusions JobCenter is a versatile and scalable distributed job management system that allows laboratories to very efficiently distribute all computational work among available resources. JobCenter is freely available at http://code.google.com/p/jobcenter/. PMID:22846423
A General-purpose Framework for Parallel Processing of Large-scale LiDAR Data
NASA Astrophysics Data System (ADS)
Li, Z.; Hodgson, M.; Li, W.
2016-12-01
Light detection and ranging (LiDAR) technologies have proven efficiency to quickly obtain very detailed Earth surface data for a large spatial extent. Such data is important for scientific discoveries such as Earth and ecological sciences and natural disasters and environmental applications. However, handling LiDAR data poses grand geoprocessing challenges due to data intensity and computational intensity. Previous studies received notable success on parallel processing of LiDAR data to these challenges. However, these studies either relied on high performance computers and specialized hardware (GPUs) or focused mostly on finding customized solutions for some specific algorithms. We developed a general-purpose scalable framework coupled with sophisticated data decomposition and parallelization strategy to efficiently handle big LiDAR data. Specifically, 1) a tile-based spatial index is proposed to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system, 2) two spatial decomposition techniques are developed to enable efficient parallelization of different types of LiDAR processing tasks, and 3) by coupling existing LiDAR processing tools with Hadoop, this framework is able to conduct a variety of LiDAR data processing tasks in parallel in a highly scalable distributed computing environment. The performance and scalability of the framework is evaluated with a series of experiments conducted on a real LiDAR dataset using a proof-of-concept prototype system. The results show that the proposed framework 1) is able to handle massive LiDAR data more efficiently than standalone tools; and 2) provides almost linear scalability in terms of either increased workload (data volume) or increased computing nodes with both spatial decomposition strategies. We believe that the proposed framework provides valuable references on developing a collaborative cyberinfrastructure for processing big earth science data in a highly scalable environment.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Inman, Jeffrey; Bonnie, David; Broomfield, Matthew
There is a sea (mar is Spanish for sea) of data out there that needs to be handled efficiently. Object Stores are filling the hole of managing large amounts of data efficiently. However, in many cases, and our HPC case in particular, we need a traditional file (POSIX) interface to this data as HPC I/O models have not moved to object interfaces, such as Amazon S3, CDMI, etc.Eventually Object Store providers may deliver file interfaces to their object stores, but at this point those interfaces are not ready to do the job that we need done. MarFS will glue togethermore » two existing scalable components: a file system's scalable metadata component that provides the file interface; and existing scalable object stores (from one or more providers). There will be utilities to do work that is not critical to be done in real-time so that MarFS can manage the space used by objects and allocated to individual users.« less
Information Security Considerations for Applications Using Apache Accumulo
2014-09-01
Distributed File System INSCOM United States Army Intelligence and Security Command JPA Java Persistence API JSON JavaScript Object Notation MAC Mandatory... MySQL [13]. BigTable can process 20 petabytes per day [14]. High degree of scalability on commodity hardware. NoSQL databases do not rely on highly...manipulation in relational databases. NoSQL databases each have a unique programming interface that uses a lower level procedural language (e.g., Java
Linear Time Algorithms to Restrict Insider Access using Multi-Policy Access Control Systems
Mell, Peter; Shook, James; Harang, Richard; Gavrila, Serban
2017-01-01
An important way to limit malicious insiders from distributing sensitive information is to as tightly as possible limit their access to information. This has always been the goal of access control mechanisms, but individual approaches have been shown to be inadequate. Ensemble approaches of multiple methods instantiated simultaneously have been shown to more tightly restrict access, but approaches to do so have had limited scalability (resulting in exponential calculations in some cases). In this work, we take the Next Generation Access Control (NGAC) approach standardized by the American National Standards Institute (ANSI) and demonstrate its scalability. The existing publicly available reference implementations all use cubic algorithms and thus NGAC was widely viewed as not scalable. The primary NGAC reference implementation took, for example, several minutes to simply display the set of files accessible to a user on a moderately sized system. In our approach, we take these cubic algorithms and make them linear. We do this by reformulating the set theoretic approach of the NGAC standard into a graph theoretic approach and then apply standard graph algorithms. We thus can answer important access control decision questions (e.g., which files are available to a user and which users can access a file) using linear time graph algorithms. We also provide a default linear time mechanism to visualize and review user access rights for an ensemble of access control mechanisms. Our visualization appears to be a simple file directory hierarchy but in reality is an automatically generated structure abstracted from the underlying access control graph that works with any set of simultaneously instantiated access control policies. It also provide an implicit mechanism for symbolic linking that provides a powerful access capability. Our work thus provides the first efficient implementation of NGAC while enabling user privilege review through a novel visualization approach. This may help transition from concept to reality the idea of using ensembles of simultaneously instantiated access control methodologies, thereby limiting insider threat. PMID:28758045
Hadoop-BAM: directly manipulating next generation sequencing data in the cloud
Niemenmaa, Matti; Kallio, Aleksi; Schumacher, André; Klemelä, Petri; Korpelainen, Eija; Heljanko, Keijo
2012-01-01
Summary: Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps. Availability: Available under the open-source MIT license at http://sourceforge.net/projects/hadoop-bam/ Contact: matti.niemenmaa@aalto.fi Supplementary information: Supplementary material is available at Bioinformatics online. PMID:22302568
The Jade File System. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Rao, Herman Chung-Hwa
1991-01-01
File systems have long been the most important and most widely used form of shared permanent storage. File systems in traditional time-sharing systems, such as Unix, support a coherent sharing model for multiple users. Distributed file systems implement this sharing model in local area networks. However, most distributed file systems fail to scale from local area networks to an internet. Four characteristics of scalability were recognized: size, wide area, autonomy, and heterogeneity. Owing to size and wide area, techniques such as broadcasting, central control, and central resources, which are widely adopted by local area network file systems, are not adequate for an internet file system. An internet file system must also support the notion of autonomy because an internet is made up by a collection of independent organizations. Finally, heterogeneity is the nature of an internet file system, not only because of its size, but also because of the autonomy of the organizations in an internet. The Jade File System, which provides a uniform way to name and access files in the internet environment, is presented. Jade is a logical system that integrates a heterogeneous collection of existing file systems, where heterogeneous means that the underlying file systems support different file access protocols. Because of autonomy, Jade is designed under the restriction that the underlying file systems may not be modified. In order to avoid the complexity of maintaining an internet-wide, global name space, Jade permits each user to define a private name space. In Jade's design, we pay careful attention to avoiding unnecessary network messages between clients and file servers in order to achieve acceptable performance. Jade's name space supports two novel features: (1) it allows multiple file systems to be mounted under one direction; and (2) it permits one logical name space to mount other logical name spaces. A prototype of Jade was implemented to examine and validate its design. The prototype consists of interfaces to the Unix File System, the Sun Network File System, and the File Transfer Protocol.
LHCb experience with LFC replication
NASA Astrophysics Data System (ADS)
Bonifazi, F.; Carbone, A.; Perez, E. D.; D'Apice, A.; dell'Agnello, L.; Duellmann, D.; Girone, M.; Re, G. L.; Martelli, B.; Peco, G.; Ricci, P. P.; Sapunenko, V.; Vagnoni, V.; Vitlacil, D.
2008-07-01
Database replication is a key topic in the framework of the LHC Computing Grid to allow processing of data in a distributed environment. In particular, the LHCb computing model relies on the LHC File Catalog, i.e. a database which stores information about files spread across the GRID, their logical names and the physical locations of all the replicas. The LHCb computing model requires the LFC to be replicated at Tier-1s. The LCG 3D project deals with the database replication issue and provides a replication service based on Oracle Streams technology. This paper describes the deployment of the LHC File Catalog replication to the INFN National Center for Telematics and Informatics (CNAF) and to other LHCb Tier-1 sites. We performed stress tests designed to evaluate any delay in the propagation of the streams and the scalability of the system. The tests show the robustness of the replica implementation with performance going much beyond the LHCb requirements.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
NASA Astrophysics Data System (ADS)
Farhan Husain, Mohammad; Doshi, Pankil; Khan, Latifur; Thuraisingham, Bhavani
Handling huge amount of data scalably is a matter of concern for a long time. Same is true for semantic web data. Current semantic web frameworks lack this ability. In this paper, we describe a framework that we built using Hadoop to store and retrieve large number of RDF triples. We describe our schema to store RDF data in Hadoop Distribute File System. We also present our algorithms to answer a SPARQL query. We make use of Hadoop's MapReduce framework to actually answer the queries. Our results reveal that we can store huge amount of semantic web data in Hadoop clusters built mostly by cheap commodity class hardware and still can answer queries fast enough. We conclude that ours is a scalable framework, able to handle large amount of RDF data efficiently.
High-performance metadata indexing and search in petascale data storage systems
NASA Astrophysics Data System (ADS)
Leung, A. W.; Shao, M.; Bisson, T.; Pasupathy, S.; Miller, E. L.
2008-07-01
Large-scale storage systems used for scientific applications can store petabytes of data and billions of files, making the organization and management of data in these systems a difficult, time-consuming task. The ability to search file metadata in a storage system can address this problem by allowing scientists to quickly navigate experiment data and code while allowing storage administrators to gather the information they need to properly manage the system. In this paper, we present Spyglass, a file metadata search system that achieves scalability by exploiting storage system properties, providing the scalability that existing file metadata search tools lack. In doing so, Spyglass can achieve search performance up to several thousand times faster than existing database solutions. We show that Spyglass enables important functionality that can aid data management for scientists and storage administrators.
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
NASA Astrophysics Data System (ADS)
Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. A.
2010-09-01
The latest release of NWChem delivers an open-source computational chemistry package with extensive capabilities for large scale simulations of chemical and biological systems. Utilizing a common computational framework, diverse theoretical descriptions can be used to provide the best solution for a given scientific problem. Scalable parallel implementations and modular software design enable efficient utilization of current computational architectures. This paper provides an overview of NWChem focusing primarily on the core theoretical modules provided by the code and their parallel performance. Program summaryProgram title: NWChem Catalogue identifier: AEGI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Open Source Educational Community License No. of lines in distributed program, including test data, etc.: 11 709 543 No. of bytes in distributed program, including test data, etc.: 680 696 106 Distribution format: tar.gz Programming language: Fortran 77, C Computer: all Linux based workstations and parallel supercomputers, Windows and Apple machines Operating system: Linux, OS X, Windows Has the code been vectorised or parallelized?: Code is parallelized Classification: 2.1, 2.2, 3, 7.3, 7.7, 16.1, 16.2, 16.3, 16.10, 16.13 Nature of problem: Large-scale atomistic simulations of chemical and biological systems require efficient and reliable methods for ground and excited solutions of many-electron Hamiltonian, analysis of the potential energy surface, and dynamics. Solution method: Ground and excited solutions of many-electron Hamiltonian are obtained utilizing density-functional theory, many-body perturbation approach, and coupled cluster expansion. These solutions or a combination thereof with classical descriptions are then used to analyze potential energy surface and perform dynamical simulations. Additional comments: Full documentation is provided in the distribution file. This includes an INSTALL file giving details of how to build the package. A set of test runs is provided in the examples directory. The distribution file for this program is over 90 Mbytes and therefore is not delivered directly when download or Email is requested. Instead a html file giving details of how the program can be obtained is sent. Running time: Running time depends on the size of the chemical system, complexity of the method, number of cpu's and the computational task. It ranges from several seconds for serial DFT energy calculations on a few atoms to several hours for parallel coupled cluster energy calculations on tens of atoms or ab-initio molecular dynamics simulation on hundreds of atoms.
NASA Astrophysics Data System (ADS)
Appel, Marius; Lahn, Florian; Pebesma, Edzer; Buytaert, Wouter; Moulds, Simon
2016-04-01
Today's amount of freely available data requires scientists to spend large parts of their work on data management. This is especially true in environmental sciences when working with large remote sensing datasets, such as obtained from earth-observation satellites like the Sentinel fleet. Many frameworks like SpatialHadoop or Apache Spark address the scalability but target programmers rather than data analysts, and are not dedicated to imagery or array data. In this work, we use the open-source data management and analytics system SciDB to bring large earth-observation datasets closer to analysts. Its underlying data representation as multidimensional arrays fits naturally to earth-observation datasets, distributes storage and computational load over multiple instances by multidimensional chunking, and also enables efficient time-series based analyses, which is usually difficult using file- or tile-based approaches. Existing interfaces to R and Python furthermore allow for scalable analytics with relatively little learning effort. However, interfacing SciDB and file-based earth-observation datasets that come as tiled temporal snapshots requires a lot of manual bookkeeping during ingestion, and SciDB natively only supports loading data from CSV-like and custom binary formatted files, which currently limits its practical use in earth-observation analytics. To make it easier to work with large multi-temporal datasets in SciDB, we developed software tools that enrich SciDB with earth observation metadata and allow working with commonly used file formats: (i) the SciDB extension library scidb4geo simplifies working with spatiotemporal arrays by adding relevant metadata to the database and (ii) the Geospatial Data Abstraction Library (GDAL) driver implementation scidb4gdal allows to ingest and export remote sensing imagery from and to a large number of file formats. Using added metadata on temporal resolution and coverage, the GDAL driver supports time-based ingestion of imagery to existing multi-temporal SciDB arrays. While our SciDB plugin works directly in the database, the GDAL driver has been specifically developed using a minimum amount of external dependencies (i.e. CURL). Source code for both tools is available from github [1]. We present these tools in a case-study that demonstrates the ingestion of multi-temporal tiled earth-observation data to SciDB, followed by a time-series analysis using R and SciDBR. Through the exclusive use of open-source software, our approach supports reproducibility in scalable large-scale earth-observation analytics. In the future, these tools can be used in an automated way to let scientists only work on ready-to-use SciDB arrays to significantly reduce the data management workload for domain scientists. [1] https://github.com/mappl/scidb4geo} and \\url{https://github.com/mappl/scidb4gdal
Nosql for Storage and Retrieval of Large LIDAR Data Collections
NASA Astrophysics Data System (ADS)
Boehm, J.; Liu, K.
2015-08-01
Developments in LiDAR technology over the past decades have made LiDAR to become a mature and widely accepted source of geospatial information. This in turn has led to an enormous growth in data volume. The central idea for a file-centric storage of LiDAR point clouds is the observation that large collections of LiDAR data are typically delivered as large collections of files, rather than single files of terabyte size. This split of the dataset, commonly referred to as tiling, was usually done to accommodate a specific processing pipeline. It makes therefore sense to preserve this split. A document oriented NoSQL database can easily emulate this data partitioning, by representing each tile (file) in a separate document. The document stores the metadata of the tile. The actual files are stored in a distributed file system emulated by the NoSQL database. We demonstrate the use of MongoDB a highly scalable document oriented NoSQL database for storing large LiDAR files. MongoDB like any NoSQL database allows for queries on the attributes of the document. As a specialty MongoDB also allows spatial queries. Hence we can perform spatial queries on the bounding boxes of the LiDAR tiles. Inserting and retrieving files on a cloud-based database is compared to native file system and cloud storage transfer speed.
GLAD: a system for developing and deploying large-scale bioinformatics grid.
Teo, Yong-Meng; Wang, Xianbing; Ng, Yew-Kwong
2005-03-01
Grid computing is used to solve large-scale bioinformatics problems with gigabytes database by distributing the computation across multiple platforms. Until now in developing bioinformatics grid applications, it is extremely tedious to design and implement the component algorithms and parallelization techniques for different classes of problems, and to access remotely located sequence database files of varying formats across the grid. In this study, we propose a grid programming toolkit, GLAD (Grid Life sciences Applications Developer), which facilitates the development and deployment of bioinformatics applications on a grid. GLAD has been developed using ALiCE (Adaptive scaLable Internet-based Computing Engine), a Java-based grid middleware, which exploits the task-based parallelism. Two bioinformatics benchmark applications, such as distributed sequence comparison and distributed progressive multiple sequence alignment, have been developed using GLAD.
Jefferson Lab Mass Storage and File Replication Services
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ian Bird; Ying Chen; Bryan Hess
Jefferson Lab has implemented a scalable, distributed, high performance mass storage system - JASMine. The system is entirely implemented in Java, provides access to robotic tape storage and includes disk cache and stage manager components. The disk manager subsystem may be used independently to manage stand-alone disk pools. The system includes a scheduler to provide policy-based access to the storage systems. Security is provided by pluggable authentication modules and is implemented at the network socket level. The tape and disk cache systems have well defined interfaces in order to provide integration with grid-based services. The system is in production andmore » being used to archive 1 TB per day from the experiments, and currently moves over 2 TB per day total. This paper will describe the architecture of JASMine; discuss the rationale for building the system, and present a transparent 3rd party file replication service to move data to collaborating institutes using JASMine, XM L, and servlet technology interfacing to grid-based file transfer mechanisms.« less
SeqHBase: a big data toolset for family based sequencing data analysis.
He, Min; Person, Thomas N; Hebbring, Scott J; Heinzen, Ethan; Ye, Zhan; Schrodi, Steven J; McPherson, Elizabeth W; Lin, Simon M; Peissig, Peggy L; Brilliant, Murray H; O'Rawe, Jason; Robison, Reid J; Lyon, Gholson J; Wang, Kai
2015-04-01
Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
MarFS, a Near-POSIX Interface to Cloud Objects
DOE Office of Scientific and Technical Information (OSTI.GOV)
Inman, Jeffrey Thornton; Vining, William Flynn; Ransom, Garrett Wilson
The engineering forces driving development of “cloud” storage have produced resilient, cost-effective storage systems that can scale to 100s of petabytes, with good parallel access and bandwidth. These features would make a good match for the vast storage needs of High-Performance Computing datacenters, but cloud storage gains some of its capability from its use of HTTP-style Representational State Transfer (REST) semantics, whereas most large datacenters have legacy applications that rely on POSIX file-system semantics. MarFS is an open-source project at Los Alamos National Laboratory that allows us to present cloud-style object-storage as a scalable near-POSIX file system. We have alsomore » developed a new storage architecture to improve bandwidth and scalability beyond what’s available in commodity object stores, while retaining their resilience and economy. Additionally, we present a scheme for scaling the POSIX interface to allow billions of files in a single directory and trillions of files in total.« less
MarFS, a Near-POSIX Interface to Cloud Objects
Inman, Jeffrey Thornton; Vining, William Flynn; Ransom, Garrett Wilson; ...
2017-01-01
The engineering forces driving development of “cloud” storage have produced resilient, cost-effective storage systems that can scale to 100s of petabytes, with good parallel access and bandwidth. These features would make a good match for the vast storage needs of High-Performance Computing datacenters, but cloud storage gains some of its capability from its use of HTTP-style Representational State Transfer (REST) semantics, whereas most large datacenters have legacy applications that rely on POSIX file-system semantics. MarFS is an open-source project at Los Alamos National Laboratory that allows us to present cloud-style object-storage as a scalable near-POSIX file system. We have alsomore » developed a new storage architecture to improve bandwidth and scalability beyond what’s available in commodity object stores, while retaining their resilience and economy. Additionally, we present a scheme for scaling the POSIX interface to allow billions of files in a single directory and trillions of files in total.« less
Cloud Engineering Principles and Technology Enablers for Medical Image Processing-as-a-Service.
Bao, Shunxing; Plassard, Andrew J; Landman, Bennett A; Gokhale, Aniruddha
2017-04-01
Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based "medical image processing-as-a-service" offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop's distributed file system. Despite this promise, HBase's load distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). This paper makes two contributions to address these concerns by describing key cloud engineering principles and technology enhancements we made to the Apache Hadoop ecosystem for medical imaging applications. First, we propose a row-key design for HBase, which is a necessary step that is driven by the hierarchical organization of imaging data. Second, we propose a novel data allocation policy within HBase to strongly enforce collocation of hierarchically related imaging data. The proposed enhancements accelerate data processing by minimizing network usage and localizing processing to machines where the data already exist. Moreover, our approach is amenable to the traditional scan, subject, and project-level analysis procedures, and is compatible with standard command line/scriptable image processing software. Experimental results for an illustrative sample of imaging data reveals that our new HBase policy results in a three-fold time improvement in conversion of classic DICOM to NiFTI file formats when compared with the default HBase region split policy, and nearly a six-fold improvement over a commonly available network file system (NFS) approach even for relatively small file sets. Moreover, file access latency is lower than network attached storage.
Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF
NASA Astrophysics Data System (ADS)
Li, Z.; Yang, C. P.; Schnase, J. L.; Duffy, D.; Lee, T. J.
2014-12-01
Climate observations and model simulations are collecting and generating vast amounts of climate data, and these data are ever-increasing and being accumulated in a rapid speed. Effectively managing and analyzing these data are essential for climate change studies. Hadoop, a distributed storage and processing framework for large data sets, has attracted increasing attentions in dealing with the Big Data challenge. The maturity of Infrastructure as a Service (IaaS) of cloud computing further accelerates the adoption of Hadoop in solving Big Data problems. However, Hadoop is designed to process unstructured data such as texts, documents and web pages, and cannot effectively handle the scientific data format such as array-based NetCDF files and other binary data format. In this paper, we propose to build a Hadoop-based middleware for transparently handling big NetCDF data by 1) designing a distributed climate data storage mechanism based on POSIX-enabled parallel file system to enable parallel big data processing with MapReduce, as well as support data access by other systems; 2) modifying the Hadoop framework to transparently processing NetCDF data in parallel without sequencing or converting the data into other file formats, or loading them to HDFS; and 3) seamlessly integrating Hadoop, cloud computing and climate data in a highly scalable and fault-tolerance framework.
NASA Astrophysics Data System (ADS)
Zhu, F.; Yu, H.; Rilee, M. L.; Kuo, K. S.; Yu, L.; Pan, Y.; Jiang, H.
2017-12-01
Since the establishment of data archive centers and the standardization of file formats, scientists are required to search metadata catalogs for data needed and download the data files to their local machines to carry out data analysis. This approach has facilitated data discovery and access for decades, but it inevitably leads to data transfer from data archive centers to scientists' computers through low-bandwidth Internet connections. Data transfer becomes a major performance bottleneck in such an approach. Combined with generally constrained local compute/storage resources, they limit the extent of scientists' studies and deprive them of timely outcomes. Thus, this conventional approach is not scalable with respect to both the volume and variety of geoscience data. A much more viable solution is to couple analysis and storage systems to minimize data transfer. In our study, we compare loosely coupled approaches (exemplified by Spark and Hadoop) and tightly coupled approaches (exemplified by parallel distributed database management systems, e.g., SciDB). In particular, we investigate the optimization of data placement and movement to effectively tackle the variety challenge, and boost the popularization of parallelization to address the volume challenge. Our goal is to enable high-performance interactive analysis for a good portion of geoscience data analysis exercise. We show that tightly coupled approaches can concentrate data traffic between local storage systems and compute units, and thereby optimizing bandwidth utilization to achieve a better throughput. Based on our observations, we develop a geoscience data analysis system that tightly couples analysis engines with storages, which has direct access to the detailed map of data partition locations. Through an innovation data partitioning and distribution scheme, our system has demonstrated scalable and interactive performance in real-world geoscience data analysis applications.
NASA Astrophysics Data System (ADS)
Magnoni, L.; Suthakar, U.; Cordeiro, C.; Georgiou, M.; Andreeva, J.; Khan, A.; Smith, D. R.
2015-12-01
Monitoring the WLCG infrastructure requires the gathering and analysis of a high volume of heterogeneous data (e.g. data transfers, job monitoring, site tests) coming from different services and experiment-specific frameworks to provide a uniform and flexible interface for scientists and sites. The current architecture, where relational database systems are used to store, to process and to serve monitoring data, has limitations in coping with the foreseen increase in the volume (e.g. higher LHC luminosity) and the variety (e.g. new data-transfer protocols and new resource-types, as cloud-computing) of WLCG monitoring events. This paper presents a new scalable data store and analytics platform designed by the Support for Distributed Computing (SDC) group, at the CERN IT department, which uses a variety of technologies each one targeting specific aspects of big-scale distributed data-processing (commonly referred as lambda-architecture approach). Results of data processing on Hadoop for WLCG data activities monitoring are presented, showing how the new architecture can easily analyze hundreds of millions of transfer logs in a few minutes. Moreover, a comparison of data partitioning, compression and file format (e.g. CSV, Avro) is presented, with particular attention given to how the file structure impacts the overall MapReduce performance. In conclusion, the evolution of the current implementation, which focuses on data storage and batch processing, towards a complete lambda-architecture is discussed, with consideration of candidate technology for the serving layer (e.g. Elasticsearch) and a description of a proof of concept implementation, based on Apache Spark and Esper, for the real-time part which compensates for batch-processing latency and automates problem detection and failures.
NASA Astrophysics Data System (ADS)
Beach, A. L., III; Northup, E. A.; Early, A. B.; Chen, G.
2016-12-01
Airborne field studies are an effective way to gain a detailed understanding of atmospheric processes for scientific research on climate change and air quality relevant issues. One major function of airborne project data management is to maintain seamless data access within the science team. This allows individual instrument principal investigators (PIs) to process and validate their own data, which requires analysis of data sets from other PIs (or instruments). The project's web platform streamlines data ingest, distribution processes, and data format validation. In May 2016, the NASA Langley Research Center (LaRC) Atmospheric Science Data Center (ASDC) developed a new data management capability to help support the Korea U.S.-Air Quality (KORUS-AQ) science team. This effort is aimed at providing direct NASA Distributed Active Archive Center (DAAC) support to an airborne field study. Working closely with the science team, the ASDC developed a scalable architecture that allows investigators to easily upload and distribute their data and documentation within a secure collaborative environment. The user interface leverages modern design elements to intuitively guide the PI through each step of the data management process. In addition, the new framework creates an abstraction layer between how the data files are stored and how the data itself is organized(i.e. grouping files by PI). This approach makes it easy for PIs to simply transfer their data to one directory, while the system itself can automatically group/sort data as needed. Moreover, the platform is "server agnostic" to a certain degree, making deployment and customization more straightforward as hardware needs change. This flexible design will improve development efficiency and can be leveraged for future field campaigns. This presentation will examine the KORUS-AQ data portal as a scalable solution that applies consistent and intuitive usability design practices to support ingest and management of airborne data.
An analysis of image storage systems for scalable training of deep neural networks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lim, Seung-Hwan; Young, Steven R; Patton, Robert M
This study presents a principled empirical evaluation of image storage systems for training deep neural networks. We employ the Caffe deep learning framework to train neural network models for three different data sets, MNIST, CIFAR-10, and ImageNet. While training the models, we evaluate five different options to retrieve training image data: (1) PNG-formatted image files on local file system; (2) pushing pixel arrays from image files into a single HDF5 file on local file system; (3) in-memory arrays to hold the pixel arrays in Python and C++; (4) loading the training data into LevelDB, a log-structured merge tree based key-valuemore » storage; and (5) loading the training data into LMDB, a B+tree based key-value storage. The experimental results quantitatively highlight the disadvantage of using normal image files on local file systems to train deep neural networks and demonstrate reliable performance with key-value storage based storage systems. When training a model on the ImageNet dataset, the image file option was more than 17 times slower than the key-value storage option. Along with measurements on training time, this study provides in-depth analysis on the cause of performance advantages/disadvantages of each back-end to train deep neural networks. We envision the provided measurements and analysis will shed light on the optimal way to architect systems for training neural networks in a scalable manner.« less
RAIN: A Bio-Inspired Communication and Data Storage Infrastructure.
Monti, Matteo; Rasmussen, Steen
2017-01-01
We summarize the results and perspectives from a companion article, where we presented and evaluated an alternative architecture for data storage in distributed networks. We name the bio-inspired architecture RAIN, and it offers file storage service that, in contrast with current centralized cloud storage, has privacy by design, is open source, is more secure, is scalable, is more sustainable, has community ownership, is inexpensive, and is potentially faster, more efficient, and more reliable. We propose that a RAIN-style architecture could form the backbone of the Internet of Things that likely will integrate multiple current and future infrastructures ranging from online services and cryptocurrency to parts of government administration.
NASA Astrophysics Data System (ADS)
Poat, M. D.; Lauret, J.; Betts, W.
2015-12-01
The STAR online computing infrastructure has become an intensive dynamic system used for first-hand data collection and analysis resulting in a dense collection of data output. As we have transitioned to our current state, inefficient, limited storage systems have become an impediment to fast feedback to online shift crews. Motivation for a centrally accessible, scalable and redundant distributed storage system had become a necessity in this environment. OpenStack Swift Object Storage and Ceph Object Storage are two eye-opening technologies as community use and development have led to success elsewhere. In this contribution, OpenStack Swift and Ceph have been put to the test with single and parallel I/O tests, emulating real world scenarios for data processing and workflows. The Ceph file system storage, offering a POSIX compliant file system mounted similarly to an NFS share was of particular interest as it aligned with our requirements and was retained as our solution. I/O performance tests were run against the Ceph POSIX file system and have presented surprising results indicating true potential for fast I/O and reliability. STAR'S online compute farm historical use has been for job submission and first hand data analysis. The goal of reusing the online compute farm to maintain a storage cluster and job submission will be an efficient use of the current infrastructure.
MicROS-drt: supporting real-time and scalable data distribution in distributed robotic systems.
Ding, Bo; Wang, Huaimin; Fan, Zedong; Zhang, Pengfei; Liu, Hui
A primary requirement in distributed robotic software systems is the dissemination of data to all interested collaborative entities in a timely and scalable manner. However, providing such a service in a highly dynamic and resource-limited robotic environment is a challenging task, and existing robot software infrastructure has limitations in this aspect. This paper presents a novel robot software infrastructure, micROS-drt, which supports real-time and scalable data distribution. The solution is based on a loosely coupled data publish-subscribe model with the ability to support various time-related constraints. And to realize this model, a mature data distribution standard, the data distribution service for real-time systems (DDS), is adopted as the foundation of the transport layer of this software infrastructure. By elaborately adapting and encapsulating the capability of the underlying DDS middleware, micROS-drt can meet the requirement of real-time and scalable data distribution in distributed robotic systems. Evaluation results in terms of scalability, latency jitter and transport priority as well as the experiment on real robots validate the effectiveness of this work.
High performance data transfer
NASA Astrophysics Data System (ADS)
Cottrell, R.; Fang, C.; Hanushevsky, A.; Kreuger, W.; Yang, W.
2017-10-01
The exponentially increasing need for high speed data transfer is driven by big data, and cloud computing together with the needs of data intensive science, High Performance Computing (HPC), defense, the oil and gas industry etc. We report on the Zettar ZX software. This has been developed since 2013 to meet these growing needs by providing high performance data transfer and encryption in a scalable, balanced, easy to deploy and use way while minimizing power and space utilization. In collaboration with several commercial vendors, Proofs of Concept (PoC) consisting of clusters have been put together using off-the- shelf components to test the ZX scalability and ability to balance services using multiple cores, and links. The PoCs are based on SSD flash storage that is managed by a parallel file system. Each cluster occupies 4 rack units. Using the PoCs, between clusters we have achieved almost 200Gbps memory to memory over two 100Gbps links, and 70Gbps parallel file to parallel file with encryption over a 5000 mile 100Gbps link.
RAID Unbound: Storage Fault Tolerance in a Distributed Environment
NASA Technical Reports Server (NTRS)
Ritchie, Brian
1996-01-01
Mirroring, data replication, backup, and more recently, redundant arrays of independent disks (RAID) are all technologies used to protect and ensure access to critical company data. A new set of problems has arisen as data becomes more and more geographically distributed. Each of the technologies listed above provides important benefits; but each has failed to adapt fully to the realities of distributed computing. The key to data high availability and protection is to take the technologies' strengths and 'virtualize' them across a distributed network. RAID and mirroring offer high data availability, which data replication and backup provide strong data protection. If we take these concepts at a very granular level (defining user, record, block, file, or directory types) and them liberate them from the physical subsystems with which they have traditionally been associated, we have the opportunity to create a highly scalable network wide storage fault tolerance. The network becomes the virtual storage space in which the traditional concepts of data high availability and protection are implemented without their corresponding physical constraints.
Design and Execution of make-like, distributed Analyses based on Spotify’s Pipelining Package Luigi
NASA Astrophysics Data System (ADS)
Erdmann, M.; Fischer, B.; Fischer, R.; Rieger, M.
2017-10-01
In high-energy particle physics, workflow management systems are primarily used as tailored solutions in dedicated areas such as Monte Carlo production. However, physicists performing data analyses are usually required to steer their individual workflows manually which is time-consuming and often leads to undocumented relations between particular workloads. We present a generic analysis design pattern that copes with the sophisticated demands of end-to-end HEP analyses and provides a make-like execution system. It is based on the open-source pipelining package Luigi which was developed at Spotify and enables the definition of arbitrary workloads, so-called Tasks, and the dependencies between them in a lightweight and scalable structure. Further features are multi-user support, automated dependency resolution and error handling, central scheduling, and status visualization in the web. In addition to already built-in features for remote jobs and file systems like Hadoop and HDFS, we added support for WLCG infrastructure such as LSF and CREAM job submission, as well as remote file access through the Grid File Access Library. Furthermore, we implemented automated resubmission functionality, software sandboxing, and a command line interface with auto-completion for a convenient working environment. For the implementation of a t \\overline{{{t}}} H cross section measurement, we created a generic Python interface that provides programmatic access to all external information such as datasets, physics processes, statistical models, and additional files and values. In summary, the setup enables the execution of the entire analysis in a parallelized and distributed fashion with a single command.
A Comparison of Different Database Technologies for the CMS AsyncStageOut Transfer Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ciangottini, D.; Balcas, J.; Mascheroni, M.
AsyncStageOut (ASO) is the component of the CMS distributed data analysis system (CRAB) that manages users transfers in a centrally controlled way using the File Transfer System (FTS3) at CERN. It addresses a major weakness of the previous, decentralized model, namely that the transfer of the user’s output data to a single remote site was part of the job execution, resulting in inefficient use of job slots and an unacceptable failure rate. Currently ASO manages up to 600k files of various sizes per day from more than 500 users per month, spread over more than 100 sites. ASO uses amore » NoSQL database (CouchDB) as internal bookkeeping and as way to communicate with other CRAB components. Since ASO/CRAB were put in production in 2014, the number of transfers constantly increased up to a point where the pressure to the central CouchDB instance became critical, creating new challenges for the system scalability, performance, and monitoring. This forced a re-engineering of the ASO application to increase its scalability and lowering its operational effort. In this contribution we present a comparison of the performance of the current NoSQL implementation and a new SQL implementation, and how their different strengths and features influenced the design choices and operational experience. We also discuss other architectural changes introduced in the system to handle the increasing load and latency in delivering output to the user.« less
A comparison of different database technologies for the CMS AsyncStageOut transfer database
NASA Astrophysics Data System (ADS)
Ciangottini, D.; Balcas, J.; Mascheroni, M.; Rupeika, E. A.; Vaandering, E.; Riahi, H.; Silva, J. M. D.; Hernandez, J. M.; Belforte, S.; Ivanov, T. T.
2017-10-01
AsyncStageOut (ASO) is the component of the CMS distributed data analysis system (CRAB) that manages users transfers in a centrally controlled way using the File Transfer System (FTS3) at CERN. It addresses a major weakness of the previous, decentralized model, namely that the transfer of the user’s output data to a single remote site was part of the job execution, resulting in inefficient use of job slots and an unacceptable failure rate. Currently ASO manages up to 600k files of various sizes per day from more than 500 users per month, spread over more than 100 sites. ASO uses a NoSQL database (CouchDB) as internal bookkeeping and as way to communicate with other CRAB components. Since ASO/CRAB were put in production in 2014, the number of transfers constantly increased up to a point where the pressure to the central CouchDB instance became critical, creating new challenges for the system scalability, performance, and monitoring. This forced a re-engineering of the ASO application to increase its scalability and lowering its operational effort. In this contribution we present a comparison of the performance of the current NoSQL implementation and a new SQL implementation, and how their different strengths and features influenced the design choices and operational experience. We also discuss other architectural changes introduced in the system to handle the increasing load and latency in delivering output to the user.
Cloud Engineering Principles and Technology Enablers for Medical Image Processing-as-a-Service
Bao, Shunxing; Plassard, Andrew J.; Landman, Bennett A.; Gokhale, Aniruddha
2017-01-01
Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based “medical image processing-as-a-service” offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop’s distributed file system. Despite this promise, HBase’s load distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). This paper makes two contributions to address these concerns by describing key cloud engineering principles and technology enhancements we made to the Apache Hadoop ecosystem for medical imaging applications. First, we propose a row-key design for HBase, which is a necessary step that is driven by the hierarchical organization of imaging data. Second, we propose a novel data allocation policy within HBase to strongly enforce collocation of hierarchically related imaging data. The proposed enhancements accelerate data processing by minimizing network usage and localizing processing to machines where the data already exist. Moreover, our approach is amenable to the traditional scan, subject, and project-level analysis procedures, and is compatible with standard command line/scriptable image processing software. Experimental results for an illustrative sample of imaging data reveals that our new HBase policy results in a three-fold time improvement in conversion of classic DICOM to NiFTI file formats when compared with the default HBase region split policy, and nearly a six-fold improvement over a commonly available network file system (NFS) approach even for relatively small file sets. Moreover, file access latency is lower than network attached storage. PMID:28884169
Convergence of Internet and TV: The Commercial Viability of P2P Content Delivery
NASA Astrophysics Data System (ADS)
de Boever, Jorn
The popularity of (illegal) P2P (peer-to-peer) file sharing has a disruptive impact on Internet traffic and business models of content providers. In addition, several studies have found an increasing demand for bandwidth consuming content, such as video, on the Internet. Although P2P systems have been put forward as a scalable and inexpensive model to deliver such content, there has been relatively little economic analysis of the potentials and obstacles of P2P systems as a legal and commercial content distribution model. Many content providers encounter uncertainties regarding the adoption or rejection of P2P networks to spread content over the Internet. The recent launch of several commercial, legal P2P content distribution platforms increases the importance of an integrated analysis of the Strengths, Weaknesses, Opportunities and Threats (SWOT).
Multiple Independent File Parallel I/O with HDF5
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, M. C.
2016-07-13
The HDF5 library has supported the I/O requirements of HPC codes at Lawrence Livermore National Labs (LLNL) since the late 90’s. In particular, HDF5 used in the Multiple Independent File (MIF) parallel I/O paradigm has supported LLNL code’s scalable I/O requirements and has recently been gainfully used at scales as large as O(10 6) parallel tasks.
Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST)
Dowd, Scot E; Zaragoza, Joaquin; Rodriguez, Javier R; Oliver, Melvin J; Payton, Paxton R
2005-01-01
Background BLAST is one of the most common and useful tools for Genetic Research. This paper describes a software application we have termed Windows .NET Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST), which enhances the BLAST utility by improving usability, fault recovery, and scalability in a Windows desktop environment. Our goal was to develop an easy to use, fault tolerant, high-throughput BLAST solution that incorporates a comprehensive BLAST result viewer with curation and annotation functionality. Results W.ND-BLAST is a comprehensive Windows-based software toolkit that targets researchers, including those with minimal computer skills, and provides the ability increase the performance of BLAST by distributing BLAST queries to any number of Windows based machines across local area networks (LAN). W.ND-BLAST provides intuitive Graphic User Interfaces (GUI) for BLAST database creation, BLAST execution, BLAST output evaluation and BLAST result exportation. This software also provides several layers of fault tolerance and fault recovery to prevent loss of data if nodes or master machines fail. This paper lays out the functionality of W.ND-BLAST. W.ND-BLAST displays close to 100% performance efficiency when distributing tasks to 12 remote computers of the same performance class. A high throughput BLAST job which took 662.68 minutes (11 hours) on one average machine was completed in 44.97 minutes when distributed to 17 nodes, which included lower performance class machines. Finally, there is a comprehensive high-throughput BLAST Output Viewer (BOV) and Annotation Engine components, which provides comprehensive exportation of BLAST hits to text files, annotated fasta files, tables, or association files. Conclusion W.ND-BLAST provides an interactive tool that allows scientists to easily utilizing their available computing resources for high throughput and comprehensive sequence analyses. The install package for W.ND-BLAST is freely downloadable from . With registration the software is free, installation, networking, and usage instructions are provided as well as a support forum. PMID:15819992
Implementing Journaling in a Linux Shared Disk File System
NASA Technical Reports Server (NTRS)
Preslan, Kenneth W.; Barry, Andrew; Brassow, Jonathan; Cattelan, Russell; Manthei, Adam; Nygaard, Erling; VanOort, Seth; Teigland, David; Tilstra, Mike; O'Keefe, Matthew;
2000-01-01
In computer systems today, speed and responsiveness is often determined by network and storage subsystem performance. Faster, more scalable networking interfaces like Fibre Channel and Gigabit Ethernet provide the scaffolding from which higher performance computer systems implementations may be constructed, but new thinking is required about how machines interact with network-enabled storage devices. In this paper we describe how we implemented journaling in the Global File System (GFS), a shared-disk, cluster file system for Linux. Our previous three papers on GFS at the Mass Storage Symposium discussed our first three GFS implementations, their performance, and the lessons learned. Our fourth paper describes, appropriately enough, the evolution of GFS version 3 to version 4, which supports journaling and recovery from client failures. In addition, GFS scalability tests extending to 8 machines accessing 8 4-disk enclosures were conducted: these tests showed good scaling. We describe the GFS cluster infrastructure, which is necessary for proper recovery from machine and disk failures in a collection of machines sharing disks using GFS. Finally, we discuss the suitability of Linux for handling the big data requirements of supercomputing centers.
Virtual file system on NoSQL for processing high volumes of HL7 messages.
Kimura, Eizen; Ishihara, Ken
2015-01-01
The Standardized Structured Medical Information Exchange (SS-MIX) is intended to be the standard repository for HL7 messages that depend on a local file system. However, its scalability is limited. We implemented a virtual file system using NoSQL to incorporate modern computing technology into SS-MIX and allow the system to integrate local patient IDs from different healthcare systems into a universal system. We discuss its implementation using the database MongoDB and describe its performance in a case study.
a Hadoop-Based Distributed Framework for Efficient Managing and Processing Big Remote Sensing Images
NASA Astrophysics Data System (ADS)
Wang, C.; Hu, F.; Hu, X.; Zhao, S.; Wen, W.; Yang, C.
2015-07-01
Various sensors from airborne and satellite platforms are producing large volumes of remote sensing images for mapping, environmental monitoring, disaster management, military intelligence, and others. However, it is challenging to efficiently storage, query and process such big data due to the data- and computing- intensive issues. In this paper, a Hadoop-based framework is proposed to manage and process the big remote sensing data in a distributed and parallel manner. Especially, remote sensing data can be directly fetched from other data platforms into the Hadoop Distributed File System (HDFS). The Orfeo toolbox, a ready-to-use tool for large image processing, is integrated into MapReduce to provide affluent image processing operations. With the integration of HDFS, Orfeo toolbox and MapReduce, these remote sensing images can be directly processed in parallel in a scalable computing environment. The experiment results show that the proposed framework can efficiently manage and process such big remote sensing data.
Sequential data access with Oracle and Hadoop: a performance comparison
NASA Astrophysics Data System (ADS)
Baranowski, Zbigniew; Canali, Luca; Grancher, Eric
2014-06-01
The Hadoop framework has proven to be an effective and popular approach for dealing with "Big Data" and, thanks to its scaling ability and optimised storage access, Hadoop Distributed File System-based projects such as MapReduce or HBase are seen as candidates to replace traditional relational database management systems whenever scalable speed of data processing is a priority. But do these projects deliver in practice? Does migrating to Hadoop's "shared nothing" architecture really improve data access throughput? And, if so, at what cost? Authors answer these questions-addressing cost/performance as well as raw performance- based on a performance comparison between an Oracle-based relational database and Hadoop's distributed solutions like MapReduce or HBase for sequential data access. A key feature of our approach is the use of an unbiased data model as certain data models can significantly favour one of the technologies tested.
Dynamic Non-Hierarchical File Systems for Exascale Storage
DOE Office of Scientific and Technical Information (OSTI.GOV)
Long, Darrell E.; Miller, Ethan L
This constitutes the final report for “Dynamic Non-Hierarchical File Systems for Exascale Storage”. The ultimate goal of this project was to improve data management in scientific computing and high-end computing (HEC) applications, and to achieve this goal we proposed: to develop the first, HEC-targeted, file system featuring rich metadata and provenance collection, extreme scalability, and future storage hardware integration as core design goals, and to evaluate and develop a flexible non-hierarchical file system interface suitable for providing more powerful and intuitive data management interfaces to HEC and scientific computing users. Data management is swiftly becoming a serious problem in themore » scientific community – while copious amounts of data are good for obtaining results, finding the right data is often daunting and sometimes impossible. Scientists participating in a Department of Energy workshop noted that most of their time was spent “...finding, processing, organizing, and moving data and it’s going to get much worse”. Scientists should not be forced to become data mining experts in order to retrieve the data they want, nor should they be expected to remember the naming convention they used several years ago for a set of experiments they now wish to revisit. Ideally, locating the data you need would be as easy as browsing the web. Unfortunately, existing data management approaches are usually based on hierarchical naming, a 40 year-old technology designed to manage thousands of files, not exabytes of data. Today’s systems do not take advantage of the rich array of metadata that current high-end computing (HEC) file systems can gather, including content-based metadata and provenance1 information. As a result, current metadata search approaches are typically ad hoc and often work by providing a parallel management system to the “main” file system, as is done in Linux (the locate utility), personal computers, and enterprise search appliances. These search applications are often optimized for a single file system, making it difficult to move files and their metadata between file systems. Users have tried to solve this problem in several ways, including the use of separate databases to index file properties, the encoding of file properties into file names, and separately gathering and managing provenance data, but none of these approaches has worked well, either due to limited usefulness or scalability, or both. Our research addressed several key issues: High-performance, real-time metadata harvesting: extracting important attributes from files dynamically and immediately updating indexes used to improve search; Transparent, automatic, and secure provenance capture: recording the data inputs and processing steps used in the production of each file in the system; Scalable indexing: indexes that are optimized for integration with the file system; Dynamic file system structure: our approach provides dynamic directories similar to those in semantic file systems, but these are the native organization rather than a feature grafted onto a conventional system. In addition to these goals, our research effort will include evaluating the impact of new storage technologies on the file system design and performance. In particular, the indexing and metadata harvesting functions can potentially benefit from the performance improvements promised by new storage class memories.« less
Architecture Knowledge for Evaluating Scalable Databases
2015-01-16
problems, arising from the proliferation of new data models and distributed technologies for building scalable, available data stores . Architects must...longer are relational databases the de facto standard for building data repositories. Highly distributed, scalable “ NoSQL ” databases [11] have emerged...This is especially challenging at the data storage layer. The multitude of competing NoSQL database technologies creates a complex and rapidly
Scalable Quantum Networks for Distributed Computing and Sensing
2016-04-01
probabilistic measurement , so we developed quantum memories and guided-wave implementations of same, demonstrating controlled delay of a heralded single...Second, fundamental scalability requires a method to synchronize protocols based on quantum measurements , which are inherently probabilistic. To meet...AFRL-AFOSR-UK-TR-2016-0007 Scalable Quantum Networks for Distributed Computing and Sensing Ian Walmsley THE UNIVERSITY OF OXFORD Final Report 04/01
CMS users data management service integration and first experiences with its NoSQL data storage
NASA Astrophysics Data System (ADS)
Riahi, H.; Spiga, D.; Boccali, T.; Ciangottini, D.; Cinquilli, M.; Hernàndez, J. M.; Konstantinov, P.; Mascheroni, M.; Santocchia, A.
2014-06-01
The distributed data analysis workflow in CMS assumes that jobs run in a different location to where their results are finally stored. Typically the user outputs must be transferred from one site to another by a dedicated CMS service, AsyncStageOut. This new service is originally developed to address the inefficiency in using the CMS computing resources when transferring the analysis job outputs, synchronously, once they are produced in the job execution node to the remote site. The AsyncStageOut is designed as a thin application relying only on the NoSQL database (CouchDB) as input and data storage. It has progressed from a limited prototype to a highly adaptable service which manages and monitors the whole user files steps, namely file transfer and publication. The AsyncStageOut is integrated with the Common CMS/Atlas Analysis Framework. It foresees the management of nearly nearly 200k users' files per day of close to 1000 individual users per month with minimal delays, and providing a real time monitoring and reports to users and service operators, while being highly available. The associated data volume represents a new set of challenges in the areas of database scalability and service performance and efficiency. In this paper, we present an overview of the AsyncStageOut model and the integration strategy with the Common Analysis Framework. The motivations for using the NoSQL technology are also presented, as well as data design and the techniques used for efficient indexing and monitoring of the data. We describe deployment model for the high availability and scalability of the service. We also discuss the hardware requirements and the results achieved as they were determined by testing with actual data and realistic loads during the commissioning and the initial production phase with the Common Analysis Framework.
Survey of MapReduce frame operation in bioinformatics.
Zou, Quan; Li, Xu-Bin; Jiang, Wen-Rui; Lin, Zi-Yu; Li, Gui-Lin; Chen, Ke
2014-07-01
Bioinformatics is challenged by the fact that traditional analysis tools have difficulty in processing large-scale data from high-throughput sequencing. The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services. In this article, we present MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains. In addition, we discuss the challenges faced by this field as well as the future works on parallel computing in bioinformatics. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Scalable Unix tools on parallel processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gropp, W.; Lusk, E.
1994-12-31
The introduction of parallel processors that run a separate copy of Unix on each process has introduced new problems in managing the user`s environment. This paper discusses some generalizations of common Unix commands for managing files (e.g. 1s) and processes (e.g. ps) that are convenient and scalable. These basic tools, just like their Unix counterparts, are text-based. We also discuss a way to use these with a graphical user interface (GUI). Some notes on the implementation are provided. Prototypes of these commands are publicly available.
Simple, Scalable, Script-Based Science Processor (S4P)
NASA Technical Reports Server (NTRS)
Lynnes, Christopher; Vollmer, Bruce; Berrick, Stephen; Mack, Robert; Pham, Long; Zhou, Bryan; Wharton, Stephen W. (Technical Monitor)
2001-01-01
The development and deployment of data processing systems to process Earth Observing System (EOS) data has proven to be costly and prone to technical and schedule risk. Integration of science algorithms into a robust operational system has been difficult. The core processing system, based on commercial tools, has demonstrated limitations at the rates needed to produce the several terabytes per day for EOS, primarily due to job management overhead. This has motivated an evolution in the EOS Data Information System toward a more distributed one incorporating Science Investigator-led Processing Systems (SIPS). As part of this evolution, the Goddard Earth Sciences Distributed Active Archive Center (GES DAAC) has developed a simplified processing system to accommodate the increased load expected with the advent of reprocessing and launch of a second satellite. This system, the Simple, Scalable, Script-based Science Processor (S42) may also serve as a resource for future SIPS. The current EOSDIS Core System was designed to be general, resulting in a large, complex mix of commercial and custom software. In contrast, many simpler systems, such as the EROS Data Center AVHRR IKM system, rely on a simple directory structure to drive processing, with directories representing different stages of production. The system passes input data to a directory, and the output data is placed in a "downstream" directory. The GES DAAC's Simple Scalable Script-based Science Processing System is based on the latter concept, but with modifications to allow varied science algorithms and improve portability. It uses a factory assembly-line paradigm: when work orders arrive at a station, an executable is run, and output work orders are sent to downstream stations. The stations are implemented as UNIX directories, while work orders are simple ASCII files. The core S4P infrastructure consists of a Perl program called stationmaster, which detects newly arrived work orders and forks a job to run the appropriate executable (registered in a configuration file for that station). Although S4P is written in Perl, the executables associated with a station can be any program that can be run from the command line, i.e., non-interactively. An S4P instance is typically monitored using a simple Graphical User Interface. However, the reliance of S4P on UNIX files and directories also allows visibility into the state of stations and jobs using standard operating system commands, permitting remote monitor/control over low-bandwidth connections. S4P is being used as the foundation for several small- to medium-size systems for data mining, on-demand subsetting, processing of direct broadcast Moderate Resolution Imaging Spectroradiometer (MODIS) data, and Quick-Response MODIS processing. It has also been used to implement a large-scale system to process MODIS Level 1 and Level 2 Standard Products, which will ultimately process close to 2 TB/day.
Sideloading - Ingestion of Large Point Clouds Into the Apache Spark Big Data Engine
NASA Astrophysics Data System (ADS)
Boehm, J.; Liu, K.; Alis, C.
2016-06-01
In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.
Boubela, Roland N.; Kalcher, Klaudius; Huf, Wolfgang; Našel, Christian; Moser, Ewald
2016-01-01
Technologies for scalable analysis of very large datasets have emerged in the domain of internet computing, but are still rarely used in neuroimaging despite the existence of data and research questions in need of efficient computation tools especially in fMRI. In this work, we present software tools for the application of Apache Spark and Graphics Processing Units (GPUs) to neuroimaging datasets, in particular providing distributed file input for 4D NIfTI fMRI datasets in Scala for use in an Apache Spark environment. Examples for using this Big Data platform in graph analysis of fMRI datasets are shown to illustrate how processing pipelines employing it can be developed. With more tools for the convenient integration of neuroimaging file formats and typical processing steps, big data technologies could find wider endorsement in the community, leading to a range of potentially useful applications especially in view of the current collaborative creation of a wealth of large data repositories including thousands of individual fMRI datasets. PMID:26778951
Zebra: A striped network file system
NASA Technical Reports Server (NTRS)
Hartman, John H.; Ousterhout, John K.
1992-01-01
The design of Zebra, a striped network file system, is presented. Zebra applies ideas from log-structured file system (LFS) and RAID research to network file systems, resulting in a network file system that has scalable performance, uses its servers efficiently even when its applications are using small files, and provides high availability. Zebra stripes file data across multiple servers, so that the file transfer rate is not limited by the performance of a single server. High availability is achieved by maintaining parity information for the file system. If a server fails its contents can be reconstructed using the contents of the remaining servers and the parity information. Zebra differs from existing striped file systems in the way it stripes file data: Zebra does not stripe on a per-file basis; instead it stripes the stream of bytes written by each client. Clients write to the servers in units called stripe fragments, which are analogous to segments in an LFS. Stripe fragments contain file blocks that were written recently, without regard to which file they belong. This method of striping has numerous advantages over per-file striping, including increased server efficiency, efficient parity computation, and elimination of parity update.
Monitoring Data-Structure Evolution in Distributed Message-Passing Programs
NASA Technical Reports Server (NTRS)
Sarukkai, Sekhar R.; Beers, Andrew; Woodrow, Thomas S. (Technical Monitor)
1996-01-01
Monitoring the evolution of data structures in parallel and distributed programs, is critical for debugging its semantics and performance. However, the current state-of-art in tracking and presenting data-structure information on parallel and distributed environments is cumbersome and does not scale. In this paper we present a methodology that automatically tracks memory bindings (not the actual contents) of static and dynamic data-structures of message-passing C programs, using PVM. With the help of a number of examples we show that in addition to determining the impact of memory allocation overheads on program performance, graphical views can help in debugging the semantics of program execution. Scalable animations of virtual address bindings of source-level data-structures are used for debugging the semantics of parallel programs across all processors. In conjunction with light-weight core-files, this technique can be used to complement traditional debuggers on single processors. Detailed information (such as data-structure contents), on specific nodes, can be determined using traditional debuggers after the data structure evolution leading to the semantic error is observed graphically.
geoKepler Workflow Module for Computationally Scalable and Reproducible Geoprocessing and Modeling
NASA Astrophysics Data System (ADS)
Cowart, C.; Block, J.; Crawl, D.; Graham, J.; Gupta, A.; Nguyen, M.; de Callafon, R.; Smarr, L.; Altintas, I.
2015-12-01
The NSF-funded WIFIRE project has developed an open-source, online geospatial workflow platform for unifying geoprocessing tools and models for for fire and other geospatially dependent modeling applications. It is a product of WIFIRE's objective to build an end-to-end cyberinfrastructure for real-time and data-driven simulation, prediction and visualization of wildfire behavior. geoKepler includes a set of reusable GIS components, or actors, for the Kepler Scientific Workflow System (https://kepler-project.org). Actors exist for reading and writing GIS data in formats such as Shapefile, GeoJSON, KML, and using OGC web services such as WFS. The actors also allow for calling geoprocessing tools in other packages such as GDAL and GRASS. Kepler integrates functions from multiple platforms and file formats into one framework, thus enabling optimal GIS interoperability, model coupling, and scalability. Products of the GIS actors can be fed directly to models such as FARSITE and WRF. Kepler's ability to schedule and scale processes using Hadoop and Spark also makes geoprocessing ultimately extensible and computationally scalable. The reusable workflows in geoKepler can be made to run automatically when alerted by real-time environmental conditions. Here, we show breakthroughs in the speed of creating complex data for hazard assessments with this platform. We also demonstrate geoKepler workflows that use Data Assimilation to ingest real-time weather data into wildfire simulations, and for data mining techniques to gain insight into environmental conditions affecting fire behavior. Existing machine learning tools and libraries such as R and MLlib are being leveraged for this purpose in Kepler, as well as Kepler's Distributed Data Parallel (DDP) capability to provide a framework for scalable processing. geoKepler workflows can be executed via an iPython notebook as a part of a Jupyter hub at UC San Diego for sharing and reporting of the scientific analysis and results from various runs of geoKepler workflows. The communication between iPython and Kepler workflow executions is established through an iPython magic function for Kepler that we have implemented. In summary, geoKepler is an ecosystem that makes geospatial processing and analysis of any kind programmable, reusable, scalable and sharable.
A Massively Parallel Computational Method of Reading Index Files for SOAPsnv.
Zhu, Xiaoqian; Peng, Shaoliang; Liu, Shaojie; Cui, Yingbo; Gu, Xiang; Gao, Ming; Fang, Lin; Fang, Xiaodong
2015-12-01
SOAPsnv is the software used for identifying the single nucleotide variation in cancer genes. However, its performance is yet to match the massive amount of data to be processed. Experiments reveal that the main performance bottleneck of SOAPsnv software is the pileup algorithm. The original pileup algorithm's I/O process is time-consuming and inefficient to read input files. Moreover, the scalability of the pileup algorithm is also poor. Therefore, we designed a new algorithm, named BamPileup, aiming to improve the performance of sequential read, and the new pileup algorithm implemented a parallel read mode based on index. Using this method, each thread can directly read the data start from a specific position. The results of experiments on the Tianhe-2 supercomputer show that, when reading data in a multi-threaded parallel I/O way, the processing time of algorithm is reduced to 3.9 s and the application program can achieve a speedup up to 100×. Moreover, the scalability of the new algorithm is also satisfying.
Integration experiences and performance studies of A COTS parallel archive systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Bary
2010-01-01
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf(COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and lessmore » robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petaflop/s computing system, LANL's Roadrunner, and demonstrated its capability to address requirements of future archival storage systems.« less
Integration experiments and performance studies of a COTS parallel archive system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Gary
2010-06-16
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching andmore » less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, Is, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petafiop/s computing system, LANL's Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems.« less
WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers
NASA Astrophysics Data System (ADS)
Andreeva, J.; Beche, A.; Belov, S.; Kadochnikov, I.; Saiz, P.; Tuckett, D.
2014-06-01
The Worldwide LHC Computing Grid provides resources for the four main virtual organizations. Along with data processing, data distribution is the key computing activity on the WLCG infrastructure. The scale of this activity is very large, the ATLAS virtual organization (VO) alone generates and distributes more than 40 PB of data in 100 million files per year. Another challenge is the heterogeneity of data transfer technologies. Currently there are two main alternatives for data transfers on the WLCG: File Transfer Service and XRootD protocol. Each LHC VO has its own monitoring system which is limited to the scope of that particular VO. There is a need for a global system which would provide a complete cross-VO and cross-technology picture of all WLCG data transfers. We present a unified monitoring tool - WLCG Transfers Dashboard - where all the VOs and technologies coexist and are monitored together. The scale of the activity and the heterogeneity of the system raise a number of technical challenges. Each technology comes with its own monitoring specificities and some of the VOs use several of these technologies. This paper describes the implementation of the system with particular focus on the design principles applied to ensure the necessary scalability and performance, and to easily integrate any new technology providing additional functionality which might be specific to that technology.
Estimates of the Sampling Distribution of Scalability Coefficient H
ERIC Educational Resources Information Center
Van Onna, Marieke J. H.
2004-01-01
Coefficient "H" is used as an index of scalability in nonparametric item response theory (NIRT). It indicates the degree to which a set of items rank orders examinees. Theoretical sampling distributions, however, have only been derived asymptotically and only under restrictive conditions. Bootstrap methods offer an alternative possibility to…
Equalizer: a scalable parallel rendering framework.
Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato
2009-01-01
Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.
Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo
2014-01-01
Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts. Available under the open source MIT license at http://sourceforge.net/projects/seqpig/
AstroVis: Visualizing astronomical data cubes
NASA Astrophysics Data System (ADS)
Finniss, Stephen; Tyler, Robin; Questiaux, Jacques
2016-08-01
AstroVis enables rapid visualization of large data files on platforms supporting the OpenGL rendering library. Radio astronomical observations are typically three dimensional and stored as data cubes. AstroVis implements a scalable approach to accessing these files using three components: a File Access Component (FAC) that reduces the impact of reading time, which speeds up access to the data; the Image Processing Component (IPC), which breaks up the data cube into smaller pieces that can be processed locally and gives a representation of the whole file; and Data Visualization, which implements an approach of Overview + Detail to reduces the dimensions of the data being worked with and the amount of memory required to store it. The result is a 3D display paired with a 2D detail display that contains a small subsection of the original file in full resolution without reducing the data in any way.
Alves, F.
2015-01-01
We prepared new and scalable, hybrid inorganic–organic step-growth hydrogels with polyhedral oligomeric silsesquioxane (POSS) network knot construction elements and hydrolytically degradable poly(ethylene glycol) (PEG) di-ester macromonomers by in situ radical-mediated thiol–ene photopolymerization. The physicochemical properties of the gels are fine-tailored over orders of magnitude including functionalization of their interior, a hierarchical gel structure, and biodegradability. PMID:25821524
Commercial imagery archive, management, exploitation, and distribution project development
NASA Astrophysics Data System (ADS)
Hollinger, Bruce; Sakkas, Alysa
1999-10-01
The Lockheed Martin (LM) team had garnered over a decade of operational experience on the U.S. Government's IDEX II (Imagery Dissemination and Exploitation) system. Recently, it set out to create a new commercial product to serve the needs of large-scale imagery archiving and analysis markets worldwide. LM decided to provide a turnkey commercial solution to receive, store, retrieve, process, analyze and disseminate in 'push' or 'pull' modes imagery, data and data products using a variety of sources and formats. LM selected 'best of breed' hardware and software components and adapted and developed its own algorithms to provide added functionality not commercially available elsewhere. The resultant product, Intelligent Library System (ILS)TM, satisfies requirements for (1) a potentially unbounded, data archive (5000 TB range) (2) automated workflow management for increased user productivity; (3) automatic tracking and management of files stored on shelves; (4) ability to ingest, process and disseminate data volumes with bandwidths ranging up to multi- gigabit per second; (5) access through a thin client-to-server network environment; (6) multiple interactive users needing retrieval of files in seconds from both archived images or in real time, and (7) scalability that maintains information throughput performance as the size of the digital library grows.
Commercial imagery archive, management, exploitation, and distribution product development
NASA Astrophysics Data System (ADS)
Hollinger, Bruce; Sakkas, Alysa
1999-12-01
The Lockheed Martin (LM) team had garnered over a decade of operational experience on the U.S. Government's IDEX II (Imagery Dissemination and Exploitation) system. Recently, it set out to create a new commercial product to serve the needs of large-scale imagery archiving and analysis markets worldwide. LM decided to provide a turnkey commercial solution to receive, store, retrieve, process, analyze and disseminate in 'push' or 'pull' modes imagery, data and data products using a variety of sources and formats. LM selected 'best of breed' hardware and software components and adapted and developed its own algorithms to provide added functionality not commercially available elsewhere. The resultant product, Intelligent Library System (ILS)TM, satisfies requirements for (a) a potentially unbounded, data archive (5000 TB range) (b) automated workflow management for increased user productivity; (c) automatic tracking and management of files stored on shelves; (d) ability to ingest, process and disseminate data volumes with bandwidths ranging up to multi- gigabit per second; (e) access through a thin client-to-server network environment; (f) multiple interactive users needing retrieval of files in seconds from both archived images or in real time, and (g) scalability that maintains information throughput performance as the size of the digital library grows.
A Grid Metadata Service for Earth and Environmental Sciences
NASA Astrophysics Data System (ADS)
Fiore, Sandro; Negro, Alessandro; Aloisio, Giovanni
2010-05-01
Critical challenges for climate modeling researchers are strongly connected with the increasingly complex simulation models and the huge quantities of produced datasets. Future trends in climate modeling will only increase computational and storage requirements. For this reason the ability to transparently access to both computational and data resources for large-scale complex climate simulations must be considered as a key requirement for Earth Science and Environmental distributed systems. From the data management perspective (i) the quantity of data will continuously increases, (ii) data will become more and more distributed and widespread, (iii) data sharing/federation will represent a key challenging issue among different sites distributed worldwide, (iv) the potential community of users (large and heterogeneous) will be interested in discovery experimental results, searching of metadata, browsing collections of files, compare different results, display output, etc.; A key element to carry out data search and discovery, manage and access huge and distributed amount of data is the metadata handling framework. What we propose for the management of distributed datasets is the GRelC service (a data grid solution focusing on metadata management). Despite the classical approaches, the proposed data-grid solution is able to address scalability, transparency, security and efficiency and interoperability. The GRelC service we propose is able to provide access to metadata stored in different and widespread data sources (relational databases running on top of MySQL, Oracle, DB2, etc. leveraging SQL as query language, as well as XML databases - XIndice, eXist, and libxml2 based documents, adopting either XPath or XQuery) providing a strong data virtualization layer in a grid environment. Such a technological solution for distributed metadata management leverages on well known adopted standards (W3C, OASIS, etc.); (ii) supports role-based management (based on VOMS), which increases flexibility and scalability; (iii) provides full support for Grid Security Infrastructure, which means (authorization, mutual authentication, data integrity, data confidentiality and delegation); (iv) is compatible with existing grid middleware such as gLite and Globus and finally (v) is currently adopted at the Euro-Mediterranean Centre for Climate Change (CMCC - Italy) to manage the entire CMCC data production activity as well as in the international Climate-G testbed.
Scalable cloud without dedicated storage
NASA Astrophysics Data System (ADS)
Batkovich, D. V.; Kompaniets, M. V.; Zarochentsev, A. K.
2015-05-01
We present a prototype of a scalable computing cloud. It is intended to be deployed on the basis of a cluster without the separate dedicated storage. The dedicated storage is replaced by the distributed software storage. In addition, all cluster nodes are used both as computing nodes and as storage nodes. This solution increases utilization of the cluster resources as well as improves fault tolerance and performance of the distributed storage. Another advantage of this solution is high scalability with a relatively low initial and maintenance cost. The solution is built on the basis of the open source components like OpenStack, CEPH, etc.
Tuning HDF5 for Lustre File Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Howison, Mark; Koziol, Quincey; Knaak, David
2010-09-24
HDF5 is a cross-platform parallel I/O library that is used by a wide variety of HPC applications for the flexibility of its hierarchical object-database representation of scientific data. We describe our recent work to optimize the performance of the HDF5 and MPI-IO libraries for the Lustre parallel file system. We selected three different HPC applications to represent the diverse range of I/O requirements, and measured their performance on three different systems to demonstrate the robustness of our optimizations across different file system configurations and to validate our optimization strategy. We demonstrate that the combined optimizations improve HDF5 parallel I/O performancemore » by up to 33 times in some cases running close to the achievable peak performance of the underlying file system and demonstrate scalable performance up to 40,960-way concurrency.« less
Design of an H.264/SVC resilient watermarking scheme
NASA Astrophysics Data System (ADS)
Van Caenegem, Robrecht; Dooms, Ann; Barbarien, Joeri; Schelkens, Peter
2010-01-01
The rapid dissemination of media technologies has lead to an increase of unauthorized copying and distribution of digital media. Digital watermarking, i.e. embedding information in the multimedia signal in a robust and imperceptible manner, can tackle this problem. Recently, there has been a huge growth in the number of different terminals and connections that can be used to consume multimedia. To tackle the resulting distribution challenges, scalable coding is often employed. Scalable coding allows the adaptation of a single bit-stream to varying terminal and transmission characteristics. As a result of this evolution, watermarking techniques that are robust against scalable compression become essential in order to control illegal copying. In this paper, a watermarking technique resilient against scalable video compression using the state-of-the-art H.264/SVC codec is therefore proposed and evaluated.
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo
2014-01-01
Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts. Availability and Implementation: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/ Contact: andre.schumacher@yahoo.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24149054
Medusa: A Scalable MR Console Using USB
Stang, Pascal P.; Conolly, Steven M.; Santos, Juan M.; Pauly, John M.; Scott, Greig C.
2012-01-01
MRI pulse sequence consoles typically employ closed proprietary hardware, software, and interfaces, making difficult any adaptation for innovative experimental technology. Yet MRI systems research is trending to higher channel count receivers, transmitters, gradient/shims, and unique interfaces for interventional applications. Customized console designs are now feasible for researchers with modern electronic components, but high data rates, synchronization, scalability, and cost present important challenges. Implementing large multi-channel MR systems with efficiency and flexibility requires a scalable modular architecture. With Medusa, we propose an open system architecture using the Universal Serial Bus (USB) for scalability, combined with distributed processing and buffering to address the high data rates and strict synchronization required by multi-channel MRI. Medusa uses a modular design concept based on digital synthesizer, receiver, and gradient blocks, in conjunction with fast programmable logic for sampling and synchronization. Medusa is a form of synthetic instrument, being reconfigurable for a variety of medical/scientific instrumentation needs. The Medusa distributed architecture, scalability, and data bandwidth limits are presented, and its flexibility is demonstrated in a variety of novel MRI applications. PMID:21954200
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing
2012-12-14
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Matei Zaharia Tathagata Das Haoyuan Li Timothy Hunter Scott Shenker Ion...SUBTITLE Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER...time. However, current programming models for distributed stream processing are relatively low-level often leaving the user to worry about consistency of
Modular Universal Scalable Ion-trap Quantum Computer
2016-06-02
SECURITY CLASSIFICATION OF: The main goal of the original MUSIQC proposal was to construct and demonstrate a modular and universally- expandable ion...Distribution Unlimited UU UU UU UU 02-06-2016 1-Aug-2010 31-Jan-2016 Final Report: Modular Universal Scalable Ion-trap Quantum Computer The views...P.O. Box 12211 Research Triangle Park, NC 27709-2211 Ion trap quantum computation, scalable modular architectures REPORT DOCUMENTATION PAGE 11
BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.
Ausmees, Kristiina; John, Aji; Toor, Salman Z; Hellander, Andreas; Nettelblad, Carl
2018-06-26
The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets. We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds - an increasingly common scenario for both academic and commercial cloud providers - our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set. BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis.
OpenPET Hardware, Firmware, Software, and Board Design Files
DOE Office of Scientific and Technical Information (OSTI.GOV)
Abu-Nimeh, Faisal; Choong, Woon-Sengq; Moses, William W.
OpenPET is an open source, flexible, high-performance, and modular data acquisition system for a variety of applications. The OpenPET electronics are capable of reading analog voltage or current signals from a wide variety of sensors. The electronics boards make extensive use of field programmable gate arrays (FPGAs) to provide flexibility and scalability. Firmware and software for the FPGAs and computer are used to control and acquire data from the system. The command and control flow is similar to the data flow, however, the commands are initiated from the computer similar to a tree topology (i.e., from top-to-bottom). Each node inmore » the tree discovers its parent and children, and all addresses are configured accordingly. A user (or a script) initiates a command from the computer. This command will be translated and encoded to the corresponding child (e.g., SB, MB, DB, etc.). Consecutively, each node will pass the command to its corresponding child(ren) by looking at the destination address. Finally, once the command reaches its desired destination(s) the corresponding node(s) execute(s) the command and send(s) a reply, if required. All the firmware, software, and the electronics board design files are distributed through the OpenPET website (http://openpet.lbl.gov).« less
Automation of multi-agent control for complex dynamic systems in heterogeneous computational network
NASA Astrophysics Data System (ADS)
Oparin, Gennady; Feoktistov, Alexander; Bogdanova, Vera; Sidorov, Ivan
2017-01-01
The rapid progress of high-performance computing entails new challenges related to solving large scientific problems for various subject domains in a heterogeneous distributed computing environment (e.g., a network, Grid system, or Cloud infrastructure). The specialists in the field of parallel and distributed computing give the special attention to a scalability of applications for problem solving. An effective management of the scalable application in the heterogeneous distributed computing environment is still a non-trivial issue. Control systems that operate in networks, especially relate to this issue. We propose a new approach to the multi-agent management for the scalable applications in the heterogeneous computational network. The fundamentals of our approach are the integrated use of conceptual programming, simulation modeling, network monitoring, multi-agent management, and service-oriented programming. We developed a special framework for an automation of the problem solving. Advantages of the proposed approach are demonstrated on the parametric synthesis example of the static linear regulator for complex dynamic systems. Benefits of the scalable application for solving this problem include automation of the multi-agent control for the systems in a parallel mode with various degrees of its detailed elaboration.
Velentgas, Priscilla; Bohn, Rhonda L; Brown, Jeffrey S; Chan, K Arnold; Gladowski, Patricia; Holick, Crystal N; Kramer, Judith M; Nakasato, Cynthia; Spettell, Claire M; Walker, Alexander M; Zhang, Fang; Platt, Richard
2008-12-01
We describe a multi-center post-marketing safety study that uses distributed data methods to minimize the need for covered entities to share protected health information (PHI). Implementation has addressed several issues relevant to creation of a large scale post-marketing drug safety surveillance system envisioned by the FDA's Sentinel Initiative. This retrospective cohort study of Guillain-Barré syndrome (GBS) following meningococcal conjugate vaccination incorporates the data and analytic expertise of five research organizations closely affiliated with US health insurers. The study uses administrative claims data, plus review of full text medical records to adjudicate the status of individuals with a diagnosis code for GBS (ICD9 357.0). A distributed network approach is used to create the analysis files and to perform most aspects of the analysis, allowing nearly all of the data to remain behind institutional firewalls. Pooled analysis files transferred to a central site will contain one record per person for approximately 0.2% of the study population, and contain PHI limited to the month and year of GBS onset for cases or the index date for matched controls. The first planned data extraction identified over 9 million eligible adolescents in the target age range of 11-21 years. They contributed an average of 14 months of eligible time on study over 27 months of calendar time. MCV4 vaccination coverage levels exceeded 20% among 17-18-year olds and 16% among 11-13 and 14-16-year-old age groups by the second quarter of 2007. This study demonstrates the feasibility of using a distributed data network approach to perform large scale post-marketing safety analyses and is scalable to include additional organizations and data sources. We believe these results can inform the development of a large national surveillance system. Copyright (c) 2008 John Wiley & Sons, Ltd.
Integrating Data Distribution and Data Assimilation Between the OOI CI and the NOAA DIF
NASA Astrophysics Data System (ADS)
Meisinger, M.; Arrott, M.; Clemesha, A.; Farcas, C.; Farcas, E.; Im, T.; Schofield, O.; Krueger, I.; Klacansky, I.; Orcutt, J.; Peach, C.; Chave, A.; Raymer, D.; Vernon, F.
2008-12-01
The Ocean Observatories Initiative (OOI) is an NSF funded program to establish the ocean observing infrastructure of the 21st century benefiting research and education. It is currently approaching final design and promises to deliver cyber and physical observatory infrastructure components as well as substantial core instrumentation to study environmental processes of the ocean at various scales, from coastal shelf-slope exchange processes to the deep ocean. The OOI's data distribution network lies at the heart of its cyber- infrastructure, which enables a multitude of science and education applications, ranging from data analysis, to processing, visualization and ontology supported query and mediation. In addition, it fundamentally supports a class of applications exploiting the knowledge gained from analyzing observational data for objective-driven ocean observing applications, such as automatically triggered response to episodic environmental events and interactive instrument tasking and control. The U.S. Department of Commerce through NOAA operates the Integrated Ocean Observing System (IOOS) providing continuous data in various formats, rates and scales on open oceans and coastal waters to scientists, managers, businesses, governments, and the public to support research and inform decision-making. The NOAA IOOS program initiated development of the Data Integration Framework (DIF) to improve management and delivery of an initial subset of ocean observations with the expectation of achieving improvements in a select set of NOAA's decision-support tools. Both OOI and NOAA through DIF collaborate on an effort to integrate the data distribution, access and analysis needs of both programs. We present details and early findings from this collaboration; one part of it is the development of a demonstrator combining web-based user access to oceanographic data through ERDDAP, efficient science data distribution, and scalable, self-healing deployment in a cloud computing environment. ERDDAP is a web-based front-end application integrating oceanographic data sources of various formats, for instance CDF data files as aggregated through NcML or presented using a THREDDS server. The OOI-designed data distribution network provides global traffic management and computational load balancing for observatory resources; it makes use of the OpenDAP Data Access Protocol (DAP) for efficient canonical science data distribution over the network. A cloud computing strategy is the basis for scalable, self-healing organization of an observatory's computing and storage resources, independent of the physical location and technical implementation of these resources.
The Spider Center Wide File System; From Concept to Reality
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shipman, Galen M; Dillow, David A; Oral, H Sarp
2009-01-01
The Leadership Computing Facility (LCF) at Oak Ridge National Laboratory (ORNL) has a diverse portfolio of computational resources ranging from a petascale XT4/XT5 simulation system (Jaguar) to numerous other systems supporting development, visualization, and data analytics. In order to support vastly different I/O needs of these systems Spider, a Lustre-based center wide file system was designed and deployed to provide over 240 GB/s of aggregate throughput with over 10 Petabytes of formatted capacity. A multi-stage InfiniBand network, dubbed as Scalable I/O Network (SION), with over 889 GB/s of bisectional bandwidth was deployed as part of Spider to provide connectivity tomore » our simulation, development, visualization, and other platforms. To our knowledge, while writing this paper, Spider is the largest and fastest POSIX-compliant parallel file system in production. This paper will detail the overall architecture of the Spider system, challenges in deploying and initial testings of a file system of this scale, and novel solutions to these challenges which offer key insights into file system design in the future.« less
SAM-FS: LSC's New Solaris-Based Storage Management Product
NASA Technical Reports Server (NTRS)
Angell, Kent
1996-01-01
SAM-FS is a full featured hierarchical storage management (HSM) device that operates as a file system on Solaris-based machines. The SAM-FS file system provides the user with all of the standard UNIX system utilities and calls, and adds some new commands, i.e. archive, release, stage, sls, sfind, and a family of maintenance commands. The system also offers enhancements such as high performance virtual disk read and write, control of the disk through an extent array, and the ability to dynamically allocate block size. SAM-FS provides 'archive sets' which are groupings of data to be copied to secondary storage. In practice, as soon as a file is written to disk, SAM-FS will make copies onto secondary media. SAM-FS is a scalable storage management system. The system can manage millions of files per system, though this is limited today by the speed of UNIX and its utilities. In the future, a new search algorithm will be implemented that will remove logical and performance restrictions on the number of files managed.
A distributed infrastructure for publishing VO services: an implementation
NASA Astrophysics Data System (ADS)
Cepparo, Francesco; Scagnetto, Ivan; Molinaro, Marco; Smareglia, Riccardo
2016-07-01
This contribution describes both the design and the implementation details of a new solution for publishing VO services, enlightening its maintainable, distributed, modular and scalable architecture. Indeed, the new publisher is multithreaded and multiprocess. Multiple instances of the modules can run on different machines to ensure high performance and high availability, and this will be true both for the interface modules of the services and the back end data access ones. The system uses message passing to let its components communicate through an AMQP message broker that can itself be distributed to provide better scalability and availability.
NASA Technical Reports Server (NTRS)
Sanyal, Soumya; Jain, Amit; Das, Sajal K.; Biswas, Rupak
2003-01-01
In this paper, we propose a distributed approach for mapping a single large application to a heterogeneous grid environment. To minimize the execution time of the parallel application, we distribute the mapping overhead to the available nodes of the grid. This approach not only provides a fast mapping of tasks to resources but is also scalable. We adopt a hierarchical grid model and accomplish the job of mapping tasks to this topology using a scheduler tree. Results show that our three-phase algorithm provides high quality mappings, and is fast and scalable.
Central Satellite Data Repository Supporting Research and Development
NASA Astrophysics Data System (ADS)
Han, W.; Brust, J.
2015-12-01
Near real-time satellite data is critical to many research and development activities of atmosphere, land, and ocean processes. Acquiring and managing huge volumes of satellite data without (or with less) latency in an organization is always a challenge in the big data age. An organization level data repository is a practical solution to meeting this challenge. The STAR (Center for Satellite Applications and Research of NOAA) Central Data Repository (SCDR) is a scalable, stable, and reliable repository to acquire, manipulate, and disseminate various types of satellite data in an effective and efficient manner. SCDR collects more than 200 data products, which are commonly used by multiple groups in STAR, from NOAA, GOES, Metop, Suomi NPP, Sentinel, Himawari, and other satellites. The processes of acquisition, recording, retrieval, organization, and dissemination are performed in parallel. Multiple data access interfaces, like FTP, FTPS, HTTP, HTTPS, and RESTful, are supported in the SCDR to obtain satellite data from their providers through high speed internet. The original satellite data in various raster formats can be parsed in the respective adapter to retrieve data information. The data information is ingested to the corresponding partitioned tables in the central database. All files are distributed equally on the Network File System (NFS) disks to balance the disk load. SCDR provides consistent interfaces (including Perl utility, portal, and RESTful Web service) to locate files of interest easily and quickly and access them directly by over 200 compute servers via NFS. SCDR greatly improves collection and integration of near real-time satellite data, addresses satellite data requirements of scientists and researchers, and facilitates their primary research and development activities.
A Mobile IPv6 based Distributed Mobility Management Mechanism of Mobile Internet
NASA Astrophysics Data System (ADS)
Yan, Shi; Jiayin, Cheng; Shanzhi, Chen
A flatter architecture is one of the trends of mobile Internet. Traditional centralized mobility management mechanism faces the challenges such as scalability and UE reachability. A MIPv6 based distributed mobility management mechanism is proposed in this paper. Some important network entities and signaling procedures are defined. UE reachability is also considered in this paper through extension to DNS servers. Simulation results show that the proposed approach can overcome the scalability problem of the centralized scheme.
Slices: A Scalable Partitioner for Finite Element Meshes
NASA Technical Reports Server (NTRS)
Ding, H. Q.; Ferraro, R. D.
1995-01-01
A parallel partitioner for partitioning unstructured finite element meshes on distributed memory architectures is developed. The element based partitioner can handle mixtures of different element types. All algorithms adopted in the partitioner are scalable, including a communication template for unpredictable incoming messages, as shown in actual timing measurements.
Scaleable wireless web-enabled sensor networks
NASA Astrophysics Data System (ADS)
Townsend, Christopher P.; Hamel, Michael J.; Sonntag, Peter A.; Trutor, B.; Arms, Steven W.
2002-06-01
Our goal was to develop a long life, low cost, scalable wireless sensing network, which collects and distributes data from a wide variety of sensors over the internet. Time division multiple access was employed with RF transmitter nodes (each w/unique16 bit address) to communicate digital data to a single receiver (range 1/3 mile). One thousand five channel nodes can communicate to one receiver (30 minute update). Current draw (sleep) is 20 microamps, allowing 5 year battery life w/one 3.6 volt Li-Ion AA size battery. The network nodes include sensor excitation (AC or DC), multiplexer, instrumentation amplifier, 16 bit A/D converter, microprocessor, and RF link. They are compatible with thermocouples, strain gauges, load/torque transducers, inductive/capacitive sensors. The receiver (418 MHz) includes a single board computer (SBC) with Ethernet capability, internet file transfer protocols (XML/HTML), and data storage. The receiver detects data from specific nodes, performs error checking, records the data. The web server interrogates the SBC (from Microsoft's Internet Explorer or Netscape's Navigator) to distribute data. This system can collect data from thousands of remote sensors on a smart structure, and be shared by an unlimited number of users.
Schilling, Lisa M.; Kwan, Bethany M.; Drolshagen, Charles T.; Hosokawa, Patrick W.; Brandt, Elias; Pace, Wilson D.; Uhrich, Christopher; Kamerick, Michael; Bunting, Aidan; Payne, Philip R.O.; Stephens, William E.; George, Joseph M.; Vance, Mark; Giacomini, Kelli; Braddy, Jason; Green, Mika K.; Kahn, Michael G.
2013-01-01
Introduction: Distributed Data Networks (DDNs) offer infrastructure solutions for sharing electronic health data from across disparate data sources to support comparative effectiveness research. Data sharing mechanisms must address technical and governance concerns stemming from network security and data disclosure laws and best practices, such as HIPAA. Methods: The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) deploys TRIAD grid technology, a common data model, detailed technical documentation, and custom software for data harmonization to facilitate data sharing in collaboration with stakeholders in the care of safety net populations. Data sharing partners host TRIAD grid nodes containing harmonized clinical data within their internal or hosted network environments. Authorized users can use a central web-based query system to request analytic data sets. Discussion: SAFTINet DDN infrastructure achieved a number of data sharing objectives, including scalable and sustainable systems for ensuring harmonized data structures and terminologies and secure distributed queries. Initial implementation challenges were resolved through iterative discussions, development and implementation of technical documentation, governance, and technology solutions. PMID:25848567
Schilling, Lisa M; Kwan, Bethany M; Drolshagen, Charles T; Hosokawa, Patrick W; Brandt, Elias; Pace, Wilson D; Uhrich, Christopher; Kamerick, Michael; Bunting, Aidan; Payne, Philip R O; Stephens, William E; George, Joseph M; Vance, Mark; Giacomini, Kelli; Braddy, Jason; Green, Mika K; Kahn, Michael G
2013-01-01
Distributed Data Networks (DDNs) offer infrastructure solutions for sharing electronic health data from across disparate data sources to support comparative effectiveness research. Data sharing mechanisms must address technical and governance concerns stemming from network security and data disclosure laws and best practices, such as HIPAA. The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) deploys TRIAD grid technology, a common data model, detailed technical documentation, and custom software for data harmonization to facilitate data sharing in collaboration with stakeholders in the care of safety net populations. Data sharing partners host TRIAD grid nodes containing harmonized clinical data within their internal or hosted network environments. Authorized users can use a central web-based query system to request analytic data sets. SAFTINet DDN infrastructure achieved a number of data sharing objectives, including scalable and sustainable systems for ensuring harmonized data structures and terminologies and secure distributed queries. Initial implementation challenges were resolved through iterative discussions, development and implementation of technical documentation, governance, and technology solutions.
Toward Millions of File System IOPS on Low-Cost, Commodity Hardware
Zheng, Da; Burns, Randal; Szalay, Alexander S.
2013-01-01
We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads. PMID:24402052
Toward Millions of File System IOPS on Low-Cost, Commodity Hardware.
Zheng, Da; Burns, Randal; Szalay, Alexander S
2013-01-01
We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads.
Algorithmic Coordination in Robotic Networks
2010-11-29
appropriate performance, robustness and scalability properties for various task allocation , surveillance, and information gathering applications is...networking, we envision designing and analyzing algorithms with appropriate performance, robustness and scalability properties for various task ...distributed algorithms for target assignments; based on the classic auction algorithms in static networks, we intend to design efficient algorithms in worst
NoSQL: collection document and cloud by using a dynamic web query form
NASA Astrophysics Data System (ADS)
Abdalla, Hemn B.; Lin, Jinzhao; Li, Guoquan
2015-07-01
Mongo-DB (from "humongous") is an open-source document database and the leading NoSQL database. A NoSQL (Not Only SQL, next generation databases, being non-relational, deal, open-source and horizontally scalable) presenting a mechanism for storage and retrieval of documents. Previously, we stored and retrieved the data using the SQL queries. Here, we use the MonogoDB that means we are not utilizing the MySQL and SQL queries. Directly importing the documents into our Drives, retrieving the documents on that drive by not applying the SQL queries, using the IO BufferReader and Writer, BufferReader for importing our type of document files to my folder (Drive). For retrieving the document files, the usage is BufferWriter from the particular folder (or) Drive. In this sense, providing the security for those storing files for what purpose means if we store the documents in our local folder means all or views that file and modified that file. So preventing that file, we are furnishing the security. The original document files will be changed to another format like in this paper; Binary format is used. Our documents will be converting to the binary format after that direct storing in one of our folder, that time the storage space will provide the private key for accessing that file. Wherever any user tries to discover the Document files means that file data are in the binary format, the document's file owner simply views that original format using that personal key from receive the secret key from the cloud.
A scalable parallel black oil simulator on distributed memory parallel computers
NASA Astrophysics Data System (ADS)
Wang, Kun; Liu, Hui; Chen, Zhangxin
2015-11-01
This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
CloVR-Comparative: automated, cloud-enabled comparative microbial genome sequence analysis pipeline.
Agrawal, Sonia; Arze, Cesar; Adkins, Ricky S; Crabtree, Jonathan; Riley, David; Vangala, Mahesh; Galens, Kevin; Fraser, Claire M; Tettelin, Hervé; White, Owen; Angiuoli, Samuel V; Mahurkar, Anup; Fricke, W Florian
2017-04-27
The benefit of increasing genomic sequence data to the scientific community depends on easy-to-use, scalable bioinformatics support. CloVR-Comparative combines commonly used bioinformatics tools into an intuitive, automated, and cloud-enabled analysis pipeline for comparative microbial genomics. CloVR-Comparative runs on annotated complete or draft genome sequences that are uploaded by the user or selected via a taxonomic tree-based user interface and downloaded from NCBI. CloVR-Comparative runs reference-free multiple whole-genome alignments to determine unique, shared and core coding sequences (CDSs) and single nucleotide polymorphisms (SNPs). Output includes short summary reports and detailed text-based results files, graphical visualizations (phylogenetic trees, circular figures), and a database file linked to the Sybil comparative genome browser. Data up- and download, pipeline configuration and monitoring, and access to Sybil are managed through CloVR-Comparative web interface. CloVR-Comparative and Sybil are distributed as part of the CloVR virtual appliance, which runs on local computers or the Amazon EC2 cloud. Representative datasets (e.g. 40 draft and complete Escherichia coli genomes) are processed in <36 h on a local desktop or at a cost of <$20 on EC2. CloVR-Comparative allows anybody with Internet access to run comparative genomics projects, while eliminating the need for on-site computational resources and expertise.
NASA Astrophysics Data System (ADS)
Sangaline, E.; Lauret, J.
2014-06-01
The quantity of information produced in Nuclear and Particle Physics (NPP) experiments necessitates the transmission and storage of data across diverse collections of computing resources. Robust solutions such as XRootD have been used in NPP, but as the usage of cloud resources grows, the difficulties in the dynamic configuration of these systems become a concern. Hadoop File System (HDFS) exists as a possible cloud storage solution with a proven track record in dynamic environments. Though currently not extensively used in NPP, HDFS is an attractive solution offering both elastic storage and rapid deployment. We will present the performance of HDFS in both canonical I/O tests and for a typical data analysis pattern within the RHIC/STAR experimental framework. These tests explore the scaling with different levels of redundancy and numbers of clients. Additionally, the performance of FUSE and NFS interfaces to HDFS were evaluated as a way to allow existing software to function without modification. Unfortunately, the complicated data structures in NPP are non-trivial to integrate with Hadoop and so many of the benefits of the MapReduce paradigm could not be directly realized. Despite this, our results indicate that using HDFS as a distributed filesystem offers reasonable performance and scalability and that it excels in its ease of configuration and deployment in a cloud environment.
Reducing I/O variability using dynamic I/O path characterization in petascale storage systems
Son, Seung Woo; Sehrish, Saba; Liao, Wei-keng; ...
2016-11-01
In petascale systems with a million CPU cores, scalable and consistent I/O performance is becoming increasingly difficult to sustain mainly because of I/O variability. Furthermore, the I/O variability is caused by concurrently running processes/jobs competing for I/O or a RAID rebuild when a disk drive fails. We present a mechanism that stripes across a selected subset of I/O nodes with the lightest workload at runtime to achieve the highest I/O bandwidth available in the system. In this paper, we propose a probing mechanism to enable application-level dynamic file striping to mitigate I/O variability. We also implement the proposed mechanism inmore » the high-level I/O library that enables memory-to-file data layout transformation and allows transparent file partitioning using subfiling. Subfiling is a technique that partitions data into a set of files of smaller size and manages file access to them, making data to be treated as a single, normal file to users. Here, we demonstrate that our bandwidth probing mechanism can successfully identify temporally slower I/O nodes without noticeable runtime overhead. Experimental results on NERSC’s systems also show that our approach isolates I/O variability effectively on shared systems and improves overall collective I/O performance with less variation.« less
Tackling the Four V's with NEXUS
NASA Astrophysics Data System (ADS)
Greguska, F. R., III; Gill, K. M.; Huang, T.; Jacob, J. C.; Quach, N.; Wilson, B. D.
2016-12-01
NASA's Earth Observing System Data and Information System (EOSDIS) reports that over 15 petabytes (PB) of Earth observing information are archived among the 12 NASA Distributed Active Archive Centers (DAACs); with more being archived daily. The upcoming Surface Water & Ocean Topography (SWOT) mission is expected to generate about 26 PB of data in 3 years. NEXUS is a state of the art deep data analytic program developed at the Jet Propulsion Laboratory with the goal of providing near real-time analytic capabilities for this vast trove of data. Rather than develop analytic services on traditional file archives, NEXUS organizes data into tiles in order to provide a platform for horizontal computing. To provide near real-time analytic solutions for missions such as SWOT, a highly scalable data ingestion solution is developed to quickly bring data into NEXUS. In order to accomplish this formidable challenge, the "Four V's" (Volume, Velocity, Veracity, and Variety) of Big Data must be considered. NEXUS consists of an ingestion subsystem that handles the Volume of data by utilizing a generic tiling strategy that subsets a given dataset into smaller tiles. These tiles are then indexed by a search engine and stored in a NoSQL database for fast retrieval. In addition to handling the Volume of data being indexed, the NEXUS ingestion subsystem is built for horizontal scalability in order to manage the Velocity of incoming data. As the load on the system increases, the components of the ingestion subsystem can be scaled to provide more capacity. During ingestion, NEXUS also takes a unique approach to the Veracity and Variety of Earth observing information being ingested. By allowing the processing and tiling mechanisms to be customized for each dataset, the NEXUS ingest system can discard erroneous or missing data as well as adapt to the many different data structures and file formats that can be found in satellite observation data. This talk will focus on the functionality and architecture of the data ingestion subsystem that is a part of the NEXUS software architecture and how it relates to the Four V's of Big Data.
An MPI-based MoSST core dynamics model
NASA Astrophysics Data System (ADS)
Jiang, Weiyuan; Kuang, Weijia
2008-09-01
Distributed systems are among the main cost-effective and expandable platforms for high-end scientific computing. Therefore scalable numerical models are important for effective use of such systems. In this paper, we present an MPI-based numerical core dynamics model for simulation of geodynamo and planetary dynamos, and for simulation of core-mantle interactions. The model is developed based on MPI libraries. Two algorithms are used for node-node communication: a "master-slave" architecture and a "divide-and-conquer" architecture. The former is easy to implement but not scalable in communication. The latter is scalable in both computation and communication. The model scalability is tested on Linux PC clusters with up to 128 nodes. This model is also benchmarked with a published numerical dynamo model solution.
CASTOR: Widely Distributed Scalable Infospaces
2008-11-01
1 i Progress against Planned Objectives Enable nimble apps that react fast as...generation of scalable, reliable, ultra- fast event notification in Linux data centers. • Maelstrom, a spin-off from Ricochet, offers a powerful new option...out potential enhancements to WS-EVENTING and WS-NOTIFICATION based on our work. Potential impact for the warflighter. QSM achieves extremely fast
Processing Diabetes Mellitus Composite Events in MAGPIE.
Brugués, Albert; Bromuri, Stefano; Barry, Michael; Del Toro, Óscar Jiménez; Mazurkiewicz, Maciej R; Kardas, Przemyslaw; Pegueroles, Josep; Schumacher, Michael
2016-02-01
The focus of this research is in the definition of programmable expert Personal Health Systems (PHS) to monitor patients affected by chronic diseases using agent oriented programming and mobile computing to represent the interactions happening amongst the components of the system. The paper also discusses issues of knowledge representation within the medical domain when dealing with temporal patterns concerning the physiological values of the patient. In the presented agent based PHS the doctors can personalize for each patient monitoring rules that can be defined in a graphical way. Furthermore, to achieve better scalability, the computations for monitoring the patients are distributed among their devices rather than being performed in a centralized server. The system is evaluated using data of 21 diabetic patients to detect temporal patterns according to a set of monitoring rules defined. The system's scalability is evaluated by comparing it with a centralized approach. The evaluation concerning the detection of temporal patterns highlights the system's ability to monitor chronic patients affected by diabetes. Regarding the scalability, the results show the fact that an approach exploiting the use of mobile computing is more scalable than a centralized approach. Therefore, more likely to satisfy the needs of next generation PHSs. PHSs are becoming an adopted technology to deal with the surge of patients affected by chronic illnesses. This paper discusses architectural choices to make an agent based PHS more scalable by using a distributed mobile computing approach. It also discusses how to model the medical knowledge in the PHS in such a way that it is modifiable at run time. The evaluation highlights the necessity of distributing the reasoning to the mobile part of the system and that modifiable rules are able to deal with the change in lifestyle of the patients affected by chronic illnesses.
Scalable parallel distance field construction for large-scale applications
Yu, Hongfeng; Xie, Jinrong; Ma, Kwan -Liu; ...
2015-10-01
Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. Anew distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking overtime, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate itsmore » efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. In conclusion, our work greatly extends the usability of distance fields for demanding applications.« less
Scalable Parallel Distance Field Construction for Large-Scale Applications.
Yu, Hongfeng; Xie, Jinrong; Ma, Kwan-Liu; Kolla, Hemanth; Chen, Jacqueline H
2015-10-01
Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. A new distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking over time, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate its efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. Our work greatly extends the usability of distance fields for demanding applications.
Jeong, Seol Young; Jo, Hyeong Gon; Kang, Soon Ju
2015-01-01
Indoor location-based services (iLBS) are extremely dynamic and changeable, and include numerous resources and mobile devices. In particular, the network infrastructure requires support for high scalability in the indoor environment, and various resource lookups are requested concurrently and frequently from several locations based on the dynamic network environment. A traditional map-based centralized approach for iLBSs has several disadvantages: it requires global knowledge to maintain a complete geographic indoor map; the central server is a single point of failure; it can also cause low scalability and traffic congestion; and it is hard to adapt to a change of service area in real time. This paper proposes a self-organizing and fully distributed platform for iLBSs. The proposed self-organizing distributed platform provides a dynamic reconfiguration of locality accuracy and service coverage by expanding and contracting dynamically. In order to verify the suggested platform, scalability performance according to the number of inserted or deleted nodes composing the dynamic infrastructure was evaluated through a simulation similar to the real environment. PMID:26016908
NASA Astrophysics Data System (ADS)
Hayat, Khizar; Puech, William; Gesquière, Gilles
2010-04-01
We propose an adaptively synchronous scalable spread spectrum (A4S) data-hiding strategy to integrate disparate data, needed for a typical 3-D visualization, into a single JPEG2000 format file. JPEG2000 encoding provides a standard format on one hand and the needed multiresolution for scalability on the other. The method has the potential of being imperceptible and robust at the same time. While the spread spectrum (SS) methods are known for the high robustness they offer, our data-hiding strategy is removable at the same time, which ensures highest possible visualization quality. The SS embedding of the discrete wavelet transform (DWT)-domain depth map is carried out in transform domain YCrCb components from the JPEG2000 coding stream just after the DWT stage. To maintain synchronization, the embedding is carried out while taking into account the correspondence of subbands. Since security is not the immediate concern, we are at liberty with the strength of embedding. This permits us to increase the robustness and bring the reversibility of our method. To estimate the maximum tolerable error in the depth map according to a given viewpoint, a human visual system (HVS)-based psychovisual analysis is also presented.
Scalable Database Design of End-Game Model with Decoupled Countermeasure and Threat Information
2017-11-01
Threat Information by Decetria Akole and Michael Chen Approved for public release; distribution is unlimited...Scalable Database Design of End-Game Model with Decoupled Countermeasure and Threat Information by Decetria Akole The Thurgood Marshall...for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data
Simplex-stochastic collocation method with improved scalability
NASA Astrophysics Data System (ADS)
Edeling, W. N.; Dwight, R. P.; Cinnella, P.
2016-04-01
The Simplex-Stochastic Collocation (SSC) method is a robust tool used to propagate uncertain input distributions through a computer code. However, it becomes prohibitively expensive for problems with dimensions higher than 5. The main purpose of this paper is to identify bottlenecks, and to improve upon this bad scalability. In order to do so, we propose an alternative interpolation stencil technique based upon the Set-Covering problem, and we integrate the SSC method in the High-Dimensional Model-Reduction framework. In addition, we address the issue of ill-conditioned sample matrices, and we present an analytical map to facilitate uniformly-distributed simplex sampling.
The NCAR Research Data Archive's Hybrid Approach for Data Discovery and Access
NASA Astrophysics Data System (ADS)
Schuster, D.; Worley, S. J.
2013-12-01
The NCAR Research Data Archive (RDA http://rda.ucar.edu) maintains a variety of data discovery and access capabilities for it's 600+ dataset collections to support the varying needs of a diverse user community. In-house developed and standards-based community tools offer services to more than 10,000 users annually. By number of users the largest group is external and access the RDA through web based protocols; the internal NCAR HPC users are fewer in number, but typically access more data volume. This paper will detail the data discovery and access services maintained by the RDA to support both user groups, and show metrics that illustrate how the community is using the services. The distributed search capability enabled by standards-based community tools, such as Geoportal and an OAI-PMH access point that serves multiple metadata standards, provide pathways for external users to initially discover RDA holdings. From here, in-house developed web interfaces leverage primary discovery level metadata databases that support keyword and faceted searches. Internal NCAR HPC users, or those familiar with the RDA, may go directly to the dataset collection of interest and refine their search based on rich file collection metadata. Multiple levels of metadata have proven to be invaluable for discovery within terabyte-sized archives composed of many atmospheric or oceanic levels, hundreds of parameters, and often numerous grid and time resolutions. Once users find the data they want, their access needs may vary as well. A THREDDS data server running on targeted dataset collections enables remote file access through OPENDAP and other web based protocols primarily for external users. In-house developed tools give all users the capability to submit data subset extraction and format conversion requests through scalable, HPC based delayed mode batch processing. Users can monitor their RDA-based data processing progress and receive instructions on how to access the data when it is ready. External users are provided with RDA server generated scripts to download the resulting request output. Similarly they can download native dataset collection files or partial files using Wget or cURL based scripts supplied by the RDA server. Internal users can access the resulting request output or native dataset collection files directly from centralized file systems.
GSKY: A scalable distributed geospatial data server on the cloud
NASA Astrophysics Data System (ADS)
Rozas Larraondo, Pablo; Pringle, Sean; Antony, Joseph; Evans, Ben
2017-04-01
Earth systems, environmental and geophysical datasets are an extremely valuable sources of information about the state and evolution of the Earth. Being able to combine information coming from different geospatial collections is in increasing demand by the scientific community, and requires managing and manipulating data with different formats and performing operations such as map reprojections, resampling and other transformations. Due to the large data volume inherent in these collections, storing multiple copies of them is unfeasible and so such data manipulation must be performed on-the-fly using efficient, high performance techniques. Ideally this should be performed using a trusted data service and common system libraries to ensure wide use and reproducibility. Recent developments in distributed computing based on dynamic access to significant cloud infrastructure opens the door for such new ways of processing geospatial data on demand. The National Computational Infrastructure (NCI), hosted at the Australian National University (ANU), has over 10 Petabytes of nationally significant research data collections. Some of these collections, which comprise a variety of observed and modelled geospatial data, are now made available via a highly distributed geospatial data server, called GSKY (pronounced [jee-skee]). GSKY supports on demand processing of large geospatial data products such as satellite earth observation data as well as numerical weather products, allowing interactive exploration and analysis of the data. It dynamically and efficiently distributes the required computations among cloud nodes providing a scalable analysis framework that can adapt to serve large number of concurrent users. Typical geospatial workflows handling different file formats and data types, or blending data in different coordinate projections and spatio-temporal resolutions, is handled transparently by GSKY. This is achieved by decoupling the data ingestion and indexing process as an independent service. An indexing service crawls data collections either locally or remotely by extracting, storing and indexing all spatio-temporal metadata associated with each individual record. GSKY provides the user with the ability of specifying how ingested data should be aggregated, transformed and presented. It presents an OGC standards-compliant interface, allowing ready accessibility for users of the data via Web Map Services (WMS), Web Processing Services (WPS) or raw data arrays using Web Coverage Services (WCS). The presentation will show some cases where we have used this new capability to provide a significant improvement over previous approaches.
Towards scalable Byzantine fault-tolerant replication
NASA Astrophysics Data System (ADS)
Zbierski, Maciej
2017-08-01
Byzantine fault-tolerant (BFT) replication is a powerful technique, enabling distributed systems to remain available and correct even in the presence of arbitrary faults. Unfortunately, existing BFT replication protocols are mostly load-unscalable, i.e. they fail to respond with adequate performance increase whenever new computational resources are introduced into the system. This article proposes a universal architecture facilitating the creation of load-scalable distributed services based on BFT replication. The suggested approach exploits parallel request processing to fully utilize the available resources, and uses a load balancer module to dynamically adapt to the properties of the observed client workload. The article additionally provides a discussion on selected deployment scenarios, and explains how the proposed architecture could be used to increase the dependability of contemporary large-scale distributed systems.
Introducing a distributed unstructured mesh into gyrokinetic particle-in-cell code, XGC
NASA Astrophysics Data System (ADS)
Yoon, Eisung; Shephard, Mark; Seol, E. Seegyoung; Kalyanaraman, Kaushik
2017-10-01
XGC has shown good scalability for large leadership supercomputers. The current production version uses a copy of the entire unstructured finite element mesh on every MPI rank. Although an obvious scalability issue if the mesh sizes are to be dramatically increased, the current approach is also not optimal with respect to data locality of particles and mesh information. To address these issues we have initiated the development of a distributed mesh PIC method. This approach directly addresses the base scalability issue with respect to mesh size and, through the use of a mesh entity centric view of the particle mesh relationship, provides opportunities to address data locality needs of many core and GPU supported heterogeneous systems. The parallel mesh PIC capabilities are being built on the Parallel Unstructured Mesh Infrastructure (PUMI). The presentation will first overview the form of mesh distribution used and indicate the structures and functions used to support the mesh, the particles and their interaction. Attention will then focus on the node-level optimizations being carried out to ensure performant operation of all PIC operations on the distributed mesh. Partnership for Edge Physics Simulation (EPSI) Grant No. DE-SC0008449 and Center for Extended Magnetohydrodynamic Modeling (CEMM) Grant No. DE-SC0006618.
Multi-Resolution Playback of Network Trace Files
2015-06-01
a com- plete MySQL database, C++ developer tools and the libraries utilized in the development of the system (Boost and Libcrafter), and Wireshark...XE suite has a limit to the allowed size of each database. In order to be scalable, the project had to switch to the MySQL database suite. The...programs that access the database use the MySQL C++ connector, provided by Oracle, and the supplied methods and libraries. 4.4 Flow Generator Chapter 3
Optimal and Scalable Caching for 5G Using Reinforcement Learning of Space-Time Popularities
NASA Astrophysics Data System (ADS)
Sadeghi, Alireza; Sheikholeslami, Fatemeh; Giannakis, Georgios B.
2018-02-01
Small basestations (SBs) equipped with caching units have potential to handle the unprecedented demand growth in heterogeneous networks. Through low-rate, backhaul connections with the backbone, SBs can prefetch popular files during off-peak traffic hours, and service them to the edge at peak periods. To intelligently prefetch, each SB must learn what and when to cache, while taking into account SB memory limitations, the massive number of available contents, the unknown popularity profiles, as well as the space-time popularity dynamics of user file requests. In this work, local and global Markov processes model user requests, and a reinforcement learning (RL) framework is put forth for finding the optimal caching policy when the transition probabilities involved are unknown. Joint consideration of global and local popularity demands along with cache-refreshing costs allow for a simple, yet practical asynchronous caching approach. The novel RL-based caching relies on a Q-learning algorithm to implement the optimal policy in an online fashion, thus enabling the cache control unit at the SB to learn, track, and possibly adapt to the underlying dynamics. To endow the algorithm with scalability, a linear function approximation of the proposed Q-learning scheme is introduced, offering faster convergence as well as reduced complexity and memory requirements. Numerical tests corroborate the merits of the proposed approach in various realistic settings.
Field of genes: using Apache Kafka as a bioinformatic data repository.
Lawlor, Brendan; Lynch, Richard; Mac Aogáin, Micheál; Walsh, Paul
2018-04-01
Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI's) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI's RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data.
CERN data services for LHC computing
NASA Astrophysics Data System (ADS)
Espinal, X.; Bocchi, E.; Chan, B.; Fiorot, A.; Iven, J.; Lo Presti, G.; Lopez, J.; Gonzalez, H.; Lamanna, M.; Mascetti, L.; Moscicki, J.; Pace, A.; Peters, A.; Ponce, S.; Rousseau, H.; van der Ster, D.
2017-10-01
Dependability, resilience, adaptability and efficiency. Growing requirements require tailoring storage services and novel solutions. Unprecedented volumes of data coming from the broad number of experiments at CERN need to be quickly available in a highly scalable way for large-scale processing and data distribution while in parallel they are routed to tape for long-term archival. These activities are critical for the success of HEP experiments. Nowadays we operate at high incoming throughput (14GB/s during 2015 LHC Pb-Pb run and 11PB in July 2016) and with concurrent complex production work-loads. In parallel our systems provide the platform for the continuous user and experiment driven work-loads for large-scale data analysis, including end-user access and sharing. The storage services at CERN cover the needs of our community: EOS and CASTOR as a large-scale storage; CERNBox for end-user access and sharing; Ceph as data back-end for the CERN OpenStack infrastructure, NFS services and S3 functionality; AFS for legacy distributed-file-system services. In this paper we will summarise the experience in supporting LHC experiments and the transition of our infrastructure from static monolithic systems to flexible components providing a more coherent environment with pluggable protocols, tuneable QoS, sharing capabilities and fine grained ACLs management while continuing to guarantee dependable and robust services.
Leveraging the Cloud for Integrated Network Experimentation
2014-03-01
kernel settings, or any of the low-level subcomponents. 3. Scalable Solutions: Businesses can build scalable solutions for their clients , ranging from...values. These values 13 can assume several distributions that include normal, Pareto , uniform, exponential and Poisson, among others [21]. Additionally, D...communication, the web client establishes a connection to the server before traffic begins to flow. Web servers do not initiate connections to clients in
The storage system of PCM based on random access file system
NASA Astrophysics Data System (ADS)
Han, Wenbing; Chen, Xiaogang; Zhou, Mi; Li, Shunfen; Li, Gezi; Song, Zhitang
2016-10-01
Emerging memory technologies such as Phase change memory (PCM) tend to offer fast, random access to persistent storage with better scalability. It's a hot topic of academic and industrial research to establish PCM in storage hierarchy to narrow the performance gap. However, the existing file systems do not perform well with the emerging PCM storage, which access storage medium via a slow, block-based interface. In this paper, we propose a novel file system, RAFS, to bring about good performance of PCM, which is built in the embedded platform. We attach PCM chips to the memory bus and build RAFS on the physical address space. In the proposed file system, we simplify traditional system architecture to eliminate block-related operations and layers. Furthermore, we adopt memory mapping and bypassed page cache to reduce copy overhead between the process address space and storage device. XIP mechanisms are also supported in RAFS. To the best of our knowledge, we are among the first to implement file system on real PCM chips. We have analyzed and evaluated its performance with IOZONE benchmark tools. Our experimental results show that the RAFS on PCM outperforms Ext4fs on SDRAM with small record lengths. Based on DRAM, RAFS is significantly faster than Ext4fs by 18% to 250%.
Distributed numerical controllers
NASA Astrophysics Data System (ADS)
Orban, Peter E.
2001-12-01
While the basic principles of Numerical Controllers (NC) have not changed much during the years, the implementation of NCs' has changed tremendously. NC equipment has evolved from yesterday's hard-wired specialty control apparatus to today's graphics intensive, networked, increasingly PC based open systems, controlling a wide variety of industrial equipment with positioning needs. One of the newest trends in NC technology is the distributed implementation of the controllers. Distributed implementation promises to offer robustness, lower implementation costs, and a scalable architecture. Historically partitioning has been done along the hierarchical levels, moving individual modules into self contained units. The paper discusses various NC architectures, the underlying technology for distributed implementation, and relevant design issues. First the functional requirements of individual NC modules are analyzed. Module functionality, cycle times, and data requirements are examined. Next the infrastructure for distributed node implementation is reviewed. Various communication protocols and distributed real-time operating system issues are investigated and compared. Finally, a different, vertical system partitioning, offering true scalability and reconfigurability is presented.
Jeong, Seol Young; Jo, Hyeong Gon; Kang, Soon Ju
2014-03-21
A tracking service like asset management is essential in a dynamic hospital environment consisting of numerous mobile assets (e.g., wheelchairs or infusion pumps) that are continuously relocated throughout a hospital. The tracking service is accomplished based on the key technologies of an indoor location-based service (LBS), such as locating and monitoring multiple mobile targets inside a building in real time. An indoor LBS such as a tracking service entails numerous resource lookups being requested concurrently and frequently from several locations, as well as a network infrastructure requiring support for high scalability in indoor environments. A traditional centralized architecture needs to maintain a geographic map of the entire building or complex in its central server, which can cause low scalability and traffic congestion. This paper presents a self-organizing and fully distributed indoor mobile asset management (MAM) platform, and proposes an architecture for multiple trackees (such as mobile assets) and trackers based on the proposed distributed platform in real time. In order to verify the suggested platform, scalability performance according to increases in the number of concurrent lookups was evaluated in a real test bed. Tracking latency and traffic load ratio in the proposed tracking architecture was also evaluated.
75 FR 51032 - National Fuel Gas Distribution Corporation; Notice of Baseline Filing
Federal Register 2010, 2011, 2012, 2013, 2014
2010-08-18
... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission [Docket No. PR10-79-000] National Fuel Gas Distribution Corporation; Notice of Baseline Filing August 12, 2010. Take notice that on August 10, 2010, National fuel Gas Distribution Corporation submitted a baseline filing of its Statement of...
New capabilities in the HENP grand challenge storage access systemand its application at RHIC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bernardo, L.; Gibbard, B.; Malon, D.
2000-04-25
The High Energy and Nuclear Physics Data Access GrandChallenge project has developed an optimizing storage access softwaresystem that was prototyped at RHIC. It is currently undergoingintegration with the STAR experiment in preparation for data taking thatstarts in mid-2000. The behavior and lessons learned in the RHIC MockData Challenge exercises are described as well as the observedperformance under conditions designed to characterize scalability. Up to250 simultaneous queries were tested and up to 10 million events across 7event components were involved in these queries. The system coordinatesthe staging of "bundles" of files from the HPSS tape system, so that allthe needed componentsmore » of each event are in disk cache when accessed bythe application software. The caching policy algorithm for thecoordinated bundle staging is described in the paper. The initialprototype implementation interfaced to the Objectivity/DB. In this latestversion, it evolved to work with arbitrary files and use CORBA interfacesto the tag database and file catalog services. The interface to the tagdatabase and the MySQL-based file catalog services used by STAR aredescribed along with the planned usage scenarios.« less
2006-09-01
STELLA and PowerLoomn. These modules comunicate with a knowledge basec using KIF and stan(lardl relational database systelnis using either standard...groups ontology as well as a rule that infers additional seed members based on joint participation in a terrorism event. EDB schema files are a special... terrorism links from the Ali Baba EDB. Our interpretation of such links is that they KOJAK Manual E-42 encode that two people committed an act of
Experience with ATLAS MySQL PanDA database service
NASA Astrophysics Data System (ADS)
Smirnov, Y.; Wlodek, T.; De, K.; Hover, J.; Ozturk, N.; Smith, J.; Wenaus, T.; Yu, D.
2010-04-01
The PanDA distributed production and analysis system has been in production use for ATLAS data processing and analysis since late 2005 in the US, and globally throughout ATLAS since early 2008. Its core architecture is based on a set of stateless web services served by Apache and backed by a suite of MySQL databases that are the repository for all PanDA information: active and archival job queues, dataset and file catalogs, site configuration information, monitoring information, system control parameters, and so on. This database system is one of the most critical components of PanDA, and has successfully delivered the functional and scaling performance required by PanDA, currently operating at a scale of half a million jobs per week, with much growth still to come. In this paper we describe the design and implementation of the PanDA database system, its architecture of MySQL servers deployed at BNL and CERN, backup strategy and monitoring tools. The system has been developed, thoroughly tested, and brought to production to provide highly reliable, scalable, flexible and available database services for ATLAS Monte Carlo production, reconstruction and physics analysis.
Efficient feature extraction from wide-area motion imagery by MapReduce in Hadoop
NASA Astrophysics Data System (ADS)
Cheng, Erkang; Ma, Liya; Blaisse, Adam; Blasch, Erik; Sheaff, Carolyn; Chen, Genshe; Wu, Jie; Ling, Haibin
2014-06-01
Wide-Area Motion Imagery (WAMI) feature extraction is important for applications such as target tracking, traffic management and accident discovery. With the increasing amount of WAMI collections and feature extraction from the data, a scalable framework is needed to handle the large amount of information. Cloud computing is one of the approaches recently applied in large scale or big data. In this paper, MapReduce in Hadoop is investigated for large scale feature extraction tasks for WAMI. Specifically, a large dataset of WAMI images is divided into several splits. Each split has a small subset of WAMI images. The feature extractions of WAMI images in each split are distributed to slave nodes in the Hadoop system. Feature extraction of each image is performed individually in the assigned slave node. Finally, the feature extraction results are sent to the Hadoop File System (HDFS) to aggregate the feature information over the collected imagery. Experiments of feature extraction with and without MapReduce are conducted to illustrate the effectiveness of our proposed Cloud-Enabled WAMI Exploitation (CAWE) approach.
Yigzaw, Kassaye Yitbarek; Michalas, Antonis; Bellika, Johan Gustav
2017-01-03
Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N - 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians.
Design of Availability-Dependent Distributed Services in Large-Scale Uncooperative Settings
ERIC Educational Resources Information Center
Morales, Ramses Victor
2009-01-01
Thesis Statement: "Availability-dependent global predicates can be efficiently and scalably realized for a class of distributed services, in spite of specific selfish and colluding behaviors, using local and decentralized protocols". Several types of large-scale distributed systems spanning the Internet have to deal with availability variations…
Information Weighted Consensus for Distributed Estimation in Vision Networks
ERIC Educational Resources Information Center
Kamal, Ahmed Tashrif
2013-01-01
Due to their high fault-tolerance, ease of installation and scalability to large networks, distributed algorithms have recently gained immense popularity in the sensor networks community, especially in computer vision. Multi-target tracking in a camera network is one of the fundamental problems in this domain. Distributed estimation algorithms…
2006-04-01
and Scalability, (2) Sensors and Platforms, (3) Distributed Computing and Processing , (4) Information Management, (5) Fusion and Resource Management...use of the deployed system. 3.3 Distributed Computing and Processing Session The Distributed Computing and Processing Session consisted of three
NASA Astrophysics Data System (ADS)
Park, Soomyung; Joo, Seong-Soon; Yae, Byung-Ho; Lee, Jong-Hyun
2002-07-01
In this paper, we present the Optical Cross-Connect (OXC) Management Control System Architecture, which has the scalability and robust maintenance and provides the distributed managing environment in the optical transport network. The OXC system we are developing, which is divided into the hardware and the internal and external software for the OXC system, is made up the OXC subsystem with the Optical Transport Network (OTN) sub layers-hardware and the optical switch control system, the signaling control protocol subsystem performing the User-to-Network Interface (UNI) and Network-to-Network Interface (NNI) signaling control, the Operation Administration Maintenance & Provisioning (OAM&P) subsystem, and the network management subsystem. And the OXC management control system has the features that can support the flexible expansion of the optical transport network, provide the connectivity to heterogeneous external network elements, be added or deleted without interrupting OAM&P services, be remotely operated, provide the global view and detail information for network planner and operator, and have Common Object Request Broker Architecture (CORBA) based the open system architecture adding and deleting the intelligent service networking functions easily in future. To meet these considerations, we adopt the object oriented development method in the whole developing steps of the system analysis, design, and implementation to build the OXC management control system with the scalability, the maintenance, and the distributed managing environment. As a consequently, the componentification for the OXC operation management functions of each subsystem makes the robust maintenance, and increases code reusability. Also, the component based OXC management control system architecture will have the flexibility and scalability in nature.
NASA Astrophysics Data System (ADS)
Huang, T.; Alarcon, C.; Quach, N. T.
2014-12-01
Capture, curate, and analysis are the typical activities performed at any given Earth Science data center. Modern data management systems must be adaptable to heterogeneous science data formats, scalable to meet the mission's quality of service requirements, and able to manage the life-cycle of any given science data product. Designing a scalable data management doesn't happen overnight. It takes countless hours of refining, refactoring, retesting, and re-architecting. The Horizon data management and workflow framework, developed at the Jet Propulsion Laboratory, is a portable, scalable, and reusable framework for developing high-performance data management and product generation workflow systems to automate data capturing, data curation, and data analysis activities. The NASA's Physical Oceanography Distributed Active Archive Center (PO.DAAC)'s Data Management and Archive System (DMAS) is its core data infrastructure that handles capturing and distribution of hundreds of thousands of satellite observations each day around the clock. DMAS is an application of the Horizon framework. The NASA Global Imagery Browse Services (GIBS) is NASA's Earth Observing System Data and Information System (EOSDIS)'s solution for making high-resolution global imageries available to the science communities. The Imagery Exchange (TIE), an application of the Horizon framework, is a core subsystem for GIBS responsible for data capturing and imagery generation automation to support the EOSDIS' 12 distributed active archive centers and 17 Science Investigator-led Processing Systems (SIPS). This presentation discusses our ongoing effort in refining, refactoring, retesting, and re-architecting the Horizon framework to enable data-intensive science and its applications.
Scalable load balancing for massively parallel distributed Monte Carlo particle transport
DOE Office of Scientific and Technical Information (OSTI.GOV)
O'Brien, M. J.; Brantley, P. S.; Joy, K. I.
2013-07-01
In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrencemore » Livermore National Laboratory. (authors)« less
High Performance Data Distribution for Scientific Community
NASA Astrophysics Data System (ADS)
Tirado, Juan M.; Higuero, Daniel; Carretero, Jesus
2010-05-01
Institutions such as NASA, ESA or JAXA find solutions to distribute data from their missions to the scientific community, and their long term archives. This is a complex problem, as it includes a vast amount of data, several geographically distributed archives, heterogeneous architectures with heterogeneous networks, and users spread around the world. We propose a novel architecture (HIDDRA) that solves this problem aiming to reduce user intervention in data acquisition and processing. HIDDRA is a modular system that provides a highly efficient parallel multiprotocol download engine, using a publish/subscribe policy which helps the final user to obtain data of interest transparently. Our system can deal simultaneously with multiple protocols (HTTP,HTTPS, FTP, GridFTP among others) to obtain the maximum bandwidth, reducing the workload in data server and increasing flexibility. It can also provide high reliability and fault tolerance, as several sources of data can be used to perform one file download. HIDDRA architecture can be arranged into a data distribution network deployed on several sites that can cooperate to provide former features. HIDDRA has been addressed by the 2009 e-IRG Report on Data Management as a promising initiative for data interoperability. Our first prototype has been evaluated in collaboration with the ESAC centre in Villafranca del Castillo (Spain) that shows a high scalability and performance, opening a wide spectrum of opportunities. Some preliminary results have been published in the Journal of Astrophysics and Space Science [1]. [1] D. Higuero, J.M. Tirado, J. Carretero, F. Félix, and A. de La Fuente. HIDDRA: a highly independent data distribution and retrieval architecture for space observation missions. Astrophysics and Space Science, 321(3):169-175, 2009
Autocorrel I: A Neural Network Based Network Event Correlation Approach
2005-05-01
which concern any component of the network. 2.1.1 Existing Intrusion Detection Systems EMERALD [8] is a distributed, scalable, hierarchal, customizable...writing this paper, the updaters of this system had not released their correlation unit to the public. EMERALD ex- plicitly divides statistical analysis... EMERALD , NetSTAT is scalable and composi- ble. QuidSCOR [12] is an open-source IDS, though it requires a subscription from its publisher, Qualys Inc
PhreeqcRM: A reaction module for transport simulators based on the geochemical model PHREEQC
Parkhurst, David L.; Wissmeier, Laurin
2015-01-01
PhreeqcRM is a geochemical reaction module designed specifically to perform equilibrium and kinetic reaction calculations for reactive transport simulators that use an operator-splitting approach. The basic function of the reaction module is to take component concentrations from the model cells of the transport simulator, run geochemical reactions, and return updated component concentrations to the transport simulator. If multicomponent diffusion is modeled (e.g., Nernst–Planck equation), then aqueous species concentrations can be used instead of component concentrations. The reaction capabilities are a complete implementation of the reaction capabilities of PHREEQC. In each cell, the reaction module maintains the composition of all of the reactants, which may include minerals, exchangers, surface complexers, gas phases, solid solutions, and user-defined kinetic reactants.PhreeqcRM assigns initial and boundary conditions for model cells based on standard PHREEQC input definitions (files or strings) of chemical compositions of solutions and reactants. Additional PhreeqcRM capabilities include methods to eliminate reaction calculations for inactive parts of a model domain, transfer concentrations and other model properties, and retrieve selected results. The module demonstrates good scalability for parallel processing by using multiprocessing with MPI (message passing interface) on distributed memory systems, and limited scalability using multithreading with OpenMP on shared memory systems. PhreeqcRM is written in C++, but interfaces allow methods to be called from C or Fortran. By using the PhreeqcRM reaction module, an existing multicomponent transport simulator can be extended to simulate a wide range of geochemical reactions. Results of the implementation of PhreeqcRM as the reaction engine for transport simulators PHAST and FEFLOW are shown by using an analytical solution and the reactive transport benchmark of MoMaS.
Music information retrieval in compressed audio files: a survey
NASA Astrophysics Data System (ADS)
Zampoglou, Markos; Malamos, Athanasios G.
2014-07-01
In this paper, we present an organized survey of the existing literature on music information retrieval systems in which descriptor features are extracted directly from the compressed audio files, without prior decompression to pulse-code modulation format. Avoiding the decompression step and utilizing the readily available compressed-domain information can significantly lighten the computational cost of a music information retrieval system, allowing application to large-scale music databases. We identify a number of systems relying on compressed-domain information and form a systematic classification of the features they extract, the retrieval tasks they tackle and the degree in which they achieve an actual increase in the overall speed-as well as any resulting loss in accuracy. Finally, we discuss recent developments in the field, and the potential research directions they open toward ultra-fast, scalable systems.
Lessons Learned in Deploying the World s Largest Scale Lustre File System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dillow, David A; Fuller, Douglas; Wang, Feiyi
2010-01-01
The Spider system at the Oak Ridge National Laboratory's Leadership Computing Facility (OLCF) is the world's largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF's diverse computational environment, the project had a number of ambitious goals. To support the workloads of the OLCF's diverse computational platforms, the aggregate performance and storage capacity of Spider exceed that of our previously deployed systems by a factor of 6x - 240 GB/sec, and 17x - 10 Petabytes, respectively. Furthermore, Spider supports over 26,000 clients concurrently accessing themore » file system, which exceeds our previously deployed systems by nearly 4x. In addition to these scalability challenges, moving to a center-wide shared file system required dramatically improved resiliency and fault-tolerance mechanisms. This paper details our efforts in designing, deploying, and operating Spider. Through a phased approach of research and development, prototyping, deployment, and transition to operations, this work has resulted in a number of insights into large-scale parallel file system architectures, from both the design and the operational perspectives. We present in this paper our solutions to issues such as network congestion, performance baselining and evaluation, file system journaling overheads, and high availability in a system with tens of thousands of components. We also discuss areas of continued challenges, such as stressed metadata performance and the need for file system quality of service alongside with our efforts to address them. Finally, operational aspects of managing a system of this scale are discussed along with real-world data and observations.« less
A Scalable proxy cache for Grid Data Access
NASA Astrophysics Data System (ADS)
Cristian Cirstea, Traian; Just Keijser, Jan; Koeroo, Oscar Arthur; Starink, Ronald; Templon, Jeffrey Alan
2012-12-01
We describe a prototype grid proxy cache system developed at Nikhef, motivated by a desire to construct the first building block of a future https-based Content Delivery Network for grid infrastructures. Two goals drove the project: firstly to provide a “native view” of the grid for desktop-type users, and secondly to improve performance for physics-analysis type use cases, where multiple passes are made over the same set of data (residing on the grid). We further constrained the design by requiring that the system should be made of standard components wherever possible. The prototype that emerged from this exercise is a horizontally-scalable, cooperating system of web server / cache nodes, fronted by a customized webDAV server. The webDAV server is custom only in the sense that it supports http redirects (providing horizontal scaling) and that the authentication module has, as back end, a proxy delegation chain that can be used by the cache nodes to retrieve files from the grid. The prototype was deployed at Nikhef and tested at a scale of several terabytes of data and approximately one hundred fast cores of computing. Both small and large files were tested, in a number of scenarios, and with various numbers of cache nodes, in order to understand the scaling properties of the system. For properly-dimensioned cache-node hardware, the system showed speedup of several integer factors for the analysis-type use cases. These results and others are presented and discussed.
Field of genes: using Apache Kafka as a bioinformatic data repository
Lynch, Richard; Walsh, Paul
2018-01-01
Abstract Background Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI’s) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI’s RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. Results The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. Conclusions Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data. PMID:29635394
Final Report: CNC Micromachines LDRD No.10793
DOE Office of Scientific and Technical Information (OSTI.GOV)
JOKIEL JR., BERNHARD; BENAVIDES, GILBERT L.; BIEG, LOTHAR F.
2003-04-01
The three-year LDRD ''CNC Micromachines'' was successfully completed at the end of FY02. The project had four major breakthroughs in spatial motion control in MEMS: (1) A unified method for designing scalable planar and spatial on-chip motion control systems was developed. The method relies on the use of parallel kinematic mechanisms (PKMs) that when properly designed provide different types of motion on-chip without the need for post-fabrication assembly, (2) A new type of actuator was developed--the linear stepping track drive (LSTD) that provides open loop linear position control that is scalable in displacement, output force and step size. Several versionsmore » of this actuator were designed, fabricated and successfully tested. (3) Different versions of XYZ translation only and PTT motion stages were designed, successfully fabricated and successfully tested demonstrating absolutely that on-chip spatial motion control systems are not only possible, but are a reality. (4) Control algorithms, software and infrastructure based on MATLAB were created and successfully implemented to drive the XYZ and PTT motion platforms in a controlled manner. The control software is capable of reading an M/G code machine tool language file, decode the instructions and correctly calculate and apply position and velocity trajectories to the motion devices linear drive inputs to position the device platform along the trajectory as specified by the input file. A full and detailed account of design methodology, theory and experimental results (failures and successes) is provided.« less
A Reusable Framework for Regional Climate Model Evaluation
NASA Astrophysics Data System (ADS)
Hart, A. F.; Goodale, C. E.; Mattmann, C. A.; Lean, P.; Kim, J.; Zimdars, P.; Waliser, D. E.; Crichton, D. J.
2011-12-01
Climate observations are currently obtained through a diverse network of sensors and platforms that include space-based observatories, airborne and seaborne platforms, and distributed, networked, ground-based instruments. These global observational measurements are critical inputs to the efforts of the climate modeling community and can provide a corpus of data for use in analysis and validation of climate models. The Regional Climate Model Evaluation System (RCMES) is an effort currently being undertaken to address the challenges of integrating this vast array of observational climate data into a coherent resource suitable for performing model analysis at the regional level. Developed through a collaboration between the NASA Jet Propulsion Laboratory (JPL) and the UCLA Joint Institute for Regional Earth System Science and Engineering (JIFRESSE), the RCMES uses existing open source technologies (MySQL, Apache Hadoop, and Apache OODT), to construct a scalable, parametric, geospatial data store that incorporates decades of observational data from a variety of NASA Earth science missions, as well as other sources into a consistently annotated, highly available scientific resource. By eliminating arbitrary partitions in the data (individual file boundaries, differing file formats, etc), and instead treating each individual observational measurement as a unique, geospatially referenced data point, the RCMES is capable of transforming large, heterogeneous collections of disparate observational data into a unified resource suitable for comparison to climate model output. This facility is further enhanced by the availability of a model evaluation toolkit which consists of a set of Python libraries, a RESTful web service layer, and a browser-based graphical user interface that allows for orchestration of model-to-data comparisons by composing them visually through web forms. This combination of tools and interfaces dramatically simplifies the process of interacting with and utilizing large volumes of observational data for model evaluation research. We feel that the RCMES is particularly appealing in that it represents a principled, reusable architectural approach rather than a one-off technological implementation. In fact, early RCMES prototypes have already utilized a variety of implementation technologies in an effort to address different performance and scalability concerns. This has been greatly facilitated by the fact that, at the architectural level, the RCMES is fundamentally domain agnostic. Strictly separating the data model from the implementation has enabled us to create a reusable architecture that we believe can be modified and configured to suit the demands of researchers in other domains.
Security on Cloud Revocation Authority using Identity Based Encryption
NASA Astrophysics Data System (ADS)
Rajaprabha, M. N.
2017-11-01
As due to the era of cloud computing most of the people are saving there documents, files and other things on cloud spaces. Due to this security over the cloud is also important because all the confidential things are there on the cloud. So to overcome private key infrastructure (PKI) issues some revocable Identity Based Encryption (IBE) techniques are introduced which eliminates the demand of PKI. The technique introduced is key update cloud service provider which is having two issues in it and they are computation and communication cost is high and second one is scalability issue. So to overcome this problem we come along with the system in which the Cloud Revocation Authority (CRA) is there for the security which will only hold the secret key for each user. And the secret key was send with the help of advanced encryption standard security. The key is encrypted and send to the CRA for giving the authentication to the person who wants to share the data or files or for the communication purpose. Through that key only the other user will able to access that file and if the user apply some invalid key on the particular file than the information of that user and file is send to the administrator and administrator is having rights to block that person of black list that person to use the system services.
ArrayBridge: Interweaving declarative array processing with high-performance computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xing, Haoyuan; Floratos, Sofoklis; Blanas, Spyros
Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in response, the database community has proposed in situ mechanisms to access data in scientific file formats. Scientists, however, desire more than a passive access method that reads arrays from files. This paper describes ArrayBridge, a bi-directional array view mechanism for scientific file formats, that aimsmore » to make declarative array manipulations interoperable with imperative file-centric analyses. Our prototype implementation of ArrayBridge uses HDF5 as the underlying array storage library and seamlessly integrates into the SciDB open-source array database system. In addition to fast querying over external array objects, ArrayBridge produces arrays in the HDF5 file format just as easily as it can read from it. ArrayBridge also supports time travel queries from imperative kernels through the unmodified HDF5 API, and automatically deduplicates between array versions for space efficiency. Our extensive performance evaluation in NERSC, a large-scale scientific computing facility, shows that ArrayBridge exhibits statistically indistinguishable performance and I/O scalability to the native SciDB storage engine.« less
Jeong, Seol Young; Jo, Hyeong Gon; Kang, Soon Ju
2014-01-01
A tracking service like asset management is essential in a dynamic hospital environment consisting of numerous mobile assets (e.g., wheelchairs or infusion pumps) that are continuously relocated throughout a hospital. The tracking service is accomplished based on the key technologies of an indoor location-based service (LBS), such as locating and monitoring multiple mobile targets inside a building in real time. An indoor LBS such as a tracking service entails numerous resource lookups being requested concurrently and frequently from several locations, as well as a network infrastructure requiring support for high scalability in indoor environments. A traditional centralized architecture needs to maintain a geographic map of the entire building or complex in its central server, which can cause low scalability and traffic congestion. This paper presents a self-organizing and fully distributed indoor mobile asset management (MAM) platform, and proposes an architecture for multiple trackees (such as mobile assets) and trackers based on the proposed distributed platform in real time. In order to verify the suggested platform, scalability performance according to increases in the number of concurrent lookups was evaluated in a real test bed. Tracking latency and traffic load ratio in the proposed tracking architecture was also evaluated. PMID:24662407
Serving ocean model data on the cloud
Meisinger, Michael; Farcas, Claudiu; Farcas, Emilia; Alexander, Charles; Arrott, Matthew; de La Beaujardiere, Jeff; Hubbard, Paul; Mendelssohn, Roy; Signell, Richard P.
2010-01-01
The NOAA-led Integrated Ocean Observing System (IOOS) and the NSF-funded Ocean Observatories Initiative Cyberinfrastructure Project (OOI-CI) are collaborating on a prototype data delivery system for numerical model output and other gridded data using cloud computing. The strategy is to take an existing distributed system for delivering gridded data and redeploy on the cloud, making modifications to the system that allow it to harness the scalability of the cloud as well as adding functionality that the scalability affords.
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh; ...
2013-01-01
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility ofmore » checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.« less
Heterogeneous information sharing of sensor information in contested environments
NASA Astrophysics Data System (ADS)
Wampler, Jason A.; Hsieh, Chien; Toth, Andrew; Sheatsley, Ryan
2017-05-01
The inherent nature of unattended sensors makes these devices most vulnerable to detection, exploitation, and denial in contested environments. Physical access is often cited as the easiest way to compromise any device or network. A new mechanism for mitigating these types of attacks developed under the Assistant Secretary of Defense for Research and Engineering, ASD(R and E) project, "Smoke Screen in Cyberspace", was demonstrated in a live, over-the-air experiment. Smoke Screen encrypts, slices up, and disburses redundant fragments of files throughout the network. Recovery is only possible after recovering all fragments and attacking/denying one or more nodes does not limit the availability of other fragment copies in the network. This experiment proved the feasibility of redundant file fragmentation, and is the foundation for developing sophisticated methods to blacklist compromised nodes, move data fragments from risks of compromise, and forward stored data fragments closer to the anticipated retrieval point. This paper outlines initial results in scalability of node members, fragment size, file size, and performance in a heterogeneous network consisting of the Wireless Network after Next (WNaN) radio and Common Sensor Radio (CSR).
Web Extensible Display Manager
DOE Office of Scientific and Technical Information (OSTI.GOV)
Slominski, Ryan; Larrieu, Theodore L.
Jefferson Lab's Web Extensible Display Manager (WEDM) allows staff to access EDM control system screens from a web browser in remote offices and from mobile devices. Native browser technologies are leveraged to avoid installing and managing software on remote clients such as browser plugins, tunnel applications, or an EDM environment. Since standard network ports are used firewall exceptions are minimized. To avoid security concerns from remote users modifying a control system, WEDM exposes read-only access and basic web authentication can be used to further restrict access. Updates of monitored EPICS channels are delivered via a Web Socket using a webmore » gateway. The software translates EDM description files (denoted with the edl suffix) to HTML with Scalable Vector Graphics (SVG) following the EDM's edl file vector drawing rules to create faithful screen renderings. The WEDM server parses edl files and creates the HTML equivalent in real-time allowing existing screens to work without modification. Alternatively, the familiar drag and drop EDM screen creation tool can be used to create optimized screens sized specifically for smart phones and then rendered by WEDM.« less
Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds
2012-01-01
Background The MapReduce framework enables a scalable processing and analyzing of large datasets by distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics, MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or executing MapReduce programs require advanced knowledge in computer science and could thus prevent scientists from usage of currently available and useful software solutions. Results Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds). The aim of Cloudgene is to build a standardized graphical execution environment for currently available and future MapReduce programs, which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters, sensitive datasets can be kept in house at all time and data transfer times are therefore minimized. Conclusions Our results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs. This platform gives developers the opportunity to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems (e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce and two systems are integrated and have been successfully deployed. Cloudgene is freely available at http://cloudgene.uibk.ac.at. PMID:22888776
Storing files in a parallel computing system based on user-specified parser function
Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Manzanares, Adam; Torres, Aaron
2014-10-21
Techniques are provided for storing files in a parallel computing system based on a user-specified parser function. A plurality of files generated by a distributed application in a parallel computing system are stored by obtaining a parser from the distributed application for processing the plurality of files prior to storage; and storing one or more of the plurality of files in one or more storage nodes of the parallel computing system based on the processing by the parser. The plurality of files comprise one or more of a plurality of complete files and a plurality of sub-files. The parser can optionally store only those files that satisfy one or more semantic requirements of the parser. The parser can also extract metadata from one or more of the files and the extracted metadata can be stored with one or more of the plurality of files and used for searching for files.
Scalability of transport parameters with pore sizes in isodense disordered media
NASA Astrophysics Data System (ADS)
Reginald, S. William; Schmitt, V.; Vallée, R. A. L.
2014-09-01
We study light multiple scattering in complex disordered porous materials. High internal phase emulsion-based isodense polystyrene foams are designed. Two types of samples, exhibiting different pore size distributions, are investigated for different slab thicknesses varying from L = 1 \\text{mm} to 10 \\text{mm} . Optical measurements combining steady-state and time-resolved detection are used to characterize the photon transport parameters. Very interestingly, a clear scalability of the transport mean free path \\ellt with the average size of the pores S is observed, featuring a constant velocity of the transport energy in these isodense structures. This study strongly motivates further investigations into the limits of validity of this scalability as the scattering strength of the system increases.
Architectures Toward Reusable Science Data Systems
NASA Astrophysics Data System (ADS)
Moses, J. F.
2014-12-01
Science Data Systems (SDS) comprise an important class of data processing systems that support product generation from remote sensors and in-situ observations. These systems enable research into new science data products, replication of experiments and verification of results. NASA has been building ground systems for satellite data processing since the first Earth observing satellites launched and is continuing development of systems to support NASA science research, NOAA's weather satellites and USGS's Earth observing satellite operations. The basic data processing workflows and scenarios continue to be valid for remote sensor observations research as well as for the complex multi-instrument operational satellite data systems being built today. System functions such as ingest, product generation and distribution need to be configured and performed in a consistent and repeatable way with an emphasis on scalability. This paper will examine the key architectural elements of several NASA satellite data processing systems currently in operation and under development that make them suitable for scaling and reuse. Examples of architectural elements that have become attractive include virtual machine environments, standard data product formats, metadata content and file naming, workflow and job management frameworks, data acquisition, search, and distribution protocols. By highlighting key elements and implementation experience the goal is to recognize architectures that will outlast their original application and be readily adaptable for new applications. Concepts and principles are explored that lead to sound guidance for SDS developers and strategists.
Architectures Toward Reusable Science Data Systems
NASA Technical Reports Server (NTRS)
Moses, John
2015-01-01
Science Data Systems (SDS) comprise an important class of data processing systems that support product generation from remote sensors and in-situ observations. These systems enable research into new science data products, replication of experiments and verification of results. NASA has been building systems for satellite data processing since the first Earth observing satellites launched and is continuing development of systems to support NASA science research and NOAAs Earth observing satellite operations. The basic data processing workflows and scenarios continue to be valid for remote sensor observations research as well as for the complex multi-instrument operational satellite data systems being built today. System functions such as ingest, product generation and distribution need to be configured and performed in a consistent and repeatable way with an emphasis on scalability. This paper will examine the key architectural elements of several NASA satellite data processing systems currently in operation and under development that make them suitable for scaling and reuse. Examples of architectural elements that have become attractive include virtual machine environments, standard data product formats, metadata content and file naming, workflow and job management frameworks, data acquisition, search, and distribution protocols. By highlighting key elements and implementation experience we expect to find architectures that will outlast their original application and be readily adaptable for new applications. Concepts and principles are explored that lead to sound guidance for SDS developers and strategists.
Large Scale Frequent Pattern Mining using MPI One-Sided Model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vishnu, Abhinav; Agarwal, Khushbu
In this paper, we propose a work-stealing runtime --- Library for Work Stealing LibWS --- using MPI one-sided model for designing scalable FP-Growth --- {\\em de facto} frequent pattern mining algorithm --- on large scale systems. LibWS provides locality efficient and highly scalable work-stealing techniques for load balancing on a variety of data distributions. We also propose a novel communication algorithm for FP-growth data exchange phase, which reduces the communication complexity from state-of-the-art O(p) to O(f + p/f) for p processes and f frequent attributed-ids. FP-Growth is implemented using LibWS and evaluated on several work distributions and support counts. Anmore » experimental evaluation of the FP-Growth on LibWS using 4096 processes on an InfiniBand Cluster demonstrates excellent efficiency for several work distributions (87\\% efficiency for Power-law and 91% for Poisson). The proposed distributed FP-Tree merging algorithm provides 38x communication speedup on 4096 cores.« less
NASA Astrophysics Data System (ADS)
Prasad, Guru; Jayaram, Sanjay; Ward, Jami; Gupta, Pankaj
2004-08-01
In this paper, Aximetric proposes a decentralized Command and Control (C2) architecture for a distributed control of a cluster of on-board health monitoring and software enabled control systems called SimBOX that will use some of the real-time infrastructure (RTI) functionality from the current military real-time simulation architecture. The uniqueness of the approach is to provide a "plug and play environment" for various system components that run at various data rates (Hz) and the ability to replicate or transfer C2 operations to various subsystems in a scalable manner. This is possible by providing a communication bus called "Distributed Shared Data Bus" and a distributed computing environment used to scale the control needs by providing a self-contained computing, data logging and control function module that can be rapidly reconfigured to perform different functions. This kind of software-enabled control is very much needed to meet the needs of future aerospace command and control functions.
NASA Astrophysics Data System (ADS)
Prasad, Guru; Jayaram, Sanjay; Ward, Jami; Gupta, Pankaj
2004-09-01
In this paper, Aximetric proposes a decentralized Command and Control (C2) architecture for a distributed control of a cluster of on-board health monitoring and software enabled control systems called
Drawert, Brian; Trogdon, Michael; Toor, Salman; Petzold, Linda; Hellander, Andreas
2016-01-01
Computational experiments using spatial stochastic simulations have led to important new biological insights, but they require specialized tools and a complex software stack, as well as large and scalable compute and data analysis resources due to the large computational cost associated with Monte Carlo computational workflows. The complexity of setting up and managing a large-scale distributed computation environment to support productive and reproducible modeling can be prohibitive for practitioners in systems biology. This results in a barrier to the adoption of spatial stochastic simulation tools, effectively limiting the type of biological questions addressed by quantitative modeling. In this paper, we present PyURDME, a new, user-friendly spatial modeling and simulation package, and MOLNs, a cloud computing appliance for distributed simulation of stochastic reaction-diffusion models. MOLNs is based on IPython and provides an interactive programming platform for development of sharable and reproducible distributed parallel computational experiments.
NASA Technical Reports Server (NTRS)
Li, Zhenlong; Hu, Fei; Schnase, John L.; Duffy, Daniel Q.; Lee, Tsengdar; Bowen, Michael K.; Yang, Chaowei
2016-01-01
Climate observations and model simulations are producing vast amounts of array-based spatiotemporal data. Efficient processing of these data is essential for assessing global challenges such as climate change, natural disasters, and diseases. This is challenging not only because of the large data volume, but also because of the intrinsic high-dimensional nature of geoscience data. To tackle this challenge, we propose a spatiotemporal indexing approach to efficiently manage and process big climate data with MapReduce in a highly scalable environment. Using this approach, big climate data are directly stored in a Hadoop Distributed File System in its original, native file format. A spatiotemporal index is built to bridge the logical array-based data model and the physical data layout, which enables fast data retrieval when performing spatiotemporal queries. Based on the index, a data-partitioning algorithm is applied to enable MapReduce to achieve high data locality, as well as balancing the workload. The proposed indexing approach is evaluated using the National Aeronautics and Space Administration (NASA) Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. The experimental results show that the index can significantly accelerate querying and processing (10 speedup compared to the baseline test using the same computing cluster), while keeping the index-to-data ratio small (0.0328). The applicability of the indexing approach is demonstrated by a climate anomaly detection deployed on a NASA Hadoop cluster. This approach is also able to support efficient processing of general array-based spatiotemporal data in various geoscience domains without special configuration on a Hadoop cluster.
Fabrication of Scalable Indoor Light Energy Harvester and Study for Agricultural IoT Applications
NASA Astrophysics Data System (ADS)
Watanabe, M.; Nakamura, A.; Kunii, A.; Kusano, K.; Futagawa, M.
2015-12-01
A scalable indoor light energy harvester was fabricated by microelectromechanical system (MEMS) and printing hybrid technology and evaluated for agricultural IoT applications under different environmental input power density conditions, such as outdoor farming under the sun, greenhouse farming under scattered lighting, and a plant factory under LEDs. We fabricated and evaluated a dye- sensitized-type solar cell (DSC) as a low cost and “scalable” optical harvester device. We developed a transparent conductive oxide (TCO)-less process with a honeycomb metal mesh substrate fabricated by MEMS technology. In terms of the electrical and optical properties, we achieved scalable harvester output power by cell area sizing. Second, we evaluated the dependence of the input power scalable characteristics on the input light intensity, spectrum distribution, and light inlet direction angle, because harvested environmental input power is unstable. The TiO2 fabrication relied on nanoimprint technology, which was designed for optical optimization and fabrication, and we confirmed that the harvesters are robust to a variety of environments. Finally, we studied optical energy harvesting applications for agricultural IoT systems. These scalable indoor light harvesters could be used in many applications and situations in smart agriculture.
Theoretical and Empirical Analysis of a Spatial EA Parallel Boosting Algorithm.
Kamath, Uday; Domeniconi, Carlotta; De Jong, Kenneth
2018-01-01
Many real-world problems involve massive amounts of data. Under these circumstances learning algorithms often become prohibitively expensive, making scalability a pressing issue to be addressed. A common approach is to perform sampling to reduce the size of the dataset and enable efficient learning. Alternatively, one customizes learning algorithms to achieve scalability. In either case, the key challenge is to obtain algorithmic efficiency without compromising the quality of the results. In this article we discuss a meta-learning algorithm (PSBML) that combines concepts from spatially structured evolutionary algorithms (SSEAs) with concepts from ensemble and boosting methodologies to achieve the desired scalability property. We present both theoretical and empirical analyses which show that PSBML preserves a critical property of boosting, specifically, convergence to a distribution centered around the margin. We then present additional empirical analyses showing that this meta-level algorithm provides a general and effective framework that can be used in combination with a variety of learning classifiers. We perform extensive experiments to investigate the trade-off achieved between scalability and accuracy, and robustness to noise, on both synthetic and real-world data. These empirical results corroborate our theoretical analysis, and demonstrate the potential of PSBML in achieving scalability without sacrificing accuracy.
A Flexible Online Metadata Editing and Management System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aguilar, Raul; Pan, Jerry Yun; Gries, Corinna
2010-01-01
A metadata editing and management system is being developed employing state of the art XML technologies. A modular and distributed design was chosen for scalability, flexibility, options for customizations, and the possibility to add more functionality at a later stage. The system consists of a desktop design tool or schema walker used to generate code for the actual online editor, a native XML database, and an online user access management application. The design tool is a Java Swing application that reads an XML schema, provides the designer with options to combine input fields into online forms and give the fieldsmore » user friendly tags. Based on design decisions, the tool generates code for the online metadata editor. The code generated is an implementation of the XForms standard using the Orbeon Framework. The design tool fulfills two requirements: First, data entry forms based on one schema may be customized at design time and second data entry applications may be generated for any valid XML schema without relying on custom information in the schema. However, the customized information generated at design time is saved in a configuration file which may be re-used and changed again in the design tool. Future developments will add functionality to the design tool to integrate help text, tool tips, project specific keyword lists, and thesaurus services. Additional styling of the finished editor is accomplished via cascading style sheets which may be further customized and different look-and-feels may be accumulated through the community process. The customized editor produces XML files in compliance with the original schema, however, data from the current page is saved into a native XML database whenever the user moves to the next screen or pushes the save button independently of validity. Currently the system uses the open source XML database eXist for storage and management, which comes with third party online and desktop management tools. However, access to metadata files in the application introduced here is managed in a custom online module, using a MySQL backend accessed by a simple Java Server Faces front end. A flexible system with three grouping options, organization, group and single editing access is provided. Three levels were chosen to distribute administrative responsibilities and handle the common situation of an information manager entering the bulk of the metadata but leave specifics to the actual data provider.« less
Fully programmable and scalable optical switching fabric for petabyte data center.
Zhu, Zhonghua; Zhong, Shan; Chen, Li; Chen, Kai
2015-02-09
We present a converged EPS and OCS switching fabric for data center networks (DCNs) based on a distributed optical switching architecture leveraging both WDM & SDM technologies. The architecture is topology adaptive, well suited to dynamic and diverse *-cast traffic patterns. Compared to a typical folded-Clos network, the new architecture is more readily scalable to future multi-Petabyte data centers with 1000 + racks while providing a higher link bandwidth, reducing transceiver count by 50%, and improving cabling efficiency by more than 90%.
Next Generation Distributed Computing for Cancer Research
Agarwal, Pankaj; Owzar, Kouros
2014-01-01
Advances in next generation sequencing (NGS) and mass spectrometry (MS) technologies have provided many new opportunities and angles for extending the scope of translational cancer research while creating tremendous challenges in data management and analysis. The resulting informatics challenge is invariably not amenable to the use of traditional computing models. Recent advances in scalable computing and associated infrastructure, particularly distributed computing for Big Data, can provide solutions for addressing these challenges. In this review, the next generation of distributed computing technologies that can address these informatics problems is described from the perspective of three key components of a computational platform, namely computing, data storage and management, and networking. A broad overview of scalable computing is provided to set the context for a detailed description of Hadoop, a technology that is being rapidly adopted for large-scale distributed computing. A proof-of-concept Hadoop cluster, set up for performance benchmarking of NGS read alignment, is described as an example of how to work with Hadoop. Finally, Hadoop is compared with a number of other current technologies for distributed computing. PMID:25983539
Distributed controller clustering in software defined networks.
Abdelaziz, Ahmed; Fong, Ang Tan; Gani, Abdullah; Garba, Usman; Khan, Suleman; Akhunzada, Adnan; Talebian, Hamid; Choo, Kim-Kwang Raymond
2017-01-01
Software Defined Networking (SDN) is an emerging promising paradigm for network management because of its centralized network intelligence. However, the centralized control architecture of the software-defined networks (SDNs) brings novel challenges of reliability, scalability, fault tolerance and interoperability. In this paper, we proposed a novel clustered distributed controller architecture in the real setting of SDNs. The distributed cluster implementation comprises of multiple popular SDN controllers. The proposed mechanism is evaluated using a real world network topology running on top of an emulated SDN environment. The result shows that the proposed distributed controller clustering mechanism is able to significantly reduce the average latency from 8.1% to 1.6%, the packet loss from 5.22% to 4.15%, compared to distributed controller without clustering running on HP Virtual Application Network (VAN) SDN and Open Network Operating System (ONOS) controllers respectively. Moreover, proposed method also shows reasonable CPU utilization results. Furthermore, the proposed mechanism makes possible to handle unexpected load fluctuations while maintaining a continuous network operation, even when there is a controller failure. The paper is a potential contribution stepping towards addressing the issues of reliability, scalability, fault tolerance, and inter-operability.
Next generation distributed computing for cancer research.
Agarwal, Pankaj; Owzar, Kouros
2014-01-01
Advances in next generation sequencing (NGS) and mass spectrometry (MS) technologies have provided many new opportunities and angles for extending the scope of translational cancer research while creating tremendous challenges in data management and analysis. The resulting informatics challenge is invariably not amenable to the use of traditional computing models. Recent advances in scalable computing and associated infrastructure, particularly distributed computing for Big Data, can provide solutions for addressing these challenges. In this review, the next generation of distributed computing technologies that can address these informatics problems is described from the perspective of three key components of a computational platform, namely computing, data storage and management, and networking. A broad overview of scalable computing is provided to set the context for a detailed description of Hadoop, a technology that is being rapidly adopted for large-scale distributed computing. A proof-of-concept Hadoop cluster, set up for performance benchmarking of NGS read alignment, is described as an example of how to work with Hadoop. Finally, Hadoop is compared with a number of other current technologies for distributed computing.
Mapping RNA-seq Reads with STAR
Dobin, Alexander; Gingeras, Thomas R.
2015-01-01
Mapping of large sets of high-throughput sequencing reads to a reference genome is one of the foundational steps in RNA-seq data analysis. The STAR software package performs this task with high levels of accuracy and speed. In addition to detecting annotated and novel splice junctions, STAR is capable of discovering more complex RNA sequence arrangements, such as chimeric and circular RNA. STAR can align spliced sequences of any length with moderate error rates providing scalability for emerging sequencing technologies. STAR generates output files that can be used for many downstream analyses such as transcript/gene expression quantification, differential gene expression, novel isoform reconstruction, signal visualization, and so forth. In this unit we describe computational protocols that produce various output files, use different RNA-seq datatypes, and utilize different mapping strategies. STAR is Open Source software that can be run on Unix, Linux or Mac OS X systems. PMID:26334920
Mapping RNA-seq Reads with STAR.
Dobin, Alexander; Gingeras, Thomas R
2015-09-03
Mapping of large sets of high-throughput sequencing reads to a reference genome is one of the foundational steps in RNA-seq data analysis. The STAR software package performs this task with high levels of accuracy and speed. In addition to detecting annotated and novel splice junctions, STAR is capable of discovering more complex RNA sequence arrangements, such as chimeric and circular RNA. STAR can align spliced sequences of any length with moderate error rates, providing scalability for emerging sequencing technologies. STAR generates output files that can be used for many downstream analyses such as transcript/gene expression quantification, differential gene expression, novel isoform reconstruction, and signal visualization. In this unit, we describe computational protocols that produce various output files, use different RNA-seq datatypes, and utilize different mapping strategies. STAR is open source software that can be run on Unix, Linux, or Mac OS X systems. Copyright © 2015 John Wiley & Sons, Inc.
Final Report for File System Support for Burst Buffers on HPC Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yu, W.; Mohror, K.
Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. As they are being deployed on more supercomputers, a file system that efficiently manages these burst buffers for fast I/O operations carries great consequence. Over the past year, FSU team has undertaken several efforts to design, prototype and evaluate distributed file systems for burst buffers on HPC systems. These include MetaKV: a Key-Value Store for Metadata Management of Distributed Burst Buffers, a user-level file system with multiple backends, and a specialized file system for large datasets of deep neural networks. Our progress for these respectivemore » efforts are elaborated further in this report.« less
Emerging Cyber Infrastructure for NASA's Large-Scale Climate Data Analytics
NASA Astrophysics Data System (ADS)
Duffy, D.; Spear, C.; Bowen, M. K.; Thompson, J. H.; Hu, F.; Yang, C. P.; Pierce, D.
2016-12-01
The resolution of NASA climate and weather simulations have grown dramatically over the past few years with the highest-fidelity models reaching down to 1.5 KM global resolutions. With each doubling of the resolution, the resulting data sets grow by a factor of eight in size. As the climate and weather models push the envelope even further, a new infrastructure to store data and provide large-scale data analytics is necessary. The NASA Center for Climate Simulation (NCCS) has deployed the Data Analytics Storage Service (DASS) that combines scalable storage with the ability to perform in-situ analytics. Within this system, large, commonly used data sets are stored in a POSIX file system (write once/read many); examples of data stored include Landsat, MERRA2, observing system simulation experiments, and high-resolution downscaled reanalysis. The total size of this repository is on the order of 15 petabytes of storage. In addition to the POSIX file system, the NCCS has deployed file system connectors to enable emerging analytics built on top of the Hadoop File System (HDFS) to run on the same storage servers within the DASS. Coupled with a custom spatiotemporal indexing approach, users can now run emerging analytical operations built on MapReduce and Spark on the same data files stored within the POSIX file system without having to make additional copies. This presentation will discuss the architecture of this system and present benchmark performance measurements from traditional TeraSort and Wordcount to large-scale climate analytical operations on NetCDF data.
Storing files in a parallel computing system based on user or application specification
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faibish, Sorin; Bent, John M.; Nick, Jeffrey M.
2016-03-29
Techniques are provided for storing files in a parallel computing system based on a user-specification. A plurality of files generated by a distributed application in a parallel computing system are stored by obtaining a specification from the distributed application indicating how the plurality of files should be stored; and storing one or more of the plurality of files in one or more storage nodes of a multi-tier storage system based on the specification. The plurality of files comprise a plurality of complete files and/or a plurality of sub-files. The specification can optionally be processed by a daemon executing on onemore » or more nodes in a multi-tier storage system. The specification indicates how the plurality of files should be stored, for example, identifying one or more storage nodes where the plurality of files should be stored.« less
AEGIS: a robust and scalable real-time public health surveillance system.
Reis, Ben Y; Kirby, Chaim; Hadden, Lucy E; Olson, Karen; McMurry, Andrew J; Daniel, James B; Mandl, Kenneth D
2007-01-01
In this report, we describe the Automated Epidemiological Geotemporal Integrated Surveillance system (AEGIS), developed for real-time population health monitoring in the state of Massachusetts. AEGIS provides public health personnel with automated near-real-time situational awareness of utilization patterns at participating healthcare institutions, supporting surveillance of bioterrorism and naturally occurring outbreaks. As real-time public health surveillance systems become integrated into regional and national surveillance initiatives, the challenges of scalability, robustness, and data security become increasingly prominent. A modular and fault tolerant design helps AEGIS achieve scalability and robustness, while a distributed storage model with local autonomy helps to minimize risk of unauthorized disclosure. The report includes a description of the evolution of the design over time in response to the challenges of a regional and national integration environment.
A Scalable Data Access Layer to Manage Structured Heterogeneous Biomedical Data.
Delussu, Giovanni; Lianas, Luca; Frexia, Francesca; Zanetti, Gianluigi
2016-01-01
This work presents a scalable data access layer, called PyEHR, designed to support the implementation of data management systems for secondary use of structured heterogeneous biomedical and clinical data. PyEHR adopts the openEHR's formalisms to guarantee the decoupling of data descriptions from implementation details and exploits structure indexing to accelerate searches. Data persistence is guaranteed by a driver layer with a common driver interface. Interfaces for two NoSQL Database Management Systems are already implemented: MongoDB and Elasticsearch. We evaluated the scalability of PyEHR experimentally through two types of tests, called "Constant Load" and "Constant Number of Records", with queries of increasing complexity on synthetic datasets of ten million records each, containing very complex openEHR archetype structures, distributed on up to ten computing nodes.
A look at scalable dense linear algebra libraries
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dongarra, J.J.; Van de Geijn, R.A.; Walker, D.W.
1992-01-01
We discuss the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers. The square block scattered decomposition is proposed as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems. An object- oriented interface to the library permits more portable applications to be written, and is easy to learn and use, since details of the parallel implementation are hidden from the user. Experiments on the Intel Touchstone Delta system with a prototype code that uses the square block scattered decomposition to perform LU factorization aremore » presented and analyzed. It was found that the code was both scalable and efficient, performing at about 14 GFLOPS (double precision) for the largest problem considered.« less
A look at scalable dense linear algebra libraries
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dongarra, J.J.; Van de Geijn, R.A.; Walker, D.W.
1992-08-01
We discuss the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers. The square block scattered decomposition is proposed as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems. An object- oriented interface to the library permits more portable applications to be written, and is easy to learn and use, since details of the parallel implementation are hidden from the user. Experiments on the Intel Touchstone Delta system with a prototype code that uses the square block scattered decomposition to perform LU factorization aremore » presented and analyzed. It was found that the code was both scalable and efficient, performing at about 14 GFLOPS (double precision) for the largest problem considered.« less
Scaling the flood regime with the soil hydraulic properties of the catchment
NASA Astrophysics Data System (ADS)
Peña Rojas, Luis Eduardo; Francés García, Félix; Barrios Peña, Miguel
2015-04-01
The spatial land cover distribution and soil type affect the hydraulic properties of soils, facilitating or retarding the infiltration rate and the response of a catchment during flooding events. This research analyzes: 1) the effect of land cover use in different time periods as a source of annual maximum flood records nonstationarity; 2) the scalability of the relationship between soil hydraulic properties of the catchment (initial abstractions, upper soil capillary storage and vertical and horizontal hydraulic conductivity) and the flood regime. The study was conducted in Combeima River basin in Colombia - South America and it was modelled the changes in the land uses registered in 1991, 2000, 2002 and 2007, using distributed hydrological modelling and nonparametric tests. The results showed that changes in land use affect hydraulic properties of soil and it has influence on the magnitude of flood peaks. What is a new finding is that this behavior is scalable with the soil hydraulic properties of the catchment flood moments have a simple scaling behavior and the peaks flow increases with higher values of capillary soil storage, whereas higher values, the peaks decreased. Finally it was applied Generalized Extreme Values and it was found scalable behavior in the parameters of the probability distribution function. The results allowed us to find a relationship between soil hydraulic properties and the behavior of flood regime in the basin studied.
The Use of Binary Search Trees in External Distribution Sorting.
ERIC Educational Resources Information Center
Cooper, David; Lynch, Michael F.
1984-01-01
Suggests new method of external distribution called tree partitioning that involves use of binary tree to split incoming file into successively smaller partitions for internal sorting. Number of disc accesses during a tree-partitioning sort were calculated in simulation using files extracted from British National Bibliography catalog files. (19…
Analysis of scalability of high-performance 3D image processing platform for virtual colonoscopy
NASA Astrophysics Data System (ADS)
Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli
2014-03-01
One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. For this purpose, we previously developed a software platform for high-performance 3D medical image processing, called HPC 3D-MIP platform, which employs increasingly available and affordable commodity computing systems such as the multicore, cluster, and cloud computing systems. To achieve scalable high-performance computing, the platform employed size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D-MIP algorithms, supported task scheduling for efficient load distribution and balancing, and consisted of a layered parallel software libraries that allow image processing applications to share the common functionalities. We evaluated the performance of the HPC 3D-MIP platform by applying it to computationally intensive processes in virtual colonoscopy. Experimental results showed a 12-fold performance improvement on a workstation with 12-core CPUs over the original sequential implementation of the processes, indicating the efficiency of the platform. Analysis of performance scalability based on the Amdahl's law for symmetric multicore chips showed the potential of a high performance scalability of the HPC 3DMIP platform when a larger number of cores is available.
Medical image archive node simulation and architecture
NASA Astrophysics Data System (ADS)
Chiang, Ted T.; Tang, Yau-Kuo
1996-05-01
It is a well known fact that managed care and new treatment technologies are revolutionizing the health care provider world. Community Health Information Network and Computer-based Patient Record projects are underway throughout the United States. More and more hospitals are installing digital, `filmless' radiology (and other imagery) systems. They generate a staggering amount of information around the clock. For example, a typical 500-bed hospital might accumulate more than 5 terabytes of image data in a period of 30 years for conventional x-ray images and digital images such as Magnetic Resonance Imaging and Computer Tomography images. With several hospitals contributing to the archive, the storage required will be in the hundreds of terabytes. Systems for reliable, secure, and inexpensive storage and retrieval of digital medical information do not exist today. In this paper, we present a Medical Image Archive and Distribution Service (MIADS) concept. MIADS is a system shared by individual and community hospitals, laboratories, and doctors' offices that need to store and retrieve medical images. Due to the large volume and complexity of the data, as well as the diversified user access requirement, implementation of the MIADS will be a complex procedure. One of the key challenges to implementing a MIADS is to select a cost-effective, scalable system architecture to meet the ingest/retrieval performance requirements. We have performed an in-depth system engineering study, and developed a sophisticated simulation model to address this key challenge. This paper describes the overall system architecture based on our system engineering study and simulation results. In particular, we will emphasize system scalability and upgradability issues. Furthermore, we will discuss our simulation results in detail. The simulations study the ingest/retrieval performance requirements based on different system configurations and architectures for variables such as workload, tape access time, number of drives, number of exams per patient, number of Central Processing Units, patient grouping, and priority impacts. The MIADS, which could be a key component of a broader data repository system, will be able to communicate with and obtain data from existing hospital information systems. We will discuss the external interfaces enabling MIADS to communicate with and obtain data from existing Radiology Information Systems such as the Picture Archiving and Communication System (PACS). Our system design encompasses the broader aspects of the archive node, which could include multimedia data such as image, audio, video, and free text data. This system is designed to be integrated with current hospital PACS through a Digital Imaging and Communications in Medicine interface. However, the system can also be accessed through the Internet using Hypertext Transport Protocol or Simple File Transport Protocol. Our design and simulation work will be key to implementing a successful, scalable medical image archive and distribution system.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
26 CFR 1.355-5 - Records to be kept and information to be filed.
Code of Federal Regulations, 2010 CFR
2010-04-01
... Records to be kept and information to be filed. (a) Distributing corporation—(1) In general. Every corporation that makes a distribution (the distributing corporation) of stock or securities of a controlled... (IF ANY) OF TAXPAYER], A DISTRIBUTING CORPORATION,” on or with its return for the year of the...
Lima, Jakelyne; Cerdeira, Louise Teixeira; Bol, Erick; Schneider, Maria Paula Cruz; Silva, Artur; Azevedo, Vasco; Abelém, Antônio Jorge Gomes
2012-01-01
Improvements in genome sequencing techniques have resulted in generation of huge volumes of data. As a consequence of this progress, the genome assembly stage demands even more computational power, since the incoming sequence files contain large amounts of data. To speed up the process, it is often necessary to distribute the workload among a group of machines. However, this requires hardware and software solutions specially configured for this purpose. Grid computing try to simplify this process of aggregate resources, but do not always offer the best performance possible due to heterogeneity and decentralized management of its resources. Thus, it is necessary to develop software that takes into account these peculiarities. In order to achieve this purpose, we developed an algorithm aimed to optimize the functionality of de novo assembly software ABySS in order to optimize its operation in grids. We run ABySS with and without the algorithm we developed in the grid simulator SimGrid. Tests showed that our algorithm is viable, flexible, and scalable even on a heterogeneous environment, which improved the genome assembly time in computational grids without changing its quality. PMID:22461785
Bringing the CMS distributed computing system into scalable operations
NASA Astrophysics Data System (ADS)
Belforte, S.; Fanfani, A.; Fisk, I.; Flix, J.; Hernández, J. M.; Kress, T.; Letts, J.; Magini, N.; Miccio, V.; Sciabà, A.
2010-04-01
Establishing efficient and scalable operations of the CMS distributed computing system critically relies on the proper integration, commissioning and scale testing of the data and workload management tools, the various computing workflows and the underlying computing infrastructure, located at more than 50 computing centres worldwide and interconnected by the Worldwide LHC Computing Grid. Computing challenges periodically undertaken by CMS in the past years with increasing scale and complexity have revealed the need for a sustained effort on computing integration and commissioning activities. The Processing and Data Access (PADA) Task Force was established at the beginning of 2008 within the CMS Computing Program with the mandate of validating the infrastructure for organized processing and user analysis including the sites and the workload and data management tools, validating the distributed production system by performing functionality, reliability and scale tests, helping sites to commission, configure and optimize the networking and storage through scale testing data transfers and data processing, and improving the efficiency of accessing data across the CMS computing system from global transfers to local access. This contribution reports on the tools and procedures developed by CMS for computing commissioning and scale testing as well as the improvements accomplished towards efficient, reliable and scalable computing operations. The activities include the development and operation of load generators for job submission and data transfers with the aim of stressing the experiment and Grid data management and workload management systems, site commissioning procedures and tools to monitor and improve site availability and reliability, as well as activities targeted to the commissioning of the distributed production, user analysis and monitoring systems.
26 CFR 1.562-3 - Distributions by a member of an affiliated group.
Code of Federal Regulations, 2011 CFR
2011-04-01
... Distributions by a member of an affiliated group. A personal holding company which files or is required to file a consolidated return with other members of an affiliated group may be required to file a separate personal holding company schedule by reason of the limitations and exceptions provided in section 542(b...
26 CFR 1.562-3 - Distributions by a member of an affiliated group.
Code of Federal Regulations, 2010 CFR
2010-04-01
... Distributions by a member of an affiliated group. A personal holding company which files or is required to file a consolidated return with other members of an affiliated group may be required to file a separate personal holding company schedule by reason of the limitations and exceptions provided in section 542(b...
Space-Filling Supercapacitor Carpets: Highly scalable fractal architecture for energy storage
NASA Astrophysics Data System (ADS)
Tiliakos, Athanasios; Trefilov, Alexandra M. I.; Tanasǎ, Eugenia; Balan, Adriana; Stamatin, Ioan
2018-04-01
Revamping ground-breaking ideas from fractal geometry, we propose an alternative micro-supercapacitor configuration realized by laser-induced graphene (LIG) foams produced via laser pyrolysis of inexpensive commercial polymers. The Space-Filling Supercapacitor Carpet (SFSC) architecture introduces the concept of nested electrodes based on the pre-fractal Peano space-filling curve, arranged in a symmetrical equilateral setup that incorporates multiple parallel capacitor cells sharing common electrodes for maximum efficiency and optimal length-to-area distribution. We elucidate on the theoretical foundations of the SFSC architecture, and we introduce innovations (high-resolution vector-mode printing) in the LIG method that allow for the realization of flexible and scalable devices based on low iterations of the Peano algorithm. SFSCs exhibit distributed capacitance properties, leading to capacitance, energy, and power ratings proportional to the number of nested electrodes (up to 4.3 mF, 0.4 μWh, and 0.2 mW for the largest tested model of low iteration using aqueous electrolytes), with competitively high energy and power densities. This can pave the road for full scalability in energy storage, reaching beyond the scale of micro-supercapacitors for incorporating into larger and more demanding applications.
Aphinyanaphongs, Yin; Fu, Lawrence D; Aliferis, Constantin F
2013-01-01
Building machine learning models that identify unproven cancer treatments on the Health Web is a promising approach for dealing with the dissemination of false and dangerous information to vulnerable health consumers. Aside from the obvious requirement of accuracy, two issues are of practical importance in deploying these models in real world applications. (a) Generalizability: The models must generalize to all treatments (not just the ones used in the training of the models). (b) Scalability: The models can be applied efficiently to billions of documents on the Health Web. First, we provide methods and related empirical data demonstrating strong accuracy and generalizability. Second, by combining the MapReduce distributed architecture and high dimensionality compression via Markov Boundary feature selection, we show how to scale the application of the models to WWW-scale corpora. The present work provides evidence that (a) a very small subset of unproven cancer treatments is sufficient to build a model to identify unproven treatments on the web; (b) unproven treatments use distinct language to market their claims and this language is learnable; (c) through distributed parallelization and state of the art feature selection, it is possible to prepare the corpora and build and apply models with large scalability.
A distributed-memory approximation algorithm for maximum weight perfect bipartite matching
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Buluc, Aydin; Li, Xiaoye S.
We design and implement an efficient parallel approximation algorithm for the problem of maximum weight perfect matching in bipartite graphs, i.e. the problem of finding a set of non-adjacent edges that covers all vertices and has maximum weight. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization where sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64, are widely used due to the lack of scalable alternatives. To overcome this limitation, we proposemore » a fully parallel distributed memory algorithm that first generates a perfect matching and then searches for weightaugmenting cycles of length four in parallel and iteratively augments the matching with a vertex disjoint set of such cycles. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.« less
A user-oriented synthetic workload generator
NASA Technical Reports Server (NTRS)
Kao, Wei-Lun
1991-01-01
A user oriented synthetic workload generator that simulates users' file access behavior based on real workload characterization is described. The model for this workload generator is user oriented and job specific, represents file I/O operations at the system call level, allows general distributions for the usage measures, and assumes independence in the file I/O operation stream. The workload generator consists of three parts which handle specification of distributions, creation of an initial file system, and selection and execution of file I/O operations. Experiments on SUN NFS are shown to demonstrate the usage of the workload generator.
Scalable Robust Principal Component Analysis Using Grassmann Averages.
Hauberg, Sren; Feragen, Aasa; Enficiaud, Raffi; Black, Michael J
2016-11-01
In large datasets, manual data verification is impossible, and we must expect the number of outliers to increase with data size. While principal component analysis (PCA) can reduce data size, and scalable solutions exist, it is well-known that outliers can arbitrarily corrupt the results. Unfortunately, state-of-the-art approaches for robust PCA are not scalable. We note that in a zero-mean dataset, each observation spans a one-dimensional subspace, giving a point on the Grassmann manifold. We show that the average subspace corresponds to the leading principal component for Gaussian data. We provide a simple algorithm for computing this Grassmann Average ( GA), and show that the subspace estimate is less sensitive to outliers than PCA for general distributions. Because averages can be efficiently computed, we immediately gain scalability. We exploit robust averaging to formulate the Robust Grassmann Average (RGA) as a form of robust PCA. The resulting Trimmed Grassmann Average ( TGA) is appropriate for computer vision because it is robust to pixel outliers. The algorithm has linear computational complexity and minimal memory requirements. We demonstrate TGA for background modeling, video restoration, and shadow removal. We show scalability by performing robust PCA on the entire Star Wars IV movie; a task beyond any current method. Source code is available online.
Novel Scalable 3-D MT Inverse Solver
NASA Astrophysics Data System (ADS)
Kuvshinov, A. V.; Kruglyakov, M.; Geraskin, A.
2016-12-01
We present a new, robust and fast, three-dimensional (3-D) magnetotelluric (MT) inverse solver. As a forward modelling engine a highly-scalable solver extrEMe [1] is used. The (regularized) inversion is based on an iterative gradient-type optimization (quasi-Newton method) and exploits adjoint sources approach for fast calculation of the gradient of the misfit. The inverse solver is able to deal with highly detailed and contrasting models, allows for working (separately or jointly) with any type of MT (single-site and/or inter-site) responses, and supports massive parallelization. Different parallelization strategies implemented in the code allow for optimal usage of available computational resources for a given problem set up. To parameterize an inverse domain a mask approach is implemented, which means that one can merge any subset of forward modelling cells in order to account for (usually) irregular distribution of observation sites. We report results of 3-D numerical experiments aimed at analysing the robustness, performance and scalability of the code. In particular, our computational experiments carried out at different platforms ranging from modern laptops to high-performance clusters demonstrate practically linear scalability of the code up to thousands of nodes. 1. Kruglyakov, M., A. Geraskin, A. Kuvshinov, 2016. Novel accurate and scalable 3-D MT forward solver based on a contracting integral equation method, Computers and Geosciences, in press.
NASA Astrophysics Data System (ADS)
Plaza, Antonio; Plaza, Javier; Paz, Abel
2010-10-01
Latest generation remote sensing instruments (called hyperspectral imagers) are now able to generate hundreds of images, corresponding to different wavelength channels, for the same area on the surface of the Earth. In previous work, we have reported that the scalability of parallel processing algorithms dealing with these high-dimensional data volumes is affected by the amount of data to be exchanged through the communication network of the system. However, large messages are common in hyperspectral imaging applications since processing algorithms are pixel-based, and each pixel vector to be exchanged through the communication network is made up of hundreds of spectral values. Thus, decreasing the amount of data to be exchanged could improve the scalability and parallel performance. In this paper, we propose a new framework based on intelligent utilization of wavelet-based data compression techniques for improving the scalability of a standard hyperspectral image processing chain on heterogeneous networks of workstations. This type of parallel platform is quickly becoming a standard in hyperspectral image processing due to the distributed nature of collected hyperspectral data as well as its flexibility and low cost. Our experimental results indicate that adaptive lossy compression can lead to improvements in the scalability of the hyperspectral processing chain without sacrificing analysis accuracy, even at sub-pixel precision levels.
Joint Experimentation on Scalable Parallel Processors (JESPP)
2006-04-01
made use of local embedded relational databases, implemented using sqlite on each node of an SPP to execute queries and return results via an ad hoc ...rl.af.mil 12a. DISTRIBUTION / AVAILABILITY STATEENT APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. 12b. DISTRIBUTION CODE 13. ABSTRACT...Experimentation Directorate (J9) required expansion of its joint semi-automated forces (JSAF) code capabilities; including number of entities, behavior complexity
A Massively Parallel Code for Polarization Calculations
NASA Astrophysics Data System (ADS)
Akiyama, Shizuka; Höflich, Peter
2001-03-01
We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.
The Building of Multimedia Communications Network based on Session Initiation Protocol
NASA Astrophysics Data System (ADS)
Yuexiao, Han; Yanfu, Zhang
In this paper, we presented a novel design for a distributed multimedia communications network. We introduced the distributed tactic, flow procedure and particular structure. We also analyzed its scalability, stability, robustness, extension, and transmission delay of this architecture. Finally, the result shows our framework is suitable for very large scale communications.
A Scalable Distributed Approach to Mobile Robot Vision
NASA Technical Reports Server (NTRS)
Kuipers, Benjamin; Browning, Robert L.; Gribble, William S.
1997-01-01
This paper documents our progress during the first year of work on our original proposal entitled 'A Scalable Distributed Approach to Mobile Robot Vision'. We are pursuing a strategy for real-time visual identification and tracking of complex objects which does not rely on specialized image-processing hardware. In this system perceptual schemas represent objects as a graph of primitive features. Distributed software agents identify and track these features, using variable-geometry image subwindows of limited size. Active control of imaging parameters and selective processing makes simultaneous real-time tracking of many primitive features tractable. Perceptual schemas operate independently from the tracking of primitive features, so that real-time tracking of a set of image features is not hurt by latency in recognition of the object that those features make up. The architecture allows semantically significant features to be tracked with limited expenditure of computational resources, and allows the visual computation to be distributed across a network of processors. Early experiments are described which demonstrate the usefulness of this formulation, followed by a brief overview of our more recent progress (after the first year).
A Scalable Data Access Layer to Manage Structured Heterogeneous Biomedical Data
Lianas, Luca; Frexia, Francesca; Zanetti, Gianluigi
2016-01-01
This work presents a scalable data access layer, called PyEHR, designed to support the implementation of data management systems for secondary use of structured heterogeneous biomedical and clinical data. PyEHR adopts the openEHR’s formalisms to guarantee the decoupling of data descriptions from implementation details and exploits structure indexing to accelerate searches. Data persistence is guaranteed by a driver layer with a common driver interface. Interfaces for two NoSQL Database Management Systems are already implemented: MongoDB and Elasticsearch. We evaluated the scalability of PyEHR experimentally through two types of tests, called “Constant Load” and “Constant Number of Records”, with queries of increasing complexity on synthetic datasets of ten million records each, containing very complex openEHR archetype structures, distributed on up to ten computing nodes. PMID:27936191
Minimizing communication cost among distributed controllers in software defined networks
NASA Astrophysics Data System (ADS)
Arlimatti, Shivaleela; Elbreiki, Walid; Hassan, Suhaidi; Habbal, Adib; Elshaikh, Mohamed
2016-08-01
Software Defined Networking (SDN) is a new paradigm to increase the flexibility of today's network by promising for a programmable network. The fundamental idea behind this new architecture is to simplify network complexity by decoupling control plane and data plane of the network devices, and by making the control plane centralized. Recently controllers have distributed to solve the problem of single point of failure, and to increase scalability and flexibility during workload distribution. Even though, controllers are flexible and scalable to accommodate more number of network switches, yet the problem of intercommunication cost between distributed controllers is still challenging issue in the Software Defined Network environment. This paper, aims to fill the gap by proposing a new mechanism, which minimizes intercommunication cost with graph partitioning algorithm, an NP hard problem. The methodology proposed in this paper is, swapping of network elements between controller domains to minimize communication cost by calculating communication gain. The swapping of elements minimizes inter and intra communication cost among network domains. We validate our work with the OMNeT++ simulation environment tool. Simulation results show that the proposed mechanism minimizes the inter domain communication cost among controllers compared to traditional distributed controllers.
Distributed controller clustering in software defined networks
Gani, Abdullah; Akhunzada, Adnan; Talebian, Hamid; Choo, Kim-Kwang Raymond
2017-01-01
Software Defined Networking (SDN) is an emerging promising paradigm for network management because of its centralized network intelligence. However, the centralized control architecture of the software-defined networks (SDNs) brings novel challenges of reliability, scalability, fault tolerance and interoperability. In this paper, we proposed a novel clustered distributed controller architecture in the real setting of SDNs. The distributed cluster implementation comprises of multiple popular SDN controllers. The proposed mechanism is evaluated using a real world network topology running on top of an emulated SDN environment. The result shows that the proposed distributed controller clustering mechanism is able to significantly reduce the average latency from 8.1% to 1.6%, the packet loss from 5.22% to 4.15%, compared to distributed controller without clustering running on HP Virtual Application Network (VAN) SDN and Open Network Operating System (ONOS) controllers respectively. Moreover, proposed method also shows reasonable CPU utilization results. Furthermore, the proposed mechanism makes possible to handle unexpected load fluctuations while maintaining a continuous network operation, even when there is a controller failure. The paper is a potential contribution stepping towards addressing the issues of reliability, scalability, fault tolerance, and inter-operability. PMID:28384312
NASA Astrophysics Data System (ADS)
Manfredi, Sabato
2016-06-01
Large-scale dynamic systems are becoming highly pervasive in their occurrence with applications ranging from system biology, environment monitoring, sensor networks, and power systems. They are characterised by high dimensionality, complexity, and uncertainty in the node dynamic/interactions that require more and more computational demanding methods for their analysis and control design, as well as the network size and node system/interaction complexity increase. Therefore, it is a challenging problem to find scalable computational method for distributed control design of large-scale networks. In this paper, we investigate the robust distributed stabilisation problem of large-scale nonlinear multi-agent systems (briefly MASs) composed of non-identical (heterogeneous) linear dynamical systems coupled by uncertain nonlinear time-varying interconnections. By employing Lyapunov stability theory and linear matrix inequality (LMI) technique, new conditions are given for the distributed control design of large-scale MASs that can be easily solved by the toolbox of MATLAB. The stabilisability of each node dynamic is a sufficient assumption to design a global stabilising distributed control. The proposed approach improves some of the existing LMI-based results on MAS by both overcoming their computational limits and extending the applicative scenario to large-scale nonlinear heterogeneous MASs. Additionally, the proposed LMI conditions are further reduced in terms of computational requirement in the case of weakly heterogeneous MASs, which is a common scenario in real application where the network nodes and links are affected by parameter uncertainties. One of the main advantages of the proposed approach is to allow to move from a centralised towards a distributed computing architecture so that the expensive computation workload spent to solve LMIs may be shared among processors located at the networked nodes, thus increasing the scalability of the approach than the network size. Finally, a numerical example shows the applicability of the proposed method and its advantage in terms of computational complexity when compared with the existing approaches.
Drawert, Brian; Trogdon, Michael; Toor, Salman; Petzold, Linda; Hellander, Andreas
2017-01-01
Computational experiments using spatial stochastic simulations have led to important new biological insights, but they require specialized tools and a complex software stack, as well as large and scalable compute and data analysis resources due to the large computational cost associated with Monte Carlo computational workflows. The complexity of setting up and managing a large-scale distributed computation environment to support productive and reproducible modeling can be prohibitive for practitioners in systems biology. This results in a barrier to the adoption of spatial stochastic simulation tools, effectively limiting the type of biological questions addressed by quantitative modeling. In this paper, we present PyURDME, a new, user-friendly spatial modeling and simulation package, and MOLNs, a cloud computing appliance for distributed simulation of stochastic reaction-diffusion models. MOLNs is based on IPython and provides an interactive programming platform for development of sharable and reproducible distributed parallel computational experiments. PMID:28190948
DOE Office of Scientific and Technical Information (OSTI.GOV)
Visweswara Sathanur, Arun; Choudhury, Sutanay; Joslyn, Cliff A.
Property graphs can be used to represent heterogeneous networks with attributed vertices and edges. Given one property graph, simulating another graph with same or greater size with identical statistical properties with respect to the attributes and connectivity is critical for privacy preservation and benchmarking purposes. In this work we tackle the problem of capturing the statistical dependence of the edge connectivity on the vertex labels and using the same distribution to regenerate property graphs of the same or expanded size in a scalable manner. However, accurate simulation becomes a challenge when the attributes do not completely explain the network structure.more » We propose the Property Graph Model (PGM) approach that uses an attribute (or label) augmentation strategy to mitigate the problem and preserve the graph connectivity as measured via degree distribution, vertex label distributions and edge connectivity. Our proposed algorithm is scalable with a linear complexity in the number of edges in the target graph. We illustrate the efficacy of the PGM approach in regenerating and expanding the datasets by leveraging two distinct illustrations.« less
Advanced technologies for scalable ATLAS conditions database access on the grid
NASA Astrophysics Data System (ADS)
Basset, R.; Canali, L.; Dimitrov, G.; Girone, M.; Hawkings, R.; Nevski, P.; Valassi, A.; Vaniachine, A.; Viegas, F.; Walker, R.; Wong, A.
2010-04-01
During massive data reprocessing operations an ATLAS Conditions Database application must support concurrent access from numerous ATLAS data processing jobs running on the Grid. By simulating realistic work-flow, ATLAS database scalability tests provided feedback for Conditions Db software optimization and allowed precise determination of required distributed database resources. In distributed data processing one must take into account the chaotic nature of Grid computing characterized by peak loads, which can be much higher than average access rates. To validate database performance at peak loads, we tested database scalability at very high concurrent jobs rates. This has been achieved through coordinated database stress tests performed in series of ATLAS reprocessing exercises at the Tier-1 sites. The goal of database stress tests is to detect scalability limits of the hardware deployed at the Tier-1 sites, so that the server overload conditions can be safely avoided in a production environment. Our analysis of server performance under stress tests indicates that Conditions Db data access is limited by the disk I/O throughput. An unacceptable side-effect of the disk I/O saturation is a degradation of the WLCG 3D Services that update Conditions Db data at all ten ATLAS Tier-1 sites using the technology of Oracle Streams. To avoid such bottlenecks we prototyped and tested a novel approach for database peak load avoidance in Grid computing. Our approach is based upon the proven idea of pilot job submission on the Grid: instead of the actual query, an ATLAS utility library sends to the database server a pilot query first.
NASA Astrophysics Data System (ADS)
Sindrilaru, Elvin A.; Peters, Andreas J.; Adde, Geoffray M.; Duellmann, Dirk
2017-10-01
CERN has been developing and operating EOS as a disk storage solution successfully for over 6 years. The CERN deployment provides 135 PB and stores 1.2 billion replicas distributed over two computer centres. Deployment includes four LHC instances, a shared instance for smaller experiments and since last year an instance for individual user data as well. The user instance represents the backbone of the CERNBOX service for file sharing. New use cases like synchronisation and sharing, the planned migration to reduce AFS usage at CERN and the continuous growth has brought EOS to new challenges. Recent developments include the integration and evaluation of various technologies to do the transition from a single active in-memory namespace to a scale-out implementation distributed over many meta-data servers. The new architecture aims to separate the data from the application logic and user interface code, thus providing flexibility and scalability to the namespace component. Another important goal is to provide EOS as a CERN-wide mounted filesystem with strong authentication making it a single storage repository accessible via various services and front- ends (/eos initiative). This required new developments in the security infrastructure of the EOS FUSE implementation. Furthermore, there were a series of improvements targeting the end-user experience like tighter consistency and latency optimisations. In collaboration with Seagate as Openlab partner, EOS has a complete integration of OpenKinetic object drive cluster as a high-throughput, high-availability, low-cost storage solution. This contribution will discuss these three main development projects and present new performance metrics.
NASA Astrophysics Data System (ADS)
Capone, V.; Esposito, R.; Pardi, S.; Taurino, F.; Tortone, G.
2012-12-01
Over the last few years we have seen an increasing number of services and applications needed to manage and maintain cloud computing facilities. This is particularly true for computing in high energy physics, which often requires complex configurations and distributed infrastructures. In this scenario a cost effective rationalization and consolidation strategy is the key to success in terms of scalability and reliability. In this work we describe an IaaS (Infrastructure as a Service) cloud computing system, with high availability and redundancy features, which is currently in production at INFN-Naples and ATLAS Tier-2 data centre. The main goal we intended to achieve was a simplified method to manage our computing resources and deliver reliable user services, reusing existing hardware without incurring heavy costs. A combined usage of virtualization and clustering technologies allowed us to consolidate our services on a small number of physical machines, reducing electric power costs. As a result of our efforts we developed a complete solution for data and computing centres that can be easily replicated using commodity hardware. Our architecture consists of 2 main subsystems: a clustered storage solution, built on top of disk servers running GlusterFS file system, and a virtual machines execution environment. GlusterFS is a network file system able to perform parallel writes on multiple disk servers, providing this way live replication of data. High availability is also achieved via a network configuration using redundant switches and multiple paths between hypervisor hosts and disk servers. We also developed a set of management scripts to easily perform basic system administration tasks such as automatic deployment of new virtual machines, adaptive scheduling of virtual machines on hypervisor hosts, live migration and automated restart in case of hypervisor failures.
Implementation of a Big Data Accessing and Processing Platform for Medical Records in Cloud.
Yang, Chao-Tung; Liu, Jung-Chun; Chen, Shuo-Tsung; Lu, Hsin-Wen
2017-08-18
Big Data analysis has become a key factor of being innovative and competitive. Along with population growth worldwide and the trend aging of population in developed countries, the rate of the national medical care usage has been increasing. Due to the fact that individual medical data are usually scattered in different institutions and their data formats are varied, to integrate those data that continue increasing is challenging. In order to have scalable load capacity for these data platforms, we must build them in good platform architecture. Some issues must be considered in order to use the cloud computing to quickly integrate big medical data into database for easy analyzing, searching, and filtering big data to obtain valuable information.This work builds a cloud storage system with HBase of Hadoop for storing and analyzing big data of medical records and improves the performance of importing data into database. The data of medical records are stored in HBase database platform for big data analysis. This system performs distributed computing on medical records data processing through Hadoop MapReduce programming, and to provide functions, including keyword search, data filtering, and basic statistics for HBase database. This system uses the Put with the single-threaded method and the CompleteBulkload mechanism to import medical data. From the experimental results, we find that when the file size is less than 300MB, the Put with single-threaded method is used and when the file size is larger than 300MB, the CompleteBulkload mechanism is used to improve the performance of data import into database. This system provides a web interface that allows users to search data, filter out meaningful information through the web, and analyze and convert data in suitable forms that will be helpful for medical staff and institutions.
Jia, Jia; Chen, Jhensi; Yao, Jun; Chu, Daping
2017-03-17
A high quality 3D display requires a high amount of optical information throughput, which needs an appropriate mechanism to distribute information in space uniformly and efficiently. This study proposes a front-viewing system which is capable of managing the required amount of information efficiently from a high bandwidth source and projecting 3D images with a decent size and a large viewing angle at video rate in full colour. It employs variable gratings to support a high bandwidth distribution. This concept is scalable and the system can be made compact in size. A horizontal parallax only (HPO) proof-of-concept system is demonstrated by projecting holographic images from a digital micro mirror device (DMD) through rotational tiled gratings before they are realised on a vertical diffuser for front-viewing.
AFRL Research in Plasma-Assisted Combustion
2013-10-23
Scramjet propulsion Non-equilibrium flows Diagnostics for scramjet controls Boundary-layer transition Structural sciences for...hypersonic vehicles Computational sciences for hypersonic flight 3 DISTRIBUTION STATEMENT A – Unclassified, Unlimited Distribution Overview Research...within My Division HIFiRE-5 Vehicle Launched 23 April 2012 can payload transition section Orion S-30 Focus on hypersonic flight: scalability
Teng, Rui; Leibnitz, Kenji; Miura, Ryu
2013-01-01
An essential application of wireless sensor networks is to successfully respond to user queries. Query packet losses occur in the query dissemination due to wireless communication problems such as interference, multipath fading, packet collisions, etc. The losses of query messages at sensor nodes result in the failure of sensor nodes reporting the requested data. Hence, the reliable and successful dissemination of query messages to sensor nodes is a non-trivial problem. The target of this paper is to enable highly successful query delivery to sensor nodes by localized and energy-efficient discovery, and recovery of query losses. We adopt local and collective cooperation among sensor nodes to increase the success rate of distributed discoveries and recoveries. To enable the scalability in the operations of discoveries and recoveries, we employ a distributed name resolution mechanism at each sensor node to allow sensor nodes to self-detect the correlated queries and query losses, and then efficiently locally respond to the query losses. We prove that the collective discovery of query losses has a high impact on the success of query dissemination and reveal that scalability can be achieved by using the proposed approach. We further study the novel features of the cooperation and competition in the collective recovery at PHY and MAC layers, and show that the appropriate number of detectors can achieve optimal successful recovery rate. We evaluate the proposed approach with both mathematical analyses and computer simulations. The proposed approach enables a high rate of successful delivery of query messages and it results in short route lengths to recover from query losses. The proposed approach is scalable and operates in a fully distributed manner. PMID:23748172
A data distribution strategy for the 1990s (files are not enough)
NASA Technical Reports Server (NTRS)
Tankenson, Mike; Wright, Steven
1993-01-01
Virtually all of the data distribution strategies being contemplated for the EOSDIS era revolve around the use of files. Most, if not all, mass storage technologies are based around the file model. However, files may be the wrong primary abstraction for supporting scientific users in the 1990s and beyond. Other abstractions more closely matching the respective scientific discipline of the end user may be more appropriate. JPL has built a unique multimission data distribution system based on a strategy of telemetry stream emulation to match the responsibilities of spacecraft team and ground data system operators supporting our nations suite of planetary probes. The current system, operational since 1989 and the launch of the Magellan spacecraft, is supporting over 200 users at 15 remote sites. This stream-oriented data distribution model can provide important lessons learned to builders of future data systems.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Loikith, P.; Lee, H.; McGibbney, L. J.; Whitehall, K. D.
2014-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark. Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk, and makes iterative algorithms feasible. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 100 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning (ML) based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. The goals of SciSpark are to: (1) Decrease the time to compute comparison statistics and plots from minutes to seconds; (2) Allow for interactive exploration of time-series properties over seasons and years; (3) Decrease the time for satellite data ingestion into RCMES to hours; (4) Allow for Level-2 comparisons with higher-order statistics or PDF's in minutes to hours; and (5) Move RCMES into a near real time decision-making platform. We will report on: the architecture and design of SciSpark, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning (sharding) of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Scalable and portable visualization of large atomistic datasets
NASA Astrophysics Data System (ADS)
Sharma, Ashish; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya
2004-10-01
A scalable and portable code named Atomsviewer has been developed to interactively visualize a large atomistic dataset consisting of up to a billion atoms. The code uses a hierarchical view frustum-culling algorithm based on the octree data structure to efficiently remove atoms outside of the user's field-of-view. Probabilistic and depth-based occlusion-culling algorithms then select atoms, which have a high probability of being visible. Finally a multiresolution algorithm is used to render the selected subset of visible atoms at varying levels of detail. Atomsviewer is written in C++ and OpenGL, and it has been tested on a number of architectures including Windows, Macintosh, and SGI. Atomsviewer has been used to visualize tens of millions of atoms on a standard desktop computer and, in its parallel version, up to a billion atoms. Program summaryTitle of program: Atomsviewer Catalogue identifier: ADUM Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADUM Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Computer for which the program is designed and others on which it has been tested: 2.4 GHz Pentium 4/Xeon processor, professional graphics card; Apple G4 (867 MHz)/G5, professional graphics card Operating systems under which the program has been tested: Windows 2000/XP, Mac OS 10.2/10.3, SGI IRIX 6.5 Programming languages used: C++, C and OpenGL Memory required to execute with typical data: 1 gigabyte of RAM High speed storage required: 60 gigabytes No. of lines in the distributed program including test data, etc.: 550 241 No. of bytes in the distributed program including test data, etc.: 6 258 245 Number of bits in a word: Arbitrary Number of processors used: 1 Has the code been vectorized or parallelized: No Distribution format: tar gzip file Nature of physical problem: Scientific visualization of atomic systems Method of solution: Rendering of atoms using computer graphic techniques, culling algorithms for data minimization, and levels-of-detail for minimal rendering Restrictions on the complexity of the problem: None Typical running time: The program is interactive in its execution Unusual features of the program: None References: The conceptual foundation and subsequent implementation of the algorithms are found in [A. Sharma, A. Nakano, R.K. Kalia, P. Vashishta, S. Kodiyalam, P. Miller, W. Zhao, X.L. Liu, T.J. Campbell, A. Haas, Presence—Teleoperators and Virtual Environments 12 (1) (2003)].
Intrex Subject/Title Inverted-File Characteristics.
ERIC Educational Resources Information Center
Uemura, Syunsuke
The characteristics of the Intrex subject/title inverted file are analyzed. Basic statistics of the inverted file are presented including various distributions of the index words and terms from which the file was derived, and statistics on stems, the file growth process, and redundancy measurements. A study of stems both with extremely high and…
Urrios, Arturo; de Nadal, Eulàlia; Solé, Ricard; Posas, Francesc
2016-01-01
Engineered synthetic biological devices have been designed to perform a variety of functions from sensing molecules and bioremediation to energy production and biomedicine. Notwithstanding, a major limitation of in vivo circuit implementation is the constraint associated to the use of standard methodologies for circuit design. Thus, future success of these devices depends on obtaining circuits with scalable complexity and reusable parts. Here we show how to build complex computational devices using multicellular consortia and space as key computational elements. This spatial modular design grants scalability since its general architecture is independent of the circuit’s complexity, minimizes wiring requirements and allows component reusability with minimal genetic engineering. The potential use of this approach is demonstrated by implementation of complex logical functions with up to six inputs, thus demonstrating the scalability and flexibility of this method. The potential implications of our results are outlined. PMID:26829588
Scalability and Validation of Big Data Bioinformatics Software.
Yang, Andrian; Troup, Michael; Ho, Joshua W K
2017-01-01
This review examines two important aspects that are central to modern big data bioinformatics analysis - software scalability and validity. We argue that not only are the issues of scalability and validation common to all big data bioinformatics analyses, they can be tackled by conceptually related methodological approaches, namely divide-and-conquer (scalability) and multiple executions (validation). Scalability is defined as the ability for a program to scale based on workload. It has always been an important consideration when developing bioinformatics algorithms and programs. Nonetheless the surge of volume and variety of biological and biomedical data has posed new challenges. We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment. Validation of software is another important issue in big data bioinformatics that is often ignored. Software validation is the process of determining whether the program under test fulfils the task for which it was designed. Determining the correctness of the computational output of big data bioinformatics software is especially difficult due to the large input space and complex algorithms involved. We discuss how state-of-the-art software testing techniques that are based on the idea of multiple executions, such as metamorphic testing, can be used to implement an effective bioinformatics quality assurance strategy. We hope this review will raise awareness of these critical issues in bioinformatics.
Visual Analytics for Power Grid Contingency Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wong, Pak C.; Huang, Zhenyu; Chen, Yousu
2014-01-20
Contingency analysis is the process of employing different measures to model scenarios, analyze them, and then derive the best response to remove the threats. This application paper focuses on a class of contingency analysis problems found in the power grid management system. A power grid is a geographically distributed interconnected transmission network that transmits and delivers electricity from generators to end users. The power grid contingency analysis problem is increasingly important because of both the growing size of the underlying raw data that need to be analyzed and the urgency to deliver working solutions in an aggressive timeframe. Failure tomore » do so may bring significant financial, economic, and security impacts to all parties involved and the society at large. The paper presents a scalable visual analytics pipeline that transforms about 100 million contingency scenarios to a manageable size and form for grid operators to examine different scenarios and come up with preventive or mitigation strategies to address the problems in a predictive and timely manner. Great attention is given to the computational scalability, information scalability, visual scalability, and display scalability issues surrounding the data analytics pipeline. Most of the large-scale computation requirements of our work are conducted on a Cray XMT multi-threaded parallel computer. The paper demonstrates a number of examples using western North American power grid models and data.« less
Scalable implementation of boson sampling with trapped ions.
Shen, C; Zhang, Z; Duan, L-M
2014-02-07
Boson sampling solves a classically intractable problem by sampling from a probability distribution given by matrix permanents. We propose a scalable implementation of boson sampling using local transverse phonon modes of trapped ions to encode the bosons. The proposed scheme allows deterministic preparation and high-efficiency readout of the bosons in the Fock states and universal mode mixing. With the state-of-the-art trapped ion technology, it is feasible to realize boson sampling with tens of bosons by this scheme, which would outperform the most powerful classical computers and constitute an effective disproof of the famous extended Church-Turing thesis.
An Element-Based Concurrent Partitioner for Unstructured Finite Element Meshes
NASA Technical Reports Server (NTRS)
Ding, Hong Q.; Ferraro, Robert D.
1996-01-01
A concurrent partitioner for partitioning unstructured finite element meshes on distributed memory architectures is developed. The partitioner uses an element-based partitioning strategy. Its main advantage over the more conventional node-based partitioning strategy is its modular programming approach to the development of parallel applications. The partitioner first partitions element centroids using a recursive inertial bisection algorithm. Elements and nodes then migrate according to the partitioned centroids, using a data request communication template for unpredictable incoming messages. Our scalable implementation is contrasted to a non-scalable implementation which is a straightforward parallelization of a sequential partitioner.
Two-dimensional photonic crystal slab nanocavities on bulk single-crystal diamond
NASA Astrophysics Data System (ADS)
Wan, Noel H.; Mouradian, Sara; Englund, Dirk
2018-04-01
Color centers in diamond are promising spin qubits for quantum computing and quantum networking. In photon-mediated entanglement distribution schemes, the efficiency of the optical interface ultimately determines the scalability of such systems. Nano-scale optical cavities coupled to emitters constitute a robust spin-photon interface that can increase spontaneous emission rates and photon extraction efficiencies. In this work, we introduce the fabrication of 2D photonic crystal slab nanocavities with high quality factors and cubic wavelength mode volumes—directly in bulk diamond. This planar platform offers scalability and considerably expands the toolkit for classical and quantum nanophotonics in diamond.
Astro-WISE: Chaining to the Universe
NASA Astrophysics Data System (ADS)
Valentijn, E. A.; McFarland, J. P.; Snigula, J.; Begeman, K. G.; Boxhoorn, D. R.; Rengelink, R.; Helmich, E.; Heraudeau, P.; Verdoes Kleijn, G.; Vermeij, R.; Vriend, W.-J.; Tempelaar, M. J.; Deul, E.; Kuijken, K.; Capaccioli, M.; Silvotti, R.; Bender, R.; Neeser, M.; Saglia, R.; Bertin, E.; Mellier, Y.
2007-10-01
The recent explosion of recorded digital data and its processed derivatives threatens to overwhelm researchers when analysing their experimental data or looking up data items in archives and file systems. While current hardware developments allow the acquisition, processing and storage of hundreds of terabytes of data at the cost of a modern sports car, the software systems to handle these data are lagging behind. This problem is very general and is well recognized by various scientific communities; several large projects have been initiated, e.g., DATAGRID/EGEE {http://www.eu-egee.org/} federates compute and storage power over the high-energy physical community, while the international astronomical community is building an Internet geared Virtual Observatory {http://www.euro-vo.org/pub/} (Padovani 2006) connecting archival data. These large projects either focus on a specific distribution aspect or aim to connect many sub-communities and have a relatively long trajectory for setting standards and a common layer. Here, we report first light of a very different solution (Valentijn & Kuijken 2004) to the problem initiated by a smaller astronomical IT community. It provides an abstract scientific information layer which integrates distributed scientific analysis with distributed processing and federated archiving and publishing. By designing new abstractions and mixing in old ones, a Science Information System with fully scalable cornerstones has been achieved, transforming data systems into knowledge systems. This break-through is facilitated by the full end-to-end linking of all dependent data items, which allows full backward chaining from the observer/researcher to the experiment. Key is the notion that information is intrinsic in nature and thus is the data acquired by a scientific experiment. The new abstraction is that software systems guide the user to that intrinsic information by forcing full backward and forward chaining in the data modelling.
An Interactive Web-Based Analysis Framework for Remote Sensing Cloud Computing
NASA Astrophysics Data System (ADS)
Wang, X. Z.; Zhang, H. M.; Zhao, J. H.; Lin, Q. H.; Zhou, Y. C.; Li, J. H.
2015-07-01
Spatiotemporal data, especially remote sensing data, are widely used in ecological, geographical, agriculture, and military research and applications. With the development of remote sensing technology, more and more remote sensing data are accumulated and stored in the cloud. An effective way for cloud users to access and analyse these massive spatiotemporal data in the web clients becomes an urgent issue. In this paper, we proposed a new scalable, interactive and web-based cloud computing solution for massive remote sensing data analysis. We build a spatiotemporal analysis platform to provide the end-user with a safe and convenient way to access massive remote sensing data stored in the cloud. The lightweight cloud storage system used to store public data and users' private data is constructed based on open source distributed file system. In it, massive remote sensing data are stored as public data, while the intermediate and input data are stored as private data. The elastic, scalable, and flexible cloud computing environment is built using Docker, which is a technology of open-source lightweight cloud computing container in the Linux operating system. In the Docker container, open-source software such as IPython, NumPy, GDAL, and Grass GIS etc., are deployed. Users can write scripts in the IPython Notebook web page through the web browser to process data, and the scripts will be submitted to IPython kernel to be executed. By comparing the performance of remote sensing data analysis tasks executed in Docker container, KVM virtual machines and physical machines respectively, we can conclude that the cloud computing environment built by Docker makes the greatest use of the host system resources, and can handle more concurrent spatial-temporal computing tasks. Docker technology provides resource isolation mechanism in aspects of IO, CPU, and memory etc., which offers security guarantee when processing remote sensing data in the IPython Notebook. Users can write complex data processing code on the web directly, so they can design their own data processing algorithm.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sreepathi, Sarat; Sripathi, Vamsi; Mills, Richard T
2013-01-01
Inefficient parallel I/O is known to be a major bottleneck among scientific applications employed on supercomputers as the number of processor cores grows into the thousands. Our prior experience indicated that parallel I/O libraries such as HDF5 that rely on MPI-IO do not scale well beyond 10K processor cores, especially on parallel file systems (like Lustre) with single point of resource contention. Our previous optimization efforts for a massively parallel multi-phase and multi-component subsurface simulator (PFLOTRAN) led to a two-phase I/O approach at the application level where a set of designated processes participate in the I/O process by splitting themore » I/O operation into a communication phase and a disk I/O phase. The designated I/O processes are created by splitting the MPI global communicator into multiple sub-communicators. The root process in each sub-communicator is responsible for performing the I/O operations for the entire group and then distributing the data to rest of the group. This approach resulted in over 25X speedup in HDF I/O read performance and 3X speedup in write performance for PFLOTRAN at over 100K processor cores on the ORNL Jaguar supercomputer. This research describes the design and development of a general purpose parallel I/O library, SCORPIO (SCalable block-ORiented Parallel I/O) that incorporates our optimized two-phase I/O approach. The library provides a simplified higher level abstraction to the user, sitting atop existing parallel I/O libraries (such as HDF5) and implements optimized I/O access patterns that can scale on larger number of processors. Performance results with standard benchmark problems and PFLOTRAN indicate that our library is able to maintain the same speedups as before with the added flexibility of being applicable to a wider range of I/O intensive applications.« less
NASA Astrophysics Data System (ADS)
Zhou, S.; Tao, W. K.; Li, X.; Matsui, T.; Sun, X. H.; Yang, X.
2015-12-01
A cloud-resolving model (CRM) is an atmospheric numerical model that can numerically resolve clouds and cloud systems at 0.25~5km horizontal grid spacings. The main advantage of the CRM is that it can allow explicit interactive processes between microphysics, radiation, turbulence, surface, and aerosols without subgrid cloud fraction, overlapping and convective parameterization. Because of their fine resolution and complex physical processes, it is challenging for the CRM community to i) visualize/inter-compare CRM simulations, ii) diagnose key processes for cloud-precipitation formation and intensity, and iii) evaluate against NASA's field campaign data and L1/L2 satellite data products due to large data volume (~10TB) and complexity of CRM's physical processes. We have been building the Super Cloud Library (SCL) upon a Hadoop framework, capable of CRM database management, distribution, visualization, subsetting, and evaluation in a scalable way. The current SCL capability includes (1) A SCL data model enables various CRM simulation outputs in NetCDF, including the NASA-Unified Weather Research and Forecasting (NU-WRF) and Goddard Cumulus Ensemble (GCE) model, to be accessed and processed by Hadoop, (2) A parallel NetCDF-to-CSV converter supports NU-WRF and GCE model outputs, (3) A technique visualizes Hadoop-resident data with IDL, (4) A technique subsets Hadoop-resident data, compliant to the SCL data model, with HIVE or Impala via HUE's Web interface, (5) A prototype enables a Hadoop MapReduce application to dynamically access and process data residing in a parallel file system, PVFS2 or CephFS, where high performance computing (HPC) simulation outputs such as NU-WRF's and GCE's are located. We are testing Apache Spark to speed up SCL data processing and analysis.With the SCL capabilities, SCL users can conduct large-domain on-demand tasks without downloading voluminous CRM datasets and various observations from NASA Field Campaigns and Satellite data to a local computer, and inter-compare CRM output and data with GCE and NU-WRF.
ACS from development to operations
NASA Astrophysics Data System (ADS)
Caproni, Alessandro; Colomer, Pau; Jeram, Bogdan; Sommer, Heiko; Chiozzi, Gianluca; Mañas, Miguel M.
2016-08-01
The ALMA Common Software (ACS), provides the infrastructure of the distributed software system of ALMA and other projects. ACS, built on top of CORBA and Data Distribution Service (DDS) middleware, is based on a Component- Container paradigm and hides the complexity of the middleware allowing the developer to focus on domain specific issues. The transition of the ALMA observatory from construction to operations brings with it that ACS effort focuses primarily on scalability, stability and robustness rather than on new features. The transition came together with a shorter release cycle and a more extensive testing. For scalability, the most problematic area has been the CORBA notification service, used to implement the publisher subscriber pattern because of the asynchronous nature of the paradigm: a lot of effort has been spent to improve its stability and recovery from run time errors. The original bulk data mechanism, implemented using the CORBA Audio/Video Streaming Service, showed its limitations and has been replaced with a more performant and scalable DDS implementation. Operational needs showed soon the difference between releases cycles for Online software (i.e. used during observations) and Offline software, which requires much more frequent releases. This paper attempts to describe the impact the transition from construction to operations had on ACS, the solution adopted so far and a look into future evolution.
Digital Libraries: The Next Generation in File System Technology.
ERIC Educational Resources Information Center
Bowman, Mic; Camargo, Bill
1998-01-01
Examines file sharing within corporations that use wide-area, distributed file systems. Applications and user interactions strongly suggest that the addition of services typically associated with digital libraries (content-based file location, strongly typed objects, representation of complex relationships between documents, and extrinsic…
Katzman, G L
2001-03-01
The goal of the project was to create a method by which an in-house digital teaching file could be constructed that was simple, inexpensive, independent of hypertext markup language (HTML) restrictions, and appears identical on multiple platforms. To accomplish this, Microsoft PowerPoint and Adobe Acrobat were used in succession to assemble digital teaching files in the Acrobat portable document file format. They were then verified to appear identically on computers running Windows, Macintosh Operating Systems (OS), and the Silicon Graphics Unix-based OS as either a free-standing file using Acrobat Reader software or from within a browser window using the Acrobat browser plug-in. This latter display method yields a file viewed through a browser window, yet remains independent of underlying HTML restrictions, which may confer an advantage over simple HTML teaching file construction. Thus, a hybrid of HTML-distributed Adobe Acrobat generated WWW documents may be a viable alternative for digital teaching file construction and distribution.
Arkansas and Louisiana Aeromagnetic and Gravity Maps and Data - A Website for Distribution of Data
Bankey, Viki; Daniels, David L.
2008-01-01
This report contains digital data, image files, and text files describing data formats for aeromagnetic and gravity data used to compile the State aeromagnetic and gravity maps of Arkansas and Louisiana. The digital files include grids, images, ArcInfo, and Geosoft compatible files. In some of the data folders, ASCII files with the extension 'txt' describe the format and contents of the data files. Read the 'txt' files before using the data files.
ms2: A molecular simulation tool for thermodynamic properties
NASA Astrophysics Data System (ADS)
Deublein, Stephan; Eckl, Bernhard; Stoll, Jürgen; Lishchuk, Sergey V.; Guevara-Carrion, Gabriela; Glass, Colin W.; Merker, Thorsten; Bernreuther, Martin; Hasse, Hans; Vrabec, Jadran
2011-11-01
This work presents the molecular simulation program ms2 that is designed for the calculation of thermodynamic properties of bulk fluids in equilibrium consisting of small electro-neutral molecules. ms2 features the two main molecular simulation techniques, molecular dynamics (MD) and Monte-Carlo. It supports the calculation of vapor-liquid equilibria of pure fluids and multi-component mixtures described by rigid molecular models on the basis of the grand equilibrium method. Furthermore, it is capable of sampling various classical ensembles and yields numerous thermodynamic properties. To evaluate the chemical potential, Widom's test molecule method and gradual insertion are implemented. Transport properties are determined by equilibrium MD simulations following the Green-Kubo formalism. ms2 is designed to meet the requirements of academia and industry, particularly achieving short response times and straightforward handling. It is written in Fortran90 and optimized for a fast execution on a broad range of computer architectures, spanning from single processor PCs over PC-clusters and vector computers to high-end parallel machines. The standard Message Passing Interface (MPI) is used for parallelization and ms2 is therefore easily portable to different computing platforms. Feature tools facilitate the interaction with the code and the interpretation of input and output files. The accuracy and reliability of ms2 has been shown for a large variety of fluids in preceding work. Program summaryProgram title:ms2 Catalogue identifier: AEJF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Special Licence supplied by the authors No. of lines in distributed program, including test data, etc.: 82 794 No. of bytes in distributed program, including test data, etc.: 793 705 Distribution format: tar.gz Programming language: Fortran90 Computer: The simulation tool ms2 is usable on a wide variety of platforms, from single processor machines over PC-clusters and vector computers to vector-parallel architectures. (Tested with Fortran compilers: gfortran, Intel, PathScale, Portland Group and Sun Studio.) Operating system: Unix/Linux, Windows Has the code been vectorized or parallelized?: Yes. Message Passing Interface (MPI) protocol Scalability. Excellent scalability up to 16 processors for molecular dynamics and >512 processors for Monte-Carlo simulations. RAM:ms2 runs on single processors with 512 MB RAM. The memory demand rises with increasing number of processors used per node and increasing number of molecules. Classification: 7.7, 7.9, 12 External routines: Message Passing Interface (MPI) Nature of problem: Calculation of application oriented thermodynamic properties for rigid electro-neutral molecules: vapor-liquid equilibria, thermal and caloric data as well as transport properties of pure fluids and multi-component mixtures. Solution method: Molecular dynamics, Monte-Carlo, various classical ensembles, grand equilibrium method, Green-Kubo formalism. Restrictions: No. The system size is user-defined. Typical problems addressed by ms2 can be solved by simulating systems containing typically 2000 molecules or less. Unusual features: Feature tools are available for creating input files, analyzing simulation results and visualizing molecular trajectories. Additional comments: Sample makefiles for multiple operation platforms are provided. Documentation is provided with the installation package and is available at http://www.ms-2.de. Running time: The running time of ms2 depends on the problem set, the system size and the number of processes used in the simulation. Running four processes on a "Nehalem" processor, simulations calculating VLE data take between two and twelve hours, calculating transport properties between six and 24 hours.
SU-E-T-142: Automatic Linac Log File: Analysis and Reporting
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gainey, M; Rothe, T
Purpose: End to end QA for IMRT/VMAT is time consuming. Automated linac log file analysis and recalculation of daily recorded fluence, and hence dose, distribution bring this closer. Methods: Matlab (R2014b, Mathworks) software was written to read in and analyse IMRT/VMAT trajectory log files (TrueBeam 1.5, Varian Medical Systems) overnight, and are archived on a backed-up network drive (figure). A summary report (PDF) is sent by email to the duty linac physicist. A structured summary report (PDF) for each patient is automatically updated for embedding into the R&V system (Mosaiq 2.5, Elekta AG). The report contains cross-referenced hyperlinks to easemore » navigation between treatment fractions. Gamma analysis can be performed on planned (DICOM RTPlan) and treated (trajectory log) fluence distributions. Trajectory log files can be converted into RTPlan files for dose distribution calculation (Eclipse, AAA10.0.28, VMS). Results: All leaf positions are within +/−0.10mm: 57% within +/−0.01mm; 89% within 0.05mm. Mean leaf position deviation is 0.02mm. Gantry angle variations lie in the range −0.1 to 0.3 degrees, mean 0.04 degrees. Fluence verification shows excellent agreement between planned and treated fluence. Agreement between planned and treated dose distribution, the derived from log files, is very good. Conclusion: Automated log file analysis is a valuable tool for the busy physicist, enabling potential treated fluence distribution errors to be quickly identified. In the near future we will correlate trajectory log analysis with routine IMRT/VMAT QA analysis. This has the potential to reduce, but not eliminate, the QA workload.« less
A Scalable, Collaborative, Interactive Light-field Display System
2014-06-01
displays, 3D display, holographic video, integral photography, plenoptic , computed photography 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF...light-field, holographic displays, 3D display, holographic video, integral photography, plenoptic , computed photography 1 Distribution A: Approved
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ahn, Gail-Joon
The project seeks an innovative framework to enable users to access and selectively share resources in distributed environments, enhancing the scalability of information sharing. We have investigated secure sharing & assurance approaches for ad-hoc collaboration, focused on Grids, Clouds, and ad-hoc network environments.
Computational Cognition and Robust Decision Making
2013-03-06
much more powerful neuromorphic chips than current state of the art. L. Chua 10 DISTRIBUTION STATEMENT A – Unclassified, Unlimited Distribution 2...Cognition Program DARPA (Gill Pratt) • Systems of Neuromorphic Adaptive Plastic Scalable Electronics (SyNAPSE) Program IARPA (Brad Minnery...2012 - Four projects at SNU and KAIST co-funded with AOARD DARPA SyNAPSE Program: - Design, fabrication, and demonstration of neuromorphic
Scalable Energy Networks to Promote Energy Security
2011-07-01
commodity. Consider current challenges of converting energy and synchronizing sources with loads—for example, capturing solar energy to provide hot water...distributed micro-generation1 (for example, roof-mounted solar panels) and plug-in elec- tric/hybrid vehicles. The imperative extends to our national...transformers, battery chargers ■■ distribution: pumps, pipes, switches, cables ■■ applications: lighting, automobiles, personal electronic devices
Job Scheduling in a Heterogeneous Grid Environment
NASA Technical Reports Server (NTRS)
Shan, Hong-Zhang; Smith, Warren; Oliker, Leonid; Biswas, Rupak
2004-01-01
Computational grids have the potential for solving large-scale scientific problems using heterogeneous and geographically distributed resources. However, a number of major technical hurdles must be overcome before this potential can be realized. One problem that is critical to effective utilization of computational grids is the efficient scheduling of jobs. This work addresses this problem by describing and evaluating a grid scheduling architecture and three job migration algorithms. The architecture is scalable and does not assume control of local site resources. The job migration policies use the availability and performance of computer systems, the network bandwidth available between systems, and the volume of input and output data associated with each job. An extensive performance comparison is presented using real workloads from leading computational centers. The results, based on several key metrics, demonstrate that the performance of our distributed migration algorithms is significantly greater than that of a local scheduling framework and comparable to a non-scalable global scheduling approach.
Hadoop distributed batch processing for Gaia: a success story
NASA Astrophysics Data System (ADS)
Riello, Marco
2015-12-01
The DPAC Cambridge Data Processing Centre (DPCI) is responsible for the photometric calibration of the Gaia data including the low resolution spectra. The large data volume produced by Gaia (~26 billion transits/year), the complexity of its data stream and the self-calibrating approach pose unique challenges for scalability, reliability and robustness of both the software pipelines and the operations infrastructure. DPCI has been the first in DPAC to realise the potential of Hadoop and Map/Reduce and to adopt them as the core technologies for its infrastructure. This has proven a winning choice allowing DPCI unmatched processing throughput and reliability within DPAC to the point that other DPCs have started following our footsteps. In this talk we will present the software infrastructure developed to build the distributed and scalable batch data processing system that is currently used in production at DPCI and the excellent results in terms of performance of the system.
Links, Amanda E.; Draper, David; Lee, Elizabeth; Guzman, Jessica; Valivullah, Zaheer; Maduro, Valerie; Lebedev, Vlad; Didenko, Maxim; Tomlin, Garrick; Brudno, Michael; Girdea, Marta; Dumitriu, Sergiu; Haendel, Melissa A.; Mungall, Christopher J.; Smedley, Damian; Hochheiser, Harry; Arnold, Andrew M.; Coessens, Bert; Verhoeven, Steven; Bone, William; Adams, David; Boerkoel, Cornelius F.; Gahl, William A.; Sincan, Murat
2016-01-01
The National Institutes of Health Undiagnosed Diseases Program (NIH UDP) applies translational research systematically to diagnose patients with undiagnosed diseases. The challenge is to implement an information system enabling scalable translational research. The authors hypothesized that similar complex problems are resolvable through process management and the distributed cognition of communities. The team, therefore, built the NIH UDP integrated collaboration system (UDPICS) to form virtual collaborative multidisciplinary research networks or communities. UDPICS supports these communities through integrated process management, ontology-based phenotyping, biospecimen management, cloud-based genomic analysis, and an electronic laboratory notebook. UDPICS provided a mechanism for efficient, transparent, and scalable translational research and thereby addressed many of the complex and diverse research and logistical problems of the NIH UDP. Full definition of the strengths and deficiencies of UDPICS will require formal qualitative and quantitative usability and process improvement measurement. PMID:27785453
Jia, Jia; Chen, Jhensi; Yao, Jun; Chu, Daping
2017-01-01
A high quality 3D display requires a high amount of optical information throughput, which needs an appropriate mechanism to distribute information in space uniformly and efficiently. This study proposes a front-viewing system which is capable of managing the required amount of information efficiently from a high bandwidth source and projecting 3D images with a decent size and a large viewing angle at video rate in full colour. It employs variable gratings to support a high bandwidth distribution. This concept is scalable and the system can be made compact in size. A horizontal parallax only (HPO) proof-of-concept system is demonstrated by projecting holographic images from a digital micro mirror device (DMD) through rotational tiled gratings before they are realised on a vertical diffuser for front-viewing. PMID:28304371
Dynamic Load Balancing for Adaptive Computations on Distributed-Memory Machines
NASA Technical Reports Server (NTRS)
1999-01-01
Dynamic load balancing is central to adaptive mesh-based computations on large-scale parallel computers. The principal investigator has investigated various issues on the dynamic load balancing problem under NASA JOVE and JAG rants. The major accomplishments of the project are two graph partitioning algorithms and a load balancing framework. The S-HARP dynamic graph partitioner is known to be the fastest among the known dynamic graph partitioners to date. It can partition a graph of over 100,000 vertices in 0.25 seconds on a 64- processor Cray T3E distributed-memory multiprocessor while maintaining the scalability of over 16-fold speedup. Other known and widely used dynamic graph partitioners take over a second or two while giving low scalability of a few fold speedup on 64 processors. These results have been published in journals and peer-reviewed flagship conferences.
Networking and AI systems: Requirements and benefits
NASA Technical Reports Server (NTRS)
1988-01-01
The price performance benefits of network systems is well documented. The ability to share expensive resources sold timesharing for mainframes, department clusters of minicomputers, and now local area networks of workstations and servers. In the process, other fundamental system requirements emerged. These have now been generalized with open system requirements for hardware, software, applications and tools. The ability to interconnect a variety of vendor products has led to a specification of interfaces that allow new techniques to extend existing systems for new and exciting applications. As an example of the message passing system, local area networks provide a testbed for many of the issues addressed by future concurrent architectures: synchronization, load balancing, fault tolerance and scalability. Gold Hill has been working with a number of vendors on distributed architectures that range from a network of workstations to a hypercube of microprocessors with distributed memory. Results from early applications are promising both for performance and scalability.
DPM — efficient storage in diverse environments
NASA Astrophysics Data System (ADS)
Hellmich, Martin; Furano, Fabrizio; Smith, David; Brito da Rocha, Ricardo; Álvarez Ayllón, Alejandro; Manzi, Andrea; Keeble, Oliver; Calvet, Ivan; Regala, Miguel Antonio
2014-06-01
Recent developments, including low power devices, cluster file systems and cloud storage, represent an explosion in the possibilities for deploying and managing grid storage. In this paper we present how different technologies can be leveraged to build a storage service with differing cost, power, performance, scalability and reliability profiles, using the popular storage solution Disk Pool Manager (DPM/dmlite) as the enabling technology. The storage manager DPM is designed for these new environments, allowing users to scale up and down as they need it, and optimizing their computing centers energy efficiency and costs. DPM runs on high-performance machines, profiting from multi-core and multi-CPU setups. It supports separating the database from the metadata server, the head node, largely reducing its hard disk requirements. Since version 1.8.6, DPM is released in EPEL and Fedora, simplifying distribution and maintenance, but also supporting the ARM architecture beside i386 and x86_64, allowing it to run the smallest low-power machines such as the Raspberry Pi or the CuBox. This usage is facilitated by the possibility to scale horizontally using a main database and a distributed memcached-powered namespace cache. Additionally, DPM supports a variety of storage pools in the backend, most importantly HDFS, S3-enabled storage, and cluster file systems, allowing users to fit their DPM installation exactly to their needs. In this paper, we investigate the power-efficiency and total cost of ownership of various DPM configurations. We develop metrics to evaluate the expected performance of a setup both in terms of namespace and disk access considering the overall cost including equipment, power consumptions, or data/storage fees. The setups tested range from the lowest scale using Raspberry Pis with only 700MHz single cores and a 100Mbps network connections, over conventional multi-core servers to typical virtual machine instances in cloud settings. We evaluate the combinations of different name server setups, for example load-balanced clusters, with different storage setups, from using a classic local configuration to private and public clouds.
Striped Data Server for Scalable Parallel Data Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chang, Jin; Gutsche, Oliver; Mandrichenko, Igor
A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approachmore » allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.« less
A Distributed Approach to System-Level Prognostics
NASA Technical Reports Server (NTRS)
Daigle, Matthew J.; Bregon, Anibal; Roychoudhury, Indranil
2012-01-01
Prognostics, which deals with predicting remaining useful life of components, subsystems, and systems, is a key technology for systems health management that leads to improved safety and reliability with reduced costs. The prognostics problem is often approached from a component-centric view. However, in most cases, it is not specifically component lifetimes that are important, but, rather, the lifetimes of the systems in which these components reside. The system-level prognostics problem can be quite difficult due to the increased scale and scope of the prognostics problem and the relative Jack of scalability and efficiency of typical prognostics approaches. In order to address these is ues, we develop a distributed solution to the system-level prognostics problem, based on the concept of structural model decomposition. The system model is decomposed into independent submodels. Independent local prognostics subproblems are then formed based on these local submodels, resul ting in a scalable, efficient, and flexible distributed approach to the system-level prognostics problem. We provide a formulation of the system-level prognostics problem and demonstrate the approach on a four-wheeled rover simulation testbed. The results show that the system-level prognostics problem can be accurately and efficiently solved in a distributed fashion.
On scalable lossless video coding based on sub-pixel accurate MCTF
NASA Astrophysics Data System (ADS)
Yea, Sehoon; Pearlman, William A.
2006-01-01
We propose two approaches to scalable lossless coding of motion video. They achieve SNR-scalable bitstream up to lossless reconstruction based upon the subpixel-accurate MCTF-based wavelet video coding. The first approach is based upon a two-stage encoding strategy where a lossy reconstruction layer is augmented by a following residual layer in order to obtain (nearly) lossless reconstruction. The key advantages of our approach include an 'on-the-fly' determination of bit budget distribution between the lossy and the residual layers, freedom to use almost any progressive lossy video coding scheme as the first layer and an added feature of near-lossless compression. The second approach capitalizes on the fact that we can maintain the invertibility of MCTF with an arbitrary sub-pixel accuracy even in the presence of an extra truncation step for lossless reconstruction thanks to the lifting implementation. Experimental results show that the proposed schemes achieve compression ratios not obtainable by intra-frame coders such as Motion JPEG-2000 thanks to their inter-frame coding nature. Also they are shown to outperform the state-of-the-art non-scalable inter-frame coder H.264 (JM) lossless mode, with the added benefit of bitstream embeddedness.
Scalable splitting algorithms for big-data interferometric imaging in the SKA era
NASA Astrophysics Data System (ADS)
Onose, Alexandru; Carrillo, Rafael E.; Repetti, Audrey; McEwen, Jason D.; Thiran, Jean-Philippe; Pesquet, Jean-Christophe; Wiaux, Yves
2016-11-01
In the context of next-generation radio telescopes, like the Square Kilometre Array (SKA), the efficient processing of large-scale data sets is extremely important. Convex optimization tasks under the compressive sensing framework have recently emerged and provide both enhanced image reconstruction quality and scalability to increasingly larger data sets. We focus herein mainly on scalability and propose two new convex optimization algorithmic structures able to solve the convex optimization tasks arising in radio-interferometric imaging. They rely on proximal splitting and forward-backward iterations and can be seen, by analogy, with the CLEAN major-minor cycle, as running sophisticated CLEAN-like iterations in parallel in multiple data, prior, and image spaces. Both methods support any convex regularization function, in particular, the well-studied ℓ1 priors promoting image sparsity in an adequate domain. Tailored for big-data, they employ parallel and distributed computations to achieve scalability, in terms of memory and computational requirements. One of them also exploits randomization, over data blocks at each iteration, offering further flexibility. We present simulation results showing the feasibility of the proposed methods as well as their advantages compared to state-of-the-art algorithmic solvers. Our MATLAB code is available online on GitHub.
Maintaining a Distributed File System by Collection and Analysis of Metrics
NASA Technical Reports Server (NTRS)
Bromberg, Daniel
1997-01-01
AFS(originally, Andrew File System) is a widely-deployed distributed file system product used by companies, universities, and laboratories world-wide. However, it is not trivial to operate: runing an AFS cell is a formidable task. It requires a team of dedicated and experienced system administratores who must manage a user base numbring in the thousands, rather than the smaller range of 10 to 500 faced by the typical system administrator.
Shahi, Shahriar; Rahimi, Saeed; Shiezadeh, Vahab; Ashasi, Habib; Abdolrahimi, Majid; Foroughreyhani, Mohammad
2012-01-01
Aim: The aim of the present study was to electrochemically evaluate corrosion resistance of RaCe and Mtwo files after repeated sterilization and preparation procedures. Study Design: A total of 450 rotary files were used. In the working groups, 72 files from each file type were distributed into 4 groups. RaCe and Mtwo files were used to prepare one root canal of the mesial root of extracted human mandibular first molars. The procedure was repeated to prepare 2 to 8 canals. The following irrigation solutions were used: group 1, RaCe files with 2.5% NaOCl; group 2, RaCe files with normal saline; group 3, Mtwo files with 2.5% NaOCl; and group 4, Mtwo files with normal saline in the manner described. In autoclave groups, 72 files from each file type were evenly distributed into 2 groups. Files were used for a cycle of sterilization without the use of files for root canal preparation. Nine new unused files from each file type were used as controls. Then the instruments were sent for corrosion assessment. Mann-Whitney U and Wilcoxon tests were used for independent and dependent groups, respectively. Results: Statistical analysis indicated that there were significant differences in corrosion resistance of files associated with working and autoclave groups between RaCe and Mtwo file types (p<0.001). Conclusions: Corrosion resistance of #25, #30, and #35 Mtwo files is significantly higher than that in RaCe files with similar sizes. Key words:Corrosion, NiTi instruments, autoclave, RaCe, Mtwo. PMID:22143690
New algorithms for processing time-series big EEG data within mobile health monitoring systems.
Serhani, Mohamed Adel; Menshawy, Mohamed El; Benharref, Abdelghani; Harous, Saad; Navaz, Alramzana Nujum
2017-10-01
Recent advances in miniature biomedical sensors, mobile smartphones, wireless communications, and distributed computing technologies provide promising techniques for developing mobile health systems. Such systems are capable of monitoring epileptic seizures reliably, which are classified as chronic diseases. Three challenging issues raised in this context with regard to the transformation, compression, storage, and visualization of big data, which results from a continuous recording of epileptic seizures using mobile devices. In this paper, we address the above challenges by developing three new algorithms to process and analyze big electroencephalography data in a rigorous and efficient manner. The first algorithm is responsible for transforming the standard European Data Format (EDF) into the standard JavaScript Object Notation (JSON) and compressing the transformed JSON data to decrease the size and time through the transfer process and to increase the network transfer rate. The second algorithm focuses on collecting and storing the compressed files generated by the transformation and compression algorithm. The collection process is performed with respect to the on-the-fly technique after decompressing files. The third algorithm provides relevant real-time interaction with signal data by prospective users. It particularly features the following capabilities: visualization of single or multiple signal channels on a smartphone device and query data segments. We tested and evaluated the effectiveness of our approach through a software architecture model implementing a mobile health system to monitor epileptic seizures. The experimental findings from 45 experiments are promising and efficiently satisfy the approach's objectives in a price of linearity. Moreover, the size of compressed JSON files and transfer times are reduced by 10% and 20%, respectively, while the average total time is remarkably reduced by 67% through all performed experiments. Our approach successfully develops efficient algorithms in terms of processing time, memory usage, and energy consumption while maintaining a high scalability of the proposed solution. Our approach efficiently supports data partitioning and parallelism relying on the MapReduce platform, which can help in monitoring and automatic detection of epileptic seizures. Copyright © 2017 Elsevier B.V. All rights reserved.
75 FR 46919 - MidAmerican Energy Company; Notice of Filing
Federal Register 2010, 2011, 2012, 2013, 2014
2010-08-04
... reclassify high voltage assets and accumulated depreciation, from distribution plant accounts to transmission plant accounts. Any person desiring to intervene or to protest this filing must file in accordance with...
Processing Benefits of Resonance Acoustic Mixing on High Performance Propellants and Explosives
2012-02-01
slightly greater stress Modulus similar Dewetting Distribution Statement A: Approved for Public Release Tensile Comparison File: NAVAIR Brief 18...greater stress Modulus similar Dewetting Distribution Statement A: Approved for Public Release Resodyn Mixed Explosive 19 File: NAVAIR Brief
Preliminary surficial geologic map database of the Amboy 30 x 60 minute quadrangle, California
Bedford, David R.; Miller, David M.; Phelps, Geoffrey A.
2006-01-01
The surficial geologic map database of the Amboy 30x60 minute quadrangle presents characteristics of surficial materials for an area approximately 5,000 km2 in the eastern Mojave Desert of California. This map consists of new surficial mapping conducted between 2000 and 2005, as well as compilations of previous surficial mapping. Surficial geology units are mapped and described based on depositional process and age categories that reflect the mode of deposition, pedogenic effects occurring post-deposition, and, where appropriate, the lithologic nature of the material. The physical properties recorded in the database focus on those that drive hydrologic, biologic, and physical processes such as particle size distribution (PSD) and bulk density. This version of the database is distributed with point data representing locations of samples for both laboratory determined physical properties and semi-quantitative field-based information. Future publications will include the field and laboratory data as well as maps of distributed physical properties across the landscape tied to physical process models where appropriate. The database is distributed in three parts: documentation, spatial map-based data, and printable map graphics of the database. Documentation includes this file, which provides a discussion of the surficial geology and describes the format and content of the map data, a database 'readme' file, which describes the database contents, and FGDC metadata for the spatial map information. Spatial data are distributed as Arc/Info coverage in ESRI interchange (e00) format, or as tabular data in the form of DBF3-file (.DBF) file formats. Map graphics files are distributed as Postscript and Adobe Portable Document Format (PDF) files, and are appropriate for representing a view of the spatial database at the mapped scale.
TOASTing Your Images With Montage
NASA Astrophysics Data System (ADS)
Berriman, G. Bruce; Good, John
2017-01-01
The Montage image mosaic engine is a scalable toolkit for creating science-grade mosaics of FITS files, according to the user's specifications of coordinates, projection, sampling, and image rotation. It is written in ANSI-C and runs on all common *nix-based platforms. The code is freely available and is released with a BSD 3-clause license. Version 5 is a major upgrade to Montage, and provides support for creating images that can be consumed by the World Wide Telescope (WWT). Montage treats the TOAST sky tessellation scheme, used by the WWT, as a spherical projection like those in the WCStools library. Thus images in any projection can be converted to the TOAST projection by Montage’s reprojection services. These reprojections can be performed at scale on high-performance platforms and on desktops. WWT consumes PNG or JPEG files, organized according to WWT’s tiling and naming scheme. Montage therefore provides a set of dedicated modules to create the required files from FITS images that contain the TOAST projection. There are two other major features of Version 5. It supports processing of HEALPix files to any projection in the WCS tools library. And it can be built as a library that can be called from other languages, primarily Python. http://montage.ipac.caltech.edu.GitHub download page: https://github.com/Caltech-IPAC/Montage.ASCL record: ascl:1010.036. DOI: dx.doi.org/10.5281/zenodo.49418 Montage is funded by the National Science Foundation under Grant Number ACI-1440620,
A distributed parallel storage architecture and its potential application within EOSDIS
NASA Technical Reports Server (NTRS)
Johnston, William E.; Tierney, Brian; Feuquay, Jay; Butzer, Tony
1994-01-01
We describe the architecture, implementation, use of a scalable, high performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
A Lightweight, High-performance I/O Management Package for Data-intensive Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jun Wang
2007-07-17
File storage systems are playing an increasingly important role in high-performance computing as the performance gap between CPU and disk increases. It could take a long time to develop an entire system from scratch. Solutions will have to be built as extensions to existing systems. If new portable, customized software components are plugged into these systems, better sustained high I/O performance and higher scalability will be achieved, and the development cycle of next-generation of parallel file systems will be shortened. The overall research objective of this ECPI development plan aims to develop a lightweight, customized, high-performance I/O management package namedmore » LightI/O to extend and leverage current parallel file systems used by DOE. During this period, We have developed a novel component in LightI/O and prototype them into PVFS2, and evaluate the resultant prototype—extended PVFS2 system on data-intensive applications. The preliminary results indicate the extended PVFS2 delivers better performance and reliability to users. A strong collaborative effort between the PI at the University of Nebraska Lincoln and the DOE collaborators—Drs Rob Ross and Rajeev Thakur at Argonne National Laboratory who are leading the PVFS2 group makes the project more promising.« less
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak; Biegel, Bryan (Technical Monitor)
2002-01-01
The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism.
Neuropsychological constraints to human data production on a global scale
NASA Astrophysics Data System (ADS)
Gros, C.; Kaczor, G.; Marković, D.
2012-01-01
Which are the factors underlying human information production on a global level? In order to gain an insight into this question we study a corpus of 252-633 mil. publicly available data files on the Internet corresponding to an overall storage volume of 284-675 Terabytes. Analyzing the file size distribution for several distinct data types we find indications that the neuropsychological capacity of the human brain to process and record information may constitute the dominant limiting factor for the overall growth of globally stored information, with real-world economic constraints having only a negligible influence. This supposition draws support from the observation that the files size distributions follow a power law for data without a time component, like images, and a log-normal distribution for multimedia files, for which time is a defining qualia.
Simple, Script-Based Science Processing Archive
NASA Technical Reports Server (NTRS)
Lynnes, Christopher; Hegde, Mahabaleshwara; Barth, C. Wrandle
2007-01-01
The Simple, Scalable, Script-based Science Processing (S4P) Archive (S4PA) is a disk-based archival system for remote sensing data. It is based on the data-driven framework of S4P and is used for data transfer, data preprocessing, metadata generation, data archive, and data distribution. New data are automatically detected by the system. S4P provides services such as data access control, data subscription, metadata publication, data replication, and data recovery. It comprises scripts that control the data flow. The system detects the availability of data on an FTP (file transfer protocol) server, initiates data transfer, preprocesses data if necessary, and archives it on readily available disk drives with FTP and HTTP (Hypertext Transfer Protocol) access, allowing instantaneous data access. There are options for plug-ins for data preprocessing before storage. Publication of metadata to external applications such as the Earth Observing System Clearinghouse (ECHO) is also supported. S4PA includes a graphical user interface for monitoring the system operation and a tool for deploying the system. To ensure reliability, S4P continuously checks stored data for integrity, Further reliability is provided by tape backups of disks made once a disk partition is full and closed. The system is designed for low maintenance, requiring minimal operator oversight.
A Scalable Monitoring for the CMS Filter Farm Based on Elasticsearch
DOE Office of Scientific and Technical Information (OSTI.GOV)
Andre, J.M.; et al.
2015-12-23
A flexible monitoring system has been designed for the CMS File-based Filter Farm making use of modern data mining and analytics components. All the metadata and monitoring information concerning data flow and execution of the HLT are generated locally in the form of small documents using the JSON encoding. These documents are indexed into a hierarchy of elasticsearch (es) clusters along with process and system log information. Elasticsearch is a search server based on Apache Lucene. It provides a distributed, multitenant-capable search and aggregation engine. Since es is schema-free, any new information can be added seamlessly and the unstructured informationmore » can be queried in non-predetermined ways. The leaf es clusters consist of the very same nodes that form the Filter Farm thus providing natural horizontal scaling. A separate central” es cluster is used to collect and index aggregated information. The fine-grained information, all the way to individual processes, remains available in the leaf clusters. The central es cluster provides quasi-real-time high-level monitoring information to any kind of client. Historical data can be retrieved to analyse past problems or correlate them with external information. We discuss the design and performance of this system in the context of the CMS DAQ commissioning for LHC Run 2.« less
TagDigger: user-friendly extraction of read counts from GBS and RAD-seq data.
Clark, Lindsay V; Sacks, Erik J
2016-01-01
In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult. We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files. TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.
10 CFR 60.22 - Filing and distribution of application.
Code of Federal Regulations, 2010 CFR
2010-01-01
... GEOLOGIC REPOSITORIES Licenses License Applications § 60.22 Filing and distribution of application. (a) An application for a construction authorization for a high-level radioactive waste repository at a geologic repository operations area, and an application for a license to receive and possess source, special nuclear...
Applications of Coding in Network Communications
ERIC Educational Resources Information Center
Chang, Christopher SungWook
2012-01-01
This thesis uses the tool of network coding to investigate fast peer-to-peer file distribution, anonymous communication, robust network construction under uncertainty, and prioritized transmission. In a peer-to-peer file distribution system, we use a linear optimization approach to show that the network coding framework significantly simplifies…
10 CFR 61.20 - Filing and distribution of application.
Code of Federal Regulations, 2012 CFR
2012-01-01
... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...
10 CFR 61.20 - Filing and distribution of application.
Code of Federal Regulations, 2013 CFR
2013-01-01
... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...
10 CFR 61.20 - Filing and distribution of application.
Code of Federal Regulations, 2011 CFR
2011-01-01
... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...
10 CFR 61.20 - Filing and distribution of application.
Code of Federal Regulations, 2014 CFR
2014-01-01
... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...
10 CFR 61.20 - Filing and distribution of application.
Code of Federal Regulations, 2010 CFR
2010-01-01
... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...
Scuba: scalable kernel-based gene prioritization.
Zampieri, Guido; Tran, Dinh Van; Donini, Michele; Navarin, Nicolò; Aiolli, Fabio; Sperduti, Alessandro; Valle, Giorgio
2018-01-25
The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba .
Transparency in Distributed File Systems
1989-01-01
ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT, TASK Computer Science Department AREA & WORK UNIT NUMBERS 734 Comouter Studies Bldc . University of...sistency control , file and director) placement, and file and directory migration in a way that pro- 3 vides full network transparency. This transparency...areas of naming, replication, con- sistency control , file and directory placement, and file and directory migration in a way that pro- 3 vides full
NASA Astrophysics Data System (ADS)
Leitão, João; Pereira, José; Rodrigues, Luís
Gossip, or epidemic, protocols have emerged as a powerful strategy to implement highly scalable and resilient reliable broadcast primitives on large scale peer-to-peer networks. Epidemic protocols are scalable because they distribute the load among all nodes in the system and resilient because they have an intrinsic level of redundancy that masks node and network failures. This chapter provides an introduction to gossip-based broadcast on large-scale unstructured peer-to-peer overlay networks: it surveys the main results in the field, discusses techniques to build and maintain the overlays that support efficient dissemination strategies, and provides an in-depth discussion and experimental evaluation of two concrete protocols, named HyParView and Plumtree.
CloudMan as a platform for tool, data, and analysis distribution.
Afgan, Enis; Chapman, Brad; Taylor, James
2012-11-27
Cloud computing provides an infrastructure that facilitates large scale computational analysis in a scalable, democratized fashion, However, in this context it is difficult to ensure sharing of an analysis environment and associated data in a scalable and precisely reproducible way. CloudMan (usecloudman.org) enables individual researchers to easily deploy, customize, and share their entire cloud analysis environment, including data, tools, and configurations. With the enabled customization and sharing of instances, CloudMan can be used as a platform for collaboration. The presented solution improves accessibility of cloud resources, tools, and data to the level of an individual researcher and contributes toward reproducibility and transparency of research solutions.
2017-05-01
Parallelizing PINT The main focus of our research into the parallelization of the PINT algorithm has been to find appropriately scalable matrix math algorithms...leading eigenvector of the adjacency matrix of the pairwise affinity graph. We reviewed the matrix math implementation currently being used in PINT and...the new versions support a feature called matrix.distributed, which is some level of support for distributed matrix math ; however our code is not
FOS: A Factored Operating Systems for High Assurance and Scalability on Multicores
2012-08-01
computing. It builds on previous work in distributed and microkernel OSes by factoring services out of the kernel, and then further distributing each...2 3.0 Methods, Assumptions, and Procedures (System Design) .................................................. 4 3.1 Microkernel ...cooperating servers. We term such a service a fleet. Figure 2 shows the high-level architecture of fos. A small microkernel runs on every core
Comparative-effectiveness research in distributed health data networks.
Toh, S; Platt, R; Steiner, J F; Brown, J S
2011-12-01
Comparative-effectiveness research (CER) can be conducted within a distributed health data network. Such networks allow secure access to separate data sets from different data partners and overcome many practical obstacles related to patient privacy, data security, and proprietary concerns. A scalable network architecture supports a wide range of CER activities and meets the data infrastructure needs envisioned by the Federal Coordinating Council for Comparative Effectiveness Research.
Scalable collaborative risk management technology for complex critical systems
NASA Technical Reports Server (NTRS)
Campbell, Scott; Torgerson, Leigh; Burleigh, Scott; Feather, Martin S.; Kiper, James D.
2004-01-01
We describe here our project and plans to develop methods, software tools, and infrastructure tools to address challenges relating to geographically distributed software development. Specifically, this work is creating an infrastructure that supports applications working over distributed geographical and organizational domains and is using this infrastructure to develop a tool that supports project development using risk management and analysis techniques where the participants are not collocated.
NCI's Distributed Geospatial Data Server
NASA Astrophysics Data System (ADS)
Larraondo, P. R.; Evans, B. J. K.; Antony, J.
2016-12-01
Earth systems, environmental and geophysics datasets are an extremely valuable source of information about the state and evolution of the Earth. However, different disciplines and applications require this data to be post-processed in different ways before it can be used. For researchers experimenting with algorithms across large datasets or combining multiple data sets, the traditional approach to batch data processing and storing all the output for later analysis rapidly becomes unfeasible, and often requires additional work to publish for others to use. Recent developments on distributed computing using interactive access to significant cloud infrastructure opens the door for new ways of processing data on demand, hence alleviating the need for storage space for each individual copy of each product. The Australian National Computational Infrastructure (NCI) has developed a highly distributed geospatial data server which supports interactive processing of large geospatial data products, including satellite Earth Observation data and global model data, using flexible user-defined functions. This system dynamically and efficiently distributes the required computations among cloud nodes and thus provides a scalable analysis capability. In many cases this completely alleviates the need to preprocess and store the data as products. This system presents a standards-compliant interface, allowing ready accessibility for users of the data. Typical data wrangling problems such as handling different file formats and data types, or harmonising the coordinate projections or temporal and spatial resolutions, can now be handled automatically by this service. The geospatial data server exposes functionality for specifying how the data should be aggregated and transformed. The resulting products can be served using several standards such as the Open Geospatial Consortium's (OGC) Web Map Service (WMS) or Web Feature Service (WFS), Open Street Map tiles, or raw binary arrays under different conventions. We will show some cases where we have used this new capability to provide a significant improvement over previous approaches.
A Domain-Specific Language for Aviation Domain Interoperability
ERIC Educational Resources Information Center
Comitz, Paul
2013-01-01
Modern information systems require a flexible, scalable, and upgradeable infrastructure that allows communication and collaboration between heterogeneous information processing and computing environments. Aviation systems from different organizations often use differing representations and distribution policies for the same data and messages,…
26 CFR 1.305-3 - Disproportionate distributions.
Code of Federal Regulations, 2010 CFR
2010-04-01
... 26 Internal Revenue 4 2010-04-01 2010-04-01 false Disproportionate distributions. 1.305-3 Section 1.305-3 Internal Revenue INTERNAL REVENUE SERVICE, DEPARTMENT OF THE TREASURY (CONTINUED) INCOME TAX... filed with the Internal Revenue Service Center with which the income tax return was filed. (4) See § 1...
2010-01-01
Background An important focus of genomic science is the discovery and characterization of all functional elements within genomes. In silico methods are used in genome studies to discover putative regulatory genomic elements (called words or motifs). Although a number of methods have been developed for motif discovery, most of them lack the scalability needed to analyze large genomic data sets. Methods This manuscript presents WordSeeker, an enumerative motif discovery toolkit that utilizes multi-core and distributed computational platforms to enable scalable analysis of genomic data. A controller task coordinates activities of worker nodes, each of which (1) enumerates a subset of the DNA word space and (2) scores words with a distributed Markov chain model. Results A comprehensive suite of performance tests was conducted to demonstrate the performance, speedup and efficiency of WordSeeker. The scalability of the toolkit enabled the analysis of the entire genome of Arabidopsis thaliana; the results of the analysis were integrated into The Arabidopsis Gene Regulatory Information Server (AGRIS). A public version of WordSeeker was deployed on the Glenn cluster at the Ohio Supercomputer Center. Conclusion WordSeeker effectively utilizes concurrent computing platforms to enable the identification of putative functional elements in genomic data sets. This capability facilitates the analysis of the large quantity of sequenced genomic data. PMID:21210985
GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Simmhan, Yogesh; Kumbhare, Alok; Wickramaarachchi, Charith
2014-08-25
Large scale graph processing is a major research area for Big Data exploration. Vertex centric programming models like Pregel are gaining traction due to their simple abstraction that allows for scalable execution on distributed systems naturally. However, there are limitations to this approach which cause vertex centric algorithms to under-perform due to poor compute to communication overhead ratio and slow convergence of iterative superstep. In this paper we introduce GoFFish a scalable sub-graph centric framework co-designed with a distributed persistent graph storage for large scale graph analytics on commodity clusters. We introduce a sub-graph centric programming abstraction that combines themore » scalability of a vertex centric approach with the flexibility of shared memory sub-graph computation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation.« less
Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R; Bock, Davi D; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R Clay; Smith, Stephen J; Szalay, Alexander S; Vogelstein, Joshua T; Vogelstein, R Jacob
2013-01-01
We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes - neural connectivity maps of the brain-using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems-reads to parallel disk arrays and writes to solid-state storage-to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization.
Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R.; Bock, Davi D.; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C.; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R. Clay; Smith, Stephen J.; Szalay, Alexander S.; Vogelstein, Joshua T.; Vogelstein, R. Jacob
2013-01-01
We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes— neural connectivity maps of the brain—using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems—reads to parallel disk arrays and writes to solid-state storage—to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization. PMID:24401992
NASA Astrophysics Data System (ADS)
Vincenti, Henri; Vay, Jean-Luc
2018-07-01
The advent of massively parallel supercomputers, with their distributed-memory technology using many processing units, has favored the development of highly-scalable local low-order solvers at the expense of harder-to-scale global very high-order spectral methods. Indeed, FFT-based methods, which were very popular on shared memory computers, have been largely replaced by finite-difference (FD) methods for the solution of many problems, including plasmas simulations with electromagnetic Particle-In-Cell methods. For some problems, such as the modeling of so-called "plasma mirrors" for the generation of high-energy particles and ultra-short radiations, we have shown that the inaccuracies of standard FD-based PIC methods prevent the modeling on present supercomputers at sufficient accuracy. We demonstrate here that a new method, based on the use of local FFTs, enables ultrahigh-order accuracy with unprecedented scalability, and thus for the first time the accurate modeling of plasma mirrors in 3D.
Trucks involved in fatal accidents codebook 2008.
DOT National Transportation Integrated Search
2011-01-01
This report provides documentation for UMTRIs file of Trucks Involved in Fatal Accidents : (TIFA), 2008, including distributions of the code values for each variable in the file. The 2008 : TIFA file is a census of all medium and heavy trucks invo...
Buses involved in fatal accidents codebook 2008.
DOT National Transportation Integrated Search
2011-03-01
This report provides documentation for UMTRIs file of Buses Involved in Fatal Accidents (BIFA), 2008, : including distributions of the code values for each variable in the file. The 2008 BIFA file is a census of all : buses involved in a fatal acc...
Buses involved in fatal accidents codebook 2007.
DOT National Transportation Integrated Search
2009-12-01
This report provides documentation for UMTRIs file of Buses Involved in Fatal Accidents (BIFA), 2007, : including distributions of the code values for each variable in the file. The 2007 BIFA file is a census of all : buses involved in a fatal acc...
Analysis of the access patterns at GSFC distributed active archive center
NASA Technical Reports Server (NTRS)
Johnson, Theodore; Bedet, Jean-Jacques
1996-01-01
The Goddard Space Flight Center (GSFC) Distributed Active Archive Center (DAAC) has been operational for more than two years. Its mission is to support existing and pre Earth Observing System (EOS) Earth science datasets, facilitate the scientific research, and test Earth Observing System Data and Information System (EOSDIS) concepts. Over 550,000 files and documents have been archived, and more than six Terabytes have been distributed to the scientific community. Information about user request and file access patterns, and their impact on system loading, is needed to optimize current operations and to plan for future archives. To facilitate the management of daily activities, the GSFC DAAC has developed a data base system to track correspondence, requests, ingestion and distribution. In addition, several log files which record transactions on Unitree are maintained and periodically examined. This study identifies some of the users' requests and file access patterns at the GSFC DAAC during 1995. The analysis is limited to the subset of orders for which the data files are under the control of the Hierarchical Storage Management (HSM) Unitree. The results show that most of the data volume ordered was for two data products. The volume was also mostly made up of level 3 and 4 data and most of the volume was distributed on 8 mm and 4 mm tapes. In addition, most of the volume ordered was for deliveries in North America although there was a significant world-wide use. There was a wide range of request sizes in terms of volume and number of files ordered. On an average 78.6 files were ordered per request. Using the data managed by Unitree, several caching algorithms have been evaluated for both hit rate and the overhead ('cost') associated with the movement of data from near-line devices to disks. The algorithm called LRU/2 bin was found to be the best for this workload, but the STbin algorithm also worked well.
Extensible Interest Management for Scalable Persistent Distributed Virtual Environments
1999-12-01
Calvin, Cebula et al. 1995; Morse, Bic et al. 2000) uses a two grid, with each grid cell having two multicast addresses. An entity expresses interest...Entity distribution for experimental runs 78 s I * • ...... ^..... * * a» Sis*«*»* 1 ***** Jj |r...Multiple Users and Shared Applications with VRML. VRML 97, Monterey, CA. pp. 33-40. Calvin, J. O., D. P. Cebula , et al. (1995). Data Subscription in
DOT National Transportation Integrated Search
2006-01-01
The project focuses on two major issues - the improvement of current work zone design practices and an analysis of : vehicle interarrival time (IAT) and speed distributions for the development of a digital computer simulation model for : queues and t...
Chelsea Lancelle
2013-09-11
In September 2013, an experiment using Distributed Acoustic Sensing (DAS) was conducted at Garner Valley, a test site of the University of California Santa Barbara (Lancelle et al., 2014). This submission includes all DAS data recorded during the experiment. The sampling rate for all files is 1000 samples per second. Any files with the same filename but ending in _01, _02, etc. represent sequential files from the same test. Locations of the sources are plotted on the basemap in GDR submission 481, titled: "PoroTomo Subtask 3.2 Sample data from a Distributed Acoustic Sensing experiment at Garner Valley, California (PoroTomo Subtask 3.2)." Lancelle, C., N. Lord, H. Wang, D. Fratta, R. Nigbor, A. Chalari, R. Karaulanov, J. Baldwin, and E. Castongia (2014), Directivity and Sensitivity of Fiber-Optic Cable Measuring Ground Motion using a Distributed Acoustic Sensing Array (abstract # NS31C-3935), AGU Fall Meeting. https://agu.confex.com/agu/fm1/meetingapp.cgi#Paper/19828 The e-poster is available at: https://agu.confex.com/data/handout/agu/fm14/Paper_19828_handout_696_0.pdf
Scalable and balanced dynamic hybrid data assimilation
NASA Astrophysics Data System (ADS)
Kauranne, Tuomo; Amour, Idrissa; Gunia, Martin; Kallio, Kari; Lepistö, Ahti; Koponen, Sampsa
2017-04-01
Scalability of complex weather forecasting suites is dependent on the technical tools available for implementing highly parallel computational kernels, but to an equally large extent also on the dependence patterns between various components of the suite, such as observation processing, data assimilation and the forecast model. Scalability is a particular challenge for 4D variational assimilation methods that necessarily couple the forecast model into the assimilation process and subject this combination to an inherently serial quasi-Newton minimization process. Ensemble based assimilation methods are naturally more parallel, but large models force ensemble sizes to be small and that results in poor assimilation accuracy, somewhat akin to shooting with a shotgun in a million-dimensional space. The Variational Ensemble Kalman Filter (VEnKF) is an ensemble method that can attain the accuracy of 4D variational data assimilation with a small ensemble size. It achieves this by processing a Gaussian approximation of the current error covariance distribution, instead of a set of ensemble members, analogously to the Extended Kalman Filter EKF. Ensemble members are re-sampled every time a new set of observations is processed from a new approximation of that Gaussian distribution which makes VEnKF a dynamic assimilation method. After this a smoothing step is applied that turns VEnKF into a dynamic Variational Ensemble Kalman Smoother VEnKS. In this smoothing step, the same process is iterated with frequent re-sampling of the ensemble but now using past iterations as surrogate observations until the end result is a smooth and balanced model trajectory. In principle, VEnKF could suffer from similar scalability issues as 4D-Var. However, this can be avoided by isolating the forecast model completely from the minimization process by implementing the latter as a wrapper code whose only link to the model is calling for many parallel and totally independent model runs, all of them implemented as parallel model runs themselves. The only bottleneck in the process is the gathering and scattering of initial and final model state snapshots before and after the parallel runs which requires a very efficient and low-latency communication network. However, the volume of data communicated is small and the intervening minimization steps are only 3D-Var, which means their computational load is negligible compared with the fully parallel model runs. We present example results of scalable VEnKF with the 4D lake and shallow sea model COHERENS, assimilating simultaneously continuous in situ measurements in a single point and infrequent satellite images that cover a whole lake, with the fully scalable VEnKF.
NASA Astrophysics Data System (ADS)
Yang, W.; Min, M.; Bai, Y.; Lynnes, C.; Holloway, D.; Enloe, Y.; di, L.
2008-12-01
In the past few years, there have been growing interests, among major earth observing satellite (EOS) data providers, in serving data through the interoperable Web Coverage Service (WCS) interface protocol, developed by the Open Geospatial Consortium (OGC). The interface protocol defined in WCS specifications allows client software to make customized requests of multi-dimensional EOS data, including spatial and temporal subsetting, resampling and interpolation, and coordinate reference system (CRS) transformation. A WCS server describes an offered coverage, i.e., a data product, through a response to a client's DescribeCoverage request. The description includes the offered coverage's spatial/temporal extents and resolutions, supported CRSs, supported interpolation methods, and supported encoding formats. Based on such information, a client can request the entire or a subset of coverage in any spatial/temporal resolutions and in any one of the supported CRSs, formats, and interpolation methods. When implementing a WCS server, a data provider has different approaches to present its data holdings to clients. One of the most straightforward, and commonly used, approaches is to offer individual physical data files as separate coverages. Such implementation, however, will result in too many offered coverages for large data holdings and it also cannot fully present the relationship among different, but spatially and/or temporally associated, data files. It is desirable to disconnect offered coverages from physical data files so that the former is more coherent, especially in spatial and temporal domains. Therefore, some servers offer one single coverage for a set of spatially coregistered time series data files such as a daily global precipitation coverage linked to many global single- day precipitation files; others offer one single coverage for multiple temporally coregistered files together forming a large spatial extent. In either case, a server needs to assemble an output coverage real-time by combining potentially large number of physical files, which can be operationally difficult. The task becomes more challenging if an offered coverage involves spatially and temporally un-registered physical files. In this presentation, we will discuss issues and lessons learned in providing NASA's AIRS Level 2 atmospheric products, which are in satellite swath CRS and in 6-minute segment granule files, as virtual global coverages. We"ll discuss the WCS server's on- the-fly georectification, mosaicking, quality screening, performance, and scalability.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-09-02
... deregistered. Filing Dates: The application was filed on March 17, 2011, and amended and restated on June 24.... On March 30, 2011, applicant made a final liquidating distribution to its shareholders, based on net.... Filing Dates: The application was filed on June 24, 2011 and amended on August 5, 2011. Applicant's...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-01-21
...), Rockville, Maryland 20852 and is accessible from the NRC's Agencywide Documents Access and Management System... receipt of the document. The E-Filing system also distributes an e-mail notice that provides access to the... intervene is filed so that they can obtain access to the document via the E-Filing system. A person filing...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-07-21
... NRC's Agencywide Documents Access and Management System (ADAMS) Public Electronic Reading Room on the... receipt of the document. The E-Filing system also distributes an e-mail notice that provides access to the... intervene is filed so that they can obtain access to the document via the E-Filing system. A person filing...
JBrowse: a dynamic web platform for genome visualization and analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Buels, Robert; Yao, Eric; Diesh, Colin M.
JBrowse is a fast and full-featured genome browser built with JavaScript and HTML5. It is easily embedded into websites or apps but can also be served as a standalone web page. Overall improvements to speed and scalability are accompanied by specific enhancements that support complex interactive queries on large track sets. Analysis functions can readily be added using the plugin framework; most visual aspects of tracks can also be customized, along with clicks, mouseovers, menus, and popup boxes. JBrowse can also be used to browse local annotation files offline and to generate high-resolution figures for publication. JBrowse is a maturemore » web application suitable for genome visualization and analysis.« less
Grange, Rebecca L.; Clizbe, Elizabeth A.; Counsell, Emma J.
2015-01-01
We have devised a highly regio- and enantioselective iridium-catalyzed allylic amination reaction with the sulfur-stabilized aza-ylide, S,S-diphenylsulfilimine. This process provides a robust and scalable method for the construction of aryl-, alkyl- and alkenyl-substituted C-chiral allylic sulfilimines, which are important functional groups for organic synthesis. Additionally, the combination of the allylic amination with an in situ deprotection of the sulfilimine constitutes a convenient one-pot protocol for the construction of chiral nonracemic primary allylic amines. PMID:28936319
JBrowse: a dynamic web platform for genome visualization and analysis.
Buels, Robert; Yao, Eric; Diesh, Colin M; Hayes, Richard D; Munoz-Torres, Monica; Helt, Gregg; Goodstein, David M; Elsik, Christine G; Lewis, Suzanna E; Stein, Lincoln; Holmes, Ian H
2016-04-12
JBrowse is a fast and full-featured genome browser built with JavaScript and HTML5. It is easily embedded into websites or apps but can also be served as a standalone web page. Overall improvements to speed and scalability are accompanied by specific enhancements that support complex interactive queries on large track sets. Analysis functions can readily be added using the plugin framework; most visual aspects of tracks can also be customized, along with clicks, mouseovers, menus, and popup boxes. JBrowse can also be used to browse local annotation files offline and to generate high-resolution figures for publication. JBrowse is a mature web application suitable for genome visualization and analysis.
Trucks involved in fatal accidents codebook 2004 (Version March 23, 2007).
DOT National Transportation Integrated Search
2007-03-01
"This report provides documentation for UMTRIs file of Trucks Involved in Fatal Accidents (TIFA), : 2004, including distributions of the code values for each variable in the file. The 2004 TIFA file is : a census of all medium and heavy trucks inv...
Trucks involved in fatal accidents codebook 2010 (Version October 22, 2012).
DOT National Transportation Integrated Search
2012-11-01
This report provides documentation for UMTRIs file of Trucks Involved in Fatal Accidents : (TIFA), 2010, including distributions of the code values for each variable in the file. The 2010 : TIFA file is a census of all medium and heavy trucks invo...
21 CFR 720.3 - How and where to file.
Code of Federal Regulations, 2011 CFR
2011-04-01
... Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...
The Open Microscopy Environment: open image informatics for the biological sciences
NASA Astrophysics Data System (ADS)
Blackburn, Colin; Allan, Chris; Besson, Sébastien; Burel, Jean-Marie; Carroll, Mark; Ferguson, Richard K.; Flynn, Helen; Gault, David; Gillen, Kenneth; Leigh, Roger; Leo, Simone; Li, Simon; Lindner, Dominik; Linkert, Melissa; Moore, Josh; Moore, William J.; Ramalingam, Balaji; Rozbicki, Emil; Rustici, Gabriella; Tarkowska, Aleksandra; Walczysko, Petr; Williams, Eleanor; Swedlow, Jason R.
2016-07-01
Despite significant advances in biological imaging and analysis, major informatics challenges remain unsolved: file formats are proprietary, storage and analysis facilities are lacking, as are standards for sharing image data and results. While the open FITS file format is ubiquitous in astronomy, astronomical imaging shares many challenges with biological imaging, including the need to share large image sets using secure, cross-platform APIs, and the need for scalable applications for processing and visualization. The Open Microscopy Environment (OME) is an open-source software framework developed to address these challenges. OME tools include: an open data model for multidimensional imaging (OME Data Model); an open file format (OME-TIFF) and library (Bio-Formats) enabling free access to images (5D+) written in more than 145 formats from many imaging domains, including FITS; and a data management server (OMERO). The Java-based OMERO client-server platform comprises an image metadata store, an image repository, visualization and analysis by remote access, allowing sharing and publishing of image data. OMERO provides a means to manage the data through a multi-platform API. OMERO's model-based architecture has enabled its extension into a range of imaging domains, including light and electron microscopy, high content screening, digital pathology and recently into applications using non-image data from clinical and genomic studies. This is made possible using the Bio-Formats library. The current release includes a single mechanism for accessing image data of all types, regardless of original file format, via Java, C/C++ and Python and a variety of applications and environments (e.g. ImageJ, Matlab and R).
A Debugger for Computational Grid Applications
NASA Technical Reports Server (NTRS)
Hood, Robert; Jost, Gabriele; Biegel, Bryan (Technical Monitor)
2001-01-01
This viewgraph presentation gives an overview of a debugger for computational grid applications. Details are given on NAS parallel tools groups (including parallelization support tools, evaluation of various parallelization strategies, and distributed and aggregated computing), debugger dependencies, scalability, initial implementation, the process grid, and information on Globus.
2015-06-01
raspberry pi , robotic operation system (ros), arduino 15. NUMBER OF PAGES 123 16. PRICE CODE 17. SECURITY CLASSIFICATION OF REPORT...51 2. Raspberry Pi ...52 Figure 21. The Raspberry Pi B+ model, from [24
Ultrascale collaborative visualization using a display-rich global cyberinfrastructure.
Jeong, Byungil; Leigh, Jason; Johnson, Andrew; Renambot, Luc; Brown, Maxine; Jagodic, Ratko; Nam, Sungwon; Hur, Hyejung
2010-01-01
The scalable adaptive graphics environment (SAGE) is high-performance graphics middleware for ultrascale collaborative visualization using a display-rich global cyberinfrastructure. Dozens of sites worldwide use this cyberinfrastructure middleware, which connects high-performance-computing resources over high-speed networks to distributed ultraresolution displays.
Vulnerability detection using data-flow graphs and SMT solvers
2016-10-31
concerns. The framework is modular and pipelined to allow scalable analysis on distributed systems. Our vulnerability detection framework employs machine...Design We designed the framework to be modular to enable flexible reuse and extendibility. In its current form, our framework performs the following
NASA Astrophysics Data System (ADS)
Zorzano, M.-P.; Martín-Torres, J.; Mathanlal, T.; Vakkada Ramachandran, A.; Ramirez-Luque, J.-A.
2018-04-01
The purpose of this work is to demonstrate the operability of a network of small-sized detectors of the PACKMAN instrument, operated simultaneously to provide real time cosmic ray and solar activity monitoring over the entire planet.
Halligan, Brian D.; Geiger, Joey F.; Vallejos, Andrew K.; Greene, Andrew S.; Twigger, Simon N.
2009-01-01
One of the major difficulties for many laboratories setting up proteomics programs has been obtaining and maintaining the computational infrastructure required for the analysis of the large flow of proteomics data. We describe a system that combines distributed cloud computing and open source software to allow laboratories to set up scalable virtual proteomics analysis clusters without the investment in computational hardware or software licensing fees. Additionally, the pricing structure of distributed computing providers, such as Amazon Web Services, allows laboratories or even individuals to have large-scale computational resources at their disposal at a very low cost per run. We provide detailed step by step instructions on how to implement the virtual proteomics analysis clusters as well as a list of current available preconfigured Amazon machine images containing the OMSSA and X!Tandem search algorithms and sequence databases on the Medical College of Wisconsin Proteomics Center website (http://proteomics.mcw.edu/vipdac). PMID:19358578
Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Daily, Jeffrey A.; Lewis, Robert R.
2011-11-30
Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement calledmore » Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.« less
Tierney, Brian D.; Choi, Sukwon; DasGupta, Sandeepan; ...
2017-08-16
A distributed impedance “field cage” structure is proposed and evaluated for electric field control in GaN-based, lateral high electron mobility transistors (HEMTs) operating as kilovolt-range power devices. In this structure, a resistive voltage divider is used to control the electric field throughout the active region. The structure complements earlier proposals utilizing floating field plates that did not employ resistively connected elements. Transient results, not previously reported for field plate schemes using either floating or resistively connected field plates, are presented for ramps of dV ds /dt = 100 V/ns. For both DC and transient results, the voltage between the gatemore » and drain is laterally distributed, ensuring the electric field profile between the gate and drain remains below the critical breakdown field as the source-to-drain voltage is increased. Our scheme indicates promise for achieving breakdown voltage scalability to a few kV.« less
A universal quantum information processor for scalable quantum communication and networks
Yang, Xihua; Xue, Bolin; Zhang, Junxiang; Zhu, Shiyao
2014-01-01
Entanglement provides an essential resource for quantum computation, quantum communication, and quantum networks. How to conveniently and efficiently realize the generation, distribution, storage, retrieval, and control of multipartite entanglement is the basic requirement for realistic quantum information processing. Here, we present a theoretical proposal to efficiently and conveniently achieve a universal quantum information processor (QIP) via atomic coherence in an atomic ensemble. The atomic coherence, produced through electromagnetically induced transparency (EIT) in the Λ-type configuration, acts as the QIP and has full functions of quantum beam splitter, quantum frequency converter, quantum entangler, and quantum repeater. By employing EIT-based nondegenerate four-wave mixing processes, the generation, exchange, distribution, and manipulation of light-light, atom-light, and atom-atom multipartite entanglement can be efficiently and flexibly achieved in a deterministic way with only coherent light fields. This method greatly facilitates the operations in quantum information processing, and holds promising applications in realistic scalable quantum communication and quantum networks. PMID:25316514
Precise and Efficient Static Array Bound Checking for Large Embedded C Programs
NASA Technical Reports Server (NTRS)
Venet, Arnaud
2004-01-01
In this paper we describe the design and implementation of a static array-bound checker for a family of embedded programs: the flight control software of recent Mars missions. These codes are large (up to 250 KLOC), pointer intensive, heavily multithreaded and written in an object-oriented style, which makes their analysis very challenging. We designed a tool called C Global Surveyor (CGS) that can analyze the largest code in a couple of hours with a precision of 80%. The scalability and precision of the analyzer are achieved by using an incremental framework in which a pointer analysis and a numerical analysis of array indices mutually refine each other. CGS has been designed so that it can distribute the analysis over several processors in a cluster of machines. To the best of our knowledge this is the first distributed implementation of static analysis algorithms. Throughout the paper we will discuss the scalability setbacks that we encountered during the construction of the tool and their impact on the initial design decisions.
Halligan, Brian D; Geiger, Joey F; Vallejos, Andrew K; Greene, Andrew S; Twigger, Simon N
2009-06-01
One of the major difficulties for many laboratories setting up proteomics programs has been obtaining and maintaining the computational infrastructure required for the analysis of the large flow of proteomics data. We describe a system that combines distributed cloud computing and open source software to allow laboratories to set up scalable virtual proteomics analysis clusters without the investment in computational hardware or software licensing fees. Additionally, the pricing structure of distributed computing providers, such as Amazon Web Services, allows laboratories or even individuals to have large-scale computational resources at their disposal at a very low cost per run. We provide detailed step-by-step instructions on how to implement the virtual proteomics analysis clusters as well as a list of current available preconfigured Amazon machine images containing the OMSSA and X!Tandem search algorithms and sequence databases on the Medical College of Wisconsin Proteomics Center Web site ( http://proteomics.mcw.edu/vipdac ).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tierney, Brian D.; Choi, Sukwon; DasGupta, Sandeepan
A distributed impedance “field cage” structure is proposed and evaluated for electric field control in GaN-based, lateral high electron mobility transistors (HEMTs) operating as kilovolt-range power devices. In this structure, a resistive voltage divider is used to control the electric field throughout the active region. The structure complements earlier proposals utilizing floating field plates that did not employ resistively connected elements. Transient results, not previously reported for field plate schemes using either floating or resistively connected field plates, are presented for ramps of dV ds /dt = 100 V/ns. For both DC and transient results, the voltage between the gatemore » and drain is laterally distributed, ensuring the electric field profile between the gate and drain remains below the critical breakdown field as the source-to-drain voltage is increased. Our scheme indicates promise for achieving breakdown voltage scalability to a few kV.« less
Building analytical platform with Big Data solutions for log files of PanDA infrastructure
NASA Astrophysics Data System (ADS)
Alekseev, A. A.; Barreiro Megino, F. G.; Klimentov, A. A.; Korchuganova, T. A.; Maendo, T.; Padolski, S. V.
2018-05-01
The paper describes the implementation of a high-performance system for the processing and analysis of log files for the PanDA infrastructure of the ATLAS experiment at the Large Hadron Collider (LHC), responsible for the workload management of order of 2M daily jobs across the Worldwide LHC Computing Grid. The solution is based on the ELK technology stack, which includes several components: Filebeat, Logstash, ElasticSearch (ES), and Kibana. Filebeat is used to collect data from logs. Logstash processes data and export to Elasticsearch. ES are responsible for centralized data storage. Accumulated data in ES can be viewed using a special software Kibana. These components were integrated with the PanDA infrastructure and replaced previous log processing systems for increased scalability and usability. The authors will describe all the components and their configuration tuning for the current tasks, the scale of the actual system and give several real-life examples of how this centralized log processing and storage service is used to showcase the advantages for daily operations.
The demonstration of significant ferroelectricity in epitaxial Y-doped HfO2 film
Shimizu, Takao; Katayama, Kiliha; Kiguchi, Takanori; Akama, Akihiro; Konno, Toyohiko J.; Sakata, Osami; Funakubo, Hiroshi
2016-01-01
Ferroelectricity and Curie temperature are demonstrated for epitaxial Y-doped HfO2 film grown on (110) yttrium oxide-stabilized zirconium oxide (YSZ) single crystal using Sn-doped In2O3 (ITO) as bottom electrodes. The XRD measurements for epitaxial film enabled us to investigate its detailed crystal structure including orientations of the film. The ferroelectricity was confirmed by electric displacement filed – electric filed hysteresis measurement, which revealed saturated polarization of 16 μC/cm2. Estimated spontaneous polarization based on the obtained saturation polarization and the crystal structure analysis was 45 μC/cm2. This value is the first experimental estimations of the spontaneous polarization and is in good agreement with the theoretical value from first principle calculation. Curie temperature was also estimated to be about 450 °C. This study strongly suggests that the HfO2-based materials are promising for various ferroelectric applications because of their comparable ferroelectric properties including polarization and Curie temperature to conventional ferroelectric materials together with the reported excellent scalability in thickness and compatibility with practical manufacturing processes. PMID:27608815
Gene Graphics: a genomic neighborhood data visualization web application.
Harrison, Katherine J; Crécy-Lagard, Valérie de; Zallot, Rémi
2018-04-15
The examination of gene neighborhood is an integral part of comparative genomics but no tools to produce publication quality graphics of gene clusters are available. Gene Graphics is a straightforward web application for creating such visuals. Supported inputs include National Center for Biotechnology Information gene and protein identifiers with automatic fetching of neighboring information, GenBank files and data extracted from the SEED database. Gene representations can be customized for many parameters including gene and genome names, colors and sizes. Gene attributes can be copied and pasted for rapid and user-friendly customization of homologous genes between species. In addition to Portable Network Graphics and Scalable Vector Graphics, produced representations can be exported as Tagged Image File Format or Encapsulated PostScript, formats that are standard for publication. Hands-on tutorials with real life examples inspired from publications are available for training. Gene Graphics is freely available at https://katlabs.cc/genegraphics/ and source code is hosted at https://github.com/katlabs/genegraphics. katherinejh@ufl.edu or remizallot@ufl.edu. Supplementary data are available at Bioinformatics online.
SLURM: Simple Linux Utility for Resource Management
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jette, M; Dunlap, C; Garlick, J
2002-04-24
Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure. Development will take place in four phases: Phase I results in a solid infrastructure; Phase II produces a functional but limited interactive job initiation capability without use of the interconnect/switch; Phase III provides switch support and documentation; Phase IV provides job status, fault-tolerance, and job queuing and control through Livermore's Distributed Productionmore » Control System (DPCS), a meta-batch and resource management system.« less
Electrohydrodynamic printing for scalable MoS2 flake coating: application to gas sensing device
NASA Astrophysics Data System (ADS)
Lim, Sooman; Cho, Byungjin; Bae, Jaehyun; Kim, Ah Ra; Lee, Kyu Hwan; Kim, Se Hyun; Hahm, Myung Gwan; Nam, Jaewook
2016-10-01
Scalable sub-micrometer molybdenum disulfide ({{MoS}}2) flake films with highly uniform coverage were created using a systematic approach. An electrohydrodynamic (EHD) printing process realized a remarkably uniform distribution of exfoliated {{MoS}}2 flakes on desired substrates. In combination with a fast evaporating dispersion medium and an optimal choice of operating parameters, the EHD printing can produce a film rapidly on a substrate without excessive agglomeration or cluster formation, which can be problems in previously reported liquid-based continuous film methods. The printing of exfoliated {{MoS}}2 flakes enabled the fabrication of a gas sensor with high performance and reproducibility for {{NO}}2 and {{NH}}3.
CloudMan as a platform for tool, data, and analysis distribution
2012-01-01
Background Cloud computing provides an infrastructure that facilitates large scale computational analysis in a scalable, democratized fashion, However, in this context it is difficult to ensure sharing of an analysis environment and associated data in a scalable and precisely reproducible way. Results CloudMan (usecloudman.org) enables individual researchers to easily deploy, customize, and share their entire cloud analysis environment, including data, tools, and configurations. Conclusions With the enabled customization and sharing of instances, CloudMan can be used as a platform for collaboration. The presented solution improves accessibility of cloud resources, tools, and data to the level of an individual researcher and contributes toward reproducibility and transparency of research solutions. PMID:23181507
Distributed file management for remote clinical image-viewing stations
NASA Astrophysics Data System (ADS)
Ligier, Yves; Ratib, Osman M.; Girard, Christian; Logean, Marianne; Trayser, Gerhard
1996-05-01
The Geneva PACS is based on a distributed architecture, with different archive servers used to store all the image files produced by digital imaging modalities. Images can then be visualized on different display stations with the Osiris software. Image visualization require to have the image file physically present on the local station. Thus, images must be transferred from archive servers to local display stations in an acceptable way, which means fast and user friendly where the notion of file must be hidden to users. The transfer of image files is done according to different schemes including prefetching and direct image selection. Prefetching allows the retrieval of previous studies of a patient in advance. A direct image selection is also provided in order to retrieve images on request. When images are transferred locally on the display station, they are stored in Papyrus files, each file containing a set of images. File names are used by the Osiris viewing software to open image sequences. But file names alone are not explicit enough to properly describe the content of the file. A specific utility has been developed to present a list of patients, and for each patient a list of exams which can be selected and automatically displayed. The system has been successfully tested in different clinical environments. It will be soon extended on a hospital wide basis.
Narth, Christophe; Lagardère, Louis; Polack, Étienne; Gresh, Nohad; Wang, Qiantao; Bell, David R; Rackers, Joshua A; Ponder, Jay W; Ren, Pengyu Y; Piquemal, Jean-Philip
2016-02-15
We propose a general coupling of the Smooth Particle Mesh Ewald SPME approach for distributed multipoles to a short-range charge penetration correction modifying the charge-charge, charge-dipole and charge-quadrupole energies. Such an approach significantly improves electrostatics when compared to ab initio values and has been calibrated on Symmetry-Adapted Perturbation Theory reference data. Various neutral molecular dimers have been tested and results on the complexes of mono- and divalent cations with a water ligand are also provided. Transferability of the correction is adressed in the context of the implementation of the AMOEBA and SIBFA polarizable force fields in the TINKER-HP software. As the choices of the multipolar distribution are discussed, conclusions are drawn for the future penetration-corrected polarizable force fields highlighting the mandatory need of non-spurious procedures for the obtention of well balanced and physically meaningful distributed moments. Finally, scalability and parallelism of the short-range corrected SPME approach are addressed, demonstrating that the damping function is computationally affordable and accurate for molecular dynamics simulations of complex bio- or bioinorganic systems in periodic boundary conditions. Copyright © 2016 Wiley Periodicals, Inc.
Research in Distributed Real-Time Systems
NASA Technical Reports Server (NTRS)
Mukkamala, R.
1997-01-01
This document summarizes the progress we have made on our study of issues concerning the schedulability of real-time systems. Our study has produced several results in the scalability issues of distributed real-time systems. In particular, we have used our techniques to resolve schedulability issues in distributed systems with end-to-end requirements. During the next year (1997-98), we propose to extend the current work to address the modeling and workload characterization issues in distributed real-time systems. In particular, we propose to investigate the effect of different workload models and component models on the design and the subsequent performance of distributed real-time systems.
Simulation of Tasks Distribution in Horizontally Scalable Management System
NASA Astrophysics Data System (ADS)
Kustov, D.; Sherstneva, A.; Botygin, I.
2016-08-01
This paper presents an imitational model of the task distribution system for the components of territorially-distributed automated management system with a dynamically changing topology. Each resource of the distributed automated management system is represented with an agent, which allows to set behavior of every resource in the best possible way and ensure their interaction. The agent work load imitation was done via service query imitation formed in a system dynamics style using a stream diagram. The query generation took place in the abstract-represented center - afterwards, they were sent to the drive to be distributed to management system resources according to a ranking table.
Cooperative Learning for Distributed In-Network Traffic Classification
NASA Astrophysics Data System (ADS)
Joseph, S. B.; Loo, H. R.; Ismail, I.; Andromeda, T.; Marsono, M. N.
2017-04-01
Inspired by the concept of autonomic distributed/decentralized network management schemes, we consider the issue of information exchange among distributed network nodes to network performance and promote scalability for in-network monitoring. In this paper, we propose a cooperative learning algorithm for propagation and synchronization of network information among autonomic distributed network nodes for online traffic classification. The results show that network nodes with sharing capability perform better with a higher average accuracy of 89.21% (sharing data) and 88.37% (sharing clusters) compared to 88.06% for nodes without cooperative learning capability. The overall performance indicates that cooperative learning is promising for distributed in-network traffic classification.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-07-18
... SECURITIES AND EXCHANGE COMMISSION [File No. 500-1] BioMETRX, Inc., Biopure Corp. (n/k/a PBBPC, Inc.), Distributed Energy Systems Corp., Fortified Holdings Corp., Knobias, Inc., and One IP Voice... concerning the securities of BioMETRX, Inc. because it has not filed any periodic reports since the period...
48 CFR 552.238-81 - Modification (Federal Supply Schedule).
Code of Federal Regulations, 2014 CFR
2014-10-01
...) Electronic File Updates. The Contractor shall update electronic file submissions to reflect all modifications... electronic file updates. The Contractor may transmit price reductions, item deletions, and corrections... workdays from the last day of the calendar quarter. (2) At a minimum, the Contractor shall distribute each...
75 FR 18201 - Wisconsin Electric Power Company; Notice of Filing
Federal Register 2010, 2011, 2012, 2013, 2014
2010-04-09
... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission [Docket No. ER10-911-001] Wisconsin Electric Power Company; Notice of Filing April 2, 2010. Take notice that on March 26, 2010, Wisconsin Electric Power Company filed counterpart signature pages to the executed Wholesale Distribution Service...
NASA Astrophysics Data System (ADS)
Verma, R. V.
2018-04-01
The Archive Inventory Management System (AIMS) is a software package for understanding the distribution, characteristics, integrity, and nuances of files and directories in large file-based data archives on a continuous basis.
21 CFR 720.3 - How and where to file.
Code of Federal Regulations, 2012 CFR
2012-04-01
... and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...
21 CFR 720.3 - How and where to file.
Code of Federal Regulations, 2014 CFR
2014-04-01
... and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...
21 CFR 720.3 - How and where to file.
Code of Federal Regulations, 2013 CFR
2013-04-01
... and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...
21 CFR 720.3 - How and where to file.
Code of Federal Regulations, 2010 CFR
2010-04-01
... and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...
NASA Technical Reports Server (NTRS)
Soltis, Steven R.; Ruwart, Thomas M.; OKeefe, Matthew T.
1996-01-01
The global file system (GFS) is a prototype design for a distributed file system in which cluster nodes physically share storage devices connected via a network-like fiber channel. Networks and network-attached storage devices have advanced to a level of performance and extensibility so that the previous disadvantages of shared disk architectures are no longer valid. This shared storage architecture attempts to exploit the sophistication of storage device technologies whereas a server architecture diminishes a device's role to that of a simple component. GFS distributes the file system responsibilities across processing nodes, storage across the devices, and file system resources across the entire storage pool. GFS caches data on the storage devices instead of the main memories of the machines. Consistency is established by using a locking mechanism maintained by the storage devices to facilitate atomic read-modify-write operations. The locking mechanism is being prototyped in the Silicon Graphics IRIX operating system and is accessed using standard Unix commands and modules.
Distributed-Memory Breadth-First Search on Massive Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Buluc, Aydin; Beamer, Scott; Madduri, Kamesh
This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.
Difet: Distributed Feature Extraction Tool for High Spatial Resolution Remote Sensing Images
NASA Astrophysics Data System (ADS)
Eken, S.; Aydın, E.; Sayar, A.
2017-11-01
In this paper, we propose distributed feature extraction tool from high spatial resolution remote sensing images. Tool is based on Apache Hadoop framework and Hadoop Image Processing Interface. Two corner detection (Harris and Shi-Tomasi) algorithms and five feature descriptors (SIFT, SURF, FAST, BRIEF, and ORB) are considered. Robustness of the tool in the task of feature extraction from LandSat-8 imageries are evaluated in terms of horizontal scalability.
Optimal File-Distribution in Heterogeneous and Asymmetric Storage Networks
NASA Astrophysics Data System (ADS)
Langner, Tobias; Schindelhauer, Christian; Souza, Alexander
We consider an optimisation problem which is motivated from storage virtualisation in the Internet. While storage networks make use of dedicated hardware to provide homogeneous bandwidth between servers and clients, in the Internet, connections between storage servers and clients are heterogeneous and often asymmetric with respect to upload and download. Thus, for a large file, the question arises how it should be fragmented and distributed among the servers to grant "optimal" access to the contents. We concentrate on the transfer time of a file, which is the time needed for one upload and a sequence of n downloads, using a set of m servers with heterogeneous bandwidths. We assume that fragments of the file can be transferred in parallel to and from multiple servers. This model yields a distribution problem that examines the question of how these fragments should be distributed onto those servers in order to minimise the transfer time. We present an algorithm, called FlowScaling, that finds an optimal solution within running time {O}(m log m). We formulate the distribution problem as a maximum flow problem, which involves a function that states whether a solution with a given transfer time bound exists. This function is then used with a scaling argument to determine an optimal solution within the claimed time complexity.
NASA Technical Reports Server (NTRS)
Tinetti, Ana F.; Maglieri, Domenic J.; Driver, Cornelius; Bobbitt, Percy J.
2011-01-01
A detailed geometric description, in wave drag format, has been developed for the Convair B-58 and North American XB-70-1 delta wing airplanes. These descriptions have been placed on electronic files, the contents of which are described in this paper They are intended for use in wave drag and sonic boom calculations. Included in the electronic file and in the present paper are photographs and 3-view drawings of the two airplanes, tabulated geometric descriptions of each vehicle and its components, and comparisons of the electronic file outputs with existing data. The comparisons include a pictorial of the two airplanes based on the present geometric descriptions, and cross-sectional area distributions for both the normal Mach cuts and oblique Mach cuts above and below the vehicles. Good correlation exists between the area distributions generated in the late 1950s and 1960s and the present files. The availability of these electronic files facilitates further validation of sonic boom prediction codes through the use of two existing data bases on these airplanes, which were acquired in the 1960s and have not been fully exploited.
78 FR 17650 - Combined Notice of Filings #2
Federal Register 2010, 2011, 2012, 2013, 2014
2013-03-22
...: Spectrum Nevada Solar, LLC. Description: Application and Initial Baseline Tariff Filing to be effective 4... Service Agreement CA PV Energy, LLC at 1670 Champagne Ave to be effective 3/16/2013. Filed Date: 3/15/13...: Southern California Edison Company. Description: GIA and Distribution Service Agreement CA PV Energy, LLC...
78 FR 49504 - Combined Notice of Filings #2
Federal Register 2010, 2011, 2012, 2013, 2014
2013-08-14
... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings 2 Take notice that the Commission received the following electric rate filings: Docket Numbers: ER13-2065-000. Applicants: Southern California Edison Company. Description: Amended SGIA & Distribution Service Agmt with Lancaster Little Rock C LLC to be effective ...
78 FR 38705 - Combined Notice of Filings #2
Federal Register 2010, 2011, 2012, 2013, 2014
2013-06-27
... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings 2 Take notice that the Commission received the following electric rate filings: Docket Numbers: ER13-1188-010. Applicants: Pacific Gas and Electric Company. Description: Pacific Gas and Electric Company. submits Wholesale Distribution Tariff Rate Case 2013 (WDT2...
MISR HDF-to-Binary Converter and Radiance/BRF Calculation Tools
Atmospheric Science Data Center
2013-04-01
... to have the HDF and HDF-EOS libraries for the target computer. The HDF libraries are available from The HDF Group (THG) . The ... and the HDF-EOS include and library files on the target computer. The following files are included in the distribution tar file for ...
The Distributed Structure-Searchable Toxicity (DSSTox) ARYEXP and GEOGSE files are newly published, structure-annotated files of the chemical-associated and chemical exposure-related summary experimental content contained in the ArrayExpress Repository and Gene Expression Omnibus...
Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Torres, Aaron
2015-10-20
Techniques are provided for storing files in a parallel computing system using different resolutions. A method is provided for storing at least one file generated by a distributed application in a parallel computing system. The file comprises one or more of a complete file and a sub-file. The method comprises the steps of obtaining semantic information related to the file; generating a plurality of replicas of the file with different resolutions based on the semantic information; and storing the file and the plurality of replicas of the file in one or more storage nodes of the parallel computing system. The different resolutions comprise, for example, a variable number of bits and/or a different sub-set of data elements from the file. A plurality of the sub-files can be merged to reproduce the file.
Unleashing spatially distributed ecohydrology modeling using Big Data tools
NASA Astrophysics Data System (ADS)
Miles, B.; Idaszak, R.
2015-12-01
Physically based spatially distributed ecohydrology models are useful for answering science and management questions related to the hydrology and biogeochemistry of prairie, savanna, forested, as well as urbanized ecosystems. However, these models can produce hundreds of gigabytes of spatial output for a single model run over decadal time scales when run at regional spatial scales and moderate spatial resolutions (~100-km2+ at 30-m spatial resolution) or when run for small watersheds at high spatial resolutions (~1-km2 at 3-m spatial resolution). Numerical data formats such as HDF5 can store arbitrarily large datasets. However even in HPC environments, there are practical limits on the size of single files that can be stored and reliably backed up. Even when such large datasets can be stored, querying and analyzing these data can suffer from poor performance due to memory limitations and I/O bottlenecks, for example on single workstations where memory and bandwidth are limited, or in HPC environments where data are stored separately from computational nodes. The difficulty of storing and analyzing spatial data from ecohydrology models limits our ability to harness these powerful tools. Big Data tools such as distributed databases have the potential to surmount the data storage and analysis challenges inherent to large spatial datasets. Distributed databases solve these problems by storing data close to computational nodes while enabling horizontal scalability and fault tolerance. Here we present the architecture of and preliminary results from PatchDB, a distributed datastore for managing spatial output from the Regional Hydro-Ecological Simulation System (RHESSys). The initial version of PatchDB uses message queueing to asynchronously write RHESSys model output to an Apache Cassandra cluster. Once stored in the cluster, these data can be efficiently queried to quickly produce both spatial visualizations for a particular variable (e.g. maps and animations), as well as point time series of arbitrary variables at arbitrary points in space within a watershed or river basin. By treating ecohydrology modeling as a Big Data problem, we hope to provide a platform for answering transformative science and management questions related to water quantity and quality in a world of non-stationary climate.
Massive Signal Analysis with Hadoop (Invited)
NASA Astrophysics Data System (ADS)
Addair, T.
2013-12-01
The Geophysical Monitoring Program (GMP) at Lawrence Livermore National Laboratory is in the process of transitioning from a primarily human-driven analysis pipeline to a more automated and exploratory system. Waveform correlation represents a significant part of this effort, and the results that come out of this processing could lead to the development of more sophisticated event detection and analysis systems that require less human interaction, and address fundamental shortcomings in existing systems. Furthermore, use of distributed IO systems fundamentally addresses a scalability concern for the GMP as our data holdings continue to grow rapidly. As the data volume increases, it becomes less reasonable to rely upon human analysts to sift through all the information. Not only is more automation essential to keeping up with the ingestion rate, but so too do we require faster and more sophisticated tools for visualizing and interacting with the data. These issues of scalability are not unique to GMP or the seismic domain. All across the lab, and throughout industry, we hear about the promise of 'big data' to address the need of quickly analyzing vast amounts of data in fundamentally new ways. Our waveform correlation system finds and correlates nearby seismic events across the entire Earth. In our original implementation of the system, we processed some 50 TB of data on an in-house traditional HPC cluster (44 cores, 1 filesystem) over the span of 42 days. Having determined the primary bottleneck in the performance to be reading waveforms off a single BlueArc file server, we began investigating distributed IO solutions like Hadoop. As a test case, we took a 1 TB subset of our data and ported it to Livermore Computing's development Hadoop cluster. Through a pilot project sponsored by Livermore Computing (LC), the GMP successfully implemented the waveform correlation system in the Hadoop distributed MapReduce computing framework. Hadoop is an open source implementation of the MapReduce distributed programming framework. We used the Hadoop scripting framework known as Pig for putting together the multi-job MapReduce pipeline used to extract as much parallelism as possible from the algorithms. We also made use the Sqoop data ingestion tool to pull metadata tables from our Oracle database into HDFS (the Hadoop Distributed Filesystem). Running on our in-house HPC cluster, processing this test dataset took 58 hours to complete. In contrast, running our Hadoop implementation on LC's 10 node (160 core) cluster, we were able to cross-correlate the 1 TB of nearby seismic events in just under 3 hours, over a factor of 19 improvement from our existing implementation. This project is one of the first major data mining and analysis tasks performed at the lab or anywhere else correlating the entire Earth's seismicity. Through the success of this project, we believe we've shown that a MapReduce solution can be appropriate for many large-scale Earth science data analysis and exploration problems. Given Hadoop's position as the dominant data analytics solution in industry, we believe Hadoop can be applied to many previously intractable Earth science problems.
The novel high-performance 3-D MT inverse solver
NASA Astrophysics Data System (ADS)
Kruglyakov, Mikhail; Geraskin, Alexey; Kuvshinov, Alexey
2016-04-01
We present novel, robust, scalable, and fast 3-D magnetotelluric (MT) inverse solver. The solver is written in multi-language paradigm to make it as efficient, readable and maintainable as possible. Separation of concerns and single responsibility concepts go through implementation of the solver. As a forward modelling engine a modern scalable solver extrEMe, based on contracting integral equation approach, is used. Iterative gradient-type (quasi-Newton) optimization scheme is invoked to search for (regularized) inverse problem solution, and adjoint source approach is used to calculate efficiently the gradient of the misfit. The inverse solver is able to deal with highly detailed and contrasting models, allows for working (separately or jointly) with any type of MT responses, and supports massive parallelization. Moreover, different parallelization strategies implemented in the code allow optimal usage of available computational resources for a given problem statement. To parameterize an inverse domain the so-called mask parameterization is implemented, which means that one can merge any subset of forward modelling cells in order to account for (usually) irregular distribution of observation sites. We report results of 3-D numerical experiments aimed at analysing the robustness, performance and scalability of the code. In particular, our computational experiments carried out at different platforms ranging from modern laptops to HPC Piz Daint (6th supercomputer in the world) demonstrate practically linear scalability of the code up to thousands of nodes.
Scalability improvements to NRLMOL for DFT calculations of large molecules
NASA Astrophysics Data System (ADS)
Diaz, Carlos Manuel
Advances in high performance computing (HPC) have provided a way to treat large, computationally demanding tasks using thousands of processors. With the development of more powerful HPC architectures, the need to create efficient and scalable code has grown more important. Electronic structure calculations are valuable in understanding experimental observations and are routinely used for new materials predictions. For the electronic structure calculations, the memory and computation time are proportional to the number of atoms. Memory requirements for these calculations scale as N2, where N is the number of atoms. While the recent advances in HPC offer platforms with large numbers of cores, the limited amount of memory available on a given node and poor scalability of the electronic structure code hinder their efficient usage of these platforms. This thesis will present some developments to overcome these bottlenecks in order to study large systems. These developments, which are implemented in the NRLMOL electronic structure code, involve the use of sparse matrix storage formats and the use of linear algebra using sparse and distributed matrices. These developments along with other related development now allow ground state density functional calculations using up to 25,000 basis functions and the excited state calculations using up to 17,000 basis functions while utilizing all cores on a node. An example on a light-harvesting triad molecule is described. Finally, future plans to further improve the scalability will be presented.
OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets.
García-Pedrajas, Nicolás; Perez-Rodríguez, Javier; de Haro-García, Aida
2013-02-01
In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
Oceans 2.0: Interactive tools for the Visualization of Multi-dimensional Ocean Sensor Data
NASA Astrophysics Data System (ADS)
Biffard, B.; Valenzuela, M.; Conley, P.; MacArthur, M.; Tredger, S.; Guillemot, E.; Pirenne, B.
2016-12-01
Ocean Networks Canada (ONC) operates ocean observatories on all three of Canada's coasts. The instruments produce 280 gigabytes of data per day with 1/2 petabyte archived so far. In 2015, 13 terabytes were downloaded by over 500 users from across the world. ONC's data management system is referred to as "Oceans 2.0" owing to its interactive, participative features. A key element of Oceans 2.0 is real time data acquisition and processing: custom device drivers implement the input-output protocol of each instrument. Automatic parsing and calibration takes place on the fly, followed by event detection and quality control. All raw data are stored in a file archive, while the processed data are copied to fast databases. Interactive access to processed data is provided through data download and visualization/quick look features that are adapted to diverse data types (scalar, acoustic, video, multi-dimensional, etc). Data may be post or re-processed to add features, analysis or correct errors, update calibrations, etc. A robust storage structure has been developed consisting of an extensive file system and a no-SQL database (Cassandra). Cassandra is a node-based open source distributed database management system. It is scalable and offers improved performance for big data. A key feature is data summarization. The system has also been integrated with web services and an ERDDAP OPeNDAP server, capable of serving scalar and multidimensional data from Cassandra for fixed or mobile devices.A complex data viewer has been developed making use of the big data capability to interactively display live or historic echo sounder and acoustic Doppler current profiler data, where users can scroll, apply processing filters and zoom through gigabytes of data with simple interactions. This new technology brings scientists one step closer to a comprehensive, web-based data analysis environment in which visual assessment, filtering, event detection and annotation can be integrated.
Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Torres, Aaron
2015-02-03
Techniques are provided for storing files in a parallel computing system using sub-files with semantically meaningful boundaries. A method is provided for storing at least one file generated by a distributed application in a parallel computing system. The file comprises one or more of a complete file and a plurality of sub-files. The method comprises the steps of obtaining a user specification of semantic information related to the file; providing the semantic information as a data structure description to a data formatting library write function; and storing the semantic information related to the file with one or more of the sub-files in one or more storage nodes of the parallel computing system. The semantic information provides a description of data in the file. The sub-files can be replicated based on semantically meaningful boundaries.
A Rich Metadata Filesystem for Scientific Data
ERIC Educational Resources Information Center
Bui, Hoang
2012-01-01
As scientific research becomes more data intensive, there is an increasing need for scalable, reliable, and high performance storage systems. Such data repositories must provide both data archival services and rich metadata, and cleanly integrate with large scale computing resources. ROARS is a hybrid approach to distributed storage that provides…
Sparse Distributed Representation and Hierarchy: Keys to Scalable Machine Intelligence
2016-04-01
Lesher, Jasmin Leveille, and Oliver Layton Neurithmic Systems, LLC APRIL 2016 Final Report Approved for public release...61101E 6. AUTHOR(S) Gerard (Rod) Rinkus, Greg Lesher, Jasmin Leveille, and Oliver Layton 5d. PROJECT NUMBER 1000 5e. TASK NUMBER N/A 5f. WORK
Simulation of the Impact of Packet Errors on the Kademlia Peer-to-Peer Routing
2010-09-01
during the routing process. Pastry [15] switches to a proximity based metric when approaching a node closely. This complicates the implementation...and Peter Druschel. Pastry : Scalable, distributed object location and routing for large-scale peer-to-peer systems. IFIP/ACM International Conference
A Highly Scalable Data Service (HSDS) using Cloud-based Storage Technologies for Earth Science Data
NASA Astrophysics Data System (ADS)
Michaelis, A.; Readey, J.; Votava, P.; Henderson, J.; Willmore, F.
2017-12-01
Cloud based infrastructure may offer several key benefits of scalability, built in redundancy, security mechanisms and reduced total cost of ownership as compared with a traditional data center approach. However, most of the tools and legacy software systems developed for online data repositories within the federal government were not developed with a cloud based infrastructure in mind and do not fully take advantage of commonly available cloud-based technologies. Moreover, services bases on object storage are well established and provided through all the leading cloud service providers (Amazon Web Service, Microsoft Azure, Google Cloud, etc…) of which can often provide unmatched "scale-out" capabilities and data availability to a large and growing consumer base at a price point unachievable from in-house solutions. We describe a system that utilizes object storage rather than traditional file system based storage to vend earth science data. The system described is not only cost effective, but shows a performance advantage for running many different analytics tasks in the cloud. To enable compatibility with existing tools and applications, we outline client libraries that are API compatible with existing libraries for HDF5 and NetCDF4. Performance of the system is demonstrated using clouds services running on Amazon Web Services.
Building an organic block storage service at CERN with Ceph
NASA Astrophysics Data System (ADS)
van der Ster, Daniel; Wiebalck, Arne
2014-06-01
Emerging storage requirements, such as the need for block storage for both OpenStack VMs and file services like AFS and NFS, have motivated the development of a generic backend storage service for CERN IT. The goals for such a service include (a) vendor neutrality, (b) horizontal scalability with commodity hardware, (c) fault tolerance at the disk, host, and network levels, and (d) support for geo-replication. Ceph is an attractive option due to its native block device layer RBD which is built upon its scalable, reliable, and performant object storage system, RADOS. It can be considered an "organic" storage solution because of its ability to balance and heal itself while living on an ever-changing set of heterogeneous disk servers. This work will present the outcome of a petabyte-scale test deployment of Ceph by CERN IT. We will first present the architecture and configuration of our cluster, including a summary of best practices learned from the community and discovered internally. Next the results of various functionality and performance tests will be shown: the cluster has been used as a backend block storage system for AFS and NFS servers as well as a large OpenStack cluster at CERN. Finally, we will discuss the next steps and future possibilities for Ceph at CERN.
Incorporating uncertainty in RADTRAN 6.0 input files.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dennis, Matthew L.; Weiner, Ruth F.; Heames, Terence John
Uncertainty may be introduced into RADTRAN analyses by distributing input parameters. The MELCOR Uncertainty Engine (Gauntt and Erickson, 2004) has been adapted for use in RADTRAN to determine the parameter shape and minimum and maximum of the distribution, to sample on the distribution, and to create an appropriate RADTRAN batch file. Coupling input parameters is not possible in this initial application. It is recommended that the analyst be very familiar with RADTRAN and able to edit or create a RADTRAN input file using a text editor before implementing the RADTRAN Uncertainty Analysis Module. Installation of the MELCOR Uncertainty Engine ismore » required for incorporation of uncertainty into RADTRAN. Gauntt and Erickson (2004) provides installation instructions as well as a description and user guide for the uncertainty engine.« less
Distributed Prognostics based on Structural Model Decomposition
NASA Technical Reports Server (NTRS)
Daigle, Matthew J.; Bregon, Anibal; Roychoudhury, I.
2014-01-01
Within systems health management, prognostics focuses on predicting the remaining useful life of a system. In the model-based prognostics paradigm, physics-based models are constructed that describe the operation of a system and how it fails. Such approaches consist of an estimation phase, in which the health state of the system is first identified, and a prediction phase, in which the health state is projected forward in time to determine the end of life. Centralized solutions to these problems are often computationally expensive, do not scale well as the size of the system grows, and introduce a single point of failure. In this paper, we propose a novel distributed model-based prognostics scheme that formally describes how to decompose both the estimation and prediction problems into independent local subproblems whose solutions may be easily composed into a global solution. The decomposition of the prognostics problem is achieved through structural decomposition of the underlying models. The decomposition algorithm creates from the global system model a set of local submodels suitable for prognostics. Independent local estimation and prediction problems are formed based on these local submodels, resulting in a scalable distributed prognostics approach that allows the local subproblems to be solved in parallel, thus offering increases in computational efficiency. Using a centrifugal pump as a case study, we perform a number of simulation-based experiments to demonstrate the distributed approach, compare the performance with a centralized approach, and establish its scalability. Index Terms-model-based prognostics, distributed prognostics, structural model decomposition ABBREVIATIONS
Scalable and expressive medical terminologies.
Mays, E; Weida, R; Dionne, R; Laker, M; White, B; Liang, C; Oles, F J
1996-01-01
The K-Rep system, based on description logic, is used to represent and reason with large and expressive controlled medical terminologies. Expressive concept descriptions incorporate semantically precise definitions composed using logical operators, together with important non-semantic information such as synonyms and codes. Examples are drawn from our experience with K-Rep in modeling the InterMed laboratory terminology and also developing a large clinical terminology now in production use at Kaiser-Permanente. System-level scalability of performance is achieved through an object-oriented database system which efficiently maps persistent memory to virtual memory. Equally important is conceptual scalability-the ability to support collaborative development, organization, and visualization of a substantial terminology as it evolves over time. K-Rep addresses this need by logically completing concept definitions and automatically classifying concepts in a taxonomy via subsumption inferences. The K-Rep system includes a general-purpose GUI environment for terminology development and browsing, a custom interface for formulary term maintenance, a C+2 application program interface, and a distributed client-server mode which provides lightweight clients with efficient run-time access to K-Rep by means of a scripting language.
Miao, Wang; Luo, Jun; Di Lucente, Stefano; Dorren, Harm; Calabretta, Nicola
2014-02-10
We propose and demonstrate an optical flat datacenter network based on scalable optical switch system with optical flow control. Modular structure with distributed control results in port-count independent optical switch reconfiguration time. RF tone in-band labeling technique allowing parallel processing of the label bits ensures the low latency operation regardless of the switch port-count. Hardware flow control is conducted at optical level by re-using the label wavelength without occupying extra bandwidth, space, and network resources which further improves the performance of latency within a simple structure. Dynamic switching including multicasting operation is validated for a 4 x 4 system. Error free operation of 40 Gb/s data packets has been achieved with only 1 dB penalty. The system could handle an input load up to 0.5 providing a packet loss lower that 10(-5) and an average latency less that 500 ns when a buffer size of 16 packets is employed. Investigation on scalability also indicates that the proposed system could potentially scale up to large port count with limited power penalty.
Statistical exchange-coupling errors and the practicality of scalable silicon donor qubits
NASA Astrophysics Data System (ADS)
Song, Yang; Das Sarma, S.
2016-12-01
Recent experimental efforts have led to considerable interest in donor-based localized electron spins in Si as viable qubits for a scalable silicon quantum computer. With the use of isotopically purified 28Si and the realization of extremely long spin coherence time in single-donor electrons, the recent experimental focus is on two-coupled donors with the eventual goal of a scaled-up quantum circuit. Motivated by this development, we simulate the statistical distribution of the exchange coupling J between a pair of donors under realistic donor placement straggles, and quantify the errors relative to the intended J value. With J values in a broad range of donor-pair separation ( 5 <|R |<60 nm), we work out various cases systematically, for a target donor separation R0 along the [001], [110] and [111] Si crystallographic directions, with |R0|=10 ,20 or 30 nm and standard deviation σR=1 ,2 ,5 or 10 nm. Our extensive theoretical results demonstrate the great challenge for a prescribed J gate even with just a donor pair, a first step for any scalable Si-donor-based quantum computer.
NASA Technical Reports Server (NTRS)
Rustay, R. C.; Gajjar, J. T.; Rankin, R. W.; Wentz, R. C.; Wooding, R.
1982-01-01
Listings of source programs and some illustrative examples of various ASCII data base files are presented. The listings are grouped into the following categories: main programs, subroutine programs, illustrative ASCII data base files. Within each category files are listed alphabetically.
Packet-aware transport for video distribution [Invited
NASA Astrophysics Data System (ADS)
Aguirre-Torres, Luis; Rosenfeld, Gady; Bruckman, Leon; O'Connor, Mannix
2006-05-01
We describe a solution based on resilient packet rings (RPR) for the distribution of broadcast video and video-on-demand (VoD) content over a packet-aware transport network. The proposed solution is based on our experience in the design and deployment of nationwide Triple Play networks and relies on technologies such as RPR, multiprotocol label switching (MPLS), and virtual private LAN service (VPLS) to provide the most efficient solution in terms of utilization, scalability, and availability.
Scalable Light Module for Low-Cost, High-Efficiency Light- Emitting Diode Luminaires
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tarsa, Eric
2015-08-31
During this two-year program Cree developed a scalable, modular optical architecture for low-cost, high-efficacy light emitting diode (LED) luminaires. Stated simply, the goal of this architecture was to efficiently and cost-effectively convey light from LEDs (point sources) to broad luminaire surfaces (area sources). By simultaneously developing warm-white LED components and low-cost, scalable optical elements, a high system optical efficiency resulted. To meet program goals, Cree evaluated novel approaches to improve LED component efficacy at high color quality while not sacrificing LED optical efficiency relative to conventional packages. Meanwhile, efficiently coupling light from LEDs into modular optical elements, followed by optimallymore » distributing and extracting this light, were challenges that were addressed via novel optical design coupled with frequent experimental evaluations. Minimizing luminaire bill of materials and assembly costs were two guiding principles for all design work, in the effort to achieve luminaires with significantly lower normalized cost ($/klm) than existing LED fixtures. Chief project accomplishments included the achievement of >150 lm/W warm-white LEDs having primary optics compatible with low-cost modular optical elements. In addition, a prototype Light Module optical efficiency of over 90% was measured, demonstrating the potential of this scalable architecture for ultra-high-efficacy LED luminaires. Since the project ended, Cree has continued to evaluate optical element fabrication and assembly methods in an effort to rapidly transfer this scalable, cost-effective technology to Cree production development groups. The Light Module concept is likely to make a strong contribution to the development of new cost-effective, high-efficacy luminaries, thereby accelerating widespread adoption of energy-saving SSL in the U.S.« less
Scalable Integrated Multi-Mission Support System Simulator Release 3.0
NASA Technical Reports Server (NTRS)
Kim, John; Velamuri, Sarma; Casey, Taylor; Bemann, Travis
2012-01-01
The Scalable Integrated Multi-mission Support System (SIMSS) is a tool that performs a variety of test activities related to spacecraft simulations and ground segment checks. SIMSS is a distributed, component-based, plug-and-play client-server system useful for performing real-time monitoring and communications testing. SIMSS runs on one or more workstations and is designed to be user-configurable or to use predefined configurations for routine operations. SIMSS consists of more than 100 modules that can be configured to create, receive, process, and/or transmit data. The SIMSS/GMSEC innovation is intended to provide missions with a low-cost solution for implementing their ground systems, as well as significantly reducing a mission s integration time and risk.
Chelonia: A self-healing, replicated storage system
NASA Astrophysics Data System (ADS)
Kerr Nilsen, Jon; Toor, Salman; Nagy, Zsombor; Read, Alex
2011-12-01
Chelonia is a novel grid storage system designed to fill the requirements gap between those of large, sophisticated scientific collaborations which have adopted the grid paradigm for their distributed storage needs, and of corporate business communities gravitating towards the cloud paradigm. Chelonia is an integrated system of heterogeneous, geographically dispersed storage sites which is easily and dynamically expandable and optimized for high availability and scalability. The architecture and implementation in term of web-services running inside the Advanced Resource Connector Hosting Environment Dameon (ARC HED) are described and results of tests in both local -area and wide-area networks that demonstrate the fault tolerance, stability and scalability of Chelonia will be presented. In addition, example setups for production deployments for small and medium-sized VO's are described.
Scalable Molecular Dynamics with NAMD
Phillips, James C.; Braun, Rosemary; Wang, Wei; Gumbart, James; Tajkhorshid, Emad; Villa, Elizabeth; Chipot, Christophe; Skeel, Robert D.; Kalé, Laxmikant; Schulten, Klaus
2008-01-01
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD scales to hundreds of processors on high-end parallel platforms, as well as tens of processors on low-cost commodity clusters, and also runs on individual desktop and laptop computers. NAMD works with AMBER and CHARMM potential functions, parameters, and file formats. This paper, directed to novices as well as experts, first introduces concepts and methods used in the NAMD program, describing the classical molecular dynamics force field, equations of motion, and integration methods along with the efficient electrostatics evaluation algorithms employed and temperature and pressure controls used. Features for steering the simulation across barriers and for calculating both alchemical and conformational free energy differences are presented. The motivations for and a roadmap to the internal design of NAMD, implemented in C++ and based on Charm++ parallel objects, are outlined. The factors affecting the serial and parallel performance of a simulation are discussed. Next, typical NAMD use is illustrated with representative applications to a small, a medium, and a large biomolecular system, highlighting particular features of NAMD, e.g., the Tcl scripting language. Finally, the paper provides a list of the key features of NAMD and discusses the benefits of combining NAMD with the molecular graphics/sequence analysis software VMD and the grid computing/collaboratory software BioCoRE. NAMD is distributed free of charge with source code at www.ks.uiuc.edu. PMID:16222654
Rasdaman for Big Spatial Raster Data
NASA Astrophysics Data System (ADS)
Hu, F.; Huang, Q.; Scheele, C. J.; Yang, C. P.; Yu, M.; Liu, K.
2015-12-01
Spatial raster data have grown exponentially over the past decade. Recent advancements on data acquisition technology, such as remote sensing, have allowed us to collect massive observation data of various spatial resolution and domain coverage. The volume, velocity, and variety of such spatial data, along with the computational intensive nature of spatial queries, pose grand challenge to the storage technologies for effective big data management. While high performance computing platforms (e.g., cloud computing) can be used to solve the computing-intensive issues in big data analysis, data has to be managed in a way that is suitable for distributed parallel processing. Recently, rasdaman (raster data manager) has emerged as a scalable and cost-effective database solution to store and retrieve massive multi-dimensional arrays, such as sensor, image, and statistics data. Within this paper, the pros and cons of using rasdaman to manage and query spatial raster data will be examined and compared with other common approaches, including file-based systems, relational databases (e.g., PostgreSQL/PostGIS), and NoSQL databases (e.g., MongoDB and Hive). Earth Observing System (EOS) data collected from NASA's Atmospheric Scientific Data Center (ASDC) will be used and stored in these selected database systems, and a set of spatial and non-spatial queries will be designed to benchmark their performance on retrieving large-scale, multi-dimensional arrays of EOS data. Lessons learnt from using rasdaman will be discussed as well.
OceanXtremes: Scalable Anomaly Detection in Oceanographic Time-Series
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Armstrong, E. M.; Chin, T. M.; Gill, K. M.; Greguska, F. R., III; Huang, T.; Jacob, J. C.; Quach, N.
2016-12-01
The oceanographic community must meet the challenge to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data-intensive reality, we are developing an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of 15 to 30-year ocean science datasets.Our parallel analytics engine is extending the NEXUS system and exploits multiple open-source technologies: Apache Cassandra as a distributed spatial "tile" cache, Apache Spark for in-memory parallel computation, and Apache Solr for spatial search and storing pre-computed tile statistics and other metadata. OceanXtremes provides these key capabilities: Parallel generation (Spark on a compute cluster) of 15 to 30-year Ocean Climatologies (e.g. sea surface temperature or SST) in hours or overnight, using simple pixel averages or customizable Gaussian-weighted "smoothing" over latitude, longitude, and time; Parallel pre-computation, tiling, and caching of anomaly fields (daily variables minus a chosen climatology) with pre-computed tile statistics; Parallel detection (over the time-series of tiles) of anomalies or phenomena by regional area-averages exceeding a specified threshold (e.g. high SST in El Nino or SST "blob" regions), or more complex, custom data mining algorithms; Shared discovery and exploration of ocean phenomena and anomalies (facet search using Solr), along with unexpected correlations between key measured variables; Scalable execution for all capabilities on a hybrid Cloud, using our on-premise OpenStack Cloud cluster or at Amazon. The key idea is that the parallel data-mining operations will be run "near" the ocean data archives (a local "network" hop) so that we can efficiently access the thousands of files making up a three decade time-series. The presentation will cover the architecture of OceanXtremes, parallelization of the climatology computation and anomaly detection algorithms using Spark, example results for SST and other time-series, and parallel performance metrics.
ECFS: A decentralized, distributed and fault-tolerant FUSE filesystem for the LHCb online farm
NASA Astrophysics Data System (ADS)
Rybczynski, Tomasz; Bonaccorsi, Enrico; Neufeld, Niko
2014-06-01
The LHCb experiment records millions of proton collisions every second, but only a fraction of them are useful for LHCb physics. In order to filter out the "bad events" a large farm of x86-servers (~2000 nodes) has been put in place. These servers boot from and run from NFS, however they use their local disk to temporarily store data, which cannot be processed in real-time ("data-deferring"). These events are subsequently processed, when there are no live-data coming in. The effective CPU power is thus greatly increased. This gain in CPU power depends critically on the availability of the local disks. For cost and power-reasons, mirroring (RAID-1) is not used, leading to a lot of operational headache with failing disks and disk-errors or server failures induced by faulty disks. To mitigate these problems and increase the reliability of the LHCb farm, while at same time keeping cost and power-consumption low, an extensive research and study of existing highly available and distributed file systems has been done. While many distributed file systems are providing reliability by "file replication", none of the evaluated ones supports erasure algorithms. A decentralised, distributed and fault-tolerant "write once read many" file system has been designed and implemented as a proof of concept providing fault tolerance without using expensive - in terms of disk space - file replication techniques and providing a unique namespace as a main goals. This paper describes the design and the implementation of the Erasure Codes File System (ECFS) and presents the specialised FUSE interface for Linux. Depending on the encoding algorithm ECFS will use a certain number of target directories as a backend to store the segments that compose the encoded data. When target directories are mounted via nfs/autofs - ECFS will act as a file-system over network/block-level raid over multiple servers.
Usage analysis of user files in UNIX
NASA Technical Reports Server (NTRS)
Devarakonda, Murthy V.; Iyer, Ravishankar K.
1987-01-01
Presented is a user-oriented analysis of short term file usage in a 4.2 BSD UNIX environment. The key aspect of this analysis is a characterization of users and files, which is a departure from the traditional approach of analyzing file references. Two characterization measures are employed: accesses-per-byte (combining fraction of a file referenced and number of references) and file size. This new approach is shown to distinguish differences in files as well as users, which cam be used in efficient file system design, and in creating realistic test workloads for simulations. A multi-stage gamma distribution is shown to closely model the file usage measures. Even though overall file sharing is small, some files belonging to a bulletin board system are accessed by many users, simultaneously and otherwise. Over 50% of users referenced files owned by other users, and over 80% of all files were involved in such references. Based on the differences in files and users, suggestions to improve the system performance were also made.
al3c: high-performance software for parameter inference using Approximate Bayesian Computation.
Stram, Alexander H; Marjoram, Paul; Chen, Gary K
2015-11-01
The development of Approximate Bayesian Computation (ABC) algorithms for parameter inference which are both computationally efficient and scalable in parallel computing environments is an important area of research. Monte Carlo rejection sampling, a fundamental component of ABC algorithms, is trivial to distribute over multiple processors but is inherently inefficient. While development of algorithms such as ABC Sequential Monte Carlo (ABC-SMC) help address the inherent inefficiencies of rejection sampling, such approaches are not as easily scaled on multiple processors. As a result, current Bayesian inference software offerings that use ABC-SMC lack the ability to scale in parallel computing environments. We present al3c, a C++ framework for implementing ABC-SMC in parallel. By requiring only that users define essential functions such as the simulation model and prior distribution function, al3c abstracts the user from both the complexities of parallel programming and the details of the ABC-SMC algorithm. By using the al3c framework, the user is able to scale the ABC-SMC algorithm in parallel computing environments for his or her specific application, with minimal programming overhead. al3c is offered as a static binary for Linux and OS-X computing environments. The user completes an XML configuration file and C++ plug-in template for the specific application, which are used by al3c to obtain the desired results. Users can download the static binaries, source code, reference documentation and examples (including those in this article) by visiting https://github.com/ahstram/al3c. astram@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
SIOExplorer: Advances Across Disciplinary and Institutional Boundaries
NASA Astrophysics Data System (ADS)
Miller, S. P.; Clark, D.; Helly, J.; Sutton, D.; Houghton, T.
2004-12-01
Strategies for interoperability have been an underlying theme in the development of the SIOExplorer Digital Library. The project was launched three years ago to stabilize data from 700 cruises by the Scripps Institution of Oceanography (SIO), scattered across distributed laboratories and on various media, mostly off-line, including paper and at-risk magnetic tapes. The need for a comprehensive scalable approach to harvesting data from 40 years of evolving instrumentation, media and formats has resulted in the implementation of a digital library architecture that is ready for interoperability. Key metadata template files maintain the integrity of the metadata and data structures, allowing forward and backward compatibility throughout the project as metadata blocks evolve or data types are added. The overall growth of the library is managed by federating new collections in disciplines as needed, each with their own independent data publishing authority. We now have a total of four collections: SIO Cruises, SIO Photo Archives, the Seamount Catalog, and the new Educators' Collection for learning resources. The data types include high resolution meteorological observations, water profiles, biological and geological samples, gravity, magnetics, seafloor swath mapping sonar files, maps and visualization files. The library transactions across the Internet amount to approximately 50,000 hits and 6 GB of downloads each month. We are currently building a new Geological Collection with thousands of dredged rocks and cores, a Seismic Collection with 30 years of reflection data, and a Physical Oceanography Collection with 50 cruises of Hydrographic Doppler Sonar System (HDSS) deep acoustic current profiling data. For the user, a Java CruiseViewer provides an interactive portal to the all the federated collections. With CruiseViewer, contents can be discovered by keyword or geographic searches over a global map, metadata can be browsed, and objects can be displayed or scheduled for download. For computer applications, REST and SOAP web services are being implemented to allow computer-to-computer interoperability for applications to search and receive data across the Internet. Discussions are underway to extend this approach and establish a digital library at the Woods Hole Oceanographic Institution for cruise data as well as extensive submersible and ROV digital video and mapping data. These efforts have been supported by NSF NSDL, ITR and OCE awards.
Expanding Access and Usage of NASA Near Real-Time Imagery and Data
NASA Astrophysics Data System (ADS)
Cechini, M.; Murphy, K. J.; Boller, R. A.; Schmaltz, J. E.; Thompson, C. K.; Huang, T.; McGann, J. M.; Ilavajhala, S.; Alarcon, C.; Roberts, J. T.
2013-12-01
In late 2009, the Land Atmosphere Near-real-time Capability for EOS (LANCE) was created to greatly expand the range of near real-time data products from a variety of Earth Observing System (EOS) instruments. Since that time, NASA's Earth Observing System Data and Information System (EOSDIS) developed the Global Imagery Browse Services (GIBS) to provide highly responsive, scalable, and expandable imagery services that distribute near real-time imagery in an intuitive and geo-referenced format. The GIBS imagery services provide access through standards-based protocols such as the Open Geospatial Consortium (OGC) Web Map Tile Service (WMTS) and standard mapping file formats such as the Keyhole Markup Language (KML). Leveraging these standard mechanisms opens NASA near real-time imagery to a broad landscape of mapping libraries supporting mobile applications. By easily integrating with mobile application development libraries, GIBS makes it possible for NASA imagery to become a reliable and valuable source for end-user applications. Recently, EOSDIS has taken steps to integrate near real-time metadata products into the EOS ClearingHOuse (ECHO) metadata repository. Registration of near real-time metadata allows for near real-time data discovery through ECHO clients. In kind with the near real-time data processing requirements, the ECHO ingest model allows for low-latency metadata insertion and updates. Combining with the ECHO repository, the fast visual access of GIBS imagery can now be linked directly back to the source data file(s). Through the use of discovery standards such as OpenSearch, desktop and mobile applications can connect users to more than just an image. As data services, such as OGC Web Coverage Service, become more prevalent within the EOSDIS system, applications may even be able to connect users from imagery to data values. In addition, the full resolution GIBS imagery provides visual context to other GIS data and tools. The NASA near real-time imagery covers a broad set of Earth science disciplines. By leveraging the ECHO and GIBS services, these data can become a visual context within which other GIS activities are performed. The focus of this presentation is to discuss the GIBS imagery and ECHO metadata services facilitating near real-time discovery and usage. Existing synergies and future possibilities will also be discussed. The NASA Worldview demonstration client will be used to show an existing application combining the ECHO and GIBS services.
Heartbeat-based error diagnosis framework for distributed embedded systems
NASA Astrophysics Data System (ADS)
Mishra, Swagat; Khilar, Pabitra Mohan
2012-01-01
Distributed Embedded Systems have significant applications in automobile industry as steer-by-wire, fly-by-wire and brake-by-wire systems. In this paper, we provide a general framework for fault detection in a distributed embedded real time system. We use heartbeat monitoring, check pointing and model based redundancy to design a scalable framework that takes care of task scheduling, temperature control and diagnosis of faulty nodes in a distributed embedded system. This helps in diagnosis and shutting down of faulty actuators before the system becomes unsafe. The framework is designed and tested using a new simulation model consisting of virtual nodes working on a message passing system.
Heartbeat-based error diagnosis framework for distributed embedded systems
NASA Astrophysics Data System (ADS)
Mishra, Swagat; Khilar, Pabitra Mohan
2011-12-01
Distributed Embedded Systems have significant applications in automobile industry as steer-by-wire, fly-by-wire and brake-by-wire systems. In this paper, we provide a general framework for fault detection in a distributed embedded real time system. We use heartbeat monitoring, check pointing and model based redundancy to design a scalable framework that takes care of task scheduling, temperature control and diagnosis of faulty nodes in a distributed embedded system. This helps in diagnosis and shutting down of faulty actuators before the system becomes unsafe. The framework is designed and tested using a new simulation model consisting of virtual nodes working on a message passing system.
NASA Astrophysics Data System (ADS)
Dykstra, D.; Bockelman, B.; Blomer, J.; Herner, K.; Levshina, T.; Slyz, M.
2015-12-01
A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliary data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called "alien cache" to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally on the site with a convenient POSIX interface. This paper discusses the details of the architecture and reports performance measurements.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dykstra, D.; Bockelman, B.; Blomer, J.
A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliarymore » data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called 'alien cache' to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally on the site with a convenient POSIX interface. This paper discusses the details of the architecture and reports performance measurements.« less
In-database processing of a large collection of remote sensing data: applications and implementation
NASA Astrophysics Data System (ADS)
Kikhtenko, Vladimir; Mamash, Elena; Chubarov, Dmitri; Voronina, Polina
2016-04-01
Large archives of remote sensing data are now available to scientists, yet the need to work with individual satellite scenes or product files constrains studies that span a wide temporal range or spatial extent. The resources (storage capacity, computing power and network bandwidth) required for such studies are often beyond the capabilities of individual geoscientists. This problem has been tackled before in remote sensing research and inspired several information systems. Some of them such as NASA Giovanni [1] and Google Earth Engine have already proved their utility for science. Analysis tasks involving large volumes of numerical data are not unique to Earth Sciences. Recent advances in data science are enabled by the development of in-database processing engines that bring processing closer to storage, use declarative query languages to facilitate parallel scalability and provide high-level abstraction of the whole dataset. We build on the idea of bridging the gap between file archives containing remote sensing data and databases by integrating files into relational database as foreign data sources and performing analytical processing inside the database engine. Thereby higher level query language can efficiently address problems of arbitrary size: from accessing the data associated with a specific pixel or a grid cell to complex aggregation over spatial or temporal extents over a large number of individual data files. This approach was implemented using PostgreSQL for a Siberian regional archive of satellite data products holding hundreds of terabytes of measurements from multiple sensors and missions taken over a decade-long span. While preserving the original storage layout and therefore compatibility with existing applications the in-database processing engine provides a toolkit for provisioning remote sensing data in scientific workflows and applications. The use of SQL - a widely used higher level declarative query language - simplifies interoperability between desktop GIS, web applications and geographic web services and interactive scientific applications (MATLAB, IPython). The system is also automatically ingesting direct readout data from meteorological and research satellites in near-real time with distributed acquisition workflows managed by Taverna workflow engine [2]. The system has demonstrated its utility in performing non-trivial analytic processing such as the computation of the Robust Satellite Technique (RST) indices [3]. It had been useful in different tasks such as studying urban heat islands, analyzing patterns in the distribution of wildfire occurrences, detecting phenomena related to seismic and earthquake activity. Initial experience has highlighted several limitations of the proposed approach yet it has demonstrated ability to facilitate the use of large archives of remote sensing data by geoscientists. 1. J.G. Acker, G. Leptoukh, Online analysis enhances use of NASA Earth science data. EOS Trans. AGU, 2007, 88(2), P. 14-17. 2. D. Hull, K. Wolsfencroft, R. Stevens, C. Goble, M.R. Pocock, P. Li and T. Oinn, Taverna: a tool for building and running workflows of services. Nucleic Acids Research. 2006. V. 34. P. W729-W732. 3. V. Tramutoli, G. Di Bello, N. Pergola, S. Piscitelli, Robust satellite techniques for remote sensing of seismically active areas // Annals of Geophysics. 2001. no. 44(2). P. 295-312.
Snake River Plain Geothermal Play Fairway Analysis - Phase 1 KMZ files
John Shervais
2015-10-10
This dataset contain raw data files in kmz files (Google Earth georeference format). These files include volcanic vent locations and age, the distribution of fine-grained lacustrine sediments (which act as both a seal and an insulating layer for hydrothermal fluids), and post-Miocene faults compiled from the Idaho Geological Survey, the USGS Quaternary Fault database, and unpublished mapping. It also contains the Composite Common Risk Segment Map created during Phase 1 studies, as well as a file with locations of select deep wells used to interrogate the subsurface.
Transportation Network Topologies
NASA Technical Reports Server (NTRS)
Holmes, Bruce J.; Scott, John M.
2004-01-01
A discomforting reality has materialized on the transportation scene: our existing air and ground infrastructures will not scale to meet our nation's 21st century demands and expectations for mobility, commerce, safety, and security. The consequence of inaction is diminished quality of life and economic opportunity in the 21st century. Clearly, new thinking is required for transportation that can scale to meet to the realities of a networked, knowledge-based economy in which the value of time is a new coin of the realm. This paper proposes a framework, or topology, for thinking about the problem of scalability of the system of networks that comprise the aviation system. This framework highlights the role of integrated communication-navigation-surveillance systems in enabling scalability of future air transportation networks. Scalability, in this vein, is a goal of the recently formed Joint Planning and Development Office for the Next Generation Air Transportation System. New foundations for 21PstP thinking about air transportation are underpinned by several technological developments in the traditional aircraft disciplines as well as in communication, navigation, surveillance and information systems. Complexity science and modern network theory give rise to one of the technological developments of importance. Scale-free (i.e., scalable) networks represent a promising concept space for modeling airspace system architectures, and for assessing network performance in terms of scalability, efficiency, robustness, resilience, and other metrics. The paper offers an air transportation system topology as framework for transportation system innovation. Successful outcomes of innovation in air transportation could lay the foundations for new paradigms for aircraft and their operating capabilities, air transportation system architectures, and airspace architectures and procedural concepts. The topology proposed considers air transportation as a system of networks, within which strategies for scalability of the topology may be enabled by technologies and policies. In particular, the effects of scalable ICNS concepts are evaluated within this proposed topology. Alternative business models are appearing on the scene as the old centralized hub-and-spoke model reaches the limits of its scalability. These models include growth of point-to-point scheduled air transportation service (e.g., the RJ phenomenon and the 'Southwest Effect'). Another is a new business model for on-demand, widely distributed, air mobility in jet taxi services. The new businesses forming around this vision are targeting personal air mobility to virtually any of the thousands of origins and destinations throughout suburban, rural, and remote communities and regions. Such advancement in air mobility has many implications for requirements for airports, airspace, and consumers. These new paradigms could support scalable alternatives for the expansion of future air mobility to more consumers in more places.
Transportation Network Topologies
NASA Technical Reports Server (NTRS)
Holmes, Bruce J.; Scott, John
2004-01-01
A discomforting reality has materialized on the transportation scene: our existing air and ground infrastructures will not scale to meet our nation's 21st century demands and expectations for mobility, commerce, safety, and security. The consequence of inaction is diminished quality of life and economic opportunity in the 21st century. Clearly, new thinking is required for transportation that can scale to meet to the realities of a networked, knowledge-based economy in which the value of time is a new coin of the realm. This paper proposes a framework, or topology, for thinking about the problem of scalability of the system of networks that comprise the aviation system. This framework highlights the role of integrated communication-navigation-surveillance systems in enabling scalability of future air transportation networks. Scalability, in this vein, is a goal of the recently formed Joint Planning and Development Office for the Next Generation Air Transportation System. New foundations for 21st thinking about air transportation are underpinned by several technological developments in the traditional aircraft disciplines as well as in communication, navigation, surveillance and information systems. Complexity science and modern network theory give rise to one of the technological developments of importance. Scale-free (i.e., scalable) networks represent a promising concept space for modeling airspace system architectures, and for assessing network performance in terms of scalability, efficiency, robustness, resilience, and other metrics. The paper offers an air transportation system topology as framework for transportation system innovation. Successful outcomes of innovation in air transportation could lay the foundations for new paradigms for aircraft and their operating capabilities, air transportation system architectures, and airspace architectures and procedural concepts. The topology proposed considers air transportation as a system of networks, within which strategies for scalability of the topology may be enabled by technologies and policies. In particular, the effects of scalable ICNS concepts are evaluated within this proposed topology. Alternative business models are appearing on the scene as the old centralized hub-and-spoke model reaches the limits of its scalability. These models include growth of point-to-point scheduled air transportation service (e.g., the RJ phenomenon and the Southwest Effect). Another is a new business model for on-demand, widely distributed, air mobility in jet taxi services. The new businesses forming around this vision are targeting personal air mobility to virtually any of the thousands of origins and destinations throughout suburban, rural, and remote communities and regions. Such advancement in air mobility has many implications for requirements for airports, airspace, and consumers. These new paradigms could support scalable alternatives for the expansion of future air mobility to more consumers in more places.
EDOS Evolution to Support NASA Future Earth Sciences Missions
NASA Technical Reports Server (NTRS)
Cordier, Guy R.; McLemore, Bruce; Wood, Terri; Wilkinson, Chris
2010-01-01
This paper presents a ground system architecture to service future NASA decadal missions and in particular, the high rate science data downlinks, by evolving EDOS current infrastructure and upgrading high rate network lines. The paper will also cover EDOS participation to date in formulation and operations concepts for the respective missions to understand the particular mission needs and derived requirements such as data volumes, downlink rates, data encoding, and data latencies. Future decadal requirements such as onboard data recorder management and file protocols drive the need to emulate these requirements within the ground system. The EDOS open system modular architecture is scalable to accommodate additional missions using the current sites antennas and future sites as well and meet the data security requirements and fulfill mission's objectives
Russell, David A.; Freudenreich, Julien J.; Ciardiello, Joe J.; Sore, Hannah F.
2017-01-01
We describe stereocontrolled semi-syntheses of deguelin and tephrosin, anti-cancer rotenoids isolated from Tephrosia vogelii. Firstly, we present a new two-step transformation of rotenone into rot-2′-enonic acid via a zinc-mediated ring opening of rotenone hydrobromide. Secondly, following conversion of rot-2′-enonic acid into deguelin, a chromium-mediated hydroxylation provides tephrosin as a single diastereoisomer. An Étard-like reaction mechanism is proposed to account for the stereochemical outcome. Our syntheses of deguelin and tephrosin are operationally simple, scalable and high yielding, offering considerable advantages over previous methods. PMID:28134391
JBrowse: A dynamic web platform for genome visualization and analysis
Buels, Robert; Yao, Eric; Diesh, Colin M.; ...
2016-04-12
Background: JBrowse is a fast and full-featured genome browser built with JavaScript and HTML5. It is easily embedded into websites or apps but can also be served as a standalone web page. Results: Overall improvements to speed and scalability are accompanied by specific enhancements that support complex interactive queries on large track sets. Analysis functions can readily be added using the plugin framework; most visual aspects of tracks can also be customized, along with clicks, mouseovers, menus, and popup boxes. JBrowse can also be used to browse local annotation files offline and to generate high-resolution figures for publication. Conclusions: JBrowsemore » is a mature web application suitable for genome visualization and analysis.« less
Utilizing ORACLE tools within Unix
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ferguson, R.
1995-07-01
Large databases, by their very nature, often serve as repositories of data which may be needed by other systems. The transmission of this data to other systems has in the past involved several layers of human intervention. The Integrated Cargo Data Base (ICDB) developed by Martin Marietta Energy Systems for the Military Traffic Management Command as part of the Worldwide Port System provides data integration and worldwide tracking of cargo that passes through common-user ocean cargo ports. One of the key functions of ICDB is data distribution of a variety of data files to a number of other systems. Developmentmore » of automated data distribution procedures had to deal with the following constraints: (1) variable generation time for data files, (2) use of only current data for data files, (3) use of a minimum number of select statements, (4) creation of unique data files for multiple recipients, (5) automatic transmission of data files to recipients, and (6) avoidance of extensive and long-term data storage.« less
Accessing files in an Internet: The Jade file system
NASA Technical Reports Server (NTRS)
Peterson, Larry L.; Rao, Herman C.
1991-01-01
Jade is a new distribution file system that provides a uniform way to name and access files in an internet environment. It makes two important contributions. First, Jade is a logical system that integrates a heterogeneous collection of existing file systems, where heterogeneous means that the underlying file systems support different file access protocols. Jade is designed under the restriction that the underlying file system may not be modified. Second, rather than providing a global name space, Jade permits each user to define a private name space. These private name spaces support two novel features: they allow multiple file systems to be mounted under one directory, and they allow one logical name space to mount other logical name spaces. A prototype of the Jade File System was implemented on Sun Workstations running Unix. It consists of interfaces to the Unix file system, the Sun Network File System, the Andrew File System, and FTP. This paper motivates Jade's design, highlights several aspects of its implementation, and illustrates applications that can take advantage of its features.
Accessing files in an internet - The Jade file system
NASA Technical Reports Server (NTRS)
Rao, Herman C.; Peterson, Larry L.
1993-01-01
Jade is a new distribution file system that provides a uniform way to name and access files in an internet environment. It makes two important contributions. First, Jade is a logical system that integrates a heterogeneous collection of existing file systems, where heterogeneous means that the underlying file systems support different file access protocols. Jade is designed under the restriction that the underlying file system may not be modified. Second, rather than providing a global name space, Jade permits each user to define a private name space. These private name spaces support two novel features: they allow multiple file systems to be mounted under one directory, and they allow one logical name space to mount other logical name spaces. A prototype of the Jade File System was implemented on Sun Workstations running Unix. It consists of interfaces to the Unix file system, the Sun Network File System, the Andrew File System, and FTP. This paper motivates Jade's design, highlights several aspects of its implementation, and illustrates applications that can take advantage of its features.
Monte Carlo based, patient-specific RapidArc QA using Linac log files.
Teke, Tony; Bergman, Alanah M; Kwa, William; Gill, Bradford; Duzenli, Cheryl; Popescu, I Antoniu
2010-01-01
A Monte Carlo (MC) based QA process to validate the dynamic beam delivery accuracy for Varian RapidArc (Varian Medical Systems, Palo Alto, CA) using Linac delivery log files (DynaLog) is presented. Using DynaLog file analysis and MC simulations, the goal of this article is to (a) confirm that adequate sampling is used in the RapidArc optimization algorithm (177 static gantry angles) and (b) to assess the physical machine performance [gantry angle and monitor unit (MU) delivery accuracy]. Ten clinically acceptable RapidArc treatment plans were generated for various tumor sites and delivered to a water-equivalent cylindrical phantom on the treatment unit. Three Monte Carlo simulations were performed to calculate dose to the CT phantom image set: (a) One using a series of static gantry angles defined by 177 control points with treatment planning system (TPS) MLC control files (planning files), (b) one using continuous gantry rotation with TPS generated MLC control files, and (c) one using continuous gantry rotation with actual Linac delivery log files. Monte Carlo simulated dose distributions are compared to both ionization chamber point measurements and with RapidArc TPS calculated doses. The 3D dose distributions were compared using a 3D gamma-factor analysis, employing a 3%/3 mm distance-to-agreement criterion. The dose difference between MC simulations, TPS, and ionization chamber point measurements was less than 2.1%. For all plans, the MC calculated 3D dose distributions agreed well with the TPS calculated doses (gamma-factor values were less than 1 for more than 95% of the points considered). Machine performance QA was supplemented with an extensive DynaLog file analysis. A DynaLog file analysis showed that leaf position errors were less than 1 mm for 94% of the time and there were no leaf errors greater than 2.5 mm. The mean standard deviation in MU and gantry angle were 0.052 MU and 0.355 degrees, respectively, for the ten cases analyzed. The accuracy and flexibility of the Monte Carlo based RapidArc QA system were demonstrated. Good machine performance and accurate dose distribution delivery of RapidArc plans were observed. The sampling used in the TPS optimization algorithm was found to be adequate.
7 CFR 283.22 - Form; filing; service; proof of service; computation of time; and extensions of time.
Code of Federal Regulations, 2010 CFR
2010-01-01
... Agriculture (Continued) FOOD AND NUTRITION SERVICE, DEPARTMENT OF AGRICULTURE FOOD STAMP AND FOOD DISTRIBUTION...) Filing. Papers are considered filed when they are postmarked, or, received, if hand delivered. Date of... serving the document by personal delivery or by mail, setting forth the date, time and manner of service...
Heterogeneous distributed query processing: The DAVID system
NASA Technical Reports Server (NTRS)
Jacobs, Barry E.
1985-01-01
The objective of the Distributed Access View Integrated Database (DAVID) project is the development of an easy to use computer system with which NASA scientists, engineers and administrators can uniformly access distributed heterogeneous databases. Basically, DAVID will be a database management system that sits alongside already existing database and file management systems. Its function is to enable users to access the data in other languages and file systems without having to learn the data manipulation languages. Given here is an outline of a talk on the DAVID project and several charts.
A Distributed Platform for Global-Scale Agent-Based Models of Disease Transmission
Parker, Jon; Epstein, Joshua M.
2013-01-01
The Global-Scale Agent Model (GSAM) is presented. The GSAM is a high-performance distributed platform for agent-based epidemic modeling capable of simulating a disease outbreak in a population of several billion agents. It is unprecedented in its scale, its speed, and its use of Java. Solutions to multiple challenges inherent in distributing massive agent-based models are presented. Communication, synchronization, and memory usage are among the topics covered in detail. The memory usage discussion is Java specific. However, the communication and synchronization discussions apply broadly. We provide benchmarks illustrating the GSAM’s speed and scalability. PMID:24465120
The deployment of routing protocols in distributed control plane of SDN.
Jingjing, Zhou; Di, Cheng; Weiming, Wang; Rong, Jin; Xiaochun, Wu
2014-01-01
Software defined network (SDN) provides a programmable network through decoupling the data plane, control plane, and application plane from the original closed system, thus revolutionizing the existing network architecture to improve the performance and scalability. In this paper, we learned about the distributed characteristics of Kandoo architecture and, meanwhile, improved and optimized Kandoo's two levels of controllers based on ideological inspiration of RCP (routing control platform). Finally, we analyzed the deployment strategies of BGP and OSPF protocol in a distributed control plane of SDN. The simulation results show that our deployment strategies are superior to the traditional routing strategies.
Continuous Production of Discrete Plasmid DNA-Polycation Nanoparticles Using Flash Nanocomplexation.
Santos, Jose Luis; Ren, Yong; Vandermark, John; Archang, Maani M; Williford, John-Michael; Liu, Heng-Wen; Lee, Jason; Wang, Tza-Huei; Mao, Hai-Quan
2016-12-01
Despite successful demonstration of linear polyethyleneimine (lPEI) as an effective carrier for a wide range of gene medicine, including DNA plasmids, small interfering RNAs, mRNAs, etc., and continuous improvement of the physical properties and biological performance of the polyelectrolyte complex nanoparticles prepared from lPEI and nucleic acids, there still exist major challenges to produce these nanocomplexes in a scalable manner, particularly for lPEI/DNA nanoparticles. This has significantly hindered the progress toward clinical translation of these nanoparticle-based gene medicine. Here the authors report a flash nanocomplexation (FNC) method that achieves continuous production of lPEI/plasmid DNA nanoparticles with narrow size distribution using a confined impinging jet device. The method involves the complex coacervation of negatively charged DNA plasmid and positive charged lPEI under rapid, highly dynamic, and homogeneous mixing conditions, producing polyelectrolyte complex nanoparticles with narrow distribution of particle size and shape. The average number of plasmid DNA packaged per nanoparticles and its distribution are similar between the FNC method and the small-scale batch mixing method. In addition, the nanoparticles prepared by these two methods exhibit similar cell transfection efficiency. These results confirm that FNC is an effective and scalable method that can produce well-controlled lPEI/plasmid DNA nanoparticles. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Continuous Production of Discrete Plasmid DNA-Polycation Nanoparticles Using Flash Nanocomplexation
Santos, Jose Luis; Ren, Yong; Vandermark, John; Archang, Maani M.; Williford, John-Michael; Liu, Heng-wen; Lee, Jason; Wang, Tza-Huei; Mao, Hai-Quan
2016-01-01
Despite successful demonstration of linear polyethyleneimine (lPEI) as an effective carrier for a wide range of gene medicine, including DNA plasmids, small interfering RNAs, mRNAs, etc., and continuous improvement of the physical properties and biological performance of the polyelectrolyte complex nanoparticles prepared from lPEI and nucleic acids, there still exist major challenges to produce these nanocomplexes in a scalable manner, particularly for lPEI/DNA nanoparticles. This has significantly hindered the progress towards clinical translation of these nanoparticle-based gene medicine. Here we report a flash nanocomplexation (FNC) method that achieves continuous production of lPEI/plasmid DNA nanoparticles with narrow size distribution using a confined impinging jet device. The method involves the complex coacervation of negatively charged DNA plasmid and positive charged lPEI under rapid, highly dynamic, and homogeneous mixing conditions, producing polyelectrolyte complex nanoparticles with narrow distribution of particle size and shape. The average number of plasmid DNA packaged per nanoparticles and its distribution are similar between the FNC method and the small-scale batch mixing method. In addition, the nanoparticles prepared by these two methods exhibit similar cell transfection efficiency. These results confirm that FNC is an effective and scalable method that can produce well-controlled lPEI/plasmid DNA nanoparticles. PMID:27717227
A Key Pre-Distribution Scheme Based on µ-PBIBD for Enhancing Resilience in Wireless Sensor Networks.
Yuan, Qi; Ma, Chunguang; Yu, Haitao; Bian, Xuefen
2018-05-12
Many key pre-distribution (KPD) schemes based on combinatorial design were proposed for secure communication of wireless sensor networks (WSNs). Due to complexity of constructing the combinatorial design, it is infeasible to generate key rings using the corresponding combinatorial design in large scale deployment of WSNs. In this paper, we present a definition of new combinatorial design, termed “µ-partially balanced incomplete block design (µ-PBIBD)”, which is a refinement of partially balanced incomplete block design (PBIBD), and then describe a 2-D construction of µ-PBIBD which is mapped to KPD in WSNs. Our approach is of simple construction which provides a strong key connectivity and a poor network resilience. To improve the network resilience of KPD based on 2-D µ-PBIBD, we propose a KPD scheme based on 3-D Ex-µ-PBIBD which is a construction of µ-PBIBD from 2-D space to 3-D space. Ex-µ-PBIBD KPD scheme improves network scalability and resilience while has better key connectivity. Theoretical analysis and comparison with the related schemes show that key pre-distribution scheme based on Ex-µ-PBIBD provides high network resilience and better key scalability, while it achieves a trade-off between network resilience and network connectivity.
Flexible session management in a distributed environment
NASA Astrophysics Data System (ADS)
Miller, Zach; Bradley, Dan; Tannenbaum, Todd; Sfiligoi, Igor
2010-04-01
Many secure communication libraries used by distributed systems, such as SSL, TLS, and Kerberos, fail to make a clear distinction between the authentication, session, and communication layers. In this paper we introduce CEDAR, the secure communication library used by the Condor High Throughput Computing software, and present the advantages to a distributed computing system resulting from CEDAR's separation of these layers. Regardless of the authentication method used, CEDAR establishes a secure session key, which has the flexibility to be used for multiple capabilities. We demonstrate how a layered approach to security sessions can avoid round-trips and latency inherent in network authentication. The creation of a distinct session management layer allows for optimizations to improve scalability by way of delegating sessions to other components in the system. This session delegation creates a chain of trust that reduces the overhead of establishing secure connections and enables centralized enforcement of system-wide security policies. Additionally, secure channels based upon UDP datagrams are often overlooked by existing libraries; we show how CEDAR's structure accommodates this as well. As an example of the utility of this work, we show how the use of delegated security sessions and other techniques inherent in CEDAR's architecture enables US CMS to meet their scalability requirements in deploying Condor over large-scale, wide-area grid systems.
A Key Pre-Distribution Scheme Based on µ-PBIBD for Enhancing Resilience in Wireless Sensor Networks
Yuan, Qi; Ma, Chunguang; Yu, Haitao; Bian, Xuefen
2018-01-01
Many key pre-distribution (KPD) schemes based on combinatorial design were proposed for secure communication of wireless sensor networks (WSNs). Due to complexity of constructing the combinatorial design, it is infeasible to generate key rings using the corresponding combinatorial design in large scale deployment of WSNs. In this paper, we present a definition of new combinatorial design, termed “µ-partially balanced incomplete block design (µ-PBIBD)”, which is a refinement of partially balanced incomplete block design (PBIBD), and then describe a 2-D construction of µ-PBIBD which is mapped to KPD in WSNs. Our approach is of simple construction which provides a strong key connectivity and a poor network resilience. To improve the network resilience of KPD based on 2-D µ-PBIBD, we propose a KPD scheme based on 3-D Ex-µ-PBIBD which is a construction of µ-PBIBD from 2-D space to 3-D space. Ex-µ-PBIBD KPD scheme improves network scalability and resilience while has better key connectivity. Theoretical analysis and comparison with the related schemes show that key pre-distribution scheme based on Ex-µ-PBIBD provides high network resilience and better key scalability, while it achieves a trade-off between network resilience and network connectivity. PMID:29757244
Towards a Cloud Based Smart Traffic Management Framework
NASA Astrophysics Data System (ADS)
Rahimi, M. M.; Hakimpour, F.
2017-09-01
Traffic big data has brought many opportunities for traffic management applications. However several challenges like heterogeneity, storage, management, processing and analysis of traffic big data may hinder their efficient and real-time applications. All these challenges call for well-adapted distributed framework for smart traffic management that can efficiently handle big traffic data integration, indexing, query processing, mining and analysis. In this paper, we present a novel, distributed, scalable and efficient framework for traffic management applications. The proposed cloud computing based framework can answer technical challenges for efficient and real-time storage, management, process and analyse of traffic big data. For evaluation of the framework, we have used OpenStreetMap (OSM) real trajectories and road network on a distributed environment. Our evaluation results indicate that speed of data importing to this framework exceeds 8000 records per second when the size of datasets is near to 5 million. We also evaluate performance of data retrieval in our proposed framework. The data retrieval speed exceeds 15000 records per second when the size of datasets is near to 5 million. We have also evaluated scalability and performance of our proposed framework using parallelisation of a critical pre-analysis in transportation applications. The results show that proposed framework achieves considerable performance and efficiency in traffic management applications.
NASA Technical Reports Server (NTRS)
Jefferies, K.
1994-01-01
OFFSET is a ray tracing computer code for optical analysis of a solar collector. The code models the flux distributions within the receiver cavity produced by reflections from the solar collector. It was developed to model the offset solar collector of the solar dynamic electric power system being developed for Space Station Freedom. OFFSET has been used to improve the understanding of the collector-receiver interface and to guide the efforts of NASA contractors also researching the optical components of the power system. The collector for Space Station Freedom consists of 19 hexagonal panels each containing 24 triangular, reflective facets. Current research is geared toward optimizing flux distribution inside the receiver via changes in collector design and receiver orientation. OFFSET offers many options for experimenting with the design of the system. The offset parabolic collector model configuration is determined by an input file of facet corner coordinates. The user may choose other configurations by changing this file, but to simulate collectors that have other than 19 groups of 24 triangular facets would require modification of the FORTRAN code. Each of the roughly 500 facets in the assembled collector may be independently aimed to smooth out, or tailor, the flux distribution on the receiver's wall. OFFSET simulates the effects of design changes such as in receiver aperture location, tilt angle, and collector facet contour. Unique features of OFFSET include: 1) equations developed to pseudo-randomly select ray originating sources on the Sun which appear evenly distributed and include solar limb darkening; 2) Cone-optics technique used to add surface specular error to the ray originating sources to determine the apparent ray sources of the reflected sun; 3) choice of facet reflective surface contour -- spherical, ideal parabolic, or toroidal; 4) Gaussian distributions of radial and tangential components of surface slope error added to the surface normals at the ten nodal points on each facet; and 5) color contour plots of receiver incident flux distribution generated by PATRAN processing of FORTRAN computer code output. OFFSET output includes a file of input data for confirmation, a PATRAN results file containing the values necessary to plot the flux distribution at the receiver surface, a PATRAN results file containing the intensity distribution on a 40 x 40 cm area of the receiver aperture plane, a data file containing calculated information on the system configuration, a file including the X-Y coordinates of the target points of each collector facet on the aperture opening, and twelve P/PLOT input data files to allow X-Y plotting of various results data. OFFSET is written in FORTRAN (70%) for the IBM VM operating system. The code contains PATRAN statements (12%) and P/PLOT statements (18%) for generating plots. Once the program has been run on VM (or an equivalent system), the PATRAN and P/PLOT files may be transferred to a DEC VAX (or equivalent system) with access to PATRAN for PATRAN post processing. OFFSET was written in 1988 and last updated in 1989. PATRAN is a registered trademark of PDA Engineering. IBM is a registered trademark of International Business Machines Corporation. DEC VAX is a registered trademark of Digital Equipment Corporation.
Security in the CernVM File System and the Frontier Distributed Database Caching System
NASA Astrophysics Data System (ADS)
Dykstra, D.; Blomer, J.
2014-06-01
Both the CernVM File System (CVMFS) and the Frontier Distributed Database Caching System (Frontier) distribute centrally updated data worldwide for LHC experiments using http proxy caches. Neither system provides privacy or access control on reading the data, but both control access to updates of the data and can guarantee the authenticity and integrity of the data transferred to clients over the internet. CVMFS has since its early days required digital signatures and secure hashes on all distributed data, and recently Frontier has added X.509-based authenticity and integrity checking. In this paper we detail and compare the security models of CVMFS and Frontier.
Distributed Storage Algorithm for Geospatial Image Data Based on Data Access Patterns.
Pan, Shaoming; Li, Yongkai; Xu, Zhengquan; Chong, Yanwen
2015-01-01
Declustering techniques are widely used in distributed environments to reduce query response time through parallel I/O by splitting large files into several small blocks and then distributing those blocks among multiple storage nodes. Unfortunately, however, many small geospatial image data files cannot be further split for distributed storage. In this paper, we propose a complete theoretical system for the distributed storage of small geospatial image data files based on mining the access patterns of geospatial image data using their historical access log information. First, an algorithm is developed to construct an access correlation matrix based on the analysis of the log information, which reveals the patterns of access to the geospatial image data. Then, a practical heuristic algorithm is developed to determine a reasonable solution based on the access correlation matrix. Finally, a number of comparative experiments are presented, demonstrating that our algorithm displays a higher total parallel access probability than those of other algorithms by approximately 10-15% and that the performance can be further improved by more than 20% by simultaneously applying a copy storage strategy. These experiments show that the algorithm can be applied in distributed environments to help realize parallel I/O and thereby improve system performance.
Architectural Principles and Experimentation of Distributed High Performance Virtual Clusters
ERIC Educational Resources Information Center
Younge, Andrew J.
2016-01-01
With the advent of virtualization and Infrastructure-as-a-Service (IaaS), the broader scientific computing community is considering the use of clouds for their scientific computing needs. This is due to the relative scalability, ease of use, advanced user environment customization abilities, and the many novel computing paradigms available for…
Scalable and Precise Abstraction of Programs for Trustworthy Software
2017-01-01
calculus for core Java. • 14 months: A systematic abstraction of core Java. • 18 months: A security auditor for core Java. • 24 months: A contract... auditor for full Java. • 42 months: A web-deployed service for security auditing. Approved for Public Release; Distribution Unlimited 4 4.0 RESULTS
Resource Management In Peer-To-Peer Networks: A Nadse Approach
NASA Astrophysics Data System (ADS)
Patel, R. B.; Garg, Vishal
2011-12-01
This article presents a common solution to Peer-to-Peer (P2P) network problems and distributed computing with the help of "Neighbor Assisted Distributed and Scalable Environment" (NADSE). NADSE supports both device and code mobility. In this article mainly we focus on the NADSE based resource management technique. How information dissemination and searching is speedup when using the NADSE service provider node in large network. Results show that performance of the NADSE network is better in comparison to Gnutella, and Freenet.
Lyceum: A Multi-Protocol Digital Library Gateway
NASA Technical Reports Server (NTRS)
Maa, Ming-Hokng; Nelson, Michael L.; Esler, Sandra L.
1997-01-01
Lyceum is a prototype scalable query gateway that provides a logically central interface to multi-protocol and physically distributed, digital libraries of scientific and technical information. Lyceum processes queries to multiple syntactically distinct search engines used by various distributed information servers from a single logically central interface without modification of the remote search engines. A working prototype (http://www.larc.nasa.gov/lyceum/) demonstrates the capabilities, potentials, and advantages of this type of meta-search engine by providing access to over 50 servers covering over 20 disciplines.
Proton beam therapy control system
Baumann, Michael A [Riverside, CA; Beloussov, Alexandre V [Bernardino, CA; Bakir, Julide [Alta Loma, CA; Armon, Deganit [Redlands, CA; Olsen, Howard B [Colton, CA; Salem, Dana [Riverside, CA
2008-07-08
A tiered communications architecture for managing network traffic in a distributed system. Communication between client or control computers and a plurality of hardware devices is administered by agent and monitor devices whose activities are coordinated to reduce the number of open channels or sockets. The communications architecture also improves the transparency and scalability of the distributed system by reducing network mapping dependence. The architecture is desirably implemented in a proton beam therapy system to provide flexible security policies which improve patent safety and facilitate system maintenance and development.
Proton beam therapy control system
Baumann, Michael A.; Beloussov, Alexandre V.; Bakir, Julide; Armon, Deganit; Olsen, Howard B.; Salem, Dana
2010-09-21
A tiered communications architecture for managing network traffic in a distributed system. Communication between client or control computers and a plurality of hardware devices is administered by agent and monitor devices whose activities are coordinated to reduce the number of open channels or sockets. The communications architecture also improves the transparency and scalability of the distributed system by reducing network mapping dependence. The architecture is desirably implemented in a proton beam therapy system to provide flexible security policies which improve patent safety and facilitate system maintenance and development.
Proton beam therapy control system
Baumann, Michael A; Beloussov, Alexandre V; Bakir, Julide; Armon, Deganit; Olsen, Howard B; Salem, Dana
2013-06-25
A tiered communications architecture for managing network traffic in a distributed system. Communication between client or control computers and a plurality of hardware devices is administered by agent and monitor devices whose activities are coordinated to reduce the number of open channels or sockets. The communications architecture also improves the transparency and scalability of the distributed system by reducing network mapping dependence. The architecture is desirably implemented in a proton beam therapy system to provide flexible security policies which improve patent safety and facilitate system maintenance and development.
Proton beam therapy control system
Baumann, Michael A; Beloussov, Alexandre V; Bakir, Julide; Armon, Deganit; Olsen, Howard B; Salem, Dana
2013-12-03
A tiered communications architecture for managing network traffic in a distributed system. Communication between client or control computers and a plurality of hardware devices is administered by agent and monitor devices whose activities are coordinated to reduce the number of open channels or sockets. The communications architecture also improves the transparency and scalability of the distributed system by reducing network mapping dependence. The architecture is desirably implemented in a proton beam therapy system to provide flexible security policies which improve patent safety and facilitate system maintenance and development.
On delay adjustment for dynamic load balancing in distributed virtual environments.
Deng, Yunhua; Lau, Rynson W H
2012-04-01
Distributed virtual environments (DVEs) are becoming very popular in recent years, due to the rapid growing of applications, such as massive multiplayer online games (MMOGs). As the number of concurrent users increases, scalability becomes one of the major challenges in designing an interactive DVE system. One solution to address this scalability problem is to adopt a multi-server architecture. While some methods focus on the quality of partitioning the load among the servers, others focus on the efficiency of the partitioning process itself. However, all these methods neglect the effect of network delay among the servers on the accuracy of the load balancing solutions. As we show in this paper, the change in the load of the servers due to network delay would affect the performance of the load balancing algorithm. In this work, we conduct a formal analysis of this problem and discuss two efficient delay adjustment schemes to address the problem. Our experimental results show that our proposed schemes can significantly improve the performance of the load balancing algorithm with neglectable computation overhead.
Yuan, Dajun; Lin, Wei; Guo, Rui; Wong, C P; Das, Suman
2012-06-01
Scalable fabrication of carbon nanotube (CNT) bundles is essential to future advances in several applications. Here, we report on the development of a simple, two-step method for fabricating vertically aligned and periodically distributed CNT bundles and periodically porous CNT films at the sub-micron scale. The method involves laser interference ablation (LIA) of an iron film followed by CNT growth via iron-catalyzed chemical vapor deposition. CNT bundles with square widths ranging from 0.5 to 1.5 µm in width, and 50-200 µm in length, are grown atop the patterned catalyst over areas spanning 8 cm(2). The CNT bundles exhibit a high degree of control over square width, orientation, uniformity, and periodicity. This simple scalable method of producing well-placed and oriented CNT bundles demonstrates a high application potential for wafer-scale integration of CNT structures into various device applications, including IC interconnects, field emitters, sensors, batteries, and optoelectronics, etc.
Hamiltonian Monte Carlo acceleration using surrogate functions with random bases.
Zhang, Cheng; Shahbaba, Babak; Zhao, Hongkai
2017-11-01
For big data analysis, high computational cost for Bayesian methods often limits their applications in practice. In recent years, there have been many attempts to improve computational efficiency of Bayesian inference. Here we propose an efficient and scalable computational technique for a state-of-the-art Markov chain Monte Carlo methods, namely, Hamiltonian Monte Carlo. The key idea is to explore and exploit the structure and regularity in parameter space for the underlying probabilistic model to construct an effective approximation of its geometric properties. To this end, we build a surrogate function to approximate the target distribution using properly chosen random bases and an efficient optimization process. The resulting method provides a flexible, scalable, and efficient sampling algorithm, which converges to the correct target distribution. We show that by choosing the basis functions and optimization process differently, our method can be related to other approaches for the construction of surrogate functions such as generalized additive models or Gaussian process models. Experiments based on simulated and real data show that our approach leads to substantially more efficient sampling algorithms compared to existing state-of-the-art methods.
NASA Astrophysics Data System (ADS)
Grasso, J. R.; Bachèlery, P.
Self-organized systems are often used to describe natural phenomena where power laws and scale invariant geometry are observed. The Piton de la Fournaise volcano shows power-law behavior in many aspects. These include the temporal distribution of eruptions, the frequency-size distributions of induced earthquakes, dikes, fissures, lava flows and interflow periods, all evidence of self-similarity over a finite scale range. We show that the bounds to scale-invariance can be used to derive geomechanical constraints on both the volcano structure and the volcano mechanics. We ascertain that the present magma bodies are multi-lens reservoirs in a quasi-eruptive condition, i.e. a marginally critical state. The scaling organization of dynamic fluid-induced observables on the volcano, such as fluid induced earthquakes, dikes and surface fissures, appears to be controlled by underlying static hierarchical structure (geology) similar to that proposed for fluid circulations in human physiology. The emergence of saturation lengths for the scalable volcanic observable argues for the finite scalability of complex naturally self-organized critical systems, including volcano dynamics.
Distributed reinforcement learning for adaptive and robust network intrusion response
NASA Astrophysics Data System (ADS)
Malialis, Kleanthis; Devlin, Sam; Kudenko, Daniel
2015-07-01
Distributed denial of service (DDoS) attacks constitute a rapidly evolving threat in the current Internet. Multiagent Router Throttling is a novel approach to defend against DDoS attacks where multiple reinforcement learning agents are installed on a set of routers and learn to rate-limit or throttle traffic towards a victim server. The focus of this paper is on online learning and scalability. We propose an approach that incorporates task decomposition, team rewards and a form of reward shaping called difference rewards. One of the novel characteristics of the proposed system is that it provides a decentralised coordinated response to the DDoS problem, thus being resilient to DDoS attacks themselves. The proposed system learns remarkably fast, thus being suitable for online learning. Furthermore, its scalability is successfully demonstrated in experiments involving 1000 learning agents. We compare our approach against a baseline and a popular state-of-the-art throttling technique from the network security literature and show that the proposed approach is more effective, adaptive to sophisticated attack rate dynamics and robust to agent failures.
Twiddlenet: Metadata Tagging and Data Dissemination in Mobile Device Networks
2007-09-01
hosting a distributed data dissemination application. Stated simply, there are a multitude of handheld devices on the market that can communicate in...content ( UGC ) across a network of distributed devices. This sharing is accomplished through the use of descriptive metadata tags that are assigned to a...file once it has been shared. These metadata files are uploaded to a centralized portal and arranged for efficient UGC location and searching
DSSTox chemical-index files for exposure-related ...
The Distributed Structure-Searchable Toxicity (DSSTox) ARYEXP and GEOGSE files are newly published, structure-annotated files of the chemical-associated and chemical exposure-related summary experimental content contained in the ArrayExpress Repository and Gene Expression Omnibus (GEO) Series (based on data extracted on September 20, 2008). ARYEXP and GEOGSE contain 887 and 1064 unique chemical substances mapped to 1835 and 2381 chemical exposure-related experiment accession IDs, respectively. The standardized files allow one to assess, compare and search the chemical content in each resource, in the context of the larger DSSTox toxicology data network, as well as across large public cheminformatics resources such as PubChem (http://pubchem.ncbi.nlm.nih.gov). The Distributed Structure-Searchable Toxicity (DSSTox) ARYEXP and GEOGSE files are newly published, structure-annotated files of the chemical-associated and chemical exposure-related summary experimental content contained in the ArrayExpress Repository and Gene Expression Omnibus (GEO) Series (based on data extracted on September 20, 2008). ARYEXP and GEOGSE contain 887 and 1064 unique chemical substances mapped to 1835 and 2381 chemical exposure-related experiment accession IDs, respectively. The standardized files allow one to assess, compare and search the chemical content in each resource, in the context of the larger DSSTox toxicology data network, as well as across large public cheminformatics resourc
Distributing File-Based Data to Remote Sites Within the BABAR Collaboration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gowdy, Stephen J.
BABAR [1] uses two formats for its data: Objectivity database and root [2] files. This poster concerns the distribution of the latter--for Objectivity data see [3]. The BABAR analysis data is stored in root files--one per physics run and analysis selection channel--maintained in a large directory tree. Currently BABAR has more than 4.5 TBytes in 200,000 root files. This data is (mostly) produced at SLAC, but is required for analysis at universities and research centers throughout the us and Europe. Two basic problems confront us when we seek to import bulk data from slac to an institute's local storage viamore » the network. We must determine which files must be imported (depending on the local site requirements and which files have already been imported), and we must make the optimum use of the network when transferring the data. Basic ftp-like tools (ftp, scp, etc) do not attempt to solve the first problem. More sophisticated tools like rsync [4], the widely-used mirror/synchronization program, compare local and remote file systems, checking for changes (based on file date, size and, if desired, an elaborate checksum) in order to only copy new or modified files. However rsync allows for only limited file selection. Also when, as in BABAR, an extremely large directory structure must be scanned, rsync can take several hours just to determine which files need to be copied. Although rsync (and scp) provides on-the-fly compression, it does not allow us to optimize the network transfer by using multiple streams, adjusting the tcp window size, or separating encrypted authentication from unencrypted data channels.« less
fastBMA: scalable network inference and transitive reduction.
Hung, Ling-Hong; Shi, Kaiyuan; Wu, Migao; Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee
2017-10-01
Inferring genetic networks from genome-wide expression data is extremely demanding computationally. We have developed fastBMA, a distributed, parallel, and scalable implementation of Bayesian model averaging (BMA) for this purpose. fastBMA also includes a computationally efficient module for eliminating redundant indirect edges in the network by mapping the transitive reduction to an easily solved shortest-path problem. We evaluated the performance of fastBMA on synthetic data and experimental genome-wide time series yeast and human datasets. When using a single CPU core, fastBMA is up to 100 times faster than the next fastest method, LASSO, with increased accuracy. It is a memory-efficient, parallel, and distributed application that scales to human genome-wide expression data. A 10 000-gene regulation network can be obtained in a matter of hours using a 32-core cloud cluster (2 nodes of 16 cores). fastBMA is a significant improvement over its predecessor ScanBMA. It is more accurate and orders of magnitude faster than other fast network inference methods such as the 1 based on LASSO. The improved scalability allows it to calculate networks from genome scale data in a reasonable time frame. The transitive reduction method can improve accuracy in denser networks. fastBMA is available as code (M.I.T. license) from GitHub (https://github.com/lhhunghimself/fastBMA), as part of the updated networkBMA Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/networkBMA.html) and as ready-to-deploy Docker images (https://hub.docker.com/r/biodepot/fastbma/). © The Authors 2017. Published by Oxford University Press.
A cloud-based multimodality case file for mobile devices.
Balkman, Jason D; Loehfelm, Thomas W
2014-01-01
Recent improvements in Web and mobile technology, along with the widespread use of handheld devices in radiology education, provide unique opportunities for creating scalable, universally accessible, portable image-rich radiology case files. A cloud database and a Web-based application for radiologic images were developed to create a mobile case file with reasonable usability, download performance, and image quality for teaching purposes. A total of 75 radiology cases related to breast, thoracic, gastrointestinal, musculoskeletal, and neuroimaging subspecialties were included in the database. Breast imaging cases are the focus of this article, as they best demonstrate handheld display capabilities across a wide variety of modalities. This case subset also illustrates methods for adapting radiologic content to cloud platforms and mobile devices. Readers will gain practical knowledge about storage and retrieval of cloud-based imaging data, an awareness of techniques used to adapt scrollable and high-resolution imaging content for the Web, and an appreciation for optimizing images for handheld devices. The evaluation of this software demonstrates the feasibility of adapting images from most imaging modalities to mobile devices, even in cases of full-field digital mammograms, where high resolution is required to represent subtle pathologic features. The cloud platform allows cases to be added and modified in real time by using only a standard Web browser with no application-specific software. Challenges remain in developing efficient ways to generate, modify, and upload radiologic and supplementary teaching content to this cloud-based platform. Online supplemental material is available for this article. ©RSNA, 2014.
A high performance hierarchical storage management system for the Canadian tier-1 centre at TRIUMF
NASA Astrophysics Data System (ADS)
Deatrich, D. C.; Liu, S. X.; Tafirout, R.
2010-04-01
We describe in this paper the design and implementation of Tapeguy, a high performance non-proprietary Hierarchical Storage Management (HSM) system which is interfaced to dCache for efficient tertiary storage operations. The system has been successfully implemented at the Canadian Tier-1 Centre at TRIUMF. The ATLAS experiment will collect a large amount of data (approximately 3.5 Petabytes each year). An efficient HSM system will play a crucial role in the success of the ATLAS Computing Model which is driven by intensive large-scale data analysis activities that will be performed on the Worldwide LHC Computing Grid infrastructure continuously. Tapeguy is Perl-based. It controls and manages data and tape libraries. Its architecture is scalable and includes Dataset Writing control, a Read-back Queuing mechanism and I/O tape drive load balancing as well as on-demand allocation of resources. A central MySQL database records metadata information for every file and transaction (for audit and performance evaluation), as well as an inventory of library elements. Tapeguy Dataset Writing was implemented to group files which are close in time and of similar type. Optional dataset path control dynamically allocates tape families and assign tapes to it. Tape flushing is based on various strategies: time, threshold or external callbacks mechanisms. Tapeguy Read-back Queuing reorders all read requests by using an elevator algorithm, avoiding unnecessary tape loading and unloading. Implementation of priorities will guarantee file delivery to all clients in a timely manner.
Fair-share scheduling algorithm for a tertiary storage system
NASA Astrophysics Data System (ADS)
Jakl, Pavel; Lauret, Jérôme; Šumbera, Michal
2010-04-01
Any experiment facing Peta bytes scale problems is in need for a highly scalable mass storage system (MSS) to keep a permanent copy of their valuable data. But beyond the permanent storage aspects, the sheer amount of data makes complete data-set availability onto live storage (centralized or aggregated space such as the one provided by Scalla/Xrootd) cost prohibitive implying that a dynamic population from MSS to faster storage is needed. One of the most challenging aspects of dealing with MSS is the robotic tape component. If a robotic system is used as the primary storage solution, the intrinsically long access times (latencies) can dramatically affect the overall performance. To speed the retrieval of such data, one could organize the requests according to criterion with an aim to deliver maximal data throughput. However, such approaches are often orthogonal to fair resource allocation and a trade-off between quality of service, responsiveness and throughput is necessary for achieving an optimal and practical implementation of a truly faire-share oriented file restore policy. Starting from an explanation of the key criterion of such a policy, we will present evaluations and comparisons of three different MSS file restoration algorithms which meet fair-share requirements, and discuss their respective merits. We will quantify their impact on a typical file restoration cycle for the RHIC/STAR experimental setup and this, within a development, analysis and production environment relying on a shared MSS service [1].
GOES-R GS Product Generation Infrastructure Operations
NASA Astrophysics Data System (ADS)
Blanton, M.; Gundy, J.
2012-12-01
GOES-R GS Product Generation Infrastructure Operations: The GOES-R Ground System (GS) will produce a much larger set of products with higher data density than previous GOES systems. This requires considerably greater compute and memory resources to achieve the necessary latency and availability for these products. Over time, new algorithms could be added and existing ones removed or updated, but the GOES-R GS cannot go down during this time. To meet these GOES-R GS processing needs, the Harris Corporation will implement a Product Generation (PG) infrastructure that is scalable, extensible, extendable, modular and reliable. The primary parts of the PG infrastructure are the Service Based Architecture (SBA), which includes the Distributed Data Fabric (DDF). The SBA is the middleware that encapsulates and manages science algorithms that generate products. The SBA is divided into three parts, the Executive, which manages and configures the algorithm as a service, the Dispatcher, which provides data to the algorithm, and the Strategy, which determines when the algorithm can execute with the available data. The SBA is a distributed architecture, with services connected to each other over a compute grid and is highly scalable. This plug-and-play architecture allows algorithms to be added, removed, or updated without affecting any other services or software currently running and producing data. Algorithms require product data from other algorithms, so a scalable and reliable messaging is necessary. The SBA uses the DDF to provide this data communication layer between algorithms. The DDF provides an abstract interface over a distributed and persistent multi-layered storage system (memory based caching above disk-based storage) and an event system that allows algorithm services to know when data is available and to get the data that they need to begin processing when they need it. Together, the SBA and the DDF provide a flexible, high performance architecture that can meet the needs of product processing now and as they grow in the future.
Computer-aided diagnosis of mammographic masses using geometric verification-based image retrieval
NASA Astrophysics Data System (ADS)
Li, Qingliang; Shi, Weili; Yang, Huamin; Zhang, Huimao; Li, Guoxin; Chen, Tao; Mori, Kensaku; Jiang, Zhengang
2017-03-01
Computer-Aided Diagnosis of masses in mammograms is an important indicator of breast cancer. The use of retrieval systems in breast examination is increasing gradually. In this respect, the method of exploiting the vocabulary tree framework and the inverted file in the mammographic masse retrieval have been proved high accuracy and excellent scalability. However it just considered the features in each image as a visual word and had ignored the spatial configurations of features. It greatly affect the retrieval performance. To overcome this drawback, we introduce the geometric verification method to retrieval in mammographic masses. First of all, we obtain corresponding match features based on the vocabulary tree framework and the inverted file. After that, we grasps the main point of local similarity characteristic of deformations in the local regions by constructing the circle regions of corresponding pairs. Meanwhile we segment the circle to express the geometric relationship of local matches in the area and generate the spatial encoding strictly. Finally we judge whether the matched features are correct or not, based on verifying the all spatial encoding are whether satisfied the geometric consistency. Experiments show the promising results of our approach.
Comparing NetCDF and SciDB on managing and querying 5D hydrologic dataset
NASA Astrophysics Data System (ADS)
Liu, Haicheng; Xiao, Xiao
2016-11-01
Efficiently extracting information from high dimensional hydro-meteorological modelling datasets requires smart solutions. Traditional methods are mostly based on files, which can be edited and accessed handily. But they have problems of efficiency due to contiguous storage structure. Others propose databases as an alternative for advantages such as native functionalities for manipulating multidimensional (MD) arrays, smart caching strategy and scalability. In this research, NetCDF file based solutions and the multidimensional array database management system (DBMS) SciDB applying chunked storage structure are benchmarked to determine the best solution for storing and querying 5D large hydrologic modelling dataset. The effect of data storage configurations including chunk size, dimension order and compression on query performance is explored. Results indicate that dimension order to organize storage of 5D data has significant influence on query performance if chunk size is very large. But the effect becomes insignificant when chunk size is properly set. Compression of SciDB mostly has negative influence on query performance. Caching is an advantage but may be influenced by execution of different query processes. On the whole, NetCDF solution without compression is in general more efficient than the SciDB DBMS.
Tracking linkage to HIV care for former prisoners
Montague, Brian T.; Rosen, David L.; Solomon, Liza; Nunn, Amy; Green, Traci; Costa, Michael; Baillargeon, Jacques; Wohl, David A.; Paar, David P.; Rich, Josiah D.; Study Group, on behalf of the LINCS
2012-01-01
Improving testing and uptake to care among highly impacted populations is a critical element of Seek, Test, Treat and Retain strategies for reducing HIV incidence in the community. HIV disproportionately impacts prisoners. Though, incarceration provides an opportunity to diagnose and initiate therapy, treatment is frequently disrupted after release. Though model programs exist to support linkage to care on release, there is a lack of scalable metrics with which to assess adequacy of linkage to care after release. The linking data from Ryan White program Client Level Data (CLD) files reported to HRSA with corrections release data offers an attractive means of generating these metrics. Identified only by use of a confidential encrypted Unique Client Identifier (eUCI) these CLD files allow collection of key clinical indicators across the system of Ryan White funded providers. Using eUCIs generated from corrections release data sets as a linkage tool, the time to the first service at community providers along with key clinical indicators of patient status at entry into care can be determined as measures of linkage adequacy. Using this strategy, high and low performing sites can be identified and best practices can be identified to reproduce these successes in other settings. PMID:22561157
VizieR Online Data Catalog: REFLEX Galaxy Cluster Survey catalogue (Boehringer+, 2004)
NASA Astrophysics Data System (ADS)
Boehringer, H.; Schuecker, P.; Guzzo, L.; Collins, C. A.; Voges, W.; Cruddace, R. G.; Ortiz-Gil, A.; Chincarini, G.; de Grandi, S.; Edge, A. C.; MacGillivray, H. T.; Neumann, D. M.; Schindler, S.; Shaver, P.
2004-05-01
The following tables provide the catalogue as well as several data files necessary to reproduce the sample preparation. These files are also required for the cosmological modeling of these observations in e.g. the study of the statistics of the large-scale structure of the matter distribution in the Universe and related cosmological tests. (13 data files).
A scalable neuroinformatics data flow for electrophysiological signals using MapReduce.
Jayapandian, Catherine; Wei, Annan; Ramesh, Priya; Zonjy, Bilal; Lhatoo, Samden D; Loparo, Kenneth; Zhang, Guo-Qiang; Sahoo, Satya S
2015-01-01
Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications.
A scalable neuroinformatics data flow for electrophysiological signals using MapReduce
Jayapandian, Catherine; Wei, Annan; Ramesh, Priya; Zonjy, Bilal; Lhatoo, Samden D.; Loparo, Kenneth; Zhang, Guo-Qiang; Sahoo, Satya S.
2015-01-01
Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications. PMID:25852536
Generative model selection using a scalable and size-independent complex network classifier
NASA Astrophysics Data System (ADS)
Motallebi, Sadegh; Aliakbary, Sadegh; Habibi, Jafar
2013-12-01
Real networks exhibit nontrivial topological features, such as heavy-tailed degree distribution, high clustering, and small-worldness. Researchers have developed several generative models for synthesizing artificial networks that are structurally similar to real networks. An important research problem is to identify the generative model that best fits to a target network. In this paper, we investigate this problem and our goal is to select the model that is able to generate graphs similar to a given network instance. By the means of generating synthetic networks with seven outstanding generative models, we have utilized machine learning methods to develop a decision tree for model selection. Our proposed method, which is named "Generative Model Selection for Complex Networks," outperforms existing methods with respect to accuracy, scalability, and size-independence.
Scalable Creation of Long-Lived Multipartite Entanglement
NASA Astrophysics Data System (ADS)
Kaufmann, H.; Ruster, T.; Schmiegelow, C. T.; Luda, M. A.; Kaushal, V.; Schulz, J.; von Lindenfels, D.; Schmidt-Kaler, F.; Poschinger, U. G.
2017-10-01
We demonstrate the deterministic generation of multipartite entanglement based on scalable methods. Four qubits are encoded in 40Ca+, stored in a microstructured segmented Paul trap. These qubits are sequentially entangled by laser-driven pairwise gate operations. Between these, the qubit register is dynamically reconfigured via ion shuttling operations, where ion crystals are separated and merged, and ions are moved in and out of a fixed laser interaction zone. A sequence consisting of three pairwise entangling gates yields a four-ion Greenberger-Horne-Zeilinger state |ψ ⟩=(1 /√{2 })(|0000 ⟩+|1111 ⟩) , and full quantum state tomography reveals a state fidelity of 94.4(3)%. We analyze the decoherence of this state and employ dynamic decoupling on the spatially distributed constituents to maintain 69(5)% coherence at a storage time of 1.1 sec.
The high performance parallel algorithm for Unified Gas-Kinetic Scheme
NASA Astrophysics Data System (ADS)
Li, Shiyi; Li, Qibing; Fu, Song; Xu, Jinxiu
2016-11-01
A high performance parallel algorithm for UGKS is developed to simulate three-dimensional flows internal and external on arbitrary grid system. The physical domain and velocity domain are divided into different blocks and distributed according to the two-dimensional Cartesian topology with intra-communicators in physical domain for data exchange and other intra-communicators in velocity domain for sum reduction to moment integrals. Numerical results of three-dimensional cavity flow and flow past a sphere agree well with the results from the existing studies and validate the applicability of the algorithm. The scalability of the algorithm is tested both on small (1-16) and large (729-5832) scale processors. The tested speed-up ratio is near linear ashind thus the efficiency is around 1, which reveals the good scalability of the present algorithm.
NASA Astrophysics Data System (ADS)
Yamamoto, K.; Murata, K.; Kimura, E.; Honda, R.
2006-12-01
In the Solar-Terrestrial Physics (STP) field, the amount of satellite observation data has been increasing every year. It is necessary to solve the following three problems to achieve large-scale statistical analyses of plenty of data. (i) More CPU power and larger memory and disk size are required. However, total powers of personal computers are not enough to analyze such amount of data. Super-computers provide a high performance CPU and rich memory area, but they are usually separated from the Internet or connected only for the purpose of programming or data file transfer. (ii) Most of the observation data files are managed at distributed data sites over the Internet. Users have to know where the data files are located. (iii) Since no common data format in the STP field is available now, users have to prepare reading program for each data by themselves. To overcome the problems (i) and (ii), we constructed a parallel and distributed data analysis environment based on the Gfarm reference implementation of the Grid Datafarm architecture. The Gfarm shares both computational resources and perform parallel distributed processings. In addition, the Gfarm provides the Gfarm filesystem which can be as virtual directory tree among nodes. The Gfarm environment is composed of three parts; a metadata server to manage distributed files information, filesystem nodes to provide computational resources and a client to throw a job into metadata server and manages data processing schedulings. In the present study, both data files and data processes are parallelized on the Gfarm with 6 file system nodes: CPU clock frequency of each node is Pentium V 1GHz, 256MB memory and40GB disk. To evaluate performances of the present Gfarm system, we scanned plenty of data files, the size of which is about 300MB for each, in three processing methods: sequential processing in one node, sequential processing by each node and parallel processing by each node. As a result, in comparison between the number of files and the elapsed time, parallel and distributed processing shorten the elapsed time to 1/5 than sequential processing. On the other hand, sequential processing times were shortened in another experiment, whose file size is smaller than 100KB. In this case, the elapsed time to scan one file is within one second. It implies that disk swap took place in case of parallel processing by each node. We note that the operation became unstable when the number of the files exceeded 1000. To overcome the problem (iii), we developed an original data class. This class supports our reading of data files with various data formats since it converts them into an original data format since it defines schemata for every type of data and encapsulates the structure of data files. In addition, since this class provides a function of time re-sampling, users can easily convert multiple data (array) with different time resolution into the same time resolution array. Finally, using the Gfarm, we achieved a high performance environment for large-scale statistical data analyses. It should be noted that the present method is effective only when one data file size is large enough. At present, we are restructuring the new Gfarm environment with 8 nodes: CPU is Athlon 64 x2 Dual Core 2GHz, 2GB memory and 1.2TB disk (using RAID0) for each node. Our original class is to be implemented on the new Gfarm environment. In the present talk, we show the latest results with applying the present system for data analyses with huge number of satellite observation data files.
NASA Astrophysics Data System (ADS)
Davies, Nigel; Raymond, Kerry; Blair, Gordon
1999-03-01
In recent years the distributed systems community has witnessed a growth in the number of conferences, leading to difficulties in tracking the literature and a consequent loss of awareness of work done by others in this important research domain. In an attempt to synthesize many of the smaller workshops and conferences in the field, and to bring together research communities which were becoming fragmented, IFIP staged Middleware'98: The IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing. The conference was widely publicized and attracted over 150 technical submissions including 135 full paper submissions. The final programme consisted of 28 papers, giving an acceptance ratio of a little over one in five. More crucially, the programme accurately reflected the state of the art in middleware research, addressing issues such as ORB architectures, engineering of large-scale systems and multimedia. The traditional role of middleware as a point of integration and service provision was clearly intact, but the programme stressed the importance of emerging `must-have' features such as support for extensibility, mobility and quality of service. The Middleware'98 conference was held in the Lake District, UK in September 1998. Over 160 delegates made the journey to one of the UK's most beautiful regions and contributed to a lively series of presentations and debates. A permanent record of the conference, including transcripts of the panel discussions which took place, is available at: http://www.comp.lancs.ac.uk/computing/middleware98/ Based on their original reviews and the reactions of delegates to the ensuing presentations we have selected six papers from the conference for publication in this special issue of Distributed Systems Engineering. The first paper, entitled `Jonathan: an open distributed processing environment in Java', by Dumant et al describes a minimal, modular ORB framework which can be used for supporting real-time and multimedia applications. The framework provides mechanisms by which services such as CORBA ORBs can be constructed as personalities which exploit the services provided by the underlying minimal kernel. The issue of engineering ORBs is taken further in the second paper, `The implementation of a high-performance ORB over multiple network transports' by Lo and Pope. This paper is of particular interest since it presents the concrete results of running a modern ORB, i.e. omniORB2, over a range of transport mechanisms, including TCP/IP, shared memory and ATM AAL5. However, in order for middleware to progress, future platforms must tackle the issue of scalability as well as that of performance. For this reason we have included two papers, `Systems support for scalable and fault tolerant Internet services' by Chawathe and Brewer and `A scalable middleware solution for advanced wide-area Web services' by van Steen et al, which address the problems inherent in developing scalable middleware. Although the two papers focus on different problems in this area, they are both motivated by the explosion of services and information made available through the World Wide Web. Indeed, the role of the World Wide Web as a component in middleware platforms featured prominently in the conference and this is reflected in our choice of the paper by Cao et al entitled `Active Cache: caching dynamic contents on the Web'. Motivated once again by the problems of scalability, Cao et al propose a system to support the caching of dynamic documents. This is achieved by enabling small applets to be cached along with pages and run by the cache servers. The issues of security, trust and resource utilization raised by such a system are explored in detail by the authors. Finally, `Mobile Java objects' by Hayton et al considers these issues still further as part of the authors' work on adding object mobility to Java. Together, the six papers contained within this issue of Distributed Systems Engineering capture the essence of Middleware'98 and demonstrate the progress that has been made in the field. Of particular note is the systems-oriented focus of these papers: the field has clearly matured beyond modelling and into the domain of advanced systems development. We hope that the papers contained here stimulate and inform you and we look forward to meeting you at a future Middleware conference.
Parallel file system with metadata distributed across partitioned key-value store c
Bent, John M.; Faibish, Sorin; Grider, Gary; Torres, Aaron
2017-09-19
Improved techniques are provided for storing metadata associated with a plurality of sub-files associated with a single shared file in a parallel file system. The shared file is generated by a plurality of applications executing on a plurality of compute nodes. A compute node implements a Parallel Log Structured File System (PLFS) library to store at least one portion of the shared file generated by an application executing on the compute node and metadata for the at least one portion of the shared file on one or more object storage servers. The compute node is also configured to implement a partitioned data store for storing a partition of the metadata for the shared file, wherein the partitioned data store communicates with partitioned data stores on other compute nodes using a message passing interface. The partitioned data store can be implemented, for example, using Multidimensional Data Hashing Indexing Middleware (MDHIM).
Standard Populations (Millions) for Age-Adjustment - SEER Population Datasets
Download files containing standard population data for use in statististical software. The files contain the same data distributed with SEER*Stat software. You can also view the standard populations, either 19 age groups or single ages.
NASA Astrophysics Data System (ADS)
Wang, Z.; Gu, Z.; Chen, B.; Yuan, J.; Wang, C.
2016-12-01
The CHAOS-6 geomagnetic field model, presented in 2016 by the Denmark's national space institute (DTU Space), is a model of the near-Earth magnetic field. According the CHAOS-6 model, seven component data of geomagnetic filed at 30 observatories in China in 2015 and at 3 observatories in China spanning the time interval 2008.0-2016.5 were calculated. Also seven component data of geomagnetic filed from the geomagnetic data of practical observations in China was obtained. Based on the model calculated data and the practical data, we have compared and analyzed the spatial distribution and the secular variation of the geomagnetic field in China. There is obvious difference between the two type data. The CHAOS-6 model cannot describe the spatial distribution and the secular variation of the geomagnetic field in China with comparative precision because of the regional and local magnetic anomalies in China.
A History of the Andrew File System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bashear, Derrick
2011-02-22
Derrick Brashear and Jeffrey Altman will present a technical history of the evolution of Andrew File System starting with the early days of the Andrew Project at Carnegie Mellon through the commercialization by Transarc Corporation and IBM and a decade of OpenAFS. The talk will be technical with a focus on the various decisions and implementation trade-offs that were made over the course of AFS versions 1 through 4, the development of the Distributed Computing Environment Distributed File System (DCE DFS), and the course of the OpenAFS development community. The speakers will also discuss the various AFS branches developed atmore » the University of Michigan, Massachusetts Institute of Technology and Carnegie Mellon University.« less
Jungle Computing: Distributed Supercomputing Beyond Clusters, Grids, and Clouds
NASA Astrophysics Data System (ADS)
Seinstra, Frank J.; Maassen, Jason; van Nieuwpoort, Rob V.; Drost, Niels; van Kessel, Timo; van Werkhoven, Ben; Urbani, Jacopo; Jacobs, Ceriel; Kielmann, Thilo; Bal, Henri E.
In recent years, the application of high-performance and distributed computing in scientific practice has become increasingly wide spread. Among the most widely available platforms to scientists are clusters, grids, and cloud systems. Such infrastructures currently are undergoing revolutionary change due to the integration of many-core technologies, providing orders-of-magnitude speed improvements for selected compute kernels. With high-performance and distributed computing systems thus becoming more heterogeneous and hierarchical, programming complexity is vastly increased. Further complexities arise because urgent desire for scalability and issues including data distribution, software heterogeneity, and ad hoc hardware availability commonly force scientists into simultaneous use of multiple platforms (e.g., clusters, grids, and clouds used concurrently). A true computing jungle.
Robot-Beacon Distributed Range-Only SLAM for Resource-Constrained Operation
Torres-González, Arturo; Martínez-de Dios, Jose Ramiro; Ollero, Anibal
2017-01-01
This work deals with robot-sensor network cooperation where sensor nodes (beacons) are used as landmarks for Range-Only (RO) Simultaneous Localization and Mapping (SLAM). Most existing RO-SLAM techniques consider beacons as passive devices disregarding the sensing, computational and communication capabilities with which they are actually endowed. SLAM is a resource-demanding task. Besides the technological constraints of the robot and beacons, many applications impose further resource consumption limitations. This paper presents a scalable distributed RO-SLAM scheme for resource-constrained operation. It is capable of exploiting robot-beacon cooperation in order to improve SLAM accuracy while meeting a given resource consumption bound expressed as the maximum number of measurements that are integrated in SLAM per iteration. The proposed scheme combines a Sparse Extended Information Filter (SEIF) SLAM method, in which each beacon gathers and integrates robot-beacon and inter-beacon measurements, and a distributed information-driven measurement allocation tool that dynamically selects the measurements that are integrated in SLAM, balancing uncertainty improvement and resource consumption. The scheme adopts a robot-beacon distributed approach in which each beacon participates in the selection, gathering and integration in SLAM of robot-beacon and inter-beacon measurements, resulting in significant estimation accuracies, resource-consumption efficiency and scalability. It has been integrated in an octorotor Unmanned Aerial System (UAS) and evaluated in 3D SLAM outdoor experiments. The experimental results obtained show its performance and robustness and evidence its advantages over existing methods. PMID:28425946
pSCANNER: patient-centered Scalable National Network for Effectiveness Research
Ohno-Machado, Lucila; Agha, Zia; Bell, Douglas S; Dahm, Lisa; Day, Michele E; Doctor, Jason N; Gabriel, Davera; Kahlon, Maninder K; Kim, Katherine K; Hogarth, Michael; Matheny, Michael E; Meeker, Daniella; Nebeker, Jonathan R
2014-01-01
This article describes the patient-centered Scalable National Network for Effectiveness Research (pSCANNER), which is part of the recently formed PCORnet, a national network composed of learning healthcare systems and patient-powered research networks funded by the Patient Centered Outcomes Research Institute (PCORI). It is designed to be a stakeholder-governed federated network that uses a distributed architecture to integrate data from three existing networks covering over 21 million patients in all 50 states: (1) VA Informatics and Computing Infrastructure (VINCI), with data from Veteran Health Administration's 151 inpatient and 909 ambulatory care and community-based outpatient clinics; (2) the University of California Research exchange (UC-ReX) network, with data from UC Davis, Irvine, Los Angeles, San Francisco, and San Diego; and (3) SCANNER, a consortium of UCSD, Tennessee VA, and three federally qualified health systems in the Los Angeles area supplemented with claims and health information exchange data, led by the University of Southern California. Initial use cases will focus on three conditions: (1) congestive heart failure; (2) Kawasaki disease; (3) obesity. Stakeholders, such as patients, clinicians, and health service researchers, will be engaged to prioritize research questions to be answered through the network. We will use a privacy-preserving distributed computation model with synchronous and asynchronous modes. The distributed system will be based on a common data model that allows the construction and evaluation of distributed multivariate models for a variety of statistical analyses. PMID:24780722
Robot-Beacon Distributed Range-Only SLAM for Resource-Constrained Operation.
Torres-González, Arturo; Martínez-de Dios, Jose Ramiro; Ollero, Anibal
2017-04-20
This work deals with robot-sensor network cooperation where sensor nodes (beacons) are used as landmarks for Range-Only (RO) Simultaneous Localization and Mapping (SLAM). Most existing RO-SLAM techniques consider beacons as passive devices disregarding the sensing, computational and communication capabilities with which they are actually endowed. SLAM is a resource-demanding task. Besides the technological constraints of the robot and beacons, many applications impose further resource consumption limitations. This paper presents a scalable distributed RO-SLAM scheme for resource-constrained operation. It is capable of exploiting robot-beacon cooperation in order to improve SLAM accuracy while meeting a given resource consumption bound expressed as the maximum number of measurements that are integrated in SLAM per iteration. The proposed scheme combines a Sparse Extended Information Filter (SEIF) SLAM method, in which each beacon gathers and integrates robot-beacon and inter-beacon measurements, and a distributed information-driven measurement allocation tool that dynamically selects the measurements that are integrated in SLAM, balancing uncertainty improvement and resource consumption. The scheme adopts a robot-beacon distributed approach in which each beacon participates in the selection, gathering and integration in SLAM of robot-beacon and inter-beacon measurements, resulting in significant estimation accuracies, resource-consumption efficiency and scalability. It has been integrated in an octorotor Unmanned Aerial System (UAS) and evaluated in 3D SLAM outdoor experiments. The experimental results obtained show its performance and robustness and evidence its advantages over existing methods.
Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation
Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana; Schulze Grotthoff, Stefan
2013-01-01
The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to failures compared to existing aggregation methods. It is illustrated that on a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and that it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method rdmGS, which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms. PMID:24748902
NASA Technical Reports Server (NTRS)
2002-01-01
TRMM has acquired more than four years of data since its launch in November 1997. All TRMM standard products are processed by the TRMM Science Data and Information System (TSDIS) and archived and distributed to general users by the GES DAAC. Table 1 shows the total archive and distribution as of February 28, 2002. The Utilization Ratio (UR), defined as the ratio of the number of distributed files to the number of archived files, of the TRMM standard products has been steadily increasing since 1998 and is currently at 6.98.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Duro, Francisco Rodrigo; Garcia Blas, Javier; Isaila, Florin
This paper explores novel techniques for improving the performance of many-task workflows based on the Swift scripting language. We propose novel programmer options for automated distributed data placement and task scheduling. These options trigger a data placement mechanism used for distributing intermediate workflow data over the servers of Hercules, a distributed key-value store that can be used to cache file system data. We demonstrate that these new mechanisms can significantly improve the aggregated throughput of many-task workflows with up to 86x, reduce the contention on the shared file system, exploit the data locality, and trade off locality and load balance.
Understanding Customer Dissatisfaction with Underutilized Distributed File Servers
NASA Technical Reports Server (NTRS)
Riedel, Erik; Gibson, Garth
1996-01-01
An important trend in the design of storage subsystems is a move toward direct network attachment. Network-attached storage offers the opportunity to off-load distributed file system functionality from dedicated file server machines and execute many requests directly at the storage devices. For this strategy to lead to better performance, as perceived by users, the response time of distributed operations must improve. In this paper we analyze measurements of an Andrew file system (AFS) server that we recently upgraded in an effort to improve client performance in our laboratory. While the original server's overall utilization was only about 3%, we show how burst loads were sufficiently intense to lead to period of poor response time significant enough to trigger customer dissatisfaction. In particular, we show how, after adjusting for network load and traffic to non-project servers, 50% of the variation in client response time was explained by variation in server central processing unit (CPU) use. That is, clients saw long response times in large part because the server was often over-utilized when it was used at all. Using these measures, we see that off-loading file server work in a network-attached storage architecture has to potential to benefit user response time. Computational power in such a system scales directly with storage capacity, so the slowdown during burst period should be reduced.
Scalable Cloning on Large-Scale GPU Platforms with Application to Time-Stepped Simulations on Grids
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoginath, Srikanth B.; Perumalla, Kalyan S.
Cloning is a technique to efficiently simulate a tree of multiple what-if scenarios that are unraveled during the course of a base simulation. However, cloned execution is highly challenging to realize on large, distributed memory computing platforms, due to the dynamic nature of the computational load across clones, and due to the complex dependencies spanning the clone tree. In this paper, we present the conceptual simulation framework, algorithmic foundations, and runtime interface of CloneX, a new system we designed for scalable simulation cloning. It efficiently and dynamically creates whole logical copies of a dynamic tree of simulations across a largemore » parallel system without full physical duplication of computation and memory. The performance of a prototype implementation executed on up to 1,024 graphical processing units of a supercomputing system has been evaluated with three benchmarks—heat diffusion, forest fire, and disease propagation models—delivering a speed up of over two orders of magnitude compared to replicated runs. Finally, the results demonstrate a significantly faster and scalable way to execute many what-if scenario ensembles of large simulations via cloning using the CloneX interface.« less
Continuous Variable Cluster State Generation over the Optical Spatial Mode Comb
Pooser, Raphael C.; Jing, Jietai
2014-10-20
One way quantum computing uses single qubit projective measurements performed on a cluster state (a highly entangled state of multiple qubits) in order to enact quantum gates. The model is promising due to its potential scalability; the cluster state may be produced at the beginning of the computation and operated on over time. Continuous variables (CV) offer another potential benefit in the form of deterministic entanglement generation. This determinism can lead to robust cluster states and scalable quantum computation. Recent demonstrations of CV cluster states have made great strides on the path to scalability utilizing either time or frequency multiplexingmore » in optical parametric oscillators (OPO) both above and below threshold. The techniques relied on a combination of entangling operators and beam splitter transformations. Here we show that an analogous transformation exists for amplifiers with Gaussian inputs states operating on multiple spatial modes. By judicious selection of local oscillators (LOs), the spatial mode distribution is analogous to the optical frequency comb consisting of axial modes in an OPO cavity. We outline an experimental system that generates cluster states across the spatial frequency comb which can also scale the amount of quantum noise reduction to potentially larger than in other systems.« less
Scalable Multiprocessor for High-Speed Computing in Space
NASA Technical Reports Server (NTRS)
Lux, James; Lang, Minh; Nishimoto, Kouji; Clark, Douglas; Stosic, Dorothy; Bachmann, Alex; Wilkinson, William; Steffke, Richard
2004-01-01
A report discusses the continuing development of a scalable multiprocessor computing system for hard real-time applications aboard a spacecraft. "Hard realtime applications" signifies applications, like real-time radar signal processing, in which the data to be processed are generated at "hundreds" of pulses per second, each pulse "requiring" millions of arithmetic operations. In these applications, the digital processors must be tightly integrated with analog instrumentation (e.g., radar equipment), and data input/output must be synchronized with analog instrumentation, controlled to within fractions of a microsecond. The scalable multiprocessor is a cluster of identical commercial-off-the-shelf generic DSP (digital-signal-processing) computers plus generic interface circuits, including analog-to-digital converters, all controlled by software. The processors are computers interconnected by high-speed serial links. Performance can be increased by adding hardware modules and correspondingly modifying the software. Work is distributed among the processors in a parallel or pipeline fashion by means of a flexible master/slave control and timing scheme. Each processor operates under its own local clock; synchronization is achieved by broadcasting master time signals to all the processors, which compute offsets between the master clock and their local clocks.
Using Swarming Agents for Scalable Security in Large Network Environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Crouse, Michael; White, Jacob L.; Fulp, Errin W.
2011-09-23
The difficulty of securing computer infrastructures increases as they grow in size and complexity. Network-based security solutions such as IDS and firewalls cannot scale because of exponentially increasing computational costs inherent in detecting the rapidly growing number of threat signatures. Hostbased solutions like virus scanners and IDS suffer similar issues, and these are compounded when enterprises try to monitor these in a centralized manner. Swarm-based autonomous agent systems like digital ants and artificial immune systems can provide a scalable security solution for large network environments. The digital ants approach offers a biologically inspired design where each ant in the virtualmore » colony can detect atoms of evidence that may help identify a possible threat. By assembling the atomic evidences from different ant types the colony may detect the threat. This decentralized approach can require, on average, fewer computational resources than traditional centralized solutions; however there are limits to its scalability. This paper describes how dividing a large infrastructure into smaller managed enclaves allows the digital ant framework to effectively operate in larger environments. Experimental results will show that using smaller enclaves allows for more consistent distribution of agents and results in faster response times.« less
Scalable Cloning on Large-Scale GPU Platforms with Application to Time-Stepped Simulations on Grids
Yoginath, Srikanth B.; Perumalla, Kalyan S.
2018-01-31
Cloning is a technique to efficiently simulate a tree of multiple what-if scenarios that are unraveled during the course of a base simulation. However, cloned execution is highly challenging to realize on large, distributed memory computing platforms, due to the dynamic nature of the computational load across clones, and due to the complex dependencies spanning the clone tree. In this paper, we present the conceptual simulation framework, algorithmic foundations, and runtime interface of CloneX, a new system we designed for scalable simulation cloning. It efficiently and dynamically creates whole logical copies of a dynamic tree of simulations across a largemore » parallel system without full physical duplication of computation and memory. The performance of a prototype implementation executed on up to 1,024 graphical processing units of a supercomputing system has been evaluated with three benchmarks—heat diffusion, forest fire, and disease propagation models—delivering a speed up of over two orders of magnitude compared to replicated runs. Finally, the results demonstrate a significantly faster and scalable way to execute many what-if scenario ensembles of large simulations via cloning using the CloneX interface.« less
NEXUS Scalable and Distributed Next-Generation Avionics Bus for Space Missions
NASA Technical Reports Server (NTRS)
He, Yutao; Shalom, Eddy; Chau, Savio N.; Some, Raphael R.; Bolotin, Gary S.
2011-01-01
A paper discusses NEXUS, a common, next-generation avionics interconnect that is transparently compatible with wired, fiber-optic, and RF physical layers; provides a flexible, scalable, packet switched topology; is fault-tolerant with sub-microsecond detection/recovery latency; has scalable bandwidth from 1 Kbps to 10 Gbps; has guaranteed real-time determinism with sub-microsecond latency/jitter; has built-in testability; features low power consumption (< 100 mW per Gbps); is lightweight with about a 5,000-logic-gate footprint; and is implemented in a small Bus Interface Unit (BIU) with reconfigurable back-end providing interface to legacy subsystems. NEXUS enhances a commercial interconnect standard, Serial RapidIO, to meet avionics interconnect requirements without breaking the standard. This unified interconnect technology can be used to meet performance, power, size, and reliability requirements of all ranges of equipment, sensors, and actuators at chip-to-chip, board-to-board, or box-to-box boundary. Early results from in-house modeling activity of Serial RapidIO using VisualSim indicate that the use of a switched, high-performance avionics network will provide a quantum leap in spacecraft onboard science and autonomy capability for science and exploration missions.
Extending DIRAC File Management with Erasure-Coding for efficient storage.
NASA Astrophysics Data System (ADS)
Cadellin Skipsey, Samuel; Todev, Paulin; Britton, David; Crooks, David; Roy, Gareth
2015-12-01
The state of the art in Grid style data management is to achieve increased resilience of data via multiple complete replicas of data files across multiple storage endpoints. While this is effective, it is not the most space-efficient approach to resilience, especially when the reliability of individual storage endpoints is sufficiently high that only a few will be inactive at any point in time. We report on work performed as part of GridPP[1], extending the Dirac File Catalogue and file management interface to allow the placement of erasure-coded files: each file distributed as N identically-sized chunks of data striped across a vector of storage endpoints, encoded such that any M chunks can be lost and the original file can be reconstructed. The tools developed are transparent to the user, and, as well as allowing up and downloading of data to Grid storage, also provide the possibility of parallelising access across all of the distributed chunks at once, improving data transfer and IO performance. We expect this approach to be of most interest to smaller VOs, who have tighter bounds on the storage available to them, but larger (WLCG) VOs may be interested as their total data increases during Run 2. We provide an analysis of the costs and benefits of the approach, along with future development and implementation plans in this area. In general, overheads for multiple file transfers provide the largest issue for competitiveness of this approach at present.
Solving data-at-rest for the storage and retrieval of files in ad hoc networks
NASA Astrophysics Data System (ADS)
Knobler, Ron; Scheffel, Peter; Williams, Jonathan; Gaj, Kris; Kaps, Jens-Peter
2013-05-01
Based on current trends for both military and commercial applications, the use of mobile devices (e.g. smartphones and tablets) is greatly increasing. Several military applications consist of secure peer to peer file sharing without a centralized authority. For these military applications, if one or more of these mobile devices are lost or compromised, sensitive files can be compromised by adversaries, since COTS devices and operating systems are used. Complete system files cannot be stored on a device, since after compromising a device, an adversary can attack the data at rest, and eventually obtain the original file. Also after a device is compromised, the existing peer to peer system devices must still be able to access all system files. McQ has teamed with the Cryptographic Engineering Research Group at George Mason University to develop a custom distributed file sharing system to provide a complete solution to the data at rest problem for resource constrained embedded systems and mobile devices. This innovative approach scales very well to a large number of network devices, without a single point of failure. We have implemented the approach on representative mobile devices as well as developed an extensive system simulator to benchmark expected system performance based on detailed modeling of the network/radio characteristics, CONOPS, and secure distributed file system functionality. The simulator is highly customizable for the purpose of determining expected system performance for other network topologies and CONOPS.
NASA Astrophysics Data System (ADS)
Frey, Davide; Guerraoui, Rachid; Kermarrec, Anne-Marie; Koldehofe, Boris; Mogensen, Martin; Monod, Maxime; Quéma, Vivien
Gossip-based information dissemination protocols are considered easy to deploy, scalable and resilient to network dynamics. Load-balancing is inherent in these protocols as the dissemination work is evenly spread among all nodes. Yet, large-scale distributed systems are usually heterogeneous with respect to network capabilities such as bandwidth. In practice, a blind load-balancing strategy might significantly hamper the performance of the gossip dissemination.
2009-12-01
other services for early UNIX systems at Bell labs. In many UNIX based systems, the field added to ‘etc/ passwd ’ file to carry GCOS ID information was...charset, and external. struct options_main { /* Option flags */ opt_flags flags; /* Password files */ struct list_main * passwd ; /* Password file...object PASSWD . It is part of several other data structures. struct PASSWD { int id; char *login; char *passwd_hash; int UID
Dynamic full-scalability conversion in scalable video coding
NASA Astrophysics Data System (ADS)
Lee, Dong Su; Bae, Tae Meon; Thang, Truong Cong; Ro, Yong Man
2007-02-01
For outstanding coding efficiency with scalability functions, SVC (Scalable Video Coding) is being standardized. SVC can support spatial, temporal and SNR scalability and these scalabilities are useful to provide a smooth video streaming service even in a time varying network such as a mobile environment. But current SVC is insufficient to support dynamic video conversion with scalability, thereby the adaptation of bitrate to meet a fluctuating network condition is limited. In this paper, we propose dynamic full-scalability conversion methods for QoS adaptive video streaming in SVC. To accomplish full scalability dynamic conversion, we develop corresponding bitstream extraction, encoding and decoding schemes. At the encoder, we insert the IDR NAL periodically to solve the problems of spatial scalability conversion. At the extractor, we analyze the SVC bitstream to get the information which enable dynamic extraction. Real time extraction is achieved by using this information. Finally, we develop the decoder so that it can manage the changing scalability. Experimental results showed that dynamic full-scalability conversion was verified and it was necessary for time varying network condition.
Analyzing microtomography data with Python and the scikit-image library.
Gouillart, Emmanuelle; Nunez-Iglesias, Juan; van der Walt, Stéfan
2017-01-01
The exploration and processing of images is a vital aspect of the scientific workflows of many X-ray imaging modalities. Users require tools that combine interactivity, versatility, and performance. scikit-image is an open-source image processing toolkit for the Python language that supports a large variety of file formats and is compatible with 2D and 3D images. The toolkit exposes a simple programming interface, with thematic modules grouping functions according to their purpose, such as image restoration, segmentation, and measurements. scikit-image users benefit from a rich scientific Python ecosystem that contains many powerful libraries for tasks such as visualization or machine learning. scikit-image combines a gentle learning curve, versatile image processing capabilities, and the scalable performance required for the high-throughput analysis of X-ray imaging data.
Simms, Andrew M; Toofanny, Rudesh D; Kehl, Catherine; Benson, Noah C; Daggett, Valerie
2008-06-01
Dynameomics is a project to investigate and catalog the native-state dynamics and thermal unfolding pathways of representatives of all protein folds using solvated molecular dynamics simulations, as described in the preceding paper. Here we introduce the design of the molecular dynamics data warehouse, a scalable, reliable repository that houses simulation data that vastly simplifies management and access. In the succeeding paper, we describe the development of a complementary multidimensional database. A single protein unfolding or native-state simulation can take weeks to months to complete, and produces gigabytes of coordinate and analysis data. Mining information from over 3000 completed simulations is complicated and time-consuming. Even the simplest queries involve writing intricate programs that must be built from low-level file system access primitives and include significant logic to correctly locate and parse data of interest. As a result, programs to answer questions that require data from hundreds of simulations are very difficult to write. Thus, organization and access to simulation data have been major obstacles to the discovery of new knowledge in the Dynameomics project. This repository is used internally and is the foundation of the Dynameomics portal site http://www.dynameomics.org. By organizing simulation data into a scalable, manageable and accessible form, we can begin to address substantial questions that move us closer to solving biomedical and bioengineering problems.
VLBA Archive &Distribution Architecture
NASA Astrophysics Data System (ADS)
Wells, D. C.
1994-01-01
Signals from the 10 antennas of NRAO's VLBA [Very Long Baseline Array] are processed by a Correlator. The complex fringe visibilities produced by the Correlator are archived on magnetic cartridges using a low-cost architecture which is capable of scaling and evolving. Archive files are copied to magnetic media to be distributed to users in FITS format, using the BINTABLE extension. Archive files are labelled using SQL INSERT statements, in order to bind the DBMS-based archive catalog to the archive media.
The Deployment of Routing Protocols in Distributed Control Plane of SDN
Jingjing, Zhou; Di, Cheng; Weiming, Wang; Rong, Jin; Xiaochun, Wu
2014-01-01
Software defined network (SDN) provides a programmable network through decoupling the data plane, control plane, and application plane from the original closed system, thus revolutionizing the existing network architecture to improve the performance and scalability. In this paper, we learned about the distributed characteristics of Kandoo architecture and, meanwhile, improved and optimized Kandoo's two levels of controllers based on ideological inspiration of RCP (routing control platform). Finally, we analyzed the deployment strategies of BGP and OSPF protocol in a distributed control plane of SDN. The simulation results show that our deployment strategies are superior to the traditional routing strategies. PMID:25250395
Resource Management for Distributed Parallel Systems
NASA Technical Reports Server (NTRS)
Neuman, B. Clifford; Rao, Santosh
1993-01-01
Multiprocessor systems should exist in the the larger context of distributed systems, allowing multiprocessor resources to be shared by those that need them. Unfortunately, typical multiprocessor resource management techniques do not scale to large networks. The Prospero Resource Manager (PRM) is a scalable resource allocation system that supports the allocation of processing resources in large networks and multiprocessor systems. To manage resources in such distributed parallel systems, PRM employs three types of managers: system managers, job managers, and node managers. There exist multiple independent instances of each type of manager, reducing bottlenecks. The complexity of each manager is further reduced because each is designed to utilize information at an appropriate level of abstraction.
ESUSA: US endangered species distribution file
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nagy, J.; Calef, C.E.
1979-10-01
This report describes a file containing distribution data on endangered species of the United States of Federal concern pursuant to the Endangered Species Act of 1973. Included for each species are (a) the common name, (b) the scientific name, (c) the family, (d) the group (mammal, bird, etc.), (e) Fish and Wildlife Service (FWS) listing and recovery priorities, (f) the Federal legal status, (g) the geographic distribution by counties or islands, (h) Federal Register citations and (i) the sources of the information on distribution of the species. Status types are endangered, threatened, proposed, formally under review, candidate, deleted, and rejected.more » Distribution is by Federal Information Processing Standard (FIPS) county code and is of four types: designated critical habitat, present range, potential range, and historic range.« less
A Disciplined Architectural Approach to Scaling Data Analysis for Massive, Scientific Data
NASA Astrophysics Data System (ADS)
Crichton, D. J.; Braverman, A. J.; Cinquini, L.; Turmon, M.; Lee, H.; Law, E.
2014-12-01
Data collections across remote sensing and ground-based instruments in astronomy, Earth science, and planetary science are outpacing scientists' ability to analyze them. Furthermore, the distribution, structure, and heterogeneity of the measurements themselves pose challenges that limit the scalability of data analysis using traditional approaches. Methods for developing science data processing pipelines, distribution of scientific datasets, and performing analysis will require innovative approaches that integrate cyber-infrastructure, algorithms, and data into more systematic approaches that can more efficiently compute and reduce data, particularly distributed data. This requires the integration of computer science, machine learning, statistics and domain expertise to identify scalable architectures for data analysis. The size of data returned from Earth Science observing satellites and the magnitude of data from climate model output, is predicted to grow into the tens of petabytes challenging current data analysis paradigms. This same kind of growth is present in astronomy and planetary science data. One of the major challenges in data science and related disciplines defining new approaches to scaling systems and analysis in order to increase scientific productivity and yield. Specific needs include: 1) identification of optimized system architectures for analyzing massive, distributed data sets; 2) algorithms for systematic analysis of massive data sets in distributed environments; and 3) the development of software infrastructures that are capable of performing massive, distributed data analysis across a comprehensive data science framework. NASA/JPL has begun an initiative in data science to address these challenges. Our goal is to evaluate how scientific productivity can be improved through optimized architectural topologies that identify how to deploy and manage the access, distribution, computation, and reduction of massive, distributed data, while managing the uncertainties of scientific conclusions derived from such capabilities. This talk will provide an overview of JPL's efforts in developing a comprehensive architectural approach to data science.
Federal Register 2010, 2011, 2012, 2013, 2014
2013-06-13
... generation of 600,000 kilowatt-hours. m. This filing is available for review and reproduction at the..., and distributing and consulting on a draft exemption application. Dated: June 6, 2013. Kimberly D...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-09-28
...; Notice of Baseline Filings September 21, 2010. Take notice that on September 14, 2010, September 16, 2010... submitted their baseline filing of its Statement of Operating Conditions for services provided under section...
NASA Astrophysics Data System (ADS)
Bauerdick, L. A. T.; Bloom, K.; Bockelman, B.; Bradley, D. C.; Dasu, S.; Dost, J. M.; Sfiligoi, I.; Tadel, A.; Tadel, M.; Wuerthwein, F.; Yagil, A.; Cms Collaboration
2014-06-01
Following the success of the XRootd-based US CMS data federation, the AAA project investigated extensions of the federation architecture by developing two sample implementations of an XRootd, disk-based, caching proxy. The first one simply starts fetching a whole file as soon as a file open request is received and is suitable when completely random file access is expected or it is already known that a whole file be read. The second implementation supports on-demand downloading of partial files. Extensions to the Hadoop Distributed File System have been developed to allow for an immediate fallback to network access when local HDFS storage fails to provide the requested block. Both cache implementations are in pre-production testing at UCSD.
Simple Automatic File Exchange (SAFE) to Support Low-Cost Spacecraft Operation via the Internet
NASA Technical Reports Server (NTRS)
Baker, Paul; Repaci, Max; Sames, David
1998-01-01
Various issues associated with Simple Automatic File Exchange (SAFE) are presented in viewgraph form. Specific topics include: 1) Packet telemetry, Internet IP networks and cost reduction; 2) Basic functions and technical features of SAFE; 3) Project goals, including low-cost satellite transmission to data centers to be distributed via an Internet; 4) Operations with a replicated file protocol; 5) File exchange operation; 6) Ground stations as gateways; 7) Lessons learned from demonstrations and tests with SAFE; and 8) Feedback and future initiatives.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Forslund, D.W.; Cook, J.L.
One of the most powerful tools available for telemedicine is a multimedia medical record accessible over a wide area and simultaneously editable by multiple physicians. The ability to do this through an intuitive interface linking multiple distributed data repositories while maintaining full data integrity is a fundamental enabling technology in healthcare. The authors discuss the role of distributed object technology using Java and CORBA in providing this capability including an example of such a system (TeleMed) which can be accessed through the World Wide Web. Issues of security, scalability, data integrity, and usability are emphasized.
Finding idle machines in a workstation-based distributed system
NASA Technical Reports Server (NTRS)
Theimer, Marvin M.; Lantz, Keith A.
1989-01-01
The authors describe the design and performance of scheduling facilities for finding idle hosts in a workstation-based distributed system. They focus on the tradeoffs between centralized and decentralized architectures with respect to scalability, fault tolerance, and simplicity of design, as well as several implementation issues of interest when multicast communication is used. They conclude that the principal tradeoff between the two approaches is that a centralized architecture can be scaled to a significantly greater degree and can more easily monitor global system statistics, whereas a decentralized architecture is simpler to implement.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dmitriy Morozov, Tom Peterka
2014-07-29
Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets. As the scale of simulations and observations surpasses billions of particles, a distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this software is a distributed-memory parallel Delaunay and Voronoi tessellation algorithm based on existing serial computational geometry libraries that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include the addition of periodic and wall boundary conditions.
The role of CORBA in enabling telemedicine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Forslund, D.W.
1997-07-01
One of the most powerful tools available for telemedicine is a multimedia medical record accessible over a wide area and simultaneously editable by multiple physicians. The ability to do this through an intuitive interface linking multiple distributed data repositories while maintaining full data integrity is a fundamental enabling technology in healthcare. The author discusses the role of distributed object technology using CORBA in providing this capability including an example of such a system (TeleMed) which can be accessed through the World Wide Web. Issues of security, scalability, data integrity, and useability are emphasized.
Common Accounting System for Monitoring the ATLAS Distributed Computing Resources
NASA Astrophysics Data System (ADS)
Karavakis, E.; Andreeva, J.; Campana, S.; Gayazov, S.; Jezequel, S.; Saiz, P.; Sargsyan, L.; Schovancova, J.; Ueda, I.; Atlas Collaboration
2014-06-01
This paper covers in detail a variety of accounting tools used to monitor the utilisation of the available computational and storage resources within the ATLAS Distributed Computing during the first three years of Large Hadron Collider data taking. The Experiment Dashboard provides a set of common accounting tools that combine monitoring information originating from many different information sources; either generic or ATLAS specific. This set of tools provides quality and scalable solutions that are flexible enough to support the constantly evolving requirements of the ATLAS user community.
An Overview of ARL’s Multimodal Signatures Database and Web Interface
2007-12-01
ActiveX components, which hindered distribution due to license agreements and run-time license software to use such components. g. Proprietary...Overview The database consists of multimodal signature data files in the HDF5 format. Generally, each signature file contains all the ancillary...only contains information in the database, Web interface, and signature files that is releasable to the public. The Web interface consists of static
Performance Analysis of the Unitree Central File
NASA Technical Reports Server (NTRS)
Pentakalos, Odysseas I.; Flater, David
1994-01-01
This report consists of two parts. The first part briefly comments on the documentation status of two major systems at NASA#s Center for Computational Sciences, specifically the Cray C98 and the Convex C3830. The second part describes the work done on improving the performance of file transfers between the Unitree Mass Storage System running on the Convex file server and the users workstations distributed over a large georgraphic area.
DIRAC File Replica and Metadata Catalog
NASA Astrophysics Data System (ADS)
Tsaregorodtsev, A.; Poss, S.
2012-12-01
File replica and metadata catalogs are essential parts of any distributed data management system, which are largely determining its functionality and performance. A new File Catalog (DFC) was developed in the framework of the DIRAC Project that combines both replica and metadata catalog functionality. The DFC design is based on the practical experience with the data management system of the LHCb Collaboration. It is optimized for the most common patterns of the catalog usage in order to achieve maximum performance from the user perspective. The DFC supports bulk operations for replica queries and allows quick analysis of the storage usage globally and for each Storage Element separately. It supports flexible ACL rules with plug-ins for various policies that can be adopted by a particular community. The DFC catalog allows to store various types of metadata associated with files and directories and to perform efficient queries for the data based on complex metadata combinations. Definition of file ancestor-descendent relation chains is also possible. The DFC catalog is implemented in the general DIRAC distributed computing framework following the standard grid security architecture. In this paper we describe the design of the DFC and its implementation details. The performance measurements are compared with other grid file catalog implementations. The experience of the DFC Catalog usage in the CLIC detector project are discussed.
Workload Characterization and Performance Implications of Large-Scale Blog Servers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jeon, Myeongjae; Kim, Youngjae; Hwang, Jeaho
With the ever-increasing popularity of social network services (SNSs), an understanding of the characteristics of these services and their effects on the behavior of their host servers is critical. However, there has been a lack of research on the workload characterization of servers running SNS applications such as blog services. To fill this void, we empirically characterized real-world web server logs collected from one of the largest South Korean blog hosting sites for 12 consecutive days. The logs consist of more than 96 million HTTP requests and 4.7 TB of network traffic. Our analysis reveals the followings: (i) The transfermore » size of non-multimedia files and blog articles can be modeled using a truncated Pareto distribution and a log-normal distribution, respectively; (ii) User access for blog articles does not show temporal locality, but is strongly biased towards those posted with image or audio files. We additionally discuss the potential performance improvement through clustering of small files on a blog page into contiguous disk blocks, which benefits from the observed file access patterns. Trace-driven simulations show that, on average, the suggested approach achieves 60.6% better system throughput and reduces the processing time for file access by 30.8% compared to the best performance of the Ext4 file system.« less
A comparison of decentralized, distributed, and centralized vibro-acoustic control.
Frampton, Kenneth D; Baumann, Oliver N; Gardonio, Paolo
2010-11-01
Direct velocity feedback control of structures is well known to increase structural damping and thus reduce vibration. In multi-channel systems the way in which the velocity signals are used to inform the actuators ranges from decentralized control, through distributed or clustered control to fully centralized control. The objective of distributed controllers is to exploit the anticipated performance advantage of the centralized control while maintaining the scalability, ease of implementation, and robustness of decentralized control. However, and in seeming contradiction, some investigations have concluded that decentralized control performs as well as distributed and centralized control, while other results have indicated that distributed control has significant performance advantages over decentralized control. The purpose of this work is to explain this seeming contradiction in results, to explore the effectiveness of decentralized, distributed, and centralized vibro-acoustic control, and to expand the concept of distributed control to include the distribution of the optimization process and the cost function employed.
A scalable architecture for online anomaly detection of WLCG batch jobs
NASA Astrophysics Data System (ADS)
Kuehn, E.; Fischer, M.; Giffels, M.; Jung, C.; Petzold, A.
2016-10-01
For data centres it is increasingly important to monitor the network usage, and learn from network usage patterns. Especially configuration issues or misbehaving batch jobs preventing a smooth operation need to be detected as early as possible. At the GridKa data and computing centre we therefore operate a tool BPNetMon for monitoring traffic data and characteristics of WLCG batch jobs and pilots locally on different worker nodes. On the one hand local information itself are not sufficient to detect anomalies for several reasons, e.g. the underlying job distribution on a single worker node might change or there might be a local misconfiguration. On the other hand a centralised anomaly detection approach does not scale regarding network communication as well as computational costs. We therefore propose a scalable architecture based on concepts of a super-peer network.
Simplified Parallel Domain Traversal
DOE Office of Scientific and Technical Information (OSTI.GOV)
Erickson III, David J
2011-01-01
Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalability on HPC platforms, we introduce a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. We also integrate our design with advanced parallel I/O techniques that operate directly on native simulation output. We demonstrate DStep bymore » performing teleconnection analysis across ensemble runs of terascale atmospheric CO{sub 2} and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.« less