Science.gov

Sample records for hadoop distributed file

  1. Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF

    NASA Astrophysics Data System (ADS)

    Li, Z.; Yang, C. P.; Schnase, J. L.; Duffy, D.; Lee, T. J.

    2014-12-01

    Climate observations and model simulations are collecting and generating vast amounts of climate data, and these data are ever-increasing and being accumulated in a rapid speed. Effectively managing and analyzing these data are essential for climate change studies. Hadoop, a distributed storage and processing framework for large data sets, has attracted increasing attentions in dealing with the Big Data challenge. The maturity of Infrastructure as a Service (IaaS) of cloud computing further accelerates the adoption of Hadoop in solving Big Data problems. However, Hadoop is designed to process unstructured data such as texts, documents and web pages, and cannot effectively handle the scientific data format such as array-based NetCDF files and other binary data format. In this paper, we propose to build a Hadoop-based middleware for transparently handling big NetCDF data by 1) designing a distributed climate data storage mechanism based on POSIX-enabled parallel file system to enable parallel big data processing with MapReduce, as well as support data access by other systems; 2) modifying the Hadoop framework to transparently processing NetCDF data in parallel without sequencing or converting the data into other file formats, or loading them to HDFS; and 3) seamlessly integrating Hadoop, cloud computing and climate data in a highly scalable and fault-tolerance framework.

  2. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.

    PubMed

    Niemenmaa, Matti; Kallio, Aleksi; Schumacher, André; Klemelä, Petri; Korpelainen, Eija; Heljanko, Keijo

    2012-03-15

    Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.

  3. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud

    PubMed Central

    Niemenmaa, Matti; Kallio, Aleksi; Schumacher, André; Klemelä, Petri; Korpelainen, Eija; Heljanko, Keijo

    2012-01-01

    Summary: Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps. Availability: Available under the open-source MIT license at http://sourceforge.net/projects/hadoop-bam/ Contact: matti.niemenmaa@aalto.fi Supplementary information: Supplementary material is available at Bioinformatics online. PMID:22302568

  4. Hadoop neural network for parallel and distributed feature selection.

    PubMed

    Hodge, Victoria J; O'Keefe, Simon; Austin, Jim

    2016-06-01

    In this paper, we introduce a theoretical basis for a Hadoop-based neural network for parallel and distributed feature selection in Big Data sets. It is underpinned by an associative memory (binary) neural network which is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm. There are many feature selectors described in the literature which all have various strengths and weaknesses. We present the implementation details of five feature selection algorithms constructed using our artificial neural network framework embedded in Hadoop YARN. Hadoop allows parallel and distributed processing. Each feature selector can be divided into subtasks and the subtasks can then be processed in parallel. Multiple feature selectors can also be processed simultaneously (in parallel) allowing multiple feature selectors to be compared. We identify commonalities among the five features selectors. All can be processed in the framework using a single representation and the overall processing can also be greatly reduced by only processing the common aspects of the feature selectors once and propagating these aspects across all five feature selectors as necessary. This allows the best feature selector and the actual features to select to be identified for large and high dimensional data sets through exploiting the efficiency and flexibility of embedding the binary associative-memory neural network in Hadoop. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.

  5. Evaluation of Apache Hadoop for parallel data analysis with ROOT

    NASA Astrophysics Data System (ADS)

    Lehrack, S.; Duckeck, G.; Ebke, J.

    2014-06-01

    The Apache Hadoop software is a Java based framework for distributed processing of large data sets across clusters of computers, using the Hadoop file system (HDFS) for data storage and backup and MapReduce as a processing platform. Hadoop is primarily designed for processing large textual data sets which can be processed in arbitrary chunks, and must be adapted to the use case of processing binary data files which cannot be split automatically. However, Hadoop offers attractive features in terms of fault tolerance, task supervision and control, multi-user functionality and job management. For this reason, we evaluated Apache Hadoop as an alternative approach to PROOF for ROOT data analysis. Two alternatives in distributing analysis data were discussed: either the data was stored in HDFS and processed with MapReduce, or the data was accessed via a standard Grid storage system (dCache Tier-2) and MapReduce was used only as execution back-end. The focus in the measurements were on the one hand to safely store analysis data on HDFS with reasonable data rates and on the other hand to process data fast and reliably with MapReduce. In the evaluation of the HDFS, read/write data rates from local Hadoop cluster have been measured and compared to standard data rates from the local NFS installation. In the evaluation of MapReduce, realistic ROOT analyses have been used and event rates have been compared to PROOF.

  6. a Hadoop-Based Distributed Framework for Efficient Managing and Processing Big Remote Sensing Images

    NASA Astrophysics Data System (ADS)

    Wang, C.; Hu, F.; Hu, X.; Zhao, S.; Wen, W.; Yang, C.

    2015-07-01

    Various sensors from airborne and satellite platforms are producing large volumes of remote sensing images for mapping, environmental monitoring, disaster management, military intelligence, and others. However, it is challenging to efficiently storage, query and process such big data due to the data- and computing- intensive issues. In this paper, a Hadoop-based framework is proposed to manage and process the big remote sensing data in a distributed and parallel manner. Especially, remote sensing data can be directly fetched from other data platforms into the Hadoop Distributed File System (HDFS). The Orfeo toolbox, a ready-to-use tool for large image processing, is integrated into MapReduce to provide affluent image processing operations. With the integration of HDFS, Orfeo toolbox and MapReduce, these remote sensing images can be directly processed in parallel in a scalable computing environment. The experiment results show that the proposed framework can efficiently manage and process such big remote sensing data.

  7. Sequential data access with Oracle and Hadoop: a performance comparison

    NASA Astrophysics Data System (ADS)

    Baranowski, Zbigniew; Canali, Luca; Grancher, Eric

    2014-06-01

    The Hadoop framework has proven to be an effective and popular approach for dealing with "Big Data" and, thanks to its scaling ability and optimised storage access, Hadoop Distributed File System-based projects such as MapReduce or HBase are seen as candidates to replace traditional relational database management systems whenever scalable speed of data processing is a priority. But do these projects deliver in practice? Does migrating to Hadoop's "shared nothing" architecture really improve data access throughput? And, if so, at what cost? Authors answer these questions-addressing cost/performance as well as raw performance- based on a performance comparison between an Oracle-based relational database and Hadoop's distributed solutions like MapReduce or HBase for sequential data access. A key feature of our approach is the use of an unbiased data model as certain data models can significantly favour one of the technologies tested.

  8. Design of material management system of mining group based on Hadoop

    NASA Astrophysics Data System (ADS)

    Xia, Zhiyuan; Tan, Zhuoying; Qi, Kuan; Li, Wen

    2018-01-01

    Under the background of persistent slowdown in mining market at present, improving the management level in mining group has become the key link to improve the economic benefit of the mine. According to the practical material management in mining group, three core components of Hadoop are applied: distributed file system HDFS, distributed computing framework Map/Reduce and distributed database HBase. Material management system of mining group based on Hadoop is constructed with the three core components of Hadoop and SSH framework technology. This system was found to strengthen collaboration between mining group and affiliated companies, and then the problems such as inefficient management, server pressure, hardware equipment performance deficiencies that exist in traditional mining material-management system are solved, and then mining group materials management is optimized, the cost of mining management is saved, the enterprise profit is increased.

  9. FASTdoop: a versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications.

    PubMed

    Ferraro Petrillo, Umberto; Roscigno, Gianluca; Cattaneo, Giuseppe; Giancarlo, Raffaele

    2017-05-15

    MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers versatility and efficiency. That is, it can handle collections of reads, with or without quality scores, as well as long genomic sequences while the existing routines concentrate mainly on NGS sequence data. Moreover, in the domain where a comparison is possible, the routines proposed here are faster than the available ones. In conclusion, FASTdoop is a much needed addition to Hadoop-BAM. The software and the datasets are available at http://www.di.unisa.it/FASTdoop/ . umberto.ferraro@uniroma1.it. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  10. Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

    NASA Astrophysics Data System (ADS)

    Farhan Husain, Mohammad; Doshi, Pankil; Khan, Latifur; Thuraisingham, Bhavani

    Handling huge amount of data scalably is a matter of concern for a long time. Same is true for semantic web data. Current semantic web frameworks lack this ability. In this paper, we describe a framework that we built using Hadoop to store and retrieve large number of RDF triples. We describe our schema to store RDF data in Hadoop Distribute File System. We also present our algorithms to answer a SPARQL query. We make use of Hadoop's MapReduce framework to actually answer the queries. Our results reveal that we can store huge amount of semantic web data in Hadoop clusters built mostly by cheap commodity class hardware and still can answer queries fast enough. We conclude that ours is a scalable framework, able to handle large amount of RDF data efficiently.

  11. Massive Signal Analysis with Hadoop (Invited)

    NASA Astrophysics Data System (ADS)

    Addair, T.

    2013-12-01

    The Geophysical Monitoring Program (GMP) at Lawrence Livermore National Laboratory is in the process of transitioning from a primarily human-driven analysis pipeline to a more automated and exploratory system. Waveform correlation represents a significant part of this effort, and the results that come out of this processing could lead to the development of more sophisticated event detection and analysis systems that require less human interaction, and address fundamental shortcomings in existing systems. Furthermore, use of distributed IO systems fundamentally addresses a scalability concern for the GMP as our data holdings continue to grow rapidly. As the data volume increases, it becomes less reasonable to rely upon human analysts to sift through all the information. Not only is more automation essential to keeping up with the ingestion rate, but so too do we require faster and more sophisticated tools for visualizing and interacting with the data. These issues of scalability are not unique to GMP or the seismic domain. All across the lab, and throughout industry, we hear about the promise of 'big data' to address the need of quickly analyzing vast amounts of data in fundamentally new ways. Our waveform correlation system finds and correlates nearby seismic events across the entire Earth. In our original implementation of the system, we processed some 50 TB of data on an in-house traditional HPC cluster (44 cores, 1 filesystem) over the span of 42 days. Having determined the primary bottleneck in the performance to be reading waveforms off a single BlueArc file server, we began investigating distributed IO solutions like Hadoop. As a test case, we took a 1 TB subset of our data and ported it to Livermore Computing's development Hadoop cluster. Through a pilot project sponsored by Livermore Computing (LC), the GMP successfully implemented the waveform correlation system in the Hadoop distributed MapReduce computing framework. Hadoop is an open source

  12. XRootD popularity on hadoop clusters

    NASA Astrophysics Data System (ADS)

    Meoni, Marco; Boccali, Tommaso; Magini, Nicolò; Menichetti, Luca; Giordano, Domenico; CMS Collaboration

    2017-10-01

    Performance data and metadata of the computing operations at the CMS experiment are collected through a distributed monitoring infrastructure, currently relying on a traditional Oracle database system. This paper shows how to harness Big Data architectures in order to improve the throughput and the efficiency of such monitoring. A large set of operational data - user activities, job submissions, resources, file transfers, site efficiencies, software releases, network traffic, machine logs - is being injected into a readily available Hadoop cluster, via several data streamers. The collected metadata is further organized running fast arbitrary queries; this offers the ability to test several Map&Reduce-based frameworks and measure the system speed-up when compared to the original database infrastructure. By leveraging a quality Hadoop data store and enabling an analytics framework on top, it is possible to design a mining platform to predict dataset popularity and discover patterns and correlations.

  13. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.

    PubMed

    Siretskiy, Alexey; Sundqvist, Tore; Voznesenskiy, Mikhail; Spjuth, Ola

    2015-01-01

    New high-throughput technologies, such as massively parallel sequencing, have transformed the life sciences into a data-intensive field. The most common e-infrastructure for analyzing this data consists of batch systems that are based on high-performance computing resources; however, the bioinformatics software that is built on this platform does not scale well in the general case. Recently, the Hadoop platform has emerged as an interesting option to address the challenges of increasingly large datasets with distributed storage, distributed processing, built-in data locality, fault tolerance, and an appealing programming methodology. In this work we introduce metrics and report on a quantitative comparison between Hadoop and a single node of conventional high-performance computing resources for the tasks of short read mapping and variant calling. We calculate efficiency as a function of data size and observe that the Hadoop platform is more efficient for biologically relevant data sizes in terms of computing hours for both split and un-split data files. We also quantify the advantages of the data locality provided by Hadoop for NGS problems, and show that a classical architecture with network-attached storage will not scale when computing resources increase in numbers. Measurements were performed using ten datasets of different sizes, up to 100 gigabases, using the pipeline implemented in Crossbow. To make a fair comparison, we implemented an improved preprocessor for Hadoop with better performance for splittable data files. For improved usability, we implemented a graphical user interface for Crossbow in a private cloud environment using the CloudGene platform. All of the code and data in this study are freely available as open source in public repositories. From our experiments we can conclude that the improved Hadoop pipeline scales better than the same pipeline on high-performance computing resources, we also conclude that Hadoop is an economically viable

  14. Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework.

    PubMed

    Lewis, Steven; Csordas, Attila; Killcoyne, Sarah; Hermjakob, Henning; Hoopmann, Michael R; Moritz, Robert L; Deutsch, Eric W; Boyle, John

    2012-12-05

    For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.

  15. Astronomical Image Processing with Hadoop

    NASA Astrophysics Data System (ADS)

    Wiley, K.; Connolly, A.; Krughoff, S.; Gardner, J.; Balazinska, M.; Howe, B.; Kwon, Y.; Bu, Y.

    2011-07-01

    In the coming decade astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. With a requirement that these images be analyzed in real time to identify moving sources such as potentially hazardous asteroids or transient objects such as supernovae, these data streams present many computational challenges. In the commercial world, new techniques that utilize cloud computing have been developed to handle massive data streams. In this paper we describe how cloud computing, and in particular the map-reduce paradigm, can be used in astronomical data processing. We will focus on our experience implementing a scalable image-processing pipeline for the SDSS database using Hadoop (http://hadoop.apache.org). This multi-terabyte imaging dataset approximates future surveys such as those which will be conducted with the LSST. Our pipeline performs image coaddition in which multiple partially overlapping images are registered, integrated and stitched into a single overarching image. We will first present our initial implementation, then describe several critical optimizations that have enabled us to achieve high performance, and finally describe how we are incorporating a large in-house existing image processing library into our Hadoop system. The optimizations involve prefiltering of the input to remove irrelevant images from consideration, grouping individual FITS files into larger, more efficient indexed files, and a hybrid system in which a relational database is used to determine the input images relevant to the task. The incorporation of an existing image processing library, written in C++, presented difficult challenges since Hadoop is programmed primarily in Java. We will describe how we achieved this integration and the sophisticated image processing routines that were made feasible as a result. We will end by briefly describing the longer term goals of our work, namely detection and classification

  16. Experience, use, and performance measurement of the Hadoop File System in a typical nuclear physics analysis workflow

    NASA Astrophysics Data System (ADS)

    Sangaline, E.; Lauret, J.

    2014-06-01

    The quantity of information produced in Nuclear and Particle Physics (NPP) experiments necessitates the transmission and storage of data across diverse collections of computing resources. Robust solutions such as XRootD have been used in NPP, but as the usage of cloud resources grows, the difficulties in the dynamic configuration of these systems become a concern. Hadoop File System (HDFS) exists as a possible cloud storage solution with a proven track record in dynamic environments. Though currently not extensively used in NPP, HDFS is an attractive solution offering both elastic storage and rapid deployment. We will present the performance of HDFS in both canonical I/O tests and for a typical data analysis pattern within the RHIC/STAR experimental framework. These tests explore the scaling with different levels of redundancy and numbers of clients. Additionally, the performance of FUSE and NFS interfaces to HDFS were evaluated as a way to allow existing software to function without modification. Unfortunately, the complicated data structures in NPP are non-trivial to integrate with Hadoop and so many of the benefits of the MapReduce paradigm could not be directly realized. Despite this, our results indicate that using HDFS as a distributed filesystem offers reasonable performance and scalability and that it excels in its ease of configuration and deployment in a cloud environment.

  17. Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

    PubMed Central

    2012-01-01

    Background For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. Results We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. Conclusion The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources. PMID:23216909

  18. Spatial coding-based approach for partitioning big spatial data in Hadoop

    NASA Astrophysics Data System (ADS)

    Yao, Xiaochuang; Mokbel, Mohamed F.; Alarabi, Louai; Eldawy, Ahmed; Yang, Jianyu; Yun, Wenju; Li, Lin; Ye, Sijing; Zhu, Dehai

    2017-09-01

    Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems.

  19. Hadoop-MCC: Efficient Multiple Compound Comparison Algorithm Using Hadoop.

    PubMed

    Hua, Guan-Jie; Hung, Che-Lun; Tang, Chuan Yi

    2018-01-01

    In the past decade, the drug design technologies have been improved enormously. The computer-aided drug design (CADD) has played an important role in analysis and prediction in drug development, which makes the procedure more economical and efficient. However, computation with big data, such as ZINC containing more than 60 million compounds data and GDB-13 with more than 930 million small molecules, is a noticeable issue of time-consuming problem. Therefore, we propose a novel heterogeneous high performance computing method, named as Hadoop-MCC, integrating Hadoop and GPU, to copy with big chemical structure data efficiently. Hadoop-MCC gains the high availability and fault tolerance from Hadoop, as Hadoop is used to scatter input data to GPU devices and gather the results from GPU devices. Hadoop framework adopts mapper/reducer computation model. In the proposed method, mappers response for fetching SMILES data segments and perform LINGO method on GPU, then reducers collect all comparison results produced by mappers. Due to the high availability of Hadoop, all of LINGO computational jobs on mappers can be completed, even if some of the mappers encounter problems. A comparison of LINGO is performed on each the GPU device in parallel. According to the experimental results, the proposed method on multiple GPU devices can achieve better computational performance than the CUDA-MCC on a single GPU device. Hadoop-MCC is able to achieve scalability, high availability, and fault tolerance granted by Hadoop, and high performance as well by integrating computational power of both of Hadoop and GPU. It has been shown that using the heterogeneous architecture as Hadoop-MCC effectively can enhance better computational performance than on a single GPU device. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  20. Remote sensing image segmentation based on Hadoop cloud platform

    NASA Astrophysics Data System (ADS)

    Li, Jie; Zhu, Lingling; Cao, Fubin

    2018-01-01

    To solve the problem that the remote sensing image segmentation speed is slow and the real-time performance is poor, this paper studies the method of remote sensing image segmentation based on Hadoop platform. On the basis of analyzing the structural characteristics of Hadoop cloud platform and its component MapReduce programming, this paper proposes a method of image segmentation based on the combination of OpenCV and Hadoop cloud platform. Firstly, the MapReduce image processing model of Hadoop cloud platform is designed, the input and output of image are customized and the segmentation method of the data file is rewritten. Then the Mean Shift image segmentation algorithm is implemented. Finally, this paper makes a segmentation experiment on remote sensing image, and uses MATLAB to realize the Mean Shift image segmentation algorithm to compare the same image segmentation experiment. The experimental results show that under the premise of ensuring good effect, the segmentation rate of remote sensing image segmentation based on Hadoop cloud Platform has been greatly improved compared with the single MATLAB image segmentation, and there is a great improvement in the effectiveness of image segmentation.

  1. Developing and Optimizing Applications in Hadoop

    NASA Astrophysics Data System (ADS)

    Kothuri, P.; Garcia, D.; Hermans, J.

    2017-10-01

    This contribution is about sharing our recent experiences of building Hadoop based application. Hadoop ecosystem now offers myriad of tools which can overwhelm new users, yet there are successful ways these tools can be leveraged to solve problems. We look at factors to consider when using Hadoop to model and store data, best practices for moving data in and out of the system and common processing patterns, at each stage relating with the real world experience gained while developing such application. We share many of the design choices, tools developed and how to profile a distributed application which can be applied for other scenarios as well. In conclusion, the goal of the presentation is to provide guidance to architect Hadoop based application and share some of the reusable components developed in this process.

  2. Efficient feature extraction from wide-area motion imagery by MapReduce in Hadoop

    NASA Astrophysics Data System (ADS)

    Cheng, Erkang; Ma, Liya; Blaisse, Adam; Blasch, Erik; Sheaff, Carolyn; Chen, Genshe; Wu, Jie; Ling, Haibin

    2014-06-01

    Wide-Area Motion Imagery (WAMI) feature extraction is important for applications such as target tracking, traffic management and accident discovery. With the increasing amount of WAMI collections and feature extraction from the data, a scalable framework is needed to handle the large amount of information. Cloud computing is one of the approaches recently applied in large scale or big data. In this paper, MapReduce in Hadoop is investigated for large scale feature extraction tasks for WAMI. Specifically, a large dataset of WAMI images is divided into several splits. Each split has a small subset of WAMI images. The feature extractions of WAMI images in each split are distributed to slave nodes in the Hadoop system. Feature extraction of each image is performed individually in the assigned slave node. Finally, the feature extraction results are sent to the Hadoop File System (HDFS) to aggregate the feature information over the collected imagery. Experiments of feature extraction with and without MapReduce are conducted to illustrate the effectiveness of our proposed Cloud-Enabled WAMI Exploitation (CAWE) approach.

  3. Framework for Parallel Preprocessing of Microarray Data Using Hadoop

    PubMed Central

    2018-01-01

    Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach. PMID:29796018

  4. a Hadoop-Based Algorithm of Generating dem Grid from Point Cloud Data

    NASA Astrophysics Data System (ADS)

    Jian, X.; Xiao, X.; Chengfang, H.; Zhizhong, Z.; Zhaohui, W.; Dengzhong, Z.

    2015-04-01

    Airborne LiDAR technology has proven to be the most powerful tools to obtain high-density, high-accuracy and significantly detailed surface information of terrain and surface objects within a short time, and from which the Digital Elevation Model of high quality can be extracted. Point cloud data generated from the pre-processed data should be classified by segmentation algorithms, so as to differ the terrain points from disorganized points, then followed by a procedure of interpolating the selected points to turn points into DEM data. The whole procedure takes a long time and huge computing resource due to high-density, that is concentrated on by a number of researches. Hadoop is a distributed system infrastructure developed by the Apache Foundation, which contains a highly fault-tolerant distributed file system (HDFS) with high transmission rate and a parallel programming model (Map/Reduce). Such a framework is appropriate for DEM generation algorithms to improve efficiency. Point cloud data of Dongting Lake acquired by Riegl LMS-Q680i laser scanner was utilized as the original data to generate DEM by a Hadoop-based algorithms implemented in Linux, then followed by another traditional procedure programmed by C++ as the comparative experiment. Then the algorithm's efficiency, coding complexity, and performance-cost ratio were discussed for the comparison. The results demonstrate that the algorithm's speed depends on size of point set and density of DEM grid, and the non-Hadoop implementation can achieve a high performance when memory is big enough, but the multiple Hadoop implementation can achieve a higher performance-cost ratio, while point set is of vast quantities on the other hand.

  5. Information Retrieval Using Hadoop Big Data Analysis

    NASA Astrophysics Data System (ADS)

    Motwani, Deepak; Madan, Madan Lal

    This paper concern on big data analysis which is the cognitive operation of probing huge amounts of information in an attempt to get uncovers unseen patterns. Through Big Data Analytics Applications such as public and private organization sectors have formed a strategic determination to turn big data into cut throat benefit. The primary occupation of extracting value from big data give rise to a process applied to pull information from multiple different sources; this process is known as extract transforms and lode. This paper approach extract information from log files and Research Paper, awareness reduces the efforts for blueprint finding and summarization of document from several positions. The work is able to understand better Hadoop basic concept and increase the user experience for research. In this paper, we propose an approach for analysis log files for finding concise information which is useful and time saving by using Hadoop. Our proposed approach will be applied on different research papers on a specific domain and applied for getting summarized content for further improvement and make the new content.

  6. Implementation and performance test of cloud platform based on Hadoop

    NASA Astrophysics Data System (ADS)

    Xu, Jingxian; Guo, Jianhong; Ren, Chunlan

    2018-01-01

    Hadoop, as an open source project for the Apache foundation, is a distributed computing framework that deals with large amounts of data and has been widely used in the Internet industry. Therefore, it is meaningful to study the implementation of Hadoop platform and the performance of test platform. The purpose of this subject is to study the method of building Hadoop platform and to study the performance of test platform. This paper presents a method to implement Hadoop platform and a test platform performance method. Experimental results show that the proposed test performance method is effective and it can detect the performance of Hadoop platform.

  7. Big data mining: In-database Oracle data mining over hadoop

    NASA Astrophysics Data System (ADS)

    Kovacheva, Zlatinka; Naydenova, Ina; Kaloyanova, Kalinka; Markov, Krasimir

    2017-07-01

    Big data challenges different aspects of storing, processing and managing data, as well as analyzing and using data for business purposes. Applying Data Mining methods over Big Data is another challenge because of huge data volumes, variety of information, and the dynamic of the sources. Different applications are made in this area, but their successful usage depends on understanding many specific parameters. In this paper we present several opportunities for using Data Mining techniques provided by the analytical engine of RDBMS Oracle over data stored in Hadoop Distributed File System (HDFS). Some experimental results are given and they are discussed.

  8. HEPDOOP: High-Energy Physics Analysis using Hadoop

    NASA Astrophysics Data System (ADS)

    Bhimji, W.; Bristow, T.; Washbrook, A.

    2014-06-01

    We perform a LHC data analysis workflow using tools and data formats that are commonly used in the "Big Data" community outside High Energy Physics (HEP). These include Apache Avro for serialisation to binary files, Pig and Hadoop for mass data processing and Python Scikit-Learn for multi-variate analysis. Comparison is made with the same analysis performed with current HEP tools in ROOT.

  9. YARNsim: Simulating Hadoop YARN

    SciTech Connect

    Liu, Ning; Yang, Xi; Sun, Xian-He

    Despite the popularity of the Apache Hadoop system, its success has been limited by issues such as single points of failure, centralized job/task management, and lack of support for programming models other than MapReduce. The next generation of Hadoop, Apache Hadoop YARN, is designed to address these issues. In this paper, we propose YARNsim, a simulation system for Hadoop YARN. YARNsim is based on parallel discrete event simulation and provides protocol-level accuracy in simulating key components of YARN. YARNsim provides a virtual platform on which system architects can evaluate the design and implementation of Hadoop YARN systems. Also, application developersmore » can tune job performance and understand the tradeoffs between different configurations, and Hadoop YARN system vendors can evaluate system efficiency under limited budgets. To demonstrate the validity of YARNsim, we use it to model two real systems and compare the experimental results from YARNsim and the real systems. The experiments include standard Hadoop benchmarks, synthetic workloads, and a bioinformatics application. The results show that the error rate is within 10% for the majority of test cases. The experiments prove that YARNsim can provide what-if analysis for system designers in a timely manner and at minimal cost compared with testing and evaluating on a real system.« less

  10. Efficient LIDAR Point Cloud Data Managing and Processing in a Hadoop-Based Distributed Framework

    NASA Astrophysics Data System (ADS)

    Wang, C.; Hu, F.; Sha, D.; Han, X.

    2017-10-01

    Light Detection and Ranging (LiDAR) is one of the most promising technologies in surveying and mapping city management, forestry, object recognition, computer vision engineer and others. However, it is challenging to efficiently storage, query and analyze the high-resolution 3D LiDAR data due to its volume and complexity. In order to improve the productivity of Lidar data processing, this study proposes a Hadoop-based framework to efficiently manage and process LiDAR data in a distributed and parallel manner, which takes advantage of Hadoop's storage and computing ability. At the same time, the Point Cloud Library (PCL), an open-source project for 2D/3D image and point cloud processing, is integrated with HDFS and MapReduce to conduct the Lidar data analysis algorithms provided by PCL in a parallel fashion. The experiment results show that the proposed framework can efficiently manage and process big LiDAR data.

  11. MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services.

    PubMed

    Pratt, Brian; Howbert, J Jeffry; Tasman, Natalie I; Nilsson, Erik J

    2012-01-01

    MR-Tandem adapts the popular X!Tandem peptide search engine to work with Hadoop MapReduce for reliable parallel execution of large searches. MR-Tandem runs on any Hadoop cluster but offers special support for Amazon Web Services for creating inexpensive on-demand Hadoop clusters, enabling search volumes that might not otherwise be feasible with the compute resources a researcher has at hand. MR-Tandem is designed to drop in wherever X!Tandem is already in use and requires no modification to existing X!Tandem parameter files, and only minimal modification to X!Tandem-based workflows. MR-Tandem is implemented as a lightly modified X!Tandem C++ executable and a Python script that drives Hadoop clusters including Amazon Web Services (AWS) Elastic Map Reduce (EMR), using the modified X!Tandem program as a Hadoop Streaming mapper and reducer. The modified X!Tandem C++ source code is Artistic licensed, supports pluggable scoring, and is available as part of the Sashimi project at http://sashimi.svn.sourceforge.net/viewvc/sashimi/trunk/trans_proteomic_pipeline/extern/xtandem/. The MR-Tandem Python script is Apache licensed and available as part of the Insilicos Cloud Army project at http://ica.svn.sourceforge.net/viewvc/ica/trunk/mr-tandem/. Full documentation and a windows installer that configures MR-Tandem, Python and all necessary packages are available at this same URL. brian.pratt@insilicos.com

  12. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

    PubMed

    Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo

    2014-01-01

    Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts. Available under the open source MIT license at http://sourceforge.net/projects/seqpig/

  13. MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services

    PubMed Central

    Pratt, Brian; Howbert, J. Jeffry; Tasman, Natalie I.; Nilsson, Erik J.

    2012-01-01

    Summary: MR-Tandem adapts the popular X!Tandem peptide search engine to work with Hadoop MapReduce for reliable parallel execution of large searches. MR-Tandem runs on any Hadoop cluster but offers special support for Amazon Web Services for creating inexpensive on-demand Hadoop clusters, enabling search volumes that might not otherwise be feasible with the compute resources a researcher has at hand. MR-Tandem is designed to drop in wherever X!Tandem is already in use and requires no modification to existing X!Tandem parameter files, and only minimal modification to X!Tandem-based workflows. Availability and implementation: MR-Tandem is implemented as a lightly modified X!Tandem C++ executable and a Python script that drives Hadoop clusters including Amazon Web Services (AWS) Elastic Map Reduce (EMR), using the modified X!Tandem program as a Hadoop Streaming mapper and reducer. The modified X!Tandem C++ source code is Artistic licensed, supports pluggable scoring, and is available as part of the Sashimi project at http://sashimi.svn.sourceforge.net/viewvc/sashimi/trunk/trans_proteomic_pipeline/extern/xtandem/. The MR-Tandem Python script is Apache licensed and available as part of the Insilicos Cloud Army project at http://ica.svn.sourceforge.net/viewvc/ica/trunk/mr-tandem/. Full documentation and a windows installer that configures MR-Tandem, Python and all necessary packages are available at this same URL. Contact: brian.pratt@insilicos.com PMID:22072385

  14. Hadoop distributed batch processing for Gaia: a success story

    NASA Astrophysics Data System (ADS)

    Riello, Marco

    2015-12-01

    The DPAC Cambridge Data Processing Centre (DPCI) is responsible for the photometric calibration of the Gaia data including the low resolution spectra. The large data volume produced by Gaia (~26 billion transits/year), the complexity of its data stream and the self-calibrating approach pose unique challenges for scalability, reliability and robustness of both the software pipelines and the operations infrastructure. DPCI has been the first in DPAC to realise the potential of Hadoop and Map/Reduce and to adopt them as the core technologies for its infrastructure. This has proven a winning choice allowing DPCI unmatched processing throughput and reliability within DPAC to the point that other DPCs have started following our footsteps. In this talk we will present the software infrastructure developed to build the distributed and scalable batch data processing system that is currently used in production at DPCI and the excellent results in terms of performance of the system.

  15. NDSI products system based on Hadoop platform

    NASA Astrophysics Data System (ADS)

    Zhou, Yan; Jiang, He; Yang, Xiaoxia; Geng, Erhui

    2015-12-01

    Snow is solid state of water resources on earth, and plays an important role in human life. Satellite remote sensing is significant in snow extraction with the advantages of cyclical, macro, comprehensiveness, objectivity, timeliness. With the continuous development of remote sensing technology, remote sensing data access to the trend of multiple platforms, multiple sensors and multiple perspectives. At the same time, in view of the remote sensing data of compute-intensive applications demand increase gradually. However, current the producing system of remote sensing products is in a serial mode, and this kind of production system is used for professional remote sensing researchers mostly, and production systems achieving automatic or semi-automatic production are relatively less. Facing massive remote sensing data, the traditional serial mode producing system with its low efficiency has been difficult to meet the requirements of mass data timely and efficient processing. In order to effectively improve the production efficiency of NDSI products, meet the demand of large-scale remote sensing data processed timely and efficiently, this paper build NDSI products production system based on Hadoop platform, and the system mainly includes the remote sensing image management module, NDSI production module, and system service module. Main research contents and results including: (1)The remote sensing image management module: includes image import and image metadata management two parts. Import mass basis IRS images and NDSI product images (the system performing the production task output) into HDFS file system; At the same time, read the corresponding orbit ranks number, maximum/minimum longitude and latitude, product date, HDFS storage path, Hadoop task ID (NDSI products), and other metadata information, and then create thumbnails, and unique ID number for each record distribution, import it into base/product image metadata database. (2)NDSI production module: includes

  16. Satellite Imagery Production and Processing Using Apache Hadoop

    NASA Astrophysics Data System (ADS)

    Hill, D. V.; Werpy, J.

    2011-12-01

    The United States Geological Survey's (USGS) Earth Resources Observation and Science (EROS) Center Land Science Research and Development (LSRD) project has devised a method to fulfill its processing needs for Essential Climate Variable (ECV) production from the Landsat archive using Apache Hadoop. Apache Hadoop is the distributed processing technology at the heart of many large-scale, processing solutions implemented at well-known companies such as Yahoo, Amazon, and Facebook. It is a proven framework and can be used to process petabytes of data on thousands of processors concurrently. It is a natural fit for producing satellite imagery and requires only a few simple modifications to serve the needs of science data processing. This presentation provides an invaluable learning opportunity and should be heard by anyone doing large scale image processing today. The session will cover a description of the problem space, evaluation of alternatives, feature set overview, configuration of Hadoop for satellite image processing, real-world performance results, tuning recommendations and finally challenges and ongoing activities. It will also present how the LSRD project built a 102 core processing cluster with no financial hardware investment and achieved ten times the initial daily throughput requirements with a full time staff of only one engineer. Satellite Imagery Production and Processing Using Apache Hadoop is presented by David V. Hill, Principal Software Architect for USGS LSRD.

  17. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

    PubMed Central

    Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo

    2014-01-01

    Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts. Availability and Implementation: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/ Contact: andre.schumacher@yahoo.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24149054

  18. Analyzing petabytes of data with Hadoop

    ScienceCinema

    Hammerbacher, Jeff

    2018-05-14

    The open source Apache Hadoop project provides a powerful suite of tools for storing and analyzing petabytes of data using commodity hardware. After several years of production use inside of web companies like Yahoo! and Facebook and nearly a year of commercial support and development by Cloudera, the technology is spreading rapidly through other disciplines, from financial services and government to life sciences and high energy physics. The talk will motivate the design of Hadoop and discuss some key implementation details in depth. It will also cover the major subprojects in the Hadoop ecosystem, go over some example applications, highlight best practices for deploying Hadoop in your environment, discuss plans for the future of the technology, and provide pointers to the many resources available for learning more. In addition to providing more information about the Hadoop platform, a major goal of this talk is to begin a dialogue with the ATLAS research team on how the tools commonly used in their environment compare to Hadoop, and how Hadoop could improve better to serve the high energy physics community. Short Biography: Jeff Hammerbacher is Vice President of Products and Chief Scientist at Cloudera. Jeff was an Entrepreneur in Residence at Accel Partners immediately prior to founding Cloudera. Before Accel, he conceived, built, and led the Data team at Facebook. The Data team was responsible for driving many of the applications of statistics and machine learning at Facebook, as well as building out the infrastructure to support these tasks for massive data sets. The team produced two open source projects: Hive, a system for offline analysis built above Hadoop, and Cassandra, a structured storage system on a P2P network. Before joining Facebook, Jeff was a quantitative analyst on Wall Street. Jeff earned his Bachelor's Degree in Mathematics from Harvard University and recently served as contributing editor to the book "Beautiful Data", published by O'Reilly in

  19. Analyzing petabytes of data with Hadoop

    SciTech Connect

    Hammerbacher, Jeff

    The open source Apache Hadoop project provides a powerful suite of tools for storing and analyzing petabytes of data using commodity hardware. After several years of production use inside of web companies like Yahoo! and Facebook and nearly a year of commercial support and development by Cloudera, the technology is spreading rapidly through other disciplines, from financial services and government to life sciences and high energy physics. The talk will motivate the design of Hadoop and discuss some key implementation details in depth. It will also cover the major subprojects in the Hadoop ecosystem, go over some example applications, highlightmore » best practices for deploying Hadoop in your environment, discuss plans for the future of the technology, and provide pointers to the many resources available for learning more. In addition to providing more information about the Hadoop platform, a major goal of this talk is to begin a dialogue with the ATLAS research team on how the tools commonly used in their environment compare to Hadoop, and how Hadoop could improve better to serve the high energy physics community. Short Biography: Jeff Hammerbacher is Vice President of Products and Chief Scientist at Cloudera. Jeff was an Entrepreneur in Residence at Accel Partners immediately prior to founding Cloudera. Before Accel, he conceived, built, and led the Data team at Facebook. The Data team was responsible for driving many of the applications of statistics and machine learning at Facebook, as well as building out the infrastructure to support these tasks for massive data sets. The team produced two open source projects: Hive, a system for offline analysis built above Hadoop, and Cassandra, a structured storage system on a P2P network. Before joining Facebook, Jeff was a quantitative analyst on Wall Street. Jeff earned his Bachelor's Degree in Mathematics from Harvard University and recently served as contributing editor to the book "Beautiful Data", published by O

  20. Database Objects vs Files: Evaluation of alternative strategies for managing large remote sensing data

    NASA Astrophysics Data System (ADS)

    Baru, Chaitan; Nandigam, Viswanath; Krishnan, Sriram

    2010-05-01

    Increasingly, the geoscience user community expects modern IT capabilities to be available in service of their research and education activities, including the ability to easily access and process large remote sensing datasets via online portals such as GEON (www.geongrid.org) and OpenTopography (opentopography.org). However, serving such datasets via online data portals presents a number of challenges. In this talk, we will evaluate the pros and cons of alternative storage strategies for management and processing of such datasets using binary large object implementations (BLOBs) in database systems versus implementation in Hadoop files using the Hadoop Distributed File System (HDFS). The storage and I/O requirements for providing online access to large datasets dictate the need for declustering data across multiple disks, for capacity as well as bandwidth and response time performance. This requires partitioning larger files into a set of smaller files, and is accompanied by the concomitant requirement for managing large numbers of file. Storing these sub-files as blobs in a shared-nothing database implemented across a cluster provides the advantage that all the distributed storage management is done by the DBMS. Furthermore, subsetting and processing routines can be implemented as user-defined functions (UDFs) on these blobs and would run in parallel across the set of nodes in the cluster. On the other hand, there are both storage overheads and constraints, and software licensing dependencies created by such an implementation. Another approach is to store the files in an external filesystem with pointers to them from within database tables. The filesystem may be a regular UNIX filesystem, a parallel filesystem, or HDFS. In the HDFS case, HDFS would provide the file management capability, while the subsetting and processing routines would be implemented as Hadoop programs using the MapReduce model. Hadoop and its related software libraries are freely available

  1. Mining on Big Data Using Hadoop MapReduce Model

    NASA Astrophysics Data System (ADS)

    Salman Ahmed, G.; Bhattacharya, Sweta

    2017-11-01

    Customary parallel calculations for mining nonstop item create opportunity to adjust stack of similar data among hubs. The paper aims to review this process by analyzing the critical execution downside of the common parallel recurrent item-set mining calculations. Given a larger than average dataset, data apportioning strategies inside the current arrangements endure high correspondence and mining overhead evoked by repetitive exchanges transmitted among registering hubs. We tend to address this downside by building up a learning apportioning approach referred as Hadoop abuse using the map-reduce programming model. All objectives of Hadoop are to zest up the execution of parallel recurrent item-set mining on Hadoop bunches. Fusing the comparability metric and furthermore the locality-sensitive hashing procedure, Hadoop puts to a great degree comparative exchanges into an information segment to lift neighborhood while not making AN exorbitant assortment of excess exchanges. We tend to execute Hadoop on a 34-hub Hadoop bunch, driven by a decent change of datasets made by IBM quest market-basket manufactured data generator. Trial uncovers the fact that Hadoop contributes towards lessening system and processing masses by the uprightness of dispensing with excess exchanges on Hadoop hubs. Hadoop impressively outperforms and enhances the other models considerably.

  2. Design and development of a medical big data processing system based on Hadoop.

    PubMed

    Yao, Qin; Tian, Yu; Li, Peng-Fei; Tian, Li-Li; Qian, Yang-Ming; Li, Jing-Song

    2015-03-01

    Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital information systems that are designing and expanding services. Big data has four characteristics--Volume, Variety, Velocity and Value (the 4 Vs)--that make traditional systems incapable of processing these data using standalones. Apache Hadoop MapReduce is a promising software framework for developing applications that process vast amounts of data in parallel with large clusters of commodity hardware in a reliable, fault-tolerant manner. With the Hadoop framework and MapReduce application program interface (API), we can more easily develop our own MapReduce applications to run on a Hadoop framework that can scale up from a single node to thousands of machines. This paper investigates a practical case of a Hadoop-based medical big data processing system. We developed this system to intelligently process medical big data and uncover some features of hospital information system user behaviors. This paper studies user behaviors regarding various data produced by different hospital information systems for daily work. In this paper, we also built a five-node Hadoop cluster to execute distributed MapReduce algorithms. Our distributed algorithms show promise in facilitating efficient data processing with medical big data in healthcare services and clinical research compared with single nodes. Additionally, with medical big data analytics, we can design our hospital information systems to be much more intelligent and easier to use by making personalized recommendations.

  3. Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

    PubMed

    Sun, Xiaobo; Gao, Jingjing; Jin, Peng; Eng, Celeste; Burchard, Esteban G; Beaty, Terri H; Ruczinski, Ingo; Mathias, Rasika A; Barnes, Kathleen; Wang, Fusheng; Qin, Zhaohui S

    2018-06-01

    Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

  4. 'Big data', Hadoop and cloud computing in genomics.

    PubMed

    O'Driscoll, Aisling; Daugelaite, Jurate; Sleator, Roy D

    2013-10-01

    Since the completion of the Human Genome project at the turn of the Century, there has been an unprecedented proliferation of genomic sequence data. A consequence of this is that the medical discoveries of the future will largely depend on our ability to process and analyse large genomic data sets, which continue to expand as the cost of sequencing decreases. Herein, we provide an overview of cloud computing and big data technologies, and discuss how such expertise can be used to deal with biology's big data sets. In particular, big data technologies such as the Apache Hadoop project, which provides distributed and parallelised data processing and analysis of petabyte (PB) scale data sets will be discussed, together with an overview of the current usage of Hadoop within the bioinformatics community. Copyright © 2013 Elsevier Inc. All rights reserved.

  5. Local alignment tool based on Hadoop framework and GPU architecture.

    PubMed

    Hung, Che-Lun; Hua, Guan-Jie

    2014-01-01

    With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.

  6. Application research of Ganglia in Hadoop monitoring and management

    NASA Astrophysics Data System (ADS)

    Li, Gang; Ding, Jing; Zhou, Lixia; Yang, Yi; Liu, Lei; Wang, Xiaolei

    2017-03-01

    There are many applications of Hadoop System in the field of large data, cloud computing. The test bench of storage and application in seismic network at Earthquake Administration of Tianjin use with Hadoop system, which is used the open source software of Ganglia to operate and monitor. This paper reviews the function, installation and configuration process, application effect of operating and monitoring in Hadoop system of the Ganglia system. It briefly introduces the idea and effect of Nagios software monitoring Hadoop system. It is valuable for the industry in the monitoring system of cloud computing platform.

  7. How Heterogeneity Affects the Design of Hadoop MapReduce Schedulers: A State-of-the-Art Survey and Challenges.

    PubMed

    Pandey, Vaibhav; Saini, Poonam

    2018-06-01

    MapReduce (MR) computing paradigm and its open source implementation Hadoop have become a de facto standard to process big data in a distributed environment. Initially, the Hadoop system was homogeneous in three significant aspects, namely, user, workload, and cluster (hardware). However, with growing variety of MR jobs and inclusion of different configurations of nodes in the existing cluster, heterogeneity has become an essential part of Hadoop systems. The heterogeneity factors adversely affect the performance of a Hadoop scheduler and limit the overall throughput of the system. To overcome this problem, various heterogeneous Hadoop schedulers have been proposed in the literature. Existing survey works in this area mostly cover homogeneous schedulers and classify them on the basis of quality of service parameters they optimize. Hence, there is a need to study the heterogeneous Hadoop schedulers on the basis of various heterogeneity factors considered by them. In this survey article, we first discuss different heterogeneity factors that typically exist in a Hadoop system and then explore various challenges that arise while designing the schedulers in the presence of such heterogeneity. Afterward, we present the comparative study of heterogeneous scheduling algorithms available in the literature and classify them by the previously said heterogeneity factors. Lastly, we investigate different methods and environment used for evaluation of discussed Hadoop schedulers.

  8. Use of the Hadoop structured storage tools for the ATLAS EventIndex event catalogue

    NASA Astrophysics Data System (ADS)

    Favareto, A.

    2016-09-01

    The ATLAS experiment at the LHC collects billions of events each data-taking year, and processes them to make them available for physics analysis in several different formats. An even larger amount of events is in addition simulated according to physics and detector models and then reconstructed and analysed to be compared to real events. The EventIndex is a catalogue of all events in each production stage; it includes for each event a few identification parameters, some basic non-mutable information coming from the online system, and the references to the files that contain the event in each format (plus the internal pointers to the event within each file for quick retrieval). Each EventIndex record is logically simple but the system has to hold many tens of billions of records, all equally important. The Hadoop technology was selected at the start of the EventIndex project development in 2012 and proved to be robust and flexible to accommodate this kind of information; both the insertion and query response times are acceptable for the continuous and automatic operation that started in Spring 2015. This paper describes the EventIndex data input and organisation in Hadoop and explains the operational challenges that were overcome in order to achieve the expected performance.

  9. Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

    PubMed Central

    Gao, Jingjing; Jin, Peng; Eng, Celeste; Burchard, Esteban G; Beaty, Terri H; Ruczinski, Ingo; Mathias, Rasika A; Barnes, Kathleen; Wang, Fusheng

    2018-01-01

    Abstract Background Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)–based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems. PMID:29762754

  10. Local Alignment Tool Based on Hadoop Framework and GPU Architecture

    PubMed Central

    Hung, Che-Lun; Hua, Guan-Jie

    2014-01-01

    With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362

  11. Hadoop-Based Distributed System for Online Prediction of Air Pollution Based on Support Vector Machine

    NASA Astrophysics Data System (ADS)

    Ghaemi, Z.; Farnaghi, M.; Alimohammadi, A.

    2015-12-01

    The critical impact of air pollution on human health and environment in one hand and the complexity of pollutant concentration behavior in the other hand lead the scientists to look for advance techniques for monitoring and predicting the urban air quality. Additionally, recent developments in data measurement techniques have led to collection of various types of data about air quality. Such data is extremely voluminous and to be useful it must be processed at high velocity. Due to the complexity of big data analysis especially for dynamic applications, online forecasting of pollutant concentration trends within a reasonable processing time is still an open problem. The purpose of this paper is to present an online forecasting approach based on Support Vector Machine (SVM) to predict the air quality one day in advance. In order to overcome the computational requirements for large-scale data analysis, distributed computing based on the Hadoop platform has been employed to leverage the processing power of multiple processing units. The MapReduce programming model is adopted for massive parallel processing in this study. Based on the online algorithm and Hadoop framework, an online forecasting system is designed to predict the air pollution of Tehran for the next 24 hours. The results have been assessed on the basis of Processing Time and Efficiency. Quite accurate predictions of air pollutant indicator levels within an acceptable processing time prove that the presented approach is very suitable to tackle large scale air pollution prediction problems.

  12. CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.

    PubMed

    Chung, Wei-Chun; Chen, Chien-Chih; Ho, Jan-Ming; Lin, Chung-Yen; Hsu, Wen-Lian; Wang, Yu-Chun; Lee, D T; Lai, Feipei; Huang, Chih-Wei; Chang, Yu-Jung

    2014-01-01

    Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce. We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard. CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users may collaborate to improve the

  13. CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce

    PubMed Central

    Chung, Wei-Chun; Chen, Chien-Chih; Ho, Jan-Ming; Lin, Chung-Yen; Hsu, Wen-Lian; Wang, Yu-Chun; Lee, D. T.; Lai, Feipei; Huang, Chih-Wei; Chang, Yu-Jung

    2014-01-01

    Background Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce. Results We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard. Conclusions CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users

  14. DistMap: a toolkit for distributed short read mapping on a Hadoop cluster.

    PubMed

    Pandey, Ram Vinay; Schlötterer, Christian

    2013-01-01

    With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/

  15. The Automatic Recognition of the Abnormal Sky-subtraction Spectra Based on Hadoop

    NASA Astrophysics Data System (ADS)

    An, An; Pan, Jingchang

    2017-10-01

    The skylines, superimposing on the target spectrum as a main noise, If the spectrum still contains a large number of high strength skylight residuals after sky-subtraction processing, it will not be conducive to the follow-up analysis of the target spectrum. At the same time, the LAMOST can observe a quantity of spectroscopic data in every night. We need an efficient platform to proceed the recognition of the larger numbers of abnormal sky-subtraction spectra quickly. Hadoop, as a distributed parallel data computing platform, can deal with large amounts of data effectively. In this paper, we conduct the continuum normalization firstly and then a simple and effective method will be presented to automatic recognize the abnormal sky-subtraction spectra based on Hadoop platform. Obtain through the experiment, the Hadoop platform can implement the recognition with more speed and efficiency, and the simple method can recognize the abnormal sky-subtraction spectra and find the abnormal skyline positions of different residual strength effectively, can be applied to the automatic detection of abnormal sky-subtraction of large number of spectra.

  16. DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster

    PubMed Central

    Pandey, Ram Vinay; Schlötterer, Christian

    2013-01-01

    With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/ PMID:24009693

  17. Large-scale seismic signal analysis with Hadoop

    DOE PAGES

    Addair, T. G.; Dodge, D. A.; Walter, W. R.; ...

    2014-02-11

    In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore » was done using a conventional distributed cluster, and required 42 days. In anticipation of processing much larger datasets, we have re-architected the system to run as a series of MapReduce jobs on a Hadoop cluster. In doing so we achieved a factor of 19 performance increase on a test dataset. We found that fundamental algorithmic transformations were required to achieve the maximum performance increase. Whereas in the original IO-bound implementation, we went to great lengths to minimize IO, in the Hadoop implementation where IO is cheap, we were able to greatly increase the parallelism of our algorithms by performing a tiered series of very fine-grained (highly parallelizable) transformations on the data. Each of these MapReduce jobs required reading and writing large amounts of data.« less

  18. Large-scale seismic signal analysis with Hadoop

    SciTech Connect

    Addair, T. G.; Dodge, D. A.; Walter, W. R.

    In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore » was done using a conventional distributed cluster, and required 42 days. In anticipation of processing much larger datasets, we have re-architected the system to run as a series of MapReduce jobs on a Hadoop cluster. In doing so we achieved a factor of 19 performance increase on a test dataset. We found that fundamental algorithmic transformations were required to achieve the maximum performance increase. Whereas in the original IO-bound implementation, we went to great lengths to minimize IO, in the Hadoop implementation where IO is cheap, we were able to greatly increase the parallelism of our algorithms by performing a tiered series of very fine-grained (highly parallelizable) transformations on the data. Each of these MapReduce jobs required reading and writing large amounts of data.« less

  19. Hadoop for High-Performance Climate Analytics: Use Cases and Lessons Learned

    NASA Technical Reports Server (NTRS)

    Tamkin, Glenn

    2013-01-01

    Scientific data services are a critical aspect of the NASA Center for Climate Simulations mission (NCCS). Hadoop, via MapReduce, provides an approach to high-performance analytics that is proving to be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. The NCCS is particularly interested in the potential of Hadoop to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we prototyped a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. The initial focus was on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. After preliminary results suggested that this approach improves efficiencies within data intensive analytic workflows, we invested in building a cyber infrastructure resource for developing a new generation of climate data analysis capabilities using Hadoop. This resource is focused on reducing the time spent in the preparation of reanalysis data used in data-model inter-comparison, a long sought goal of the climate community. This paper summarizes the related use cases and lessons learned.

  20. Integration of Oracle and Hadoop: Hybrid Databases Affordable at Scale

    NASA Astrophysics Data System (ADS)

    Canali, L.; Baranowski, Z.; Kothuri, P.

    2017-10-01

    This work reports on the activities aimed at integrating Oracle and Hadoop technologies for the use cases of CERN database services and in particular on the development of solutions for offloading data and queries from Oracle databases into Hadoop-based systems. The goal and interest of this investigation is to increase the scalability and optimize the cost/performance footprint for some of our largest Oracle databases. These concepts have been applied, among others, to build offline copies of CERN accelerator controls and logging databases. The tested solution allows to run reports on the controls data offloaded in Hadoop without affecting the critical production database, providing both performance benefits and cost reduction for the underlying infrastructure. Other use cases discussed include building hybrid database solutions with Oracle and Hadoop, offering the combined advantages of a mature relational database system with a scalable analytics engine.

  1. Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce.

    PubMed

    Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel

    2013-08-01

    Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.

  2. Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce

    PubMed Central

    Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel

    2013-01-01

    Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS – a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive. PMID:24187650

  3. A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex

    NASA Astrophysics Data System (ADS)

    Baranowski, Z.; Canali, L.; Toebbicke, R.; Hrivnac, J.; Barberis, D.

    2017-10-01

    This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions of event records, each of which consists of ∼100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. We report also on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.

  4. Large-scale seismic waveform quality metric calculation using Hadoop

    DOE PAGES

    Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.; ...

    2016-05-27

    Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will

  5. Large-scale seismic waveform quality metric calculation using Hadoop

    SciTech Connect

    Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.

    Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will

  6. Large-scale seismic waveform quality metric calculation using Hadoop

    NASA Astrophysics Data System (ADS)

    Magana-Zook, S.; Gaylord, J. M.; Knapp, D. R.; Dodge, D. A.; Ruppert, S. D.

    2016-09-01

    In this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of 0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. These experiments were conducted multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely

  7. Leverage hadoop framework for large scale clinical informatics applications.

    PubMed

    Dong, Xiao; Bahroos, Neil; Sadhu, Eugene; Jackson, Tommie; Chukhman, Morris; Johnson, Robert; Boyd, Andrew; Hynes, Denise

    2013-01-01

    In this manuscript, we present our experiences using the Apache Hadoop framework for high data volume and computationally intensive applications, and discuss some best practice guidelines in a clinical informatics setting. There are three main aspects in our approach: (a) process and integrate diverse, heterogeneous data sources using standard Hadoop programming tools and customized MapReduce programs; (b) after fine-grained aggregate results are obtained, perform data analysis using the Mahout data mining library; (c) leverage the column oriented features in HBase for patient centric modeling and complex temporal reasoning. This framework provides a scalable solution to meet the rapidly increasing, imperative "Big Data" needs of clinical and translational research. The intrinsic advantage of fault tolerance, high availability and scalability of Hadoop platform makes these applications readily deployable at the enterprise level cluster environment.

  8. Design and Verification of Remote Sensing Image Data Center Storage Architecture Based on Hadoop

    NASA Astrophysics Data System (ADS)

    Tang, D.; Zhou, X.; Jing, Y.; Cong, W.; Li, C.

    2018-04-01

    The data center is a new concept of data processing and application proposed in recent years. It is a new method of processing technologies based on data, parallel computing, and compatibility with different hardware clusters. While optimizing the data storage management structure, it fully utilizes cluster resource computing nodes and improves the efficiency of data parallel application. This paper used mature Hadoop technology to build a large-scale distributed image management architecture for remote sensing imagery. Using MapReduce parallel processing technology, it called many computing nodes to process image storage blocks and pyramids in the background to improve the efficiency of image reading and application and sovled the need for concurrent multi-user high-speed access to remotely sensed data. It verified the rationality, reliability and superiority of the system design by testing the storage efficiency of different image data and multi-users and analyzing the distributed storage architecture to improve the application efficiency of remote sensing images through building an actual Hadoop service system.

  9. Distributed Data Collection for the ATLAS EventIndex

    NASA Astrophysics Data System (ADS)

    Sánchez, J.; Fernández Casaní, A.; González de la Hoz, S.

    2015-12-01

    The ATLAS EventIndex contains records of all events processed by ATLAS, in all processing stages. These records include the references to the files containing each event (the GUID of the file) and the internal pointer to each event in the file. This information is collected by all jobs that run at Tier-0 or on the Grid and process ATLAS events. Each job produces a snippet of information for each permanent output file. This information is packed and transferred to a central broker at CERN using an ActiveMQ messaging system, and then is unpacked, sorted and reformatted in order to be stored and catalogued into a central Hadoop server. This contribution describes in detail the Producer/Consumer architecture to convey this information from the running jobs through the messaging system to the Hadoop server.

  10. Hadoop-based implementation of processing medical diagnostic records for visual patient system

    NASA Astrophysics Data System (ADS)

    Yang, Yuanyuan; Shi, Liehang; Xie, Zhe; Zhang, Jianguo

    2018-03-01

    We have innovatively introduced Visual Patient (VP) concept and method visually to represent and index patient imaging diagnostic records (IDR) in last year SPIE Medical Imaging (SPIE MI 2017), which can enable a doctor to review a large amount of IDR of a patient in a limited appointed time slot. In this presentation, we presented a new approach to design data processing architecture of VP system (VPS) to acquire, process and store various kinds of IDR to build VP instance for each patient in hospital environment based on Hadoop distributed processing structure. We designed this system architecture called Medical Information Processing System (MIPS) with a combination of Hadoop batch processing architecture and Storm stream processing architecture. The MIPS implemented parallel processing of various kinds of clinical data with high efficiency, which come from disparate hospital information system such as PACS, RIS LIS and HIS.

  11. Theoretical and Empirical Comparison of Big Data Image Processing with Apache Hadoop and Sun Grid Engine.

    PubMed

    Bao, Shunxing; Weitendorf, Frederick D; Plassard, Andrew J; Huo, Yuankai; Gokhale, Aniruddha; Landman, Bennett A

    2017-02-11

    The field of big data is generally concerned with the scale of processing at which traditional computational paradigms break down. In medical imaging, traditional large scale processing uses a cluster computer that combines a group of workstation nodes into a functional unit that is controlled by a job scheduler. Typically, a shared-storage network file system (NFS) is used to host imaging data. However, data transfer from storage to processing nodes can saturate network bandwidth when data is frequently uploaded/retrieved from the NFS, e.g., "short" processing times and/or "large" datasets. Recently, an alternative approach using Hadoop and HBase was presented for medical imaging to enable co-location of data storage and computation while minimizing data transfer. The benefits of using such a framework must be formally evaluated against a traditional approach to characterize the point at which simply "large scale" processing transitions into "big data" and necessitates alternative computational frameworks. The proposed Hadoop system was implemented on a production lab-cluster alongside a standard Sun Grid Engine (SGE). Theoretical models for wall-clock time and resource time for both approaches are introduced and validated. To provide real example data, three T1 image archives were retrieved from a university secure, shared web database and used to empirically assess computational performance under three configurations of cluster hardware (using 72, 109, or 209 CPU cores) with differing job lengths. Empirical results match the theoretical models. Based on these data, a comparative analysis is presented for when the Hadoop framework will be relevant and non-relevant for medical imaging.

  12. Theoretical and empirical comparison of big data image processing with Apache Hadoop and Sun Grid Engine

    NASA Astrophysics Data System (ADS)

    Bao, Shunxing; Weitendorf, Frederick D.; Plassard, Andrew J.; Huo, Yuankai; Gokhale, Aniruddha; Landman, Bennett A.

    2017-03-01

    The field of big data is generally concerned with the scale of processing at which traditional computational paradigms break down. In medical imaging, traditional large scale processing uses a cluster computer that combines a group of workstation nodes into a functional unit that is controlled by a job scheduler. Typically, a shared-storage network file system (NFS) is used to host imaging data. However, data transfer from storage to processing nodes can saturate network bandwidth when data is frequently uploaded/retrieved from the NFS, e.g., "short" processing times and/or "large" datasets. Recently, an alternative approach using Hadoop and HBase was presented for medical imaging to enable co-location of data storage and computation while minimizing data transfer. The benefits of using such a framework must be formally evaluated against a traditional approach to characterize the point at which simply "large scale" processing transitions into "big data" and necessitates alternative computational frameworks. The proposed Hadoop system was implemented on a production lab-cluster alongside a standard Sun Grid Engine (SGE). Theoretical models for wall-clock time and resource time for both approaches are introduced and validated. To provide real example data, three T1 image archives were retrieved from a university secure, shared web database and used to empirically assess computational performance under three configurations of cluster hardware (using 72, 109, or 209 CPU cores) with differing job lengths. Empirical results match the theoretical models. Based on these data, a comparative analysis is presented for when the Hadoop framework will be relevant and nonrelevant for medical imaging.

  13. Chaos-Based Simultaneous Compression and Encryption for Hadoop.

    PubMed

    Usama, Muhammad; Zakaria, Nordin

    2017-01-01

    Data compression and encryption are key components of commonly deployed platforms such as Hadoop. Numerous data compression and encryption tools are presently available on such platforms and the tools are characteristically applied in sequence, i.e., compression followed by encryption or encryption followed by compression. This paper focuses on the open-source Hadoop framework and proposes a data storage method that efficiently couples data compression with encryption. A simultaneous compression and encryption scheme is introduced that addresses an important implementation issue of source coding based on Tent Map and Piece-wise Linear Chaotic Map (PWLM), which is the infinite precision of real numbers that result from their long products. The approach proposed here solves the implementation issue by removing fractional components that are generated by the long products of real numbers. Moreover, it incorporates a stealth key that performs a cyclic shift in PWLM without compromising compression capabilities. In addition, the proposed approach implements a masking pseudorandom keystream that enhances encryption quality. The proposed algorithm demonstrated a congruent fit within the Hadoop framework, providing robust encryption security and compression.

  14. Chaos-Based Simultaneous Compression and Encryption for Hadoop

    PubMed Central

    Zakaria, Nordin

    2017-01-01

    Data compression and encryption are key components of commonly deployed platforms such as Hadoop. Numerous data compression and encryption tools are presently available on such platforms and the tools are characteristically applied in sequence, i.e., compression followed by encryption or encryption followed by compression. This paper focuses on the open-source Hadoop framework and proposes a data storage method that efficiently couples data compression with encryption. A simultaneous compression and encryption scheme is introduced that addresses an important implementation issue of source coding based on Tent Map and Piece-wise Linear Chaotic Map (PWLM), which is the infinite precision of real numbers that result from their long products. The approach proposed here solves the implementation issue by removing fractional components that are generated by the long products of real numbers. Moreover, it incorporates a stealth key that performs a cyclic shift in PWLM without compromising compression capabilities. In addition, the proposed approach implements a masking pseudorandom keystream that enhances encryption quality. The proposed algorithm demonstrated a congruent fit within the Hadoop framework, providing robust encryption security and compression. PMID:28072850

  15. Hadoop Oriented Smart Cities Architecture.

    PubMed

    Diaconita, Vlad; Bologa, Ana-Ramona; Bologa, Razvan

    2018-04-12

    A smart city implies a consistent use of technology for the benefit of the community. As the city develops over time, components and subsystems such as smart grids, smart water management, smart traffic and transportation systems, smart waste management systems, smart security systems, or e-governance are added. These components ingest and generate a multitude of structured, semi-structured or unstructured data that may be processed using a variety of algorithms in batches, micro batches or in real-time. The ICT architecture must be able to handle the increased storage and processing needs. When vertical scaling is no longer a viable solution, Hadoop can offer efficient linear horizontal scaling, solving storage, processing, and data analyses problems in many ways. This enables architects and developers to choose a stack according to their needs and skill-levels. In this paper, we propose a Hadoop-based architectural stack that can provide the ICT backbone for efficiently managing a smart city. On the one hand, Hadoop, together with Spark and the plethora of NoSQL databases and accompanying Apache projects, is a mature ecosystem. This is one of the reasons why it is an attractive option for a Smart City architecture. On the other hand, it is also very dynamic; things can change very quickly, and many new frameworks, products and options continue to emerge as others decline. To construct an optimized, modern architecture, we discuss and compare various products and engines based on a process that takes into consideration how the products perform and scale, as well as the reusability of the code, innovations, features, and support and interest in online communities.

  16. Hadoop Oriented Smart Cities Architecture

    PubMed Central

    Bologa, Ana-Ramona; Bologa, Razvan

    2018-01-01

    A smart city implies a consistent use of technology for the benefit of the community. As the city develops over time, components and subsystems such as smart grids, smart water management, smart traffic and transportation systems, smart waste management systems, smart security systems, or e-governance are added. These components ingest and generate a multitude of structured, semi-structured or unstructured data that may be processed using a variety of algorithms in batches, micro batches or in real-time. The ICT architecture must be able to handle the increased storage and processing needs. When vertical scaling is no longer a viable solution, Hadoop can offer efficient linear horizontal scaling, solving storage, processing, and data analyses problems in many ways. This enables architects and developers to choose a stack according to their needs and skill-levels. In this paper, we propose a Hadoop-based architectural stack that can provide the ICT backbone for efficiently managing a smart city. On the one hand, Hadoop, together with Spark and the plethora of NoSQL databases and accompanying Apache projects, is a mature ecosystem. This is one of the reasons why it is an attractive option for a Smart City architecture. On the other hand, it is also very dynamic; things can change very quickly, and many new frameworks, products and options continue to emerge as others decline. To construct an optimized, modern architecture, we discuss and compare various products and engines based on a process that takes into consideration how the products perform and scale, as well as the reusability of the code, innovations, features, and support and interest in online communities. PMID:29649172

  17. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.

    PubMed

    Taylor, Ronald C

    2010-12-21

    Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date. Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms.

  18. Scalable Regression Tree Learning on Hadoop using OpenPlanet

    SciTech Connect

    Yin, Wei; Simmhan, Yogesh; Prasanna, Viktor

    As scientific and engineering domains attempt to effectively analyze the deluge of data arriving from sensors and instruments, machine learning is becoming a key data mining tool to build prediction models. Regression tree is a popular learning model that combines decision trees and linear regression to forecast numerical target variables based on a set of input features. Map Reduce is well suited for addressing such data intensive learning applications, and a proprietary regression tree algorithm, PLANET, using MapReduce has been proposed earlier. In this paper, we describe an open source implement of this algorithm, OpenPlanet, on the Hadoop framework usingmore » a hybrid approach. Further, we evaluate the performance of OpenPlanet using realworld datasets from the Smart Power Grid domain to perform energy use forecasting, and propose tuning strategies of Hadoop parameters to improve the performance of the default configuration by 75% for a training dataset of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.« less

  19. Adding Data Management Services to Parallel File Systems

    SciTech Connect

    Brandt, Scott

    2015-03-04

    -based ecosystem; (3) common optimizations, e.g., indexing and caching, are readily supported across several file formats, avoiding effort duplication; and (4) performance improves significantly, as data processing is integrated more tightly with data storage. Our key contributions are: SciHadoop which explores changes to MapReduce assumption by taking advantage of semantics of structured data while preserving MapReduce’s failure and resource management; DataMods which extends common abstractions of parallel file systems so they become programmable such that they can be extended to natively support a variety of data models and can be hooked into emerging distributed runtimes such as Stanford’s Legion; and Miso which combines Hadoop and relational data warehousing to minimize time to insight, taking into account the overhead of ingesting data into data warehousing.« less

  20. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

    PubMed Central

    2010-01-01

    Background Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. Description An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date. Conclusions Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms. PMID:21210976

  1. Theoretical and Empirical Comparison of Big Data Image Processing with Apache Hadoop and Sun Grid Engine

    PubMed Central

    Bao, Shunxing; Weitendorf, Frederick D.; Plassard, Andrew J.; Huo, Yuankai; Gokhale, Aniruddha; Landman, Bennett A.

    2016-01-01

    The field of big data is generally concerned with the scale of processing at which traditional computational paradigms break down. In medical imaging, traditional large scale processing uses a cluster computer that combines a group of workstation nodes into a functional unit that is controlled by a job scheduler. Typically, a shared-storage network file system (NFS) is used to host imaging data. However, data transfer from storage to processing nodes can saturate network bandwidth when data is frequently uploaded/retrieved from the NFS, e.g., “short” processing times and/or “large” datasets. Recently, an alternative approach using Hadoop and HBase was presented for medical imaging to enable co-location of data storage and computation while minimizing data transfer. The benefits of using such a framework must be formally evaluated against a traditional approach to characterize the point at which simply “large scale” processing transitions into “big data” and necessitates alternative computational frameworks. The proposed Hadoop system was implemented on a production lab-cluster alongside a standard Sun Grid Engine (SGE). Theoretical models for wall-clock time and resource time for both approaches are introduced and validated. To provide real example data, three T1 image archives were retrieved from a university secure, shared web database and used to empirically assess computational performance under three configurations of cluster hardware (using 72, 109, or 209 CPU cores) with differing job lengths. Empirical results match the theoretical models. Based on these data, a comparative analysis is presented for when the Hadoop framework will be relevant and non-relevant for medical imaging. PMID:28736473

  2. Large Scale Hierarchical K-Means Based Image Retrieval With MapReduce

    DTIC Science & Technology

    2014-03-27

    hadoop distributed file system: Architecture and design, 2007. [10] G. Bradski. Dr. Dobb’s Journal of Software Tools, 2000. [11] Terry Costlow. Big data ...million images running on 20 virtual machines are shown. 15. SUBJECT TERMS Image Retrieval, MapReduce, Hierarchical K-Means, Big Data , Hadoop U U U UU 87...13 2.1.1.2 HDFS Data Representation . . . . . . . . . . . . . . . . 14 2.1.1.3 Hadoop Engine

  3. Deceit: A flexible distributed file system

    NASA Technical Reports Server (NTRS)

    Siegel, Alex; Birman, Kenneth; Marzullo, Keith

    1989-01-01

    Deceit, a distributed file system (DFS) being developed at Cornell, focuses on flexible file semantics in relation to efficiency, scalability, and reliability. Deceit servers are interchangeable and collectively provide the illusion of a single, large server machine to any clients of the Deceit service. Non-volatile replicas of each file are stored on a subset of the file servers. The user is able to set parameters on a file to achieve different levels of availability, performance, and one-copy serializability. Deceit also supports a file version control mechanism. In contrast with many recent DFS efforts, Deceit can behave like a plain Sun Network File System (NFS) server and can be used by any NFS client without modifying any client software. The current Deceit prototype uses the ISIS Distributed Programming Environment for all communication and process group management, an approach that reduces system complexity and increases system robustness.

  4. Querying Archetype-Based Electronic Health Records Using Hadoop and Dewey Encoding of openEHR Models.

    PubMed

    Sundvall, Erik; Wei-Kleiner, Fang; Freire, Sergio M; Lambrix, Patrick

    2017-01-01

    Archetype-based Electronic Health Record (EHR) systems using generic reference models from e.g. openEHR, ISO 13606 or CIMI should be easy to update and reconfigure with new types (or versions) of data models or entries, ideally with very limited programming or manual database tweaking. Exploratory research (e.g. epidemiology) leading to ad-hoc querying on a population-wide scale can be a challenge in such environments. This publication describes implementation and test of an archetype-aware Dewey encoding optimization that can be used to produce such systems in environments supporting relational operations, e.g. RDBMs and distributed map-reduce frameworks like Hadoop. Initial testing was done using a nine-node 2.2 GHz quad-core Hadoop cluster querying a dataset consisting of targeted extracts from 4+ million real patient EHRs, query results with sub-minute response time were obtained.

  5. A hadoop-based method to predict potential effective drug combination.

    PubMed

    Sun, Yifan; Xiong, Yi; Xu, Qian; Wei, Dongqing

    2014-01-01

    Combination drugs that impact multiple targets simultaneously are promising candidates for combating complex diseases due to their improved efficacy and reduced side effects. However, exhaustive screening of all possible drug combinations is extremely time-consuming and impractical. Here, we present a novel Hadoop-based approach to predict drug combinations by taking advantage of the MapReduce programming model, which leads to an improvement of scalability of the prediction algorithm. By integrating the gene expression data of multiple drugs, we constructed data preprocessing and the support vector machines and naïve Bayesian classifiers on Hadoop for prediction of drug combinations. The experimental results suggest that our Hadoop-based model achieves much higher efficiency in the big data processing steps with satisfactory performance. We believed that our proposed approach can help accelerate the prediction of potential effective drugs with the increasing of the combination number at an exponential rate in future. The source code and datasets are available upon request.

  6. A Hadoop-Based Method to Predict Potential Effective Drug Combination

    PubMed Central

    Xiong, Yi; Xu, Qian; Wei, Dongqing

    2014-01-01

    Combination drugs that impact multiple targets simultaneously are promising candidates for combating complex diseases due to their improved efficacy and reduced side effects. However, exhaustive screening of all possible drug combinations is extremely time-consuming and impractical. Here, we present a novel Hadoop-based approach to predict drug combinations by taking advantage of the MapReduce programming model, which leads to an improvement of scalability of the prediction algorithm. By integrating the gene expression data of multiple drugs, we constructed data preprocessing and the support vector machines and naïve Bayesian classifiers on Hadoop for prediction of drug combinations. The experimental results suggest that our Hadoop-based model achieves much higher efficiency in the big data processing steps with satisfactory performance. We believed that our proposed approach can help accelerate the prediction of potential effective drugs with the increasing of the combination number at an exponential rate in future. The source code and datasets are available upon request. PMID:25147789

  7. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

    SciTech Connect

    Taylor, Ronald C.

    Bioinformatics researchers are increasingly confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBasemore » project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date.« less

  8. Cloudwave: distributed processing of "big data" from electrophysiological recordings for epilepsy clinical research using Hadoop.

    PubMed

    Jayapandian, Catherine P; Chen, Chien-Hung; Bozorgi, Alireza; Lhatoo, Samden D; Zhang, Guo-Qiang; Sahoo, Satya S

    2013-01-01

    Epilepsy is the most common serious neurological disorder affecting 50-60 million persons worldwide. Multi-modal electrophysiological data, such as electroencephalography (EEG) and electrocardiography (EKG), are central to effective patient care and clinical research in epilepsy. Electrophysiological data is an example of clinical "big data" consisting of more than 100 multi-channel signals with recordings from each patient generating 5-10GB of data. Current approaches to store and analyze signal data using standalone tools, such as Nihon Kohden neurology software, are inadequate to meet the growing volume of data and the need for supporting multi-center collaborative studies with real time and interactive access. We introduce the Cloudwave platform in this paper that features a Web-based intuitive signal analysis interface integrated with a Hadoop-based data processing module implemented on clinical data stored in a "private cloud". Cloudwave has been developed as part of the National Institute of Neurological Disorders and Strokes (NINDS) funded multi-center Prevention and Risk Identification of SUDEP Mortality (PRISM) project. The Cloudwave visualization interface provides real-time rendering of multi-modal signals with "montages" for EEG feature characterization over 2TB of patient data generated at the Case University Hospital Epilepsy Monitoring Unit. Results from performance evaluation of the Cloudwave Hadoop data processing module demonstrate one order of magnitude improvement in performance over 77GB of patient data. (Cloudwave project: http://prism.case.edu/prism/index.php/Cloudwave).

  9. BESIU Physical Analysis on Hadoop Platform

    NASA Astrophysics Data System (ADS)

    Huo, Jing; Zang, Dongsong; Lei, Xiaofeng; Li, Qiang; Sun, Gongxing

    2014-06-01

    In the past 20 years, computing cluster has been widely used for High Energy Physics data processing. The jobs running on the traditional cluster with a Data-to-Computing structure, have to read large volumes of data via the network to the computing nodes for analysis, thereby making the I/O latency become a bottleneck of the whole system. The new distributed computing technology based on the MapReduce programming model has many advantages, such as high concurrency, high scalability and high fault tolerance, and it can benefit us in dealing with Big Data. This paper brings the idea of using MapReduce model to do BESIII physical analysis, and presents a new data analysis system structure based on Hadoop platform, which not only greatly improve the efficiency of data analysis, but also reduces the cost of system building. Moreover, this paper establishes an event pre-selection system based on the event level metadata(TAGs) database to optimize the data analyzing procedure.

  10. Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce

    PubMed Central

    Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng

    2016-01-01

    The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS – a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing. PMID:27617325

  11. Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce.

    PubMed

    Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng

    2013-11-01

    The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS - a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing.

  12. Next Generation Distributed Computing for Cancer Research

    PubMed Central

    Agarwal, Pankaj; Owzar, Kouros

    2014-01-01

    Advances in next generation sequencing (NGS) and mass spectrometry (MS) technologies have provided many new opportunities and angles for extending the scope of translational cancer research while creating tremendous challenges in data management and analysis. The resulting informatics challenge is invariably not amenable to the use of traditional computing models. Recent advances in scalable computing and associated infrastructure, particularly distributed computing for Big Data, can provide solutions for addressing these challenges. In this review, the next generation of distributed computing technologies that can address these informatics problems is described from the perspective of three key components of a computational platform, namely computing, data storage and management, and networking. A broad overview of scalable computing is provided to set the context for a detailed description of Hadoop, a technology that is being rapidly adopted for large-scale distributed computing. A proof-of-concept Hadoop cluster, set up for performance benchmarking of NGS read alignment, is described as an example of how to work with Hadoop. Finally, Hadoop is compared with a number of other current technologies for distributed computing. PMID:25983539

  13. Next generation distributed computing for cancer research.

    PubMed

    Agarwal, Pankaj; Owzar, Kouros

    2014-01-01

    Advances in next generation sequencing (NGS) and mass spectrometry (MS) technologies have provided many new opportunities and angles for extending the scope of translational cancer research while creating tremendous challenges in data management and analysis. The resulting informatics challenge is invariably not amenable to the use of traditional computing models. Recent advances in scalable computing and associated infrastructure, particularly distributed computing for Big Data, can provide solutions for addressing these challenges. In this review, the next generation of distributed computing technologies that can address these informatics problems is described from the perspective of three key components of a computational platform, namely computing, data storage and management, and networking. A broad overview of scalable computing is provided to set the context for a detailed description of Hadoop, a technology that is being rapidly adopted for large-scale distributed computing. A proof-of-concept Hadoop cluster, set up for performance benchmarking of NGS read alignment, is described as an example of how to work with Hadoop. Finally, Hadoop is compared with a number of other current technologies for distributed computing.

  14. SciTech Connect

    Fadika, Zacharia; Dede, Elif; Govindaraju, Madhusudhan

    MapReduce is increasingly becoming a popular framework, and a potent programming model. The most popular open source implementation of MapReduce, Hadoop, is based on the Hadoop Distributed File System (HDFS). However, as HDFS is not POSIX compliant, it cannot be fully leveraged by applications running on a majority of existing HPC environments such as Teragrid and NERSC. These HPC environments typicallysupport globally shared file systems such as NFS and GPFS. On such resourceful HPC infrastructures, the use of Hadoop not only creates compatibility issues, but also affects overall performance due to the added overhead of the HDFS. This paper notmore » only presents a MapReduce implementation directly suitable for HPC environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems' functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPCenvironments, but also allows for better performance in such settings. This paper shows the applicability and high performance of the MapReduce paradigm through MARIANE, an implementation designed for clustered and shared-disk file systems and as such not dedicated to a specific MapReduce solution. The paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in distributed environments over Apache Hadoop in a data intensive setting, on the Magellan testbed at the National Energy Research Scientific Computing Center (NERSC).« less

  15. A novel approach to multiple sequence alignment using hadoop data grids.

    PubMed

    Sudha Sadasivam, G; Baktavatchalam, G

    2010-01-01

    Multiple alignment of protein sequences helps to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Although dynamic programming algorithms produce accurate alignments, they are computation intensive. In this paper we propose a time efficient approach to sequence alignment that also produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. The principle of block splitting in hadoop coupled with its scalability facilitates alignment of very large sequences.

  16. Cloudwave: Distributed Processing of “Big Data” from Electrophysiological Recordings for Epilepsy Clinical Research Using Hadoop

    PubMed Central

    Jayapandian, Catherine P.; Chen, Chien-Hung; Bozorgi, Alireza; Lhatoo, Samden D.; Zhang, Guo-Qiang; Sahoo, Satya S.

    2013-01-01

    Epilepsy is the most common serious neurological disorder affecting 50–60 million persons worldwide. Multi-modal electrophysiological data, such as electroencephalography (EEG) and electrocardiography (EKG), are central to effective patient care and clinical research in epilepsy. Electrophysiological data is an example of clinical “big data” consisting of more than 100 multi-channel signals with recordings from each patient generating 5–10GB of data. Current approaches to store and analyze signal data using standalone tools, such as Nihon Kohden neurology software, are inadequate to meet the growing volume of data and the need for supporting multi-center collaborative studies with real time and interactive access. We introduce the Cloudwave platform in this paper that features a Web-based intuitive signal analysis interface integrated with a Hadoop-based data processing module implemented on clinical data stored in a “private cloud”. Cloudwave has been developed as part of the National Institute of Neurological Disorders and Strokes (NINDS) funded multi-center Prevention and Risk Identification of SUDEP Mortality (PRISM) project. The Cloudwave visualization interface provides real-time rendering of multi-modal signals with “montages” for EEG feature characterization over 2TB of patient data generated at the Case University Hospital Epilepsy Monitoring Unit. Results from performance evaluation of the Cloudwave Hadoop data processing module demonstrate one order of magnitude improvement in performance over 77GB of patient data. (Cloudwave project: http://prism.case.edu/prism/index.php/Cloudwave) PMID:24551370

  17. Near Real-Time Processing of Proteomics Data Using Hadoop.

    PubMed

    Hillman, Chris; Ahmad, Yasmeen; Whitehorn, Mark; Cobley, Andy

    2014-03-01

    This article presents a near real-time processing solution using MapReduce and Hadoop. The solution is aimed at some of the data management and processing challenges facing the life sciences community. Research into genes and their product proteins generates huge volumes of data that must be extensively preprocessed before any biological insight can be gained. In order to carry out this processing in a timely manner, we have investigated the use of techniques from the big data field. These are applied specifically to process data resulting from mass spectrometers in the course of proteomic experiments. Here we present methods of handling the raw data in Hadoop, and then we investigate a process for preprocessing the data using Java code and the MapReduce framework to identify 2D and 3D peaks.

  18. Distributed PACS using distributed file system with hierarchical meta data servers.

    PubMed

    Hiroyasu, Tomoyuki; Minamitani, Yoshiyuki; Miki, Mitsunori; Yokouchi, Hisatake; Yoshimi, Masato

    2012-01-01

    In this research, we propose a new distributed PACS (Picture Archiving and Communication Systems) which is available to integrate several PACSs that exist in each medical institution. The conventional PACS controls DICOM file into one data-base. On the other hand, in the proposed system, DICOM file is separated into meta data and image data and those are stored individually. Using this mechanism, since file is not always accessed the entire data, some operations such as finding files, changing titles, and so on can be performed in high-speed. At the same time, as distributed file system is utilized, accessing image files can also achieve high-speed access and high fault tolerant. The introduced system has a more significant point. That is the simplicity to integrate several PACSs. In the proposed system, only the meta data servers are integrated and integrated system can be constructed. This system also has the scalability of file access with along to the number of file numbers and file sizes. On the other hand, because meta-data server is integrated, the meta data server is the weakness of this system. To solve this defect, hieratical meta data servers are introduced. Because of this mechanism, not only fault--tolerant ability is increased but scalability of file access is also increased. To discuss the proposed system, the prototype system using Gfarm was implemented. For evaluating the implemented system, file search operating time of Gfarm and NFS were compared.

  19. Power Grid Data Analysis with R and Hadoop

    SciTech Connect

    Hafen, Ryan P.; Gibson, Tara D.; Kleese van Dam, Kerstin

    This book chapter presents an approach to analysis of large-scale time-series sensor information based on our experience with power grid data. We use the R-Hadoop Integrated Programming Environment (RHIPE) to analyze a 2TB data set and present code and results for this analysis.

  20. Reliable file sharing in distributed operating system using web RTC

    NASA Astrophysics Data System (ADS)

    Dukiya, Rajesh

    2017-12-01

    Since, the evolution of distributed operating system, distributed file system is come out to be important part in operating system. P2P is a reliable way in Distributed Operating System for file sharing. It was introduced in 1999, later it became a high research interest topic. Peer to Peer network is a type of network, where peers share network workload and other load related tasks. A P2P network can be a period of time connection, where a bunch of computers connected by a USB (Universal Serial Bus) port to transfer or enable disk sharing i.e. file sharing. Currently P2P requires special network that should be designed in P2P way. Nowadays, there is a big influence of browsers in our life. In this project we are going to study of file sharing mechanism in distributed operating system in web browsers, where we will try to find performance bottlenecks which our research will going to be an improvement in file sharing by performance and scalability in distributed file systems. Additionally, we will discuss the scope of Web Torrent file sharing and free-riding in peer to peer networks.

  1. VisIO: enabling interactive visualization of ultra-scale, time-series data via high-bandwidth distributed I/O systems

    SciTech Connect

    Mitchell, Christopher J; Ahrens, James P; Wang, Jun

    2010-10-15

    Petascale simulations compute at resolutions ranging into billions of cells and write terabytes of data for visualization and analysis. Interactive visuaUzation of this time series is a desired step before starting a new run. The I/O subsystem and associated network often are a significant impediment to interactive visualization of time-varying data; as they are not configured or provisioned to provide necessary I/O read rates. In this paper, we propose a new I/O library for visualization applications: VisIO. Visualization applications commonly use N-to-N reads within their parallel enabled readers which provides an incentive for a shared-nothing approach to I/O, similar tomore » other data-intensive approaches such as Hadoop. However, unlike other data-intensive applications, visualization requires: (1) interactive performance for large data volumes, (2) compatibility with MPI and POSIX file system semantics for compatibility with existing infrastructure, and (3) use of existing file formats and their stipulated data partitioning rules. VisIO, provides a mechanism for using a non-POSIX distributed file system to provide linear scaling of 110 bandwidth. In addition, we introduce a novel scheduling algorithm that helps to co-locate visualization processes on nodes with the requested data. Testing using VisIO integrated into Para View was conducted using the Hadoop Distributed File System (HDFS) on TACC's Longhorn cluster. A representative dataset, VPIC, across 128 nodes showed a 64.4% read performance improvement compared to the provided Lustre installation. Also tested, was a dataset representing a global ocean salinity simulation that showed a 51.4% improvement in read performance over Lustre when using our VisIO system. VisIO, provides powerful high-performance I/O services to visualization applications, allowing for interactive performance with ultra-scale, time-series data.« less

  2. GATE Monte Carlo simulation of dose distribution using MapReduce in a cloud computing environment.

    PubMed

    Liu, Yangchuan; Tang, Yuguo; Gao, Xin

    2017-12-01

    The GATE Monte Carlo simulation platform has good application prospects of treatment planning and quality assurance. However, accurate dose calculation using GATE is time consuming. The purpose of this study is to implement a novel cloud computing method for accurate GATE Monte Carlo simulation of dose distribution using MapReduce. An Amazon Machine Image installed with Hadoop and GATE is created to set up Hadoop clusters on Amazon Elastic Compute Cloud (EC2). Macros, the input files for GATE, are split into a number of self-contained sub-macros. Through Hadoop Streaming, the sub-macros are executed by GATE in Map tasks and the sub-results are aggregated into final outputs in Reduce tasks. As an evaluation, GATE simulations were performed in a cubical water phantom for X-ray photons of 6 and 18 MeV. The parallel simulation on the cloud computing platform is as accurate as the single-threaded simulation on a local server and the simulation correctness is not affected by the failure of some worker nodes. The cloud-based simulation time is approximately inversely proportional to the number of worker nodes. For the simulation of 10 million photons on a cluster with 64 worker nodes, time decreases of 41× and 32× were achieved compared to the single worker node case and the single-threaded case, respectively. The test of Hadoop's fault tolerance showed that the simulation correctness was not affected by the failure of some worker nodes. The results verify that the proposed method provides a feasible cloud computing solution for GATE.

  3. cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud

    PubMed Central

    Hodor, Paul; Chawla, Amandeep; Clark, Andrew; Neal, Lauren

    2016-01-01

    Summary: One of the solutions proposed for addressing the challenge of the overwhelming abundance of genomic sequence and other biological data is the use of the Hadoop computing framework. Appropriate tools are needed to set up computational environments that facilitate research of novel bioinformatics methodology using Hadoop. Here, we present cl-dash, a complete starter kit for setting up such an environment. Configuring and deploying new Hadoop clusters can be done in minutes. Use of Amazon Web Services ensures no initial investment and minimal operation costs. Two sample bioinformatics applications help the researcher understand and learn the principles of implementing an algorithm using the MapReduce programming pattern. Availability and implementation: Source code is available at https://bitbucket.org/booz-allen-sci-comp-team/cl-dash.git. Contact: hodor_paul@bah.com PMID:26428290

  4. cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud.

    PubMed

    Hodor, Paul; Chawla, Amandeep; Clark, Andrew; Neal, Lauren

    2016-01-15

    : One of the solutions proposed for addressing the challenge of the overwhelming abundance of genomic sequence and other biological data is the use of the Hadoop computing framework. Appropriate tools are needed to set up computational environments that facilitate research of novel bioinformatics methodology using Hadoop. Here, we present cl-dash, a complete starter kit for setting up such an environment. Configuring and deploying new Hadoop clusters can be done in minutes. Use of Amazon Web Services ensures no initial investment and minimal operation costs. Two sample bioinformatics applications help the researcher understand and learn the principles of implementing an algorithm using the MapReduce programming pattern. Source code is available at https://bitbucket.org/booz-allen-sci-comp-team/cl-dash.git. hodor_paul@bah.com. © The Author 2015. Published by Oxford University Press.

  5. A Hadoop-based Molecular Docking System

    NASA Astrophysics Data System (ADS)

    Dong, Yueli; Guo, Quan; Sun, Bin

    2017-10-01

    Molecular docking always faces the challenge of managing tens of TB datasets. It is necessary to improve the efficiency of the storage and docking. We proposed the molecular docking platform based on Hadoop for virtual screening, it provides the preprocessing of ligand datasets and the analysis function of the docking results. A molecular cloud database that supports mass data management is constructed. Through this platform, the docking time is reduced, the data storage is efficient, and the management of the ligand datasets is convenient.

  6. Cost Considerations in Cloud Computing

    DTIC Science & Technology

    2014-01-01

    investments. 2. Database Options The potential promise that “ big data ” analytics holds for many enterprise mission areas makes relevant the question of the...development of a range of new distributed file systems and data - bases that have better scalability properties than traditional SQL databases. Hadoop ... data . Many systems exist that extend or supplement Hadoop —such as Apache Accumulo, which provides a highly granular mechanism for managing security

  7. BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters.

    PubMed

    Huang, Hailiang; Tata, Sandeep; Prill, Robert J

    2013-01-01

    Computational workloads for genome-wide association studies (GWAS) are growing in scale and complexity outpacing the capabilities of single-threaded software designed for personal computers. The BlueSNP R package implements GWAS statistical tests in the R programming language and executes the calculations across computer clusters configured with Apache Hadoop, a de facto standard framework for distributed data processing using the MapReduce formalism. BlueSNP makes computationally intensive analyses, such as estimating empirical p-values via data permutation, and searching for expression quantitative trait loci over thousands of genes, feasible for large genotype-phenotype datasets. http://github.com/ibm-bioinformatics/bluesnp

  8. Hotspot Patterns: The Formal Definition and Automatic Detection of Architecture Smells

    DTIC Science & Technology

    2015-01-15

    serious question for a project manager or architect: how to determine which parts of the code base should be given higher priority for maintenance and...services framework; Hadoop8 is a tool for distributed processing of large data sets; HBase9 is the Hadoop database; Ivy10 is a dependency management tool...answer this question more rigorously, we conducted Pearson Correlation Analysis to test the dependency between the number of issues a file involves

  9. Mining algorithm for association rules in big data based on Hadoop

    NASA Astrophysics Data System (ADS)

    Fu, Chunhua; Wang, Xiaojing; Zhang, Lijun; Qiao, Liying

    2018-04-01

    In order to solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm is realized in the parallelization based on Hadoop framework and Map Reduce model. On the basis, it is improved using the transaction reduce method for further enhancement of the algorithm's mining efficiency. The experiment, which consists of verification of parallel mining results, comparison on efficiency between serials and parallel, variable relationship between mining time and node number and between mining time and data amount, is carried out in the mining results and efficiency by Hadoop clustering. Experiments show that the paralleled FP-Growth algorithm implemented is able to accurately mine frequent item sets, with a better performance and scalability. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset.

  10. Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms.

    PubMed

    Ferraro Petrillo, Umberto; Roscigno, Gianluca; Cattaneo, Giuseppe; Giancarlo, Raffaele

    2018-06-01

    Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications. The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH. umberto.ferraro@uniroma1.it. Supplementary data are available at Bioinformatics online.

  11. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.

  12. 75 FR 51032 - National Fuel Gas Distribution Corporation; Notice of Baseline Filing

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-08-18

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission [Docket No. PR10-79-000] National Fuel Gas Distribution Corporation; Notice of Baseline Filing August 12, 2010. Take notice that on August 10, 2010, National fuel Gas Distribution Corporation submitted a baseline filing of its Statement of...

  13. Optimal File-Distribution in Heterogeneous and Asymmetric Storage Networks

    NASA Astrophysics Data System (ADS)

    Langner, Tobias; Schindelhauer, Christian; Souza, Alexander

    We consider an optimisation problem which is motivated from storage virtualisation in the Internet. While storage networks make use of dedicated hardware to provide homogeneous bandwidth between servers and clients, in the Internet, connections between storage servers and clients are heterogeneous and often asymmetric with respect to upload and download. Thus, for a large file, the question arises how it should be fragmented and distributed among the servers to grant "optimal" access to the contents. We concentrate on the transfer time of a file, which is the time needed for one upload and a sequence of n downloads, using a set of m servers with heterogeneous bandwidths. We assume that fragments of the file can be transferred in parallel to and from multiple servers. This model yields a distribution problem that examines the question of how these fragments should be distributed onto those servers in order to minimise the transfer time. We present an algorithm, called FlowScaling, that finds an optimal solution within running time {O}(m log m). We formulate the distribution problem as a maximum flow problem, which involves a function that states whether a solution with a given transfer time bound exists. This function is then used with a scaling argument to determine an optimal solution within the claimed time complexity.

  14. MRUniNovo: an efficient tool for de novo peptide sequencing utilizing the hadoop distributed computing framework.

    PubMed

    Li, Chuang; Chen, Tao; He, Qiang; Zhu, Yunping; Li, Kenli

    2017-03-15

    Tandem mass spectrometry-based de novo peptide sequencing is a complex and time-consuming process. The current algorithms for de novo peptide sequencing cannot rapidly and thoroughly process large mass spectrometry datasets. In this paper, we propose MRUniNovo, a novel tool for parallel de novo peptide sequencing. MRUniNovo parallelizes UniNovo based on the Hadoop compute platform. Our experimental results demonstrate that MRUniNovo significantly reduces the computation time of de novo peptide sequencing without sacrificing the correctness and accuracy of the results, and thus can process very large datasets that UniNovo cannot. MRUniNovo is an open source software tool implemented in java. The source code and the parameter settings are available at http://bioinfo.hupo.org.cn/MRUniNovo/index.php. s131020002@hnu.edu.cn ; taochen1019@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  15. Cloud Engineering Principles and Technology Enablers for Medical Image Processing-as-a-Service.

    PubMed

    Bao, Shunxing; Plassard, Andrew J; Landman, Bennett A; Gokhale, Aniruddha

    2017-04-01

    Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based "medical image processing-as-a-service" offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop's distributed file system. Despite this promise, HBase's load distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). This paper makes two contributions to address these concerns by describing key cloud engineering principles and technology enhancements we made to the Apache Hadoop ecosystem for medical imaging applications. First, we propose a row-key design for HBase, which is a necessary step that is driven by the hierarchical organization of imaging data. Second, we propose a novel data allocation policy within HBase to strongly enforce collocation of hierarchically related imaging data. The proposed enhancements accelerate data processing by minimizing network usage and localizing processing to machines where the data already exist. Moreover, our approach is amenable to the traditional scan, subject, and project-level analysis procedures, and is compatible with standard command line/scriptable image processing software. Experimental results for an illustrative sample of imaging data reveals that our new HBase policy results in a three-fold time improvement in conversion of classic DICOM to NiFTI file formats when compared with the default HBase region split policy

  16. Understanding Customer Dissatisfaction with Underutilized Distributed File Servers

    NASA Technical Reports Server (NTRS)

    Riedel, Erik; Gibson, Garth

    1996-01-01

    An important trend in the design of storage subsystems is a move toward direct network attachment. Network-attached storage offers the opportunity to off-load distributed file system functionality from dedicated file server machines and execute many requests directly at the storage devices. For this strategy to lead to better performance, as perceived by users, the response time of distributed operations must improve. In this paper we analyze measurements of an Andrew file system (AFS) server that we recently upgraded in an effort to improve client performance in our laboratory. While the original server's overall utilization was only about 3%, we show how burst loads were sufficiently intense to lead to period of poor response time significant enough to trigger customer dissatisfaction. In particular, we show how, after adjusting for network load and traffic to non-project servers, 50% of the variation in client response time was explained by variation in server central processing unit (CPU) use. That is, clients saw long response times in large part because the server was often over-utilized when it was used at all. Using these measures, we see that off-loading file server work in a network-attached storage architecture has to potential to benefit user response time. Computational power in such a system scales directly with storage capacity, so the slowdown during burst period should be reduced.

  17. ESUSA: US endangered species distribution file

    SciTech Connect

    Nagy, J.; Calef, C.E.

    1979-10-01

    This report describes a file containing distribution data on endangered species of the United States of Federal concern pursuant to the Endangered Species Act of 1973. Included for each species are (a) the common name, (b) the scientific name, (c) the family, (d) the group (mammal, bird, etc.), (e) Fish and Wildlife Service (FWS) listing and recovery priorities, (f) the Federal legal status, (g) the geographic distribution by counties or islands, (h) Federal Register citations and (i) the sources of the information on distribution of the species. Status types are endangered, threatened, proposed, formally under review, candidate, deleted, and rejected.more » Distribution is by Federal Information Processing Standard (FIPS) county code and is of four types: designated critical habitat, present range, potential range, and historic range.« less

  18. BioPig: a Hadoop-based analytic toolkit for large-scale sequence data.

    PubMed

    Nordberg, Henrik; Bhatia, Karan; Wang, Kai; Wang, Zhong

    2013-12-01

    The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. We built BioPig on the Apache's Hadoop MapReduce system and the Pig data flow language. Compared with traditional serial and MPI-based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at National Energy Research Scientific Computing Center and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.

  19. Difet: Distributed Feature Extraction Tool for High Spatial Resolution Remote Sensing Images

    NASA Astrophysics Data System (ADS)

    Eken, S.; Aydın, E.; Sayar, A.

    2017-11-01

    In this paper, we propose distributed feature extraction tool from high spatial resolution remote sensing images. Tool is based on Apache Hadoop framework and Hadoop Image Processing Interface. Two corner detection (Harris and Shi-Tomasi) algorithms and five feature descriptors (SIFT, SURF, FAST, BRIEF, and ORB) are considered. Robustness of the tool in the task of feature extraction from LandSat-8 imageries are evaluated in terms of horizontal scalability.

  20. Distributed file management for remote clinical image-viewing stations

    NASA Astrophysics Data System (ADS)

    Ligier, Yves; Ratib, Osman M.; Girard, Christian; Logean, Marianne; Trayser, Gerhard

    1996-05-01

    The Geneva PACS is based on a distributed architecture, with different archive servers used to store all the image files produced by digital imaging modalities. Images can then be visualized on different display stations with the Osiris software. Image visualization require to have the image file physically present on the local station. Thus, images must be transferred from archive servers to local display stations in an acceptable way, which means fast and user friendly where the notion of file must be hidden to users. The transfer of image files is done according to different schemes including prefetching and direct image selection. Prefetching allows the retrieval of previous studies of a patient in advance. A direct image selection is also provided in order to retrieve images on request. When images are transferred locally on the display station, they are stored in Papyrus files, each file containing a set of images. File names are used by the Osiris viewing software to open image sequences. But file names alone are not explicit enough to properly describe the content of the file. A specific utility has been developed to present a list of patients, and for each patient a list of exams which can be selected and automatically displayed. The system has been successfully tested in different clinical environments. It will be soon extended on a hospital wide basis.

  1. LVFS: A Big Data File Storage Bridge for the HPC Community

    NASA Astrophysics Data System (ADS)

    Golpayegani, N.; Halem, M.; Mauoka, E.; Fonseca, L. F.

    2015-12-01

    Merging Big Data capabilities into High Performance Computing architecture starts at the file storage level. Heterogeneous storage systems are emerging which offer enhanced features for dealing with Big Data such as the IBM GPFS storage system's integration into Hadoop Map-Reduce. Taking advantage of these capabilities requires file storage systems to be adaptive and accommodate these new storage technologies. We present the extension of the Lightweight Virtual File System (LVFS) currently running as the production system for the MODIS Level 1 and Atmosphere Archive and Distribution System (LAADS) to incorporate a flexible plugin architecture which allows easy integration of new HPC hardware and/or software storage technologies without disrupting workflows, system architectures and only minimal impact on existing tools. We consider two essential aspects provided by the LVFS plugin architecture needed for the future HPC community. First, it allows for the seamless integration of new and emerging hardware technologies which are significantly different than existing technologies such as Segate's Kinetic disks and Intel's 3DXPoint non-volatile storage. Second is the transparent and instantaneous conversion between new software technologies and various file formats. With most current storage system a switch in file format would require costly reprocessing and nearly doubling of storage requirements. We will install LVFS on UMBC's IBM iDataPlex cluster with a heterogeneous storage architecture utilizing local, remote, and Seagate Kinetic storage as a case study. LVFS merges different kinds of storage architectures to show users a uniform layout and, therefore, prevent any disruption in workflows, architecture design, or tool usage. We will show how LVFS will convert HDF data produced by applying machine learning algorithms to Xco2 Level 2 data from the OCO-2 satellite to produce CO2 surface fluxes into GeoTIFF for visualization.

  2. A data distribution strategy for the 1990s (files are not enough)

    NASA Technical Reports Server (NTRS)

    Tankenson, Mike; Wright, Steven

    1993-01-01

    Virtually all of the data distribution strategies being contemplated for the EOSDIS era revolve around the use of files. Most, if not all, mass storage technologies are based around the file model. However, files may be the wrong primary abstraction for supporting scientific users in the 1990s and beyond. Other abstractions more closely matching the respective scientific discipline of the end user may be more appropriate. JPL has built a unique multimission data distribution system based on a strategy of telemetry stream emulation to match the responsibilities of spacecraft team and ground data system operators supporting our nations suite of planetary probes. The current system, operational since 1989 and the launch of the Magellan spacecraft, is supporting over 200 users at 15 remote sites. This stream-oriented data distribution model can provide important lessons learned to builders of future data systems.

  3. Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services

    PubMed Central

    Zamani, Hamid

    2017-01-01

    Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services. PMID:29375652

  4. Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services.

    PubMed

    Chrimes, Dillon; Zamani, Hamid

    2017-01-01

    Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.

  5. Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data

    NASA Astrophysics Data System (ADS)

    Dykstra, D.; Bockelman, B.; Blomer, J.; Herner, K.; Levshina, T.; Slyz, M.

    2015-12-01

    A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliary data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called "alien cache" to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally on the

  6. Engineering the CernVM-Filesystem as a High Bandwidth Distributed Filesystem for Auxiliary Physics Data

    SciTech Connect

    Dykstra, D.; Bockelman, B.; Blomer, J.

    A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliarymore » data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called 'alien cache' to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally

  7. Hadoop and friends - first experience at CERN with a new platform for high throughput analysis steps

    NASA Astrophysics Data System (ADS)

    Duellmann, D.; Surdy, K.; Menichetti, L.; Toebbicke, R.

    2017-10-01

    The statistical analysis of infrastructure metrics comes with several specific challenges, including the fairly large volume of unstructured metrics from a large set of independent data sources. Hadoop and Spark provide an ideal environment in particular for the first steps of skimming rapidly through hundreds of TB of low relevance data to find and extract the much smaller data volume that is relevant for statistical analysis and modelling. This presentation will describe the new Hadoop service at CERN and the use of several of its components for high throughput data aggregation and ad-hoc pattern searches. We will describe the hardware setup used, the service structure with a small set of decoupled clusters and the first experience with co-hosting different applications and performing software upgrades. We will further detail the common infrastructure used for data extraction and preparation from continuous monitoring and database input sources.

  8. Transparency in Distributed File Systems

    DTIC Science & Technology

    1989-01-01

    ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT, TASK Computer Science Department AREA & WORK UNIT NUMBERS 734 Comouter Studies Bldc . University of...sistency control , file and director) placement, and file and directory migration in a way that pro- 3 vides full network transparency. This transparency...areas of naming, replication, con- sistency control , file and directory placement, and file and directory migration in a way that pro- 3 vides full

  9. 10 CFR 61.20 - Filing and distribution of application.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...

  10. 10 CFR 61.20 - Filing and distribution of application.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...

  11. 10 CFR 61.20 - Filing and distribution of application.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...

  12. 10 CFR 61.20 - Filing and distribution of application.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...

  13. 10 CFR 61.20 - Filing and distribution of application.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... license covering the receipt and disposal of radioactive wastes in a land disposal facility are required....20 Energy NUCLEAR REGULATORY COMMISSION (CONTINUED) LICENSING REQUIREMENTS FOR LAND DISPOSAL OF RADIOACTIVE WASTE Licenses § 61.20 Filing and distribution of application. (a) An application for a license...

  14. Maintaining a Distributed File System by Collection and Analysis of Metrics

    NASA Technical Reports Server (NTRS)

    Bromberg, Daniel

    1997-01-01

    AFS(originally, Andrew File System) is a widely-deployed distributed file system product used by companies, universities, and laboratories world-wide. However, it is not trivial to operate: runing an AFS cell is a formidable task. It requires a team of dedicated and experienced system administratores who must manage a user base numbring in the thousands, rather than the smaller range of 10 to 500 faced by the typical system administrator.

  15. 10 CFR 60.22 - Filing and distribution of application.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... GEOLOGIC REPOSITORIES Licenses License Applications § 60.22 Filing and distribution of application. (a) An application for a construction authorization for a high-level radioactive waste repository at a geologic repository operations area, and an application for a license to receive and possess source, special nuclear...

  16. Using Hadoop MapReduce for Parallel Genetic Algorithms: A Comparison of the Global, Grid and Island Models.

    PubMed

    Ferrucci, Filomena; Salza, Pasquale; Sarro, Federica

    2017-06-29

    The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store

  17. Indoor air quality analysis based on Hadoop

    NASA Astrophysics Data System (ADS)

    Tuo, Wang; Yunhua, Sun; Song, Tian; Liang, Yu; Weihong, Cui

    2014-03-01

    The air of the office environment is our research object. The data of temperature, humidity, concentrations of carbon dioxide, carbon monoxide and ammonia are collected peer one to eight seconds by the sensor monitoring system. And all the data are stored in the Hbase database of Hadoop platform. With the help of HBase feature of column-oriented store and versioned (automatically add the time column), the time-series data sets are bulit based on the primary key Row-key and timestamp. The parallel computing programming model MapReduce is used to process millions of data collected by sensors. By analysing the changing trend of parameters' value at different time of the same day and at the same time of various dates, the impact of human factor and other factors on the room microenvironment is achieved according to the liquidity of the office staff. Moreover, the effective way to improve indoor air quality is proposed in the end of this paper.

  18. XRootd, disk-based, caching proxy for optimization of data access, data placement and data replication

    NASA Astrophysics Data System (ADS)

    Bauerdick, L. A. T.; Bloom, K.; Bockelman, B.; Bradley, D. C.; Dasu, S.; Dost, J. M.; Sfiligoi, I.; Tadel, A.; Tadel, M.; Wuerthwein, F.; Yagil, A.; Cms Collaboration

    2014-06-01

    Following the success of the XRootd-based US CMS data federation, the AAA project investigated extensions of the federation architecture by developing two sample implementations of an XRootd, disk-based, caching proxy. The first one simply starts fetching a whole file as soon as a file open request is received and is suitable when completely random file access is expected or it is already known that a whole file be read. The second implementation supports on-demand downloading of partial files. Extensions to the Hadoop Distributed File System have been developed to allow for an immediate fallback to network access when local HDFS storage fails to provide the requested block. Both cache implementations are in pre-production testing at UCSD.

  19. Distributing File-Based Data to Remote Sites Within the BABAR Collaboration

    SciTech Connect

    Gowdy, Stephen J.

    BABAR [1] uses two formats for its data: Objectivity database and root [2] files. This poster concerns the distribution of the latter--for Objectivity data see [3]. The BABAR analysis data is stored in root files--one per physics run and analysis selection channel--maintained in a large directory tree. Currently BABAR has more than 4.5 TBytes in 200,000 root files. This data is (mostly) produced at SLAC, but is required for analysis at universities and research centers throughout the us and Europe. Two basic problems confront us when we seek to import bulk data from slac to an institute's local storage viamore » the network. We must determine which files must be imported (depending on the local site requirements and which files have already been imported), and we must make the optimum use of the network when transferring the data. Basic ftp-like tools (ftp, scp, etc) do not attempt to solve the first problem. More sophisticated tools like rsync [4], the widely-used mirror/synchronization program, compare local and remote file systems, checking for changes (based on file date, size and, if desired, an elaborate checksum) in order to only copy new or modified files. However rsync allows for only limited file selection. Also when, as in BABAR, an extremely large directory structure must be scanned, rsync can take several hours just to determine which files need to be copied. Although rsync (and scp) provides on-the-fly compression, it does not allow us to optimize the network transfer by using multiple streams, adjusting the tcp window size, or separating encrypted authentication from unencrypted data channels.« less

  20. Efficient Retrieval of Massive Ocean Remote Sensing Images via a Cloud-Based Mean-Shift Algorithm.

    PubMed

    Yang, Mengzhao; Song, Wei; Mei, Haibin

    2017-07-23

    The rapid development of remote sensing (RS) technology has resulted in the proliferation of high-resolution images. There are challenges involved in not only storing large volumes of RS images but also in rapidly retrieving the images for ocean disaster analysis such as for storm surges and typhoon warnings. In this paper, we present an efficient retrieval of massive ocean RS images via a Cloud-based mean-shift algorithm. Distributed construction method via the pyramid model is proposed based on the maximum hierarchical layer algorithm and used to realize efficient storage structure of RS images on the Cloud platform. We achieve high-performance processing of massive RS images in the Hadoop system. Based on the pyramid Hadoop distributed file system (HDFS) storage method, an improved mean-shift algorithm for RS image retrieval is presented by fusion with the canopy algorithm via Hadoop MapReduce programming. The results show that the new method can achieve better performance for data storage than HDFS alone and WebGIS-based HDFS. Speedup and scaleup are very close to linear changes with an increase of RS images, which proves that image retrieval using our method is efficient.

  1. Efficient Retrieval of Massive Ocean Remote Sensing Images via a Cloud-Based Mean-Shift Algorithm

    PubMed Central

    Song, Wei; Mei, Haibin

    2017-01-01

    The rapid development of remote sensing (RS) technology has resulted in the proliferation of high-resolution images. There are challenges involved in not only storing large volumes of RS images but also in rapidly retrieving the images for ocean disaster analysis such as for storm surges and typhoon warnings. In this paper, we present an efficient retrieval of massive ocean RS images via a Cloud-based mean-shift algorithm. Distributed construction method via the pyramid model is proposed based on the maximum hierarchical layer algorithm and used to realize efficient storage structure of RS images on the Cloud platform. We achieve high-performance processing of massive RS images in the Hadoop system. Based on the pyramid Hadoop distributed file system (HDFS) storage method, an improved mean-shift algorithm for RS image retrieval is presented by fusion with the canopy algorithm via Hadoop MapReduce programming. The results show that the new method can achieve better performance for data storage than HDFS alone and WebGIS-based HDFS. Speedup and scaleup are very close to linear changes with an increase of RS images, which proves that image retrieval using our method is efficient. PMID:28737699

  2. A New Data Access Mechanism for HDFS

    NASA Astrophysics Data System (ADS)

    Li, Qiang; Sun, Zhenyu; Wei, Zhanchen; Sun, Gongxing

    2017-10-01

    With the era of big data emerging, Hadoop has become the de facto standard of big data processing platform. However, it is still difficult to get legacy applications, such as High Energy Physics (HEP) applications, to run efficiently on Hadoop platform. There are two reasons which lead to the difficulties mentioned above: firstly, random access is not supported on Hadoop File System (HDFS), secondly, it is difficult to make legacy applications adopt to HDFS streaming data processing mode. In order to address the two issues, a new read and write mechanism of HDFS is proposed. With this mechanism, data access is done on the local file system instead of through HDFS streaming interfaces. To enable files modified by users, three attributes including permissions, owner and group are imposed on Block objects. Blocks stored on Datanodes have the same attributes as the file they are owned by. Users can modify blocks when the Map task running locally, and HDFS is responsible to update the rest replicas later after the block modification finished. To further improve the performance of Hadoop system, a complete localization task execution mechanism is implemented for I/O intensive jobs. Test results show that average CPU utilization is improved by 10% with the new task selection strategy, data read and write performances are improved by about 10% and 30% separately.

  3. Enabling Incremental Iterative Development at Scale: Quality Attribute Refinement and Allocation in Practice

    DTIC Science & Technology

    2015-06-01

    abstract constraints along six dimen- sions for expansion: user, actions, data , business rules, interfaces, and quality attributes [Gottesdiener 2010...relevant open source systems. For example, the CONNECT and HADOOP Distributed File System (HDFS) projects have many user stories that deal with...Iteration Zero involves architecture planning before writing any code. An overly long Iteration Zero is equivalent to the dysfunctional “ Big Up-Front

  4. SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data

    NASA Astrophysics Data System (ADS)

    Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.

    2015-12-01

    Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These

  5. A Columnar Storage Strategy with Spatiotemporal Index for Big Climate Data

    NASA Astrophysics Data System (ADS)

    Hu, F.; Bowen, M. K.; Li, Z.; Schnase, J. L.; Duffy, D.; Lee, T. J.; Yang, C. P.

    2015-12-01

    Large collections of observational, reanalysis, and climate model output data may grow to as large as a 100 PB in the coming years, so climate dataset is in the Big Data domain, and various distributed computing frameworks have been utilized to address the challenges by big climate data analysis. However, due to the binary data format (NetCDF, HDF) with high spatial and temporal dimensions, the computing frameworks in Apache Hadoop ecosystem are not originally suited for big climate data. In order to make the computing frameworks in Hadoop ecosystem directly support big climate data, we propose a columnar storage format with spatiotemporal index to store climate data, which will support any project in the Apache Hadoop ecosystem (e.g. MapReduce, Spark, Hive, Impala). With this approach, the climate data will be transferred into binary Parquet data format, a columnar storage format, and spatial and temporal index will be built and attached into the end of Parquet files to enable real-time data query. Then such climate data in Parquet data format could be available to any computing frameworks in Hadoop ecosystem. The proposed approach is evaluated using the NASA Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. Experimental results show that this approach could efficiently overcome the gap between the big climate data and the distributed computing frameworks, and the spatiotemporal index could significantly accelerate data querying and processing.

  6. Sharing lattice QCD data over a widely distributed file system

    NASA Astrophysics Data System (ADS)

    Amagasa, T.; Aoki, S.; Aoki, Y.; Aoyama, T.; Doi, T.; Fukumura, K.; Ishii, N.; Ishikawa, K.-I.; Jitsumoto, H.; Kamano, H.; Konno, Y.; Matsufuru, H.; Mikami, Y.; Miura, K.; Sato, M.; Takeda, S.; Tatebe, O.; Togawa, H.; Ukawa, A.; Ukita, N.; Watanabe, Y.; Yamazaki, T.; Yoshie, T.

    2015-12-01

    JLDG is a data-grid for the lattice QCD (LQCD) community in Japan. Several large research groups in Japan have been working on lattice QCD simulations using supercomputers distributed over distant sites. The JLDG provides such collaborations with an efficient method of data management and sharing. File servers installed on 9 sites are connected to the NII SINET VPN and are bound into a single file system with the GFarm. The file system looks the same from any sites, so that users can do analyses on a supercomputer on a site, using data generated and stored in the JLDG at a different site. We present a brief description of hardware and software of the JLDG, including a recently developed subsystem for cooperating with the HPCI shared storage, and report performance and statistics of the JLDG. As of April 2015, 15 research groups (61 users) store their daily research data of 4.7PB including replica and 68 million files in total. Number of publications for works which used the JLDG is 98. The large number of publications and recent rapid increase of disk usage convince us that the JLDG has grown up into a useful infrastructure for LQCD community in Japan.

  7. Design and Execution of make-like, distributed Analyses based on Spotify’s Pipelining Package Luigi

    NASA Astrophysics Data System (ADS)

    Erdmann, M.; Fischer, B.; Fischer, R.; Rieger, M.

    2017-10-01

    In high-energy particle physics, workflow management systems are primarily used as tailored solutions in dedicated areas such as Monte Carlo production. However, physicists performing data analyses are usually required to steer their individual workflows manually which is time-consuming and often leads to undocumented relations between particular workloads. We present a generic analysis design pattern that copes with the sophisticated demands of end-to-end HEP analyses and provides a make-like execution system. It is based on the open-source pipelining package Luigi which was developed at Spotify and enables the definition of arbitrary workloads, so-called Tasks, and the dependencies between them in a lightweight and scalable structure. Further features are multi-user support, automated dependency resolution and error handling, central scheduling, and status visualization in the web. In addition to already built-in features for remote jobs and file systems like Hadoop and HDFS, we added support for WLCG infrastructure such as LSF and CREAM job submission, as well as remote file access through the Grid File Access Library. Furthermore, we implemented automated resubmission functionality, software sandboxing, and a command line interface with auto-completion for a convenient working environment. For the implementation of a t \\overline{{{t}}} H cross section measurement, we created a generic Python interface that provides programmatic access to all external information such as datasets, physics processes, statistical models, and additional files and values. In summary, the setup enables the execution of the entire analysis in a parallelized and distributed fashion with a single command.

  8. SciTech Connect

    Schaumberg, Andrew

    The Omics Tools package provides several small trivial tools for work in genomics. This single portable package, the “omics.jar” file, is a toolbox that works in any Java-based environment, including PCs, Macs, and supercomputers. The number of tools is expected to grow. One tool (called cmsearch.hadoop or cmsearch.local), calls the external cmsearch program to predict non-coding RNA in a genome. The cmsearch program is part of the third-party Infernal package. Omics Tools does not contain Infernal. Infernal may be installed separately. The cmsearch.hadoop subtool requires Apache Hadoop and runs on a supercomputer, though cmsearch.local does not and runs on amore » server. Omics Tools does not contain Hadoop. Hadoop mat be installed separartely The other tools (cmgbk, cmgff, fastats, pal, randgrp, randgrpr, randsub) do not interface with third-party tools. Omics Tools is written in Java and Scala programming languages. Invoking the “help” command shows currently available tools, as shown below: schaumbe@gpint06:~/proj/omics$ java -jar omics.jar help Known commands are: cmgbk : compare cmsearch and GenBank Infernal hits cmgff : compare hits among two GFF (version 3) files cmsearch.hadoop : find Infernal hits in a genome, on your supercomputer cmsearch.local : find Infernal hits in a genome, on your workstation fastats : FASTA stats, e.g. # bases, GC content pal : stem-loop motif detection by palindromic sequence search (code stub) randgrp : random subsample without replacement, of groups randgrpr : random subsample with replacement, of groups (fast) randsub : random subsample without replacement, of file lines For more help regarding a particular command, use: java -jar omics.jar command help Usage: java -jar omics.jar command args« less

  9. Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends

    PubMed Central

    2014-01-01

    The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data. The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation. In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by

  10. Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.

    PubMed

    Mohammed, Emad A; Far, Behrouz H; Naugler, Christopher

    2014-01-01

    The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called "big data" challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data. THE MAPREDUCE PROGRAMMING FRAMEWORK USES TWO TASKS COMMON IN FUNCTIONAL PROGRAMMING: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation. In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by

  11. Cloud Engineering Principles and Technology Enablers for Medical Image Processing-as-a-Service

    PubMed Central

    Bao, Shunxing; Plassard, Andrew J.; Landman, Bennett A.; Gokhale, Aniruddha

    2017-01-01

    Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based “medical image processing-as-a-service” offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop’s distributed file system. Despite this promise, HBase’s load distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). This paper makes two contributions to address these concerns by describing key cloud engineering principles and technology enhancements we made to the Apache Hadoop ecosystem for medical imaging applications. First, we propose a row-key design for HBase, which is a necessary step that is driven by the hierarchical organization of imaging data. Second, we propose a novel data allocation policy within HBase to strongly enforce collocation of hierarchically related imaging data. The proposed enhancements accelerate data processing by minimizing network usage and localizing processing to machines where the data already exist. Moreover, our approach is amenable to the traditional scan, subject, and project-level analysis procedures, and is compatible with standard command line/scriptable image processing software. Experimental results for an illustrative sample of imaging data reveals that our new HBase policy results in a three-fold time improvement in conversion of classic DICOM to NiFTI file formats when compared with the default HBase region split

  12. Knowledge Visualizations: A Tool to Achieve Optimized Operational Decision Making and Data Integration

    DTIC Science & Technology

    2015-06-01

    Hadoop Distributed File System (HDFS) without any integration with Accumulo-based Knowledge Stores based on OWL/RDF. 4. Cloud Based The Apache Software...BTW, 7(12), pp. 227–241. Godin, A. & Akins, D. (2014). Extending DCGS-N naval tactical clouds from in-storage to in-memory for the integrated fires...VISUALIZATIONS: A TOOL TO ACHIEVE OPTIMIZED OPERATIONAL DECISION MAKING AND DATA INTEGRATION by Paul C. Hudson Jeffrey A. Rzasa June 2015 Thesis

  13. A Distributed Fuzzy Associative Classifier for Big Data.

    PubMed

    Segatori, Armando; Bechini, Alessio; Ducange, Pietro; Marcelloni, Francesco

    2017-09-19

    Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes. Then, a set of candidate fuzzy association rules is generated by employing a distributed fuzzy extension of the well-known FP-Growth algorithm. Finally, this set is pruned by using three purposely adapted types of pruning. We implemented our approach on the popular Hadoop framework. Hadoop allows distributing storage and processing of very large data sets on computer clusters built from commodity hardware. We have performed an extensive experimentation and a detailed analysis of the results using six very large datasets with up to 11,000,000 instances. We have also experimented different types of reasoning methods. Focusing on accuracy, model complexity, computation time, and scalability, we compare the results achieved by our approach with those obtained by two distributed nonfuzzy ACs recently proposed in the literature. We highlight that, although the accuracies result to be comparable, the complexity, evaluated in terms of number of rules, of the classifiers generated by the fuzzy distributed approach is lower than the one of the nonfuzzy classifiers.

  14. Implementation of a SOA-Based Service Deployment Platform with Portal

    NASA Astrophysics Data System (ADS)

    Yang, Chao-Tung; Yu, Shih-Chi; Lai, Chung-Che; Liu, Jung-Chun; Chu, William C.

    In this paper we propose a Service Oriented Architecture to provide a flexible and serviceable environment. SOA comes up with commercial requirements; it integrates many techniques over ten years to find the solution in different platforms, programming languages and users. SOA provides the connection with a protocol between service providers and service users. After this, the performance and the reliability problems are reviewed. Finally we apply SOA into our Grid and Hadoop platform. Service acts as an interface in front of the Resource Broker in the Grid, and the Resource Broker is middleware that provides functions for developers. The Hadoop has a file replication feature to ensure file reliability. Services provided on the Grid and Hadoop are centralized. We design a portal, in which users can use services on it directly or register service through the service provider. The portal also offers a service workflow function so that users can customize services according to the need of their jobs.

  15. PoroTomo Subtask 3.2 Data files from the Distributed Acoustic Sensing experiment at Garner Valley, California

    DOE Data Explorer

    Chelsea Lancelle

    2013-09-11

    In September 2013, an experiment using Distributed Acoustic Sensing (DAS) was conducted at Garner Valley, a test site of the University of California Santa Barbara (Lancelle et al., 2014). This submission includes all DAS data recorded during the experiment. The sampling rate for all files is 1000 samples per second. Any files with the same filename but ending in _01, _02, etc. represent sequential files from the same test. Locations of the sources are plotted on the basemap in GDR submission 481, titled: "PoroTomo Subtask 3.2 Sample data from a Distributed Acoustic Sensing experiment at Garner Valley, California (PoroTomo Subtask 3.2)." Lancelle, C., N. Lord, H. Wang, D. Fratta, R. Nigbor, A. Chalari, R. Karaulanov, J. Baldwin, and E. Castongia (2014), Directivity and Sensitivity of Fiber-Optic Cable Measuring Ground Motion using a Distributed Acoustic Sensing Array (abstract # NS31C-3935), AGU Fall Meeting. 
https://agu.confex.com/agu/fm1/meetingapp.cgi#Paper/19828 The e-poster is available at: https://agu.confex.com/data/handout/agu/fm14/Paper_19828_handout_696_0.pdf

  16. Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform.

    PubMed

    Cao, Jianfang; Chen, Lichao; Wang, Min; Tian, Yun

    2018-01-01

    The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance.

  17. A Real-Time Magnetoencephalography Brain-Computer Interface Using Interactive 3D Visualization and the Hadoop Ecosystem.

    PubMed

    McClay, Wilbert A; Yadav, Nancy; Ozbek, Yusuf; Haas, Andy; Attias, Hagaii T; Nagarajan, Srikantan S

    2015-09-30

    Ecumenically, the fastest growing segment of Big Data is human biology-related data and the annual data creation is on the order of zetabytes. The implications are global across industries, of which the treatment of brain related illnesses and trauma could see the most significant and immediate effects. The next generation of health care IT and sensory devices are acquiring and storing massive amounts of patient related data. An innovative Brain-Computer Interface (BCI) for interactive 3D visualization is presented utilizing the Hadoop Ecosystem for data analysis and storage. The BCI is an implementation of Bayesian factor analysis algorithms that can distinguish distinct thought actions using magneto encephalographic (MEG) brain signals. We have collected data on five subjects yielding 90% positive performance in MEG mid- and post-movement activity. We describe a driver that substitutes the actions of the BCI as mouse button presses for real-time use in visual simulations. This process has been added into a flight visualization demonstration. By thinking left or right, the user experiences the aircraft turning in the chosen direction. The driver components of the BCI can be compiled into any software and substitute a user's intent for specific keyboard strikes or mouse button presses. The BCI's data analytics OPEN ACCESS Brain. Sci. 2015, 5 420 of a subject's MEG brainwaves and flight visualization performance are stored and analyzed using the Hadoop Ecosystem as a quick retrieval data warehouse.

  18. A Real-Time Magnetoencephalography Brain-Computer Interface Using Interactive 3D Visualization and the Hadoop Ecosystem

    PubMed Central

    McClay, Wilbert A.; Yadav, Nancy; Ozbek, Yusuf; Haas, Andy; Attias, Hagaii T.; Nagarajan, Srikantan S.

    2015-01-01

    Ecumenically, the fastest growing segment of Big Data is human biology-related data and the annual data creation is on the order of zetabytes. The implications are global across industries, of which the treatment of brain related illnesses and trauma could see the most significant and immediate effects. The next generation of health care IT and sensory devices are acquiring and storing massive amounts of patient related data. An innovative Brain-Computer Interface (BCI) for interactive 3D visualization is presented utilizing the Hadoop Ecosystem for data analysis and storage. The BCI is an implementation of Bayesian factor analysis algorithms that can distinguish distinct thought actions using magneto encephalographic (MEG) brain signals. We have collected data on five subjects yielding 90% positive performance in MEG mid- and post-movement activity. We describe a driver that substitutes the actions of the BCI as mouse button presses for real-time use in visual simulations. This process has been added into a flight visualization demonstration. By thinking left or right, the user experiences the aircraft turning in the chosen direction. The driver components of the BCI can be compiled into any software and substitute a user’s intent for specific keyboard strikes or mouse button presses. The BCI’s data analytics of a subject’s MEG brainwaves and flight visualization performance are stored and analyzed using the Hadoop Ecosystem as a quick retrieval data warehouse. PMID:26437432

  19. A Spatiotemporal Indexing Approach for Efficient Processing of Big Array-Based Climate Data with MapReduce

    NASA Technical Reports Server (NTRS)

    Li, Zhenlong; Hu, Fei; Schnase, John L.; Duffy, Daniel Q.; Lee, Tsengdar; Bowen, Michael K.; Yang, Chaowei

    2016-01-01

    Climate observations and model simulations are producing vast amounts of array-based spatiotemporal data. Efficient processing of these data is essential for assessing global challenges such as climate change, natural disasters, and diseases. This is challenging not only because of the large data volume, but also because of the intrinsic high-dimensional nature of geoscience data. To tackle this challenge, we propose a spatiotemporal indexing approach to efficiently manage and process big climate data with MapReduce in a highly scalable environment. Using this approach, big climate data are directly stored in a Hadoop Distributed File System in its original, native file format. A spatiotemporal index is built to bridge the logical array-based data model and the physical data layout, which enables fast data retrieval when performing spatiotemporal queries. Based on the index, a data-partitioning algorithm is applied to enable MapReduce to achieve high data locality, as well as balancing the workload. The proposed indexing approach is evaluated using the National Aeronautics and Space Administration (NASA) Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. The experimental results show that the index can significantly accelerate querying and processing (10 speedup compared to the baseline test using the same computing cluster), while keeping the index-to-data ratio small (0.0328). The applicability of the indexing approach is demonstrated by a climate anomaly detection deployed on a NASA Hadoop cluster. This approach is also able to support efficient processing of general array-based spatiotemporal data in various geoscience domains without special configuration on a Hadoop cluster.

  20. The Archive Solution for Distributed Workflow Management Agents of the CMS Experiment at LHC

    SciTech Connect

    Kuznetsov, Valentin; Fischer, Nils Leif; Guo, Yuyi

    The CMS experiment at the CERN LHC developed the Workflow Management Archive system to persistently store unstructured framework job report documents produced by distributed workflow management agents. In this paper we present its architecture, implementation, deployment, and integration with the CMS and CERN computing infrastructures, such as central HDFS and Hadoop Spark cluster. The system leverages modern technologies such as a document oriented database and the Hadoop eco-system to provide the necessary flexibility to reliably process, store, and aggregatemore » $$\\mathcal{O}$$(1M) documents on a daily basis. We describe the data transformation, the short and long term storage layers, the query language, along with the aggregation pipeline developed to visualize various performance metrics to assist CMS data operators in assessing the performance of the CMS computing system.« less

  1. The Archive Solution for Distributed Workflow Management Agents of the CMS Experiment at LHC

    DOE PAGES

    Kuznetsov, Valentin; Fischer, Nils Leif; Guo, Yuyi

    2018-03-19

    The CMS experiment at the CERN LHC developed the Workflow Management Archive system to persistently store unstructured framework job report documents produced by distributed workflow management agents. In this paper we present its architecture, implementation, deployment, and integration with the CMS and CERN computing infrastructures, such as central HDFS and Hadoop Spark cluster. The system leverages modern technologies such as a document oriented database and the Hadoop eco-system to provide the necessary flexibility to reliably process, store, and aggregatemore » $$\\mathcal{O}$$(1M) documents on a daily basis. We describe the data transformation, the short and long term storage layers, the query language, along with the aggregation pipeline developed to visualize various performance metrics to assist CMS data operators in assessing the performance of the CMS computing system.« less

  2. A Discretization Algorithm for Meteorological Data and its Parallelization Based on Hadoop

    NASA Astrophysics Data System (ADS)

    Liu, Chao; Jin, Wen; Yu, Yuting; Qiu, Taorong; Bai, Xiaoming; Zou, Shuilong

    2017-10-01

    In view of the large amount of meteorological observation data, the property is more and the attribute values are continuous values, the correlation between the elements is the need for the application of meteorological data, this paper is devoted to solving the problem of how to better discretize large meteorological data to more effectively dig out the hidden knowledge in meteorological data and research on the improvement of discretization algorithm for large scale data, in order to achieve data in the large meteorological data discretization for the follow-up to better provide knowledge to provide protection, a discretization algorithm based on information entropy and inconsistency of meteorological attributes is proposed and the algorithm is parallelized under Hadoop platform. Finally, the comparison test validates the effectiveness of the proposed algorithm for discretization in the area of meteorological large data.

  3. Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform

    PubMed Central

    Wang, Min; Tian, Yun

    2018-01-01

    The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance. PMID:29861711

  4. Accessing files in an Internet: The Jade file system

    NASA Technical Reports Server (NTRS)

    Peterson, Larry L.; Rao, Herman C.

    1991-01-01

    Jade is a new distribution file system that provides a uniform way to name and access files in an internet environment. It makes two important contributions. First, Jade is a logical system that integrates a heterogeneous collection of existing file systems, where heterogeneous means that the underlying file systems support different file access protocols. Jade is designed under the restriction that the underlying file system may not be modified. Second, rather than providing a global name space, Jade permits each user to define a private name space. These private name spaces support two novel features: they allow multiple file systems to be mounted under one directory, and they allow one logical name space to mount other logical name spaces. A prototype of the Jade File System was implemented on Sun Workstations running Unix. It consists of interfaces to the Unix file system, the Sun Network File System, the Andrew File System, and FTP. This paper motivates Jade's design, highlights several aspects of its implementation, and illustrates applications that can take advantage of its features.

  5. Accessing files in an internet - The Jade file system

    NASA Technical Reports Server (NTRS)

    Rao, Herman C.; Peterson, Larry L.

    1993-01-01

    Jade is a new distribution file system that provides a uniform way to name and access files in an internet environment. It makes two important contributions. First, Jade is a logical system that integrates a heterogeneous collection of existing file systems, where heterogeneous means that the underlying file systems support different file access protocols. Jade is designed under the restriction that the underlying file system may not be modified. Second, rather than providing a global name space, Jade permits each user to define a private name space. These private name spaces support two novel features: they allow multiple file systems to be mounted under one directory, and they allow one logical name space to mount other logical name spaces. A prototype of the Jade File System was implemented on Sun Workstations running Unix. It consists of interfaces to the Unix file system, the Sun Network File System, the Andrew File System, and FTP. This paper motivates Jade's design, highlights several aspects of its implementation, and illustrates applications that can take advantage of its features.

  6. Security in the CernVM File System and the Frontier Distributed Database Caching System

    NASA Astrophysics Data System (ADS)

    Dykstra, D.; Blomer, J.

    2014-06-01

    Both the CernVM File System (CVMFS) and the Frontier Distributed Database Caching System (Frontier) distribute centrally updated data worldwide for LHC experiments using http proxy caches. Neither system provides privacy or access control on reading the data, but both control access to updates of the data and can guarantee the authenticity and integrity of the data transferred to clients over the internet. CVMFS has since its early days required digital signatures and secure hashes on all distributed data, and recently Frontier has added X.509-based authenticity and integrity checking. In this paper we detail and compare the security models of CVMFS and Frontier.

  7. ATLAS Data Management Accounting with Hadoop Pig and HBase

    NASA Astrophysics Data System (ADS)

    Lassnig, Mario; Garonne, Vincent; Dimitrov, Gancho; Canali, Luca

    2012-12-01

    The ATLAS Distributed Data Management system requires accounting of its contents at the metadata layer. This presents a hard problem due to the large scale of the system, the high dimensionality of attributes, and the high rate of concurrent modifications of data. The system must efficiently account more than 90PB of disk and tape that store upwards of 500 million files across 100 sites globally. In this work a generic accounting system is presented, which is able to scale to the requirements of ATLAS. The design and architecture is presented, and the implementation is discussed. An emphasis is placed on the design choices such that the underlying data models are generally applicable to different kinds of accounting, reporting and monitoring.

  8. Distributed metadata servers for cluster file systems using shared low latency persistent key-value metadata store

    SciTech Connect

    Bent, John M.; Faibish, Sorin; Pedone, Jr., James M.

    A cluster file system is provided having a plurality of distributed metadata servers with shared access to one or more shared low latency persistent key-value metadata stores. A metadata server comprises an abstract storage interface comprising a software interface module that communicates with at least one shared persistent key-value metadata store providing a key-value interface for persistent storage of key-value metadata. The software interface module provides the key-value metadata to the at least one shared persistent key-value metadata store in a key-value format. The shared persistent key-value metadata store is accessed by a plurality of metadata servers. A metadata requestmore » can be processed by a given metadata server independently of other metadata servers in the cluster file system. A distributed metadata storage environment is also disclosed that comprises a plurality of metadata servers having an abstract storage interface to at least one shared persistent key-value metadata store.« less

  9. Distributing an executable job load file to compute nodes in a parallel computer

    SciTech Connect

    Gooding, Thomas M.

    Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications linkmore » over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.« less

  10. Intrex Subject/Title Inverted-File Characteristics.

    ERIC Educational Resources Information Center

    Uemura, Syunsuke

    The characteristics of the Intrex subject/title inverted file are analyzed. Basic statistics of the inverted file are presented including various distributions of the index words and terms from which the file was derived, and statistics on stems, the file growth process, and redundancy measurements. A study of stems both with extremely high and…

  11. Cloud-Enabled Climate Analytics-as-a-Service using Reanalysis data: A case study.

    NASA Astrophysics Data System (ADS)

    Nadeau, D.; Duffy, D.; Schnase, J. L.; McInerney, M.; Tamkin, G.; Potter, G. L.; Thompson, J. H.

    2014-12-01

    The NASA Center for Climate Simulation (NCCS) maintains advanced data capabilities and facilities that allow researchers to access the enormous volume of data generated by weather and climate models. The NASA Climate Model Data Service (CDS) and the NCCS are merging their efforts to provide Climate Analytics-as-a-Service for the comparative study of the major reanalysis projects: ECMWF ERA-Interim, NASA/GMAO MERRA, NOAA/NCEP CFSR, NOAA/ESRL 20CR, JMA JRA25, and JRA55. These reanalyses have been repackaged to netCDF4 file format following the CMIP5 Climate and Forecast (CF) metadata convention prior to be sequenced into the Hadoop Distributed File System ( HDFS ). A small set of operations that represent a common starting point in many analysis workflows was then created: min, max, sum, count, variance and average. In this example, Reanalysis data exploration was performed with the use of Hadoop MapReduce and accessibility was achieved using the Climate Data Service(CDS) application programming interface (API) created at NCCS. This API provides a uniform treatment of large amount of data. In this case study, we have limited our exploration to 2 variables, temperature and precipitation, using 3 operations, min, max and avg and using 30-year of Reanalysis data for 3 regions of the world: global, polar, subtropical.

  12. Derived virtual devices: a secure distributed file system mechanism

    NASA Technical Reports Server (NTRS)

    VanMeter, Rodney; Hotz, Steve; Finn, Gregory

    1996-01-01

    This paper presents the design of derived virtual devices (DVDs). DVDs are the mechanism used by the Netstation Project to provide secure shared access to network-attached peripherals distributed in an untrusted network environment. DVDs improve Input/Output efficiency by allowing user processes to perform I/O operations directly from devices without intermediate transfer through the controlling operating system kernel. The security enforced at the device through the DVD mechanism includes resource boundary checking, user authentication, and restricted operations, e.g., read-only access. To illustrate the application of DVDs, we present the interactions between a network-attached disk and a file system designed to exploit the DVD abstraction. We further discuss third-party transfer as a mechanism intended to provide for efficient data transfer in a typical NAP environment. We show how DVDs facilitate third-party transfer, and provide the security required in a more open network environment.

  13. Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism.

    PubMed

    Yu, Jia; Blom, Jochen; Sczyrba, Alexander; Goesmann, Alexander

    2017-09-10

    The introduction of next generation sequencing has caused a steady increase in the amounts of data that have to be processed in modern life science. Sequence alignment plays a key role in the analysis of sequencing data e.g. within whole genome sequencing or metagenome projects. BLAST is a commonly used alignment tool that was the standard approach for more than two decades, but in the last years faster alternatives have been proposed including RapSearch, GHOSTX, and DIAMOND. Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy. Copyright © 2017 The Author(s). Published by Elsevier B.V. All rights reserved.

  14. Parallel file system with metadata distributed across partitioned key-value store c

    DOEpatents

    Bent, John M.; Faibish, Sorin; Grider, Gary; Torres, Aaron

    2017-09-19

    Improved techniques are provided for storing metadata associated with a plurality of sub-files associated with a single shared file in a parallel file system. The shared file is generated by a plurality of applications executing on a plurality of compute nodes. A compute node implements a Parallel Log Structured File System (PLFS) library to store at least one portion of the shared file generated by an application executing on the compute node and metadata for the at least one portion of the shared file on one or more object storage servers. The compute node is also configured to implement a partitioned data store for storing a partition of the metadata for the shared file, wherein the partitioned data store communicates with partitioned data stores on other compute nodes using a message passing interface. The partitioned data store can be implemented, for example, using Multidimensional Data Hashing Indexing Middleware (MDHIM).

  15. The Global File System

    NASA Technical Reports Server (NTRS)

    Soltis, Steven R.; Ruwart, Thomas M.; OKeefe, Matthew T.

    1996-01-01

    The global file system (GFS) is a prototype design for a distributed file system in which cluster nodes physically share storage devices connected via a network-like fiber channel. Networks and network-attached storage devices have advanced to a level of performance and extensibility so that the previous disadvantages of shared disk architectures are no longer valid. This shared storage architecture attempts to exploit the sophistication of storage device technologies whereas a server architecture diminishes a device's role to that of a simple component. GFS distributes the file system responsibilities across processing nodes, storage across the devices, and file system resources across the entire storage pool. GFS caches data on the storage devices instead of the main memories of the machines. Consistency is established by using a locking mechanism maintained by the storage devices to facilitate atomic read-modify-write operations. The locking mechanism is being prototyped in the Silicon Graphics IRIX operating system and is accessed using standard Unix commands and modules.

  16. A Scalable Cloud Library Empowering Big Data Management, Diagnosis, and Visualization of Cloud-Resolving Models

    NASA Astrophysics Data System (ADS)

    Zhou, S.; Tao, W. K.; Li, X.; Matsui, T.; Sun, X. H.; Yang, X.

    2015-12-01

    A cloud-resolving model (CRM) is an atmospheric numerical model that can numerically resolve clouds and cloud systems at 0.25~5km horizontal grid spacings. The main advantage of the CRM is that it can allow explicit interactive processes between microphysics, radiation, turbulence, surface, and aerosols without subgrid cloud fraction, overlapping and convective parameterization. Because of their fine resolution and complex physical processes, it is challenging for the CRM community to i) visualize/inter-compare CRM simulations, ii) diagnose key processes for cloud-precipitation formation and intensity, and iii) evaluate against NASA's field campaign data and L1/L2 satellite data products due to large data volume (~10TB) and complexity of CRM's physical processes. We have been building the Super Cloud Library (SCL) upon a Hadoop framework, capable of CRM database management, distribution, visualization, subsetting, and evaluation in a scalable way. The current SCL capability includes (1) A SCL data model enables various CRM simulation outputs in NetCDF, including the NASA-Unified Weather Research and Forecasting (NU-WRF) and Goddard Cumulus Ensemble (GCE) model, to be accessed and processed by Hadoop, (2) A parallel NetCDF-to-CSV converter supports NU-WRF and GCE model outputs, (3) A technique visualizes Hadoop-resident data with IDL, (4) A technique subsets Hadoop-resident data, compliant to the SCL data model, with HIVE or Impala via HUE's Web interface, (5) A prototype enables a Hadoop MapReduce application to dynamically access and process data residing in a parallel file system, PVFS2 or CephFS, where high performance computing (HPC) simulation outputs such as NU-WRF's and GCE's are located. We are testing Apache Spark to speed up SCL data processing and analysis.With the SCL capabilities, SCL users can conduct large-domain on-demand tasks without downloading voluminous CRM datasets and various observations from NASA Field Campaigns and Satellite data to a

  17. MRPack: Multi-Algorithm Execution Using Compute-Intensive Approach in MapReduce

    PubMed Central

    2015-01-01

    Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement. PMID:26305223

  18. MRPack: Multi-Algorithm Execution Using Compute-Intensive Approach in MapReduce.

    PubMed

    Idris, Muhammad; Hussain, Shujaat; Siddiqi, Muhammad Hameed; Hassan, Waseem; Syed Muhammad Bilal, Hafiz; Lee, Sungyoung

    2015-01-01

    Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.

  19. Large-Scale Exploratory Analysis, Cleaning, and Modeling for Event Detection in Real-World Power Systems Data

    DTIC Science & Technology

    2013-11-01

    big data with R is relatively new. RHadoop is a mature product from Revolution Analytics that uses R with Hadoop Streaming [15] and provides...agnostic all- data summaries or computations, in which case we use MapReduce directly. 2.3 D&R Software Environment In this work, we use the Hadoop ...job scheduling and tracking, data distribu- tion, system architecture, heterogeneity, and fault-tolerance. Hadoop also provides a distributed key-value

  20. Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.

    PubMed

    Klein, Max; Sharma, Rati; Bohrer, Chris H; Avelis, Cameron M; Roberts, Elijah

    2017-01-15

    Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology. Source code is licensed under the Apache 2.0 open source license and is available at the project website: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark CONTACT: eroberts@jhu.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  1. Enhanced K-means clustering with encryption on cloud

    NASA Astrophysics Data System (ADS)

    Singh, Iqjot; Dwivedi, Prerna; Gupta, Taru; Shynu, P. G.

    2017-11-01

    This paper tries to solve the problem of storing and managing big files over cloud by implementing hashing on Hadoop in big-data and ensure security while uploading and downloading files. Cloud computing is a term that emphasis on sharing data and facilitates to share infrastructure and resources.[10] Hadoop is an open source software that gives us access to store and manage big files according to our needs on cloud. K-means clustering algorithm is an algorithm used to calculate distance between the centroid of the cluster and the data points. Hashing is a algorithm in which we are storing and retrieving data with hash keys. The hashing algorithm is called as hash function which is used to portray the original data and later to fetch the data stored at the specific key. [17] Encryption is a process to transform electronic data into non readable form known as cipher text. Decryption is the opposite process of encryption, it transforms the cipher text into plain text that the end user can read and understand well. For encryption and decryption we are using Symmetric key cryptographic algorithm. In symmetric key cryptography are using DES algorithm for a secure storage of the files. [3

  2. Medical Big Data Warehouse: Architecture and System Design, a Case Study: Improving Healthcare Resources Distribution.

    PubMed

    Sebaa, Abderrazak; Chikh, Fatima; Nouicer, Amina; Tari, AbdelKamel

    2018-02-19

    The huge increases in medical devices and clinical applications which generate enormous data have raised a big issue in managing, processing, and mining this massive amount of data. Indeed, traditional data warehousing frameworks can not be effective when managing the volume, variety, and velocity of current medical applications. As a result, several data warehouses face many issues over medical data and many challenges need to be addressed. New solutions have emerged and Hadoop is one of the best examples, it can be used to process these streams of medical data. However, without an efficient system design and architecture, these performances will not be significant and valuable for medical managers. In this paper, we provide a short review of the literature about research issues of traditional data warehouses and we present some important Hadoop-based data warehouses. In addition, a Hadoop-based architecture and a conceptual data model for designing medical Big Data warehouse are given. In our case study, we provide implementation detail of big data warehouse based on the proposed architecture and data model in the Apache Hadoop platform to ensure an optimal allocation of health resources.

  3. The Jade File System. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Rao, Herman Chung-Hwa

    1991-01-01

    File systems have long been the most important and most widely used form of shared permanent storage. File systems in traditional time-sharing systems, such as Unix, support a coherent sharing model for multiple users. Distributed file systems implement this sharing model in local area networks. However, most distributed file systems fail to scale from local area networks to an internet. Four characteristics of scalability were recognized: size, wide area, autonomy, and heterogeneity. Owing to size and wide area, techniques such as broadcasting, central control, and central resources, which are widely adopted by local area network file systems, are not adequate for an internet file system. An internet file system must also support the notion of autonomy because an internet is made up by a collection of independent organizations. Finally, heterogeneity is the nature of an internet file system, not only because of its size, but also because of the autonomy of the organizations in an internet. The Jade File System, which provides a uniform way to name and access files in the internet environment, is presented. Jade is a logical system that integrates a heterogeneous collection of existing file systems, where heterogeneous means that the underlying file systems support different file access protocols. Because of autonomy, Jade is designed under the restriction that the underlying file systems may not be modified. In order to avoid the complexity of maintaining an internet-wide, global name space, Jade permits each user to define a private name space. In Jade's design, we pay careful attention to avoiding unnecessary network messages between clients and file servers in order to achieve acceptable performance. Jade's name space supports two novel features: (1) it allows multiple file systems to be mounted under one direction; and (2) it permits one logical name space to mount other logical name spaces. A prototype of Jade was implemented to examine and validate its

  4. Securing the AliEn File Catalogue - Enforcing authorization with accountable file operations

    NASA Astrophysics Data System (ADS)

    Schreiner, Steffen; Bagnasco, Stefano; Sankar Banerjee, Subho; Betev, Latchezar; Carminati, Federico; Vladimirovna Datskova, Olga; Furano, Fabrizio; Grigoras, Alina; Grigoras, Costin; Mendez Lorenzo, Patricia; Peters, Andreas Joachim; Saiz, Pablo; Zhu, Jianlin

    2011-12-01

    The AliEn Grid Services, as operated by the ALICE Collaboration in its global physics analysis grid framework, is based on a central File Catalogue together with a distributed set of storage systems and the possibility to register links to external data resources. This paper describes several identified vulnerabilities in the AliEn File Catalogue access protocol regarding fraud and unauthorized file alteration and presents a more secure and revised design: a new mechanism, called LFN Booking Table, is introduced in order to keep track of access authorization in the transient state of files entering or leaving the File Catalogue. Due to a simplification of the original Access Envelope mechanism for xrootd-protocol-based storage systems, fundamental computational improvements of the mechanism were achieved as well as an up to 50% reduction of the credential's size. By extending the access protocol with signed status messages from the underlying storage system, the File Catalogue receives trusted information about a file's size and checksum and the protocol is no longer dependent on client trust. Altogether, the revised design complies with atomic and consistent transactions and allows for accountable, authentic, and traceable file operations. This paper describes these changes as part and beyond the development of AliEn version 2.19.

  5. Spatial Distribution and Secular Variation of Geomagnetic Filed in China Described by the CHAOS-6 Model and its Error Analysis

    NASA Astrophysics Data System (ADS)

    Wang, Z.; Gu, Z.; Chen, B.; Yuan, J.; Wang, C.

    2016-12-01

    The CHAOS-6 geomagnetic field model, presented in 2016 by the Denmark's national space institute (DTU Space), is a model of the near-Earth magnetic field. According the CHAOS-6 model, seven component data of geomagnetic filed at 30 observatories in China in 2015 and at 3 observatories in China spanning the time interval 2008.0-2016.5 were calculated. Also seven component data of geomagnetic filed from the geomagnetic data of practical observations in China was obtained. Based on the model calculated data and the practical data, we have compared and analyzed the spatial distribution and the secular variation of the geomagnetic field in China. There is obvious difference between the two type data. The CHAOS-6 model cannot describe the spatial distribution and the secular variation of the geomagnetic field in China with comparative precision because of the regional and local magnetic anomalies in China.

  6. ClimateSpark: An In-memory Distributed Computing Framework for Big Climate Data Analytics

    NASA Astrophysics Data System (ADS)

    Hu, F.; Yang, C. P.; Duffy, D.; Schnase, J. L.; Li, Z.

    2016-12-01

    Massive array-based climate data is being generated from global surveillance systems and model simulations. They are widely used to analyze the environment problems, such as climate changes, natural hazards, and public health. However, knowing the underlying information from these big climate datasets is challenging due to both data- and computing- intensive issues in data processing and analyzing. To tackle the challenges, this paper proposes ClimateSpark, an in-memory distributed computing framework to support big climate data processing. In ClimateSpark, the spatiotemporal index is developed to enable Apache Spark to treat the array-based climate data (e.g. netCDF4, HDF4) as native formats, which are stored in Hadoop Distributed File System (HDFS) without any preprocessing. Based on the index, the spatiotemporal query services are provided to retrieve dataset according to a defined geospatial and temporal bounding box. The data subsets will be read out, and a data partition strategy will be applied to equally split the queried data to each computing node, and store them in memory as climateRDDs for processing. By leveraging Spark SQL and User Defined Function (UDFs), the climate data analysis operations can be conducted by the intuitive SQL language. ClimateSpark is evaluated by two use cases using the NASA Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. One use case is to conduct the spatiotemporal query and visualize the subset results in animation; the other one is to compare different climate model outputs using Taylor-diagram service. Experimental results show that ClimateSpark can significantly accelerate data query and processing, and enable the complex analysis services served in the SQL-style fashion.

  7. Digital Libraries: The Next Generation in File System Technology.

    ERIC Educational Resources Information Center

    Bowman, Mic; Camargo, Bill

    1998-01-01

    Examines file sharing within corporations that use wide-area, distributed file systems. Applications and user interactions strongly suggest that the addition of services typically associated with digital libraries (content-based file location, strongly typed objects, representation of complex relationships between documents, and extrinsic…

  8. Usage analysis of user files in UNIX

    NASA Technical Reports Server (NTRS)

    Devarakonda, Murthy V.; Iyer, Ravishankar K.

    1987-01-01

    Presented is a user-oriented analysis of short term file usage in a 4.2 BSD UNIX environment. The key aspect of this analysis is a characterization of users and files, which is a departure from the traditional approach of analyzing file references. Two characterization measures are employed: accesses-per-byte (combining fraction of a file referenced and number of references) and file size. This new approach is shown to distinguish differences in files as well as users, which cam be used in efficient file system design, and in creating realistic test workloads for simulations. A multi-stage gamma distribution is shown to closely model the file usage measures. Even though overall file sharing is small, some files belonging to a bulletin board system are accessed by many users, simultaneously and otherwise. Over 50% of users referenced files owned by other users, and over 80% of all files were involved in such references. Based on the differences in files and users, suggestions to improve the system performance were also made.

  9. An Ephemeral Burst-Buffer File System for Scientific Applications

    SciTech Connect

    Wang, Teng; Moody, Adam; Yu, Weikuan

    BurstFS is a distributed file system for node-local burst buffers on high performance computing systems. BurstFS presents a shared file system space across the burst buffers so that applications that use shared files can access the highly-scalable burst buffers without changing their applications.

  10. Distributed Fast Self-Organized Maps for Massive Spectrophotometric Data Analysis †.

    PubMed

    Dafonte, Carlos; Garabato, Daniel; Álvarez, Marco A; Manteiga, Minia

    2018-05-03

    Analyzing huge amounts of data becomes essential in the era of Big Data, where databases are populated with hundreds of Gigabytes that must be processed to extract knowledge. Hence, classical algorithms must be adapted towards distributed computing methodologies that leverage the underlying computational power of these platforms. Here, a parallel, scalable, and optimized design for self-organized maps (SOM) is proposed in order to analyze massive data gathered by the spectrophotometric sensor of the European Space Agency (ESA) Gaia spacecraft, although it could be extrapolated to other domains. The performance comparison between the sequential implementation and the distributed ones based on Apache Hadoop and Apache Spark is an important part of the work, as well as the detailed analysis of the proposed optimizations. Finally, a domain-specific visualization tool to explore astronomical SOMs is presented.

  11. Genotyping in the cloud with Crossbow.

    PubMed

    Gurtowski, James; Schatz, Michael C; Langmead, Ben

    2012-09-01

    Crossbow is a scalable, portable, and automatic cloud computing tool for identifying SNPs from high-coverage, short-read resequencing data. It is built on Apache Hadoop, an implementation of the MapReduce software framework. Hadoop allows Crossbow to distribute read alignment and SNP calling subtasks over a cluster of commodity computers. Two robust tools, Bowtie and SOAPsnp, implement the fundamental alignment and variant calling operations respectively, and have demonstrated capabilities within Crossbow of analyzing approximately one billion short reads per hour on a commodity Hadoop cluster with 320 cores. Through protocol examples, this unit will demonstrate the use of Crossbow for identifying variations in three different operating modes: on a Hadoop cluster, on a single computer, and on the Amazon Elastic MapReduce cloud computing service.

  12. Final Report for File System Support for Burst Buffers on HPC Systems

    SciTech Connect

    Yu, W.; Mohror, K.

    Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. As they are being deployed on more supercomputers, a file system that efficiently manages these burst buffers for fast I/O operations carries great consequence. Over the past year, FSU team has undertaken several efforts to design, prototype and evaluate distributed file systems for burst buffers on HPC systems. These include MetaKV: a Key-Value Store for Metadata Management of Distributed Burst Buffers, a user-level file system with multiple backends, and a specialized file system for large datasets of deep neural networks. Our progress for these respectivemore » efforts are elaborated further in this report.« less

  13. Storing files in a parallel computing system based on user-specified parser function

    DOEpatents

    Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Manzanares, Adam; Torres, Aaron

    2014-10-21

    Techniques are provided for storing files in a parallel computing system based on a user-specified parser function. A plurality of files generated by a distributed application in a parallel computing system are stored by obtaining a parser from the distributed application for processing the plurality of files prior to storage; and storing one or more of the plurality of files in one or more storage nodes of the parallel computing system based on the processing by the parser. The plurality of files comprise one or more of a plurality of complete files and a plurality of sub-files. The parser can optionally store only those files that satisfy one or more semantic requirements of the parser. The parser can also extract metadata from one or more of the files and the extracted metadata can be stored with one or more of the plurality of files and used for searching for files.

  14. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes

    PubMed Central

    2013-01-01

    Motivation Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. Results We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity

  15. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.

    PubMed

    Wang, Yue; Goh, Wilson; Wong, Limsoon; Montana, Giovanni

    2013-01-01

    Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly

  16. Reconstructing evolutionary trees in parallel for massive sequences.

    PubMed

    Zou, Quan; Wan, Shixiang; Zeng, Xiangxiang; Ma, Zhanshan Sam

    2017-12-14

    Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .

  17. Distribution of immunodeficiency fact files with XML--from Web to WAP.

    PubMed

    Väliaho, Jouni; Riikonen, Pentti; Vihinen, Mauno

    2005-06-26

    Although biomedical information is growing rapidly, it is difficult to find and retrieve validated data especially for rare hereditary diseases. There is an increased need for services capable of integrating and validating information as well as proving it in a logically organized structure. A XML-based language enables creation of open source databases for storage, maintenance and delivery for different platforms. Here we present a new data model called fact file and an XML-based specification Inherited Disease Markup Language (IDML), that were developed to facilitate disease information integration, storage and exchange. The data model was applied to primary immunodeficiencies, but it can be used for any hereditary disease. Fact files integrate biomedical, genetic and clinical information related to hereditary diseases. IDML and fact files were used to build a comprehensive Web and WAP accessible knowledge base ImmunoDeficiency Resource (IDR) available at http://bioinf.uta.fi/idr/. A fact file is a user oriented user interface, which serves as a starting point to explore information on hereditary diseases. The IDML enables the seamless integration and presentation of genetic and disease information resources in the Internet. IDML can be used to build information services for all kinds of inherited diseases. The open source specification and related programs are available at http://bioinf.uta.fi/idml/.

  18. Storing files in a parallel computing system based on user or application specification

    SciTech Connect

    Faibish, Sorin; Bent, John M.; Nick, Jeffrey M.

    2016-03-29

    Techniques are provided for storing files in a parallel computing system based on a user-specification. A plurality of files generated by a distributed application in a parallel computing system are stored by obtaining a specification from the distributed application indicating how the plurality of files should be stored; and storing one or more of the plurality of files in one or more storage nodes of a multi-tier storage system based on the specification. The plurality of files comprise a plurality of complete files and/or a plurality of sub-files. The specification can optionally be processed by a daemon executing on onemore » or more nodes in a multi-tier storage system. The specification indicates how the plurality of files should be stored, for example, identifying one or more storage nodes where the plurality of files should be stored.« less

  19. DSSTox chemical-index files for exposure-related ...

    EPA Pesticide Factsheets

    The Distributed Structure-Searchable Toxicity (DSSTox) ARYEXP and GEOGSE files are newly published, structure-annotated files of the chemical-associated and chemical exposure-related summary experimental content contained in the ArrayExpress Repository and Gene Expression Omnibus (GEO) Series (based on data extracted on September 20, 2008). ARYEXP and GEOGSE contain 887 and 1064 unique chemical substances mapped to 1835 and 2381 chemical exposure-related experiment accession IDs, respectively. The standardized files allow one to assess, compare and search the chemical content in each resource, in the context of the larger DSSTox toxicology data network, as well as across large public cheminformatics resources such as PubChem (http://pubchem.ncbi.nlm.nih.gov). The Distributed Structure-Searchable Toxicity (DSSTox) ARYEXP and GEOGSE files are newly published, structure-annotated files of the chemical-associated and chemical exposure-related summary experimental content contained in the ArrayExpress Repository and Gene Expression Omnibus (GEO) Series (based on data extracted on September 20, 2008). ARYEXP and GEOGSE contain 887 and 1064 unique chemical substances mapped to 1835 and 2381 chemical exposure-related experiment accession IDs, respectively. The standardized files allow one to assess, compare and search the chemical content in each resource, in the context of the larger DSSTox toxicology data network, as well as across large public cheminformatics resourc

  20. DIRAC File Replica and Metadata Catalog

    NASA Astrophysics Data System (ADS)

    Tsaregorodtsev, A.; Poss, S.

    2012-12-01

    File replica and metadata catalogs are essential parts of any distributed data management system, which are largely determining its functionality and performance. A new File Catalog (DFC) was developed in the framework of the DIRAC Project that combines both replica and metadata catalog functionality. The DFC design is based on the practical experience with the data management system of the LHCb Collaboration. It is optimized for the most common patterns of the catalog usage in order to achieve maximum performance from the user perspective. The DFC supports bulk operations for replica queries and allows quick analysis of the storage usage globally and for each Storage Element separately. It supports flexible ACL rules with plug-ins for various policies that can be adopted by a particular community. The DFC catalog allows to store various types of metadata associated with files and directories and to perform efficient queries for the data based on complex metadata combinations. Definition of file ancestor-descendent relation chains is also possible. The DFC catalog is implemented in the general DIRAC distributed computing framework following the standard grid security architecture. In this paper we describe the design of the DFC and its implementation details. The performance measurements are compared with other grid file catalog implementations. The experience of the DFC Catalog usage in the CLIC detector project are discussed.

  1. An Approach Using Parallel Architecture to Storage DICOM Images in Distributed File System

    NASA Astrophysics Data System (ADS)

    Soares, Tiago S.; Prado, Thiago C.; Dantas, M. A. R.; de Macedo, Douglas D. J.; Bauer, Michael A.

    2012-02-01

    Telemedicine is a very important area in medical field that is expanding daily motivated by many researchers interested in improving medical applications. In Brazil was started in 2005, in the State of Santa Catarina has a developed server called the CyclopsDCMServer, which the purpose to embrace the HDF for the manipulation of medical images (DICOM) using a distributed file system. Since then, many researches were initiated in order to seek better performance. Our approach for this server represents an additional parallel implementation in I/O operations since HDF version 5 has an essential feature for our work which supports parallel I/O, based upon the MPI paradigm. Early experiments using four parallel nodes, provide good performance when compare to the serial HDF implemented in the CyclopsDCMServer.

  2. 75 FR 46919 - MidAmerican Energy Company; Notice of Filing

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-08-04

    ... reclassify high voltage assets and accumulated depreciation, from distribution plant accounts to transmission plant accounts. Any person desiring to intervene or to protest this filing must file in accordance with...

  3. Survey of MapReduce frame operation in bioinformatics.

    PubMed

    Zou, Quan; Li, Xu-Bin; Jiang, Wen-Rui; Lin, Zi-Yu; Li, Gui-Lin; Chen, Ke

    2014-07-01

    Bioinformatics is challenged by the fact that traditional analysis tools have difficulty in processing large-scale data from high-throughput sequencing. The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services. In this article, we present MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains. In addition, we discuss the challenges faced by this field as well as the future works on parallel computing in bioinformatics. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  4. 78 FR 49504 - Combined Notice of Filings #2

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-08-14

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings 2 Take notice that the Commission received the following electric rate filings: Docket Numbers: ER13-2065-000. Applicants: Southern California Edison Company. Description: Amended SGIA & Distribution Service Agmt with Lancaster Little Rock C LLC to be effective ...

  5. SU-E-T-142: Automatic Linac Log File: Analysis and Reporting

    SciTech Connect

    Gainey, M; Rothe, T

    Purpose: End to end QA for IMRT/VMAT is time consuming. Automated linac log file analysis and recalculation of daily recorded fluence, and hence dose, distribution bring this closer. Methods: Matlab (R2014b, Mathworks) software was written to read in and analyse IMRT/VMAT trajectory log files (TrueBeam 1.5, Varian Medical Systems) overnight, and are archived on a backed-up network drive (figure). A summary report (PDF) is sent by email to the duty linac physicist. A structured summary report (PDF) for each patient is automatically updated for embedding into the R&V system (Mosaiq 2.5, Elekta AG). The report contains cross-referenced hyperlinks to easemore » navigation between treatment fractions. Gamma analysis can be performed on planned (DICOM RTPlan) and treated (trajectory log) fluence distributions. Trajectory log files can be converted into RTPlan files for dose distribution calculation (Eclipse, AAA10.0.28, VMS). Results: All leaf positions are within +/−0.10mm: 57% within +/−0.01mm; 89% within 0.05mm. Mean leaf position deviation is 0.02mm. Gantry angle variations lie in the range −0.1 to 0.3 degrees, mean 0.04 degrees. Fluence verification shows excellent agreement between planned and treated fluence. Agreement between planned and treated dose distribution, the derived from log files, is very good. Conclusion: Automated log file analysis is a valuable tool for the busy physicist, enabling potential treated fluence distribution errors to be quickly identified. In the near future we will correlate trajectory log analysis with routine IMRT/VMAT QA analysis. This has the potential to reduce, but not eliminate, the QA workload.« less

  6. 78 FR 17650 - Combined Notice of Filings #2

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-03-22

    ...: Spectrum Nevada Solar, LLC. Description: Application and Initial Baseline Tariff Filing to be effective 4... Service Agreement CA PV Energy, LLC at 1670 Champagne Ave to be effective 3/16/2013. Filed Date: 3/15/13...: Southern California Edison Company. Description: GIA and Distribution Service Agreement CA PV Energy, LLC...

  7. 78 FR 38705 - Combined Notice of Filings #2

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-06-27

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings 2 Take notice that the Commission received the following electric rate filings: Docket Numbers: ER13-1188-010. Applicants: Pacific Gas and Electric Company. Description: Pacific Gas and Electric Company. submits Wholesale Distribution Tariff Rate Case 2013 (WDT2...

  8. SeqHBase: a big data toolset for family based sequencing data analysis.

    PubMed

    He, Min; Person, Thomas N; Hebbring, Scott J; Heinzen, Ethan; Ye, Zhan; Schrodi, Steven J; McPherson, Elizabeth W; Lin, Simon M; Peissig, Peggy L; Brilliant, Murray H; O'Rawe, Jason; Robison, Reid J; Lyon, Gholson J; Wang, Kai

    2015-04-01

    Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

  9. File Management In Space

    NASA Technical Reports Server (NTRS)

    Critchfield, Anna R.; Zepp, Robert H.

    2000-01-01

    We propose that the user interact with the spacecraft as if the spacecraft were a file server, so that the user can select and receive data as files in standard formats (e.g., tables or images, such as jpeg) via the Internet. Internet technology will be used end-to-end from the spacecraft to authorized users, such as the flight operation team, and project scientists. The proposed solution includes a ground system and spacecraft architecture, mission operations scenarios, and an implementation roadmap showing migration from current practice to the future, where distributed users request and receive files of spacecraft data from archives or spacecraft with equal ease. This solution will provide ground support personnel and scientists easy, direct, secure access to their authorized data without cumbersome processing, and can be extended to support autonomous communications with the spacecraft.

  10. 21 CFR 720.3 - How and where to file.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...

  11. 21 CFR 720.3 - How and where to file.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...

  12. 21 CFR 720.3 - How and where to file.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...

  13. 21 CFR 720.3 - How and where to file.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...

  14. 21 CFR 720.3 - How and where to file.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN SERVICES (CONTINUED) COSMETICS VOLUNTARY FILING OF COSMETIC PRODUCT INGREDIENT COMPOSITION STATEMENTS § 720.3 How and where to file. Forms FDA 2512 and FDA 2514 (“Discontinuance of Commercial Distribution of Cosmetic Product Formulation...

  15. Adobe acrobat: an alternative electronic teaching file construction methodology independent of HTML restrictions.

    PubMed

    Katzman, G L

    2001-03-01

    The goal of the project was to create a method by which an in-house digital teaching file could be constructed that was simple, inexpensive, independent of hypertext markup language (HTML) restrictions, and appears identical on multiple platforms. To accomplish this, Microsoft PowerPoint and Adobe Acrobat were used in succession to assemble digital teaching files in the Acrobat portable document file format. They were then verified to appear identically on computers running Windows, Macintosh Operating Systems (OS), and the Silicon Graphics Unix-based OS as either a free-standing file using Acrobat Reader software or from within a browser window using the Acrobat browser plug-in. This latter display method yields a file viewed through a browser window, yet remains independent of underlying HTML restrictions, which may confer an advantage over simple HTML teaching file construction. Thus, a hybrid of HTML-distributed Adobe Acrobat generated WWW documents may be a viable alternative for digital teaching file construction and distribution.

  16. "WWW.MDTF.ORG": a World Wide Web forum for developing open-architecture, freely distributed, digital teaching file software by participant consensus.

    PubMed

    Katzman, G L; Morris, D; Lauman, J; Cochella, C; Goede, P; Harnsberger, H R

    2001-06-01

    To foster a community supported evaluation processes for open-source digital teaching file (DTF) development and maintenance. The mechanisms used to support this process will include standard web browsers, web servers, forum software, and custom additions to the forum software to potentially enable a mediated voting protocol. The web server will also serve as a focal point for beta and release software distribution, which is the desired end-goal of this process. We foresee that www.mdtf.org will provide for widespread distribution of open source DTF software that will include function and interface design decisions from community participation on the website forums.

  17. New directions in the CernVM file system

    NASA Astrophysics Data System (ADS)

    Blomer, Jakob; Buncic, Predrag; Ganis, Gerardo; Hardi, Nikola; Meusel, Rene; Popescu, Radu

    2017-10-01

    The CernVM File System today is commonly used to host and distribute application software stacks. In addition to this core task, recent developments expand the scope of the file system into two new areas. Firstly, CernVM-FS emerges as a good match for container engines to distribute the container image contents. Compared to native container image distribution (e.g. through the “Docker registry”), CernVM-FS massively reduces the network traffic for image distribution. This has been shown, for instance, by a prototype integration of CernVM-FS into Mesos developed by Mesosphere, Inc. We present a path for a smooth integration of CernVM-FS and Docker. Secondly, CernVM-FS recently raised new interest as an option for the distribution of experiment conditions data. Here, the focus is on improved versioning capabilities of CernVM-FS that allows to link the conditions data of a run period to the state of a CernVM-FS repository. Lastly, CernVM-FS has been extended to provide a name space for physics data for the LIGO and CMS collaborations. Searching through a data namespace is often done by a central, experiment specific database service. A name space on CernVM-FS can particularly benefit from an existing, scalable infrastructure and from the POSIX file system interface.

  18. Architecture of distributed picture archiving and communication systems for storing and processing high resolution medical images

    NASA Astrophysics Data System (ADS)

    Tokareva, Victoria

    2018-04-01

    New generation medicine demands a better quality of analysis increasing the amount of data collected during checkups, and simultaneously decreasing the invasiveness of a procedure. Thus it becomes urgent not only to develop advanced modern hardware, but also to implement special software infrastructure for using it in everyday clinical practice, so-called Picture Archiving and Communication Systems (PACS). Developing distributed PACS is a challenging task for nowadays medical informatics. The paper discusses the architecture of distributed PACS server for processing large high-quality medical images, with respect to technical specifications of modern medical imaging hardware, as well as international standards in medical imaging software. The MapReduce paradigm is proposed for image reconstruction by server, and the details of utilizing the Hadoop framework for this task are being discussed in order to provide the design of distributed PACS as ergonomic and adapted to the needs of end users as possible.

  19. 75 FR 18201 - Wisconsin Electric Power Company; Notice of Filing

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-04-09

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission [Docket No. ER10-911-001] Wisconsin Electric Power Company; Notice of Filing April 2, 2010. Take notice that on March 26, 2010, Wisconsin Electric Power Company filed counterpart signature pages to the executed Wholesale Distribution Service...

  20. Cloud-based distributed control of unmanned systems

    NASA Astrophysics Data System (ADS)

    Nguyen, Kim B.; Powell, Darren N.; Yetman, Charles; August, Michael; Alderson, Susan L.; Raney, Christopher J.

    2015-05-01

    Enabling warfighters to efficiently and safely execute dangerous missions, unmanned systems have been an increasingly valuable component in modern warfare. The evolving use of unmanned systems leads to vast amounts of data collected from sensors placed on the remote vehicles. As a result, many command and control (C2) systems have been developed to provide the necessary tools to perform one of the following functions: controlling the unmanned vehicle or analyzing and processing the sensory data from unmanned vehicles. These C2 systems are often disparate from one another, limiting the ability to optimally distribute data among different users. The Space and Naval Warfare Systems Center Pacific (SSC Pacific) seeks to address this technology gap through the UxV to the Cloud via Widgets project. The overarching intent of this three year effort is to provide three major capabilities: 1) unmanned vehicle control using an open service oriented architecture; 2) data distribution utilizing cloud technologies; 3) a collection of web-based tools enabling analysts to better view and process data. This paper focuses on how the UxV to the Cloud via Widgets system is designed and implemented by leveraging the following technologies: Data Distribution Service (DDS), Accumulo, Hadoop, and Ozone Widget Framework (OWF).

  1. Performance Analysis of the Unitree Central File

    NASA Technical Reports Server (NTRS)

    Pentakalos, Odysseas I.; Flater, David

    1994-01-01

    This report consists of two parts. The first part briefly comments on the documentation status of two major systems at NASA#s Center for Computational Sciences, specifically the Cray C98 and the Convex C3830. The second part describes the work done on improving the performance of file transfers between the Unitree Mass Storage System running on the Convex file server and the users workstations distributed over a large georgraphic area.

  2. Distributed and parallel approach for handle and perform huge datasets

    NASA Astrophysics Data System (ADS)

    Konopko, Joanna

    2015-12-01

    Big Data refers to the dynamic, large and disparate volumes of data comes from many different sources (tools, machines, sensors, mobile devices) uncorrelated with each others. It requires new, innovative and scalable technology to collect, host and analytically process the vast amount of data. Proper architecture of the system that perform huge data sets is needed. In this paper, the comparison of distributed and parallel system architecture is presented on the example of MapReduce (MR) Hadoop platform and parallel database platform (DBMS). This paper also analyzes the problem of performing and handling valuable information from petabytes of data. The both paradigms: MapReduce and parallel DBMS are described and compared. The hybrid architecture approach is also proposed and could be used to solve the analyzed problem of storing and processing Big Data.

  3. Incorporating uncertainty in RADTRAN 6.0 input files.

    SciTech Connect

    Dennis, Matthew L.; Weiner, Ruth F.; Heames, Terence John

    Uncertainty may be introduced into RADTRAN analyses by distributing input parameters. The MELCOR Uncertainty Engine (Gauntt and Erickson, 2004) has been adapted for use in RADTRAN to determine the parameter shape and minimum and maximum of the distribution, to sample on the distribution, and to create an appropriate RADTRAN batch file. Coupling input parameters is not possible in this initial application. It is recommended that the analyst be very familiar with RADTRAN and able to edit or create a RADTRAN input file using a text editor before implementing the RADTRAN Uncertainty Analysis Module. Installation of the MELCOR Uncertainty Engine ismore » required for incorporation of uncertainty into RADTRAN. Gauntt and Erickson (2004) provides installation instructions as well as a description and user guide for the uncertainty engine.« less

  4. Extending DIRAC File Management with Erasure-Coding for efficient storage.

    NASA Astrophysics Data System (ADS)

    Cadellin Skipsey, Samuel; Todev, Paulin; Britton, David; Crooks, David; Roy, Gareth

    2015-12-01

    The state of the art in Grid style data management is to achieve increased resilience of data via multiple complete replicas of data files across multiple storage endpoints. While this is effective, it is not the most space-efficient approach to resilience, especially when the reliability of individual storage endpoints is sufficiently high that only a few will be inactive at any point in time. We report on work performed as part of GridPP[1], extending the Dirac File Catalogue and file management interface to allow the placement of erasure-coded files: each file distributed as N identically-sized chunks of data striped across a vector of storage endpoints, encoded such that any M chunks can be lost and the original file can be reconstructed. The tools developed are transparent to the user, and, as well as allowing up and downloading of data to Grid storage, also provide the possibility of parallelising access across all of the distributed chunks at once, improving data transfer and IO performance. We expect this approach to be of most interest to smaller VOs, who have tighter bounds on the storage available to them, but larger (WLCG) VOs may be interested as their total data increases during Run 2. We provide an analysis of the costs and benefits of the approach, along with future development and implementation plans in this area. In general, overheads for multiple file transfers provide the largest issue for competitiveness of this approach at present.

  5. Implementation of a Big Data Accessing and Processing Platform for Medical Records in Cloud.

    PubMed

    Yang, Chao-Tung; Liu, Jung-Chun; Chen, Shuo-Tsung; Lu, Hsin-Wen

    2017-08-18

    Big Data analysis has become a key factor of being innovative and competitive. Along with population growth worldwide and the trend aging of population in developed countries, the rate of the national medical care usage has been increasing. Due to the fact that individual medical data are usually scattered in different institutions and their data formats are varied, to integrate those data that continue increasing is challenging. In order to have scalable load capacity for these data platforms, we must build them in good platform architecture. Some issues must be considered in order to use the cloud computing to quickly integrate big medical data into database for easy analyzing, searching, and filtering big data to obtain valuable information.This work builds a cloud storage system with HBase of Hadoop for storing and analyzing big data of medical records and improves the performance of importing data into database. The data of medical records are stored in HBase database platform for big data analysis. This system performs distributed computing on medical records data processing through Hadoop MapReduce programming, and to provide functions, including keyword search, data filtering, and basic statistics for HBase database. This system uses the Put with the single-threaded method and the CompleteBulkload mechanism to import medical data. From the experimental results, we find that when the file size is less than 300MB, the Put with single-threaded method is used and when the file size is larger than 300MB, the CompleteBulkload mechanism is used to improve the performance of data import into database. This system provides a web interface that allows users to search data, filter out meaningful information through the web, and analyze and convert data in suitable forms that will be helpful for medical staff and institutions.

  6. NASA work unit system file maintenance manual

    NASA Technical Reports Server (NTRS)

    1972-01-01

    The NASA Work Unit System is a management information system for research tasks (i.e., work units) performed under NASA grants and contracts. It supplies profiles on research efforts and statistics on fund distribution. The file maintenance operator can add, delete and change records at a remote terminal or can submit punched cards to the computer room for batch update. The system is designed for file maintenance by a person with little or no knowledge of data processing techniques.

  7. A History of the Andrew File System

    SciTech Connect

    Bashear, Derrick

    2011-02-22

    Derrick Brashear and Jeffrey Altman will present a technical history of the evolution of Andrew File System starting with the early days of the Andrew Project at Carnegie Mellon through the commercialization by Transarc Corporation and IBM and a decade of OpenAFS. The talk will be technical with a focus on the various decisions and implementation trade-offs that were made over the course of AFS versions 1 through 4, the development of the Distributed Computing Environment Distributed File System (DCE DFS), and the course of the OpenAFS development community. The speakers will also discuss the various AFS branches developed atmore » the University of Michigan, Massachusetts Institute of Technology and Carnegie Mellon University.« less

  8. Snake River Plain Geothermal Play Fairway Analysis - Phase 1 KMZ files

    DOE Data Explorer

    John Shervais

    2015-10-10

    This dataset contain raw data files in kmz files (Google Earth georeference format). These files include volcanic vent locations and age, the distribution of fine-grained lacustrine sediments (which act as both a seal and an insulating layer for hydrothermal fluids), and post-Miocene faults compiled from the Idaho Geological Survey, the USGS Quaternary Fault database, and unpublished mapping. It also contains the Composite Common Risk Segment Map created during Phase 1 studies, as well as a file with locations of select deep wells used to interrogate the subsurface.

  9. Algorithms for Large-Scale Astronomical Problems

    DTIC Science & Technology

    2013-08-01

    implemented as a succession of Hadoop MapReduce jobs and sequential programs written in Java . The sampling and splitting stages are implemented as...one MapReduce job, the partitioning and clustering phases make up another job. The merging stage is implemented as a stand-alone Java program. The...Merging. The merging stage is implemented as a sequential Java program that reads the files with the shell information, which were generated by

  10. Portable Map-Reduce Utility for MIT SuperCloud Environment

    DTIC Science & Technology

    2015-09-17

    Reuther, A. Rosa, C. Yee, “Driving Big Data With Big Compute,” IEEE HPEC, Sep 10-12, 2012, Waltham, MA. [6] Apache Hadoop 1.2.1 Documentation: HDFS... big data architecture, which is designed to address these challenges, is made of the computing resources, scheduler, central storage file system...databases, analytics software and web interfaces [1]. These components are common to many big data and supercomputing systems. The platform is

  11. Stochastic Petri net analysis of a replicated file system

    NASA Technical Reports Server (NTRS)

    Bechta Dugan, Joanne; Ciardo, Gianfranco

    1989-01-01

    A stochastic Petri-net model of a replicated file system is presented for a distributed environment where replicated files reside on different hosts and a voting algorithm is used to maintain consistency. Witnesses, which simply record the status of the file but contain no data, can be used in addition to or in place of files to reduce overhead. A model sufficiently detailed to include file status (current or out-of-date), as well as failure and repair of hosts where copies or witnesses reside, is presented. The number of copies and witnesses is a parameter of the model. Two different majority protocols are examined, one where a majority of all copies and witnesses is necessary to form a quorum, and the other where only a majority of the copies and witnesses on operational hosts is needed. The latter, known as adaptive voting, is shown to increase file availability in most cases.

  12. LVFS: A Scalable Petabye/Exabyte Data Storage System

    NASA Astrophysics Data System (ADS)

    Golpayegani, N.; Halem, M.; Masuoka, E. J.; Ye, G.; Devine, N. K.

    2013-12-01

    Managing petabytes of data with hundreds of millions of files is the first step necessary towards an effective big data computing and collaboration environment in a distributed system. We describe here the MODAPS LAADS Virtual File System (LVFS), a new storage architecture which replaces the previous MODAPS operational Level 1 Land Atmosphere Archive Distribution System (LAADS) NFS based approach to storing and distributing datasets from several instruments, such as MODIS, MERIS, and VIIRS. LAADS is responsible for the distribution of over 4 petabytes of data and over 300 million files across more than 500 disks. We present here the first LVFS big data comparative performance results and new capabilities not previously possible with the LAADS system. We consider two aspects in addressing inefficiencies of massive scales of data. First, is dealing in a reliable and resilient manner with the volume and quantity of files in such a dataset, and, second, minimizing the discovery and lookup times for accessing files in such large datasets. There are several popular file systems that successfully deal with the first aspect of the problem. Their solution, in general, is through distribution, replication, and parallelism of the storage architecture. The Hadoop Distributed File System (HDFS), Parallel Virtual File System (PVFS), and Lustre are examples of such file systems that deal with petabyte data volumes. The second aspect deals with data discovery among billions of files, the largest bottleneck in reducing access time. The metadata of a file, generally represented in a directory layout, is stored in ways that are not readily scalable. This is true for HDFS, PVFS, and Lustre as well. Recent experimental file systems, such as Spyglass or Pantheon, have attempted to address this problem through redesign of the metadata directory architecture. LVFS takes a radically different architectural approach by eliminating the need for a separate directory within the file system

  13. A distributed pipeline for DIDSON data processing

    USGS Publications Warehouse

    Li, Liling; Danner, Tyler; Eickholt, Jesse; McCann, Erin L.; Pangle, Kevin; Johnson, Nicholas

    2018-01-01

    Technological advances in the field of ecology allow data on ecological systems to be collected at high resolution, both temporally and spatially. Devices such as Dual-frequency Identification Sonar (DIDSON) can be deployed in aquatic environments for extended periods and easily generate several terabytes of underwater surveillance data which may need to be processed multiple times. Due to the large amount of data generated and need for flexibility in processing, a distributed pipeline was constructed for DIDSON data making use of the Hadoop ecosystem. The pipeline is capable of ingesting raw DIDSON data, transforming the acoustic data to images, filtering the images, detecting and extracting motion, and generating feature data for machine learning and classification. All of the tasks in the pipeline can be run in parallel and the framework allows for custom processing. Applications of the pipeline include monitoring migration times, determining the presence of a particular species, estimating population size and other fishery management tasks.

  14. Monte Carlo based, patient-specific RapidArc QA using Linac log files.

    PubMed

    Teke, Tony; Bergman, Alanah M; Kwa, William; Gill, Bradford; Duzenli, Cheryl; Popescu, I Antoniu

    2010-01-01

    A Monte Carlo (MC) based QA process to validate the dynamic beam delivery accuracy for Varian RapidArc (Varian Medical Systems, Palo Alto, CA) using Linac delivery log files (DynaLog) is presented. Using DynaLog file analysis and MC simulations, the goal of this article is to (a) confirm that adequate sampling is used in the RapidArc optimization algorithm (177 static gantry angles) and (b) to assess the physical machine performance [gantry angle and monitor unit (MU) delivery accuracy]. Ten clinically acceptable RapidArc treatment plans were generated for various tumor sites and delivered to a water-equivalent cylindrical phantom on the treatment unit. Three Monte Carlo simulations were performed to calculate dose to the CT phantom image set: (a) One using a series of static gantry angles defined by 177 control points with treatment planning system (TPS) MLC control files (planning files), (b) one using continuous gantry rotation with TPS generated MLC control files, and (c) one using continuous gantry rotation with actual Linac delivery log files. Monte Carlo simulated dose distributions are compared to both ionization chamber point measurements and with RapidArc TPS calculated doses. The 3D dose distributions were compared using a 3D gamma-factor analysis, employing a 3%/3 mm distance-to-agreement criterion. The dose difference between MC simulations, TPS, and ionization chamber point measurements was less than 2.1%. For all plans, the MC calculated 3D dose distributions agreed well with the TPS calculated doses (gamma-factor values were less than 1 for more than 95% of the points considered). Machine performance QA was supplemented with an extensive DynaLog file analysis. A DynaLog file analysis showed that leaf position errors were less than 1 mm for 94% of the time and there were no leaf errors greater than 2.5 mm. The mean standard deviation in MU and gantry angle were 0.052 MU and 0.355 degrees, respectively, for the ten cases analyzed. The accuracy and

  15. Leveraging the Cloud for Robust and Efficient Lunar Image Processing

    NASA Technical Reports Server (NTRS)

    Chang, George; Malhotra, Shan; Wolgast, Paul

    2011-01-01

    The Lunar Mapping and Modeling Project (LMMP) is tasked to aggregate lunar data, from the Apollo era to the latest instruments on the LRO spacecraft, into a central repository accessible by scientists and the general public. A critical function of this task is to provide users with the best solution for browsing the vast amounts of imagery available. The image files LMMP manages range from a few gigabytes to hundreds of gigabytes in size with new data arriving every day. Despite this ever-increasing amount of data, LMMP must make the data readily available in a timely manner for users to view and analyze. This is accomplished by tiling large images into smaller images using Hadoop, a distributed computing software platform implementation of the MapReduce framework, running on a small cluster of machines locally. Additionally, the software is implemented to use Amazon's Elastic Compute Cloud (EC2) facility. We also developed a hybrid solution to serve images to users by leveraging cloud storage using Amazon's Simple Storage Service (S3) for public data while keeping private information on our own data servers. By using Cloud Computing, we improve upon our local solution by reducing the need to manage our own hardware and computing infrastructure, thereby reducing costs. Further, by using a hybrid of local and cloud storage, we are able to provide data to our users more efficiently and securely. 12 This paper examines the use of a distributed approach with Hadoop to tile images, an approach that provides significant improvements in image processing time, from hours to minutes. This paper describes the constraints imposed on the solution and the resulting techniques developed for the hybrid solution of a customized Hadoop infrastructure over local and cloud resources in managing this ever-growing data set. It examines the performance trade-offs of using the more plentiful resources of the cloud, such as those provided by S3, against the bandwidth limitations such use

  16. The Future of the Andrew File System

    ScienceCinema

    Brashear, Drrick; Altman, Jeffry

    2018-05-25

    The talk will discuss the ten operational capabilities that have made AFS unique in the distributed file system space and how these capabilities are being expanded upon to meet the needs of the 21st century. Derrick Brashear and Jeffrey Altman will present a technical road map of new features and technical innovations that are under development by the OpenAFS community and Your File System, Inc. funded by a U.S. Department of Energy Small Business Innovative Research grant. The talk will end with a comparison of AFS to its modern days competitors.

  17. The Future of the Andrew File System

    SciTech Connect

    Brashear, Drrick; Altman, Jeffry

    2011-02-23

    The talk will discuss the ten operational capabilities that have made AFS unique in the distributed file system space and how these capabilities are being expanded upon to meet the needs of the 21st century. Derrick Brashear and Jeffrey Altman will present a technical road map of new features and technical innovations that are under development by the OpenAFS community and Your File System, Inc. funded by a U.S. Department of Energy Small Business Innovative Research grant. The talk will end with a comparison of AFS to its modern days competitors.

  18. The version control service for the ATLAS data acquisition configuration files

    NASA Astrophysics Data System (ADS)

    Soloviev, Igor

    2012-12-01

    The ATLAS experiment at the LHC in Geneva uses a complex and highly distributed Trigger and Data Acquisition system, involving a very large number of computing nodes and custom modules. The configuration of the system is specified by schema and data in more than 1000 XML files, with various experts responsible for updating the files associated with their components. Maintaining an error free and consistent set of XML files proved a major challenge. Therefore a special service was implemented; to validate any modifications; to check the authorization of anyone trying to modify a file; to record who had made changes, plus when and why; and to provide tools to compare different versions of files and to go back to earlier versions if required. This paper provides details of the implementation and exploitation experience, that may be interesting for other applications using many human-readable files maintained by different people, where consistency of the files and traceability of modifications are key requirements.

  19. A Distributed Problem-Solving Approach to Rule Induction: Learning in Distributed Artificial Intelligence Systems

    DTIC Science & Technology

    1990-11-01

    Intelligence Systems," in Distributed Artifcial Intelligence , vol. II, L. Gasser and M. Huhns (eds), Pitman, London, 1989, pp. 413-430. Shaw, M. Harrow, B...IDTIC FILE COPY A Distributed Problem-Solving Approach to Rule Induction: Learning in Distributed Artificial Intelligence Systems N Michael I. Shaw...SUBTITLE 5. FUNDING NUMBERS A Distributed Problem-Solving Approach to Rule Induction: Learning in Distributed Artificial Intelligence Systems 6

  20. 37 CFR 360.11 - Time of filing.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... OF ROYALTY CLAIMS FILING OF CLAIMS TO ROYALTY FEES COLLECTED UNDER COMPULSORY LICENSE Satellite... compulsory license royalty fees for secondary transmissions by satellite carriers during the previous... Copyright Royalty Board. No royalty fees shall be distributed to any party during the specified period...

  1. Automated quality control in a file-based broadcasting workflow

    NASA Astrophysics Data System (ADS)

    Zhang, Lina

    2014-04-01

    Benefit from the development of information and internet technologies, television broadcasting is transforming from inefficient tape-based production and distribution to integrated file-based workflows. However, no matter how many changes have took place, successful broadcasting still depends on the ability to deliver a consistent high quality signal to the audiences. After the transition from tape to file, traditional methods of manual quality control (QC) become inadequate, subjective, and inefficient. Based on China Central Television's full file-based workflow in the new site, this paper introduces an automated quality control test system for accurate detection of hidden troubles in media contents. It discusses the system framework and workflow control when the automated QC is added. It puts forward a QC criterion and brings forth a QC software followed this criterion. It also does some experiments on QC speed by adopting parallel processing and distributed computing. The performance of the test system shows that the adoption of automated QC can make the production effective and efficient, and help the station to achieve a competitive advantage in the media market.

  2. Distributed File System Utilities to Manage Large DatasetsVersion 0.5

    SciTech Connect

    2014-05-21

    FileUtils provides a suite of tools to manage large datasets typically created by large parallel MPI applications. They are written in C and use standard POSIX I/Ocalls. The current suite consists of tools to copy, compare, remove, and list. The tools provide dramatic speedup over existing Linux tools, which often run as a single process.

  3. Methods and apparatus for multi-resolution replication of files in a parallel computing system using semantic information

    DOEpatents

    Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Torres, Aaron

    2015-10-20

    Techniques are provided for storing files in a parallel computing system using different resolutions. A method is provided for storing at least one file generated by a distributed application in a parallel computing system. The file comprises one or more of a complete file and a sub-file. The method comprises the steps of obtaining semantic information related to the file; generating a plurality of replicas of the file with different resolutions based on the semantic information; and storing the file and the plurality of replicas of the file in one or more storage nodes of the parallel computing system. The different resolutions comprise, for example, a variable number of bits and/or a different sub-set of data elements from the file. A plurality of the sub-files can be merged to reproduce the file.

  4. The Use of Binary Search Trees in External Distribution Sorting.

    ERIC Educational Resources Information Center

    Cooper, David; Lynch, Michael F.

    1984-01-01

    Suggests new method of external distribution called tree partitioning that involves use of binary tree to split incoming file into successively smaller partitions for internal sorting. Number of disc accesses during a tree-partitioning sort were calculated in simulation using files extracted from British National Bibliography catalog files. (19…

  5. 29 CFR 4007.3 - Filing requirement; method of filing.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... 29 Labor 9 2010-07-01 2010-07-01 false Filing requirement; method of filing. 4007.3 Section 4007.3... PREMIUMS § 4007.3 Filing requirement; method of filing. (a) In general. The estimation, determination... Web site (http://www.pbgc.gov). Subject to the provisions of § 4007.13, the plan administrator of each...

  6. DSSTOX MASTER STRUCTURE-INDEX FILE: SDF FILE AND ...

    EPA Pesticide Factsheets

    The DSSTox Master Structure-Index File serves to consolidate, manage, and ensure quality and uniformity of the chemical and substance information spanning all DSSTox Structure Data Files, including those in development but not yet published separately on this website. The DSSTox Master Structure-Index File serves to consolidate, manage, and ensure quality and uniformity of the chemical and substance information spanning all DSSTox Structure Data Files, including those in development but not yet published separately on this website.

  7. 26 CFR 1.305-3 - Disproportionate distributions.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... 26 Internal Revenue 4 2010-04-01 2010-04-01 false Disproportionate distributions. 1.305-3 Section 1.305-3 Internal Revenue INTERNAL REVENUE SERVICE, DEPARTMENT OF THE TREASURY (CONTINUED) INCOME TAX... filed with the Internal Revenue Service Center with which the income tax return was filed. (4) See § 1...

  8. Please Move Inactive Files Off the /projects File System | High-Performance

    Science.gov Websites

    Computing | NREL Please Move Inactive Files Off the /projects File System Please Move Inactive Files Off the /projects File System January 11, 2018 The /projects file system is a shared resource . This year this has created a space crunch - the file system is now about 90% full and we need your help

  9. A General-purpose Framework for Parallel Processing of Large-scale LiDAR Data

    NASA Astrophysics Data System (ADS)

    Li, Z.; Hodgson, M.; Li, W.

    2016-12-01

    Light detection and ranging (LiDAR) technologies have proven efficiency to quickly obtain very detailed Earth surface data for a large spatial extent. Such data is important for scientific discoveries such as Earth and ecological sciences and natural disasters and environmental applications. However, handling LiDAR data poses grand geoprocessing challenges due to data intensity and computational intensity. Previous studies received notable success on parallel processing of LiDAR data to these challenges. However, these studies either relied on high performance computers and specialized hardware (GPUs) or focused mostly on finding customized solutions for some specific algorithms. We developed a general-purpose scalable framework coupled with sophisticated data decomposition and parallelization strategy to efficiently handle big LiDAR data. Specifically, 1) a tile-based spatial index is proposed to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system, 2) two spatial decomposition techniques are developed to enable efficient parallelization of different types of LiDAR processing tasks, and 3) by coupling existing LiDAR processing tools with Hadoop, this framework is able to conduct a variety of LiDAR data processing tasks in parallel in a highly scalable distributed computing environment. The performance and scalability of the framework is evaluated with a series of experiments conducted on a real LiDAR dataset using a proof-of-concept prototype system. The results show that the proposed framework 1) is able to handle massive LiDAR data more efficiently than standalone tools; and 2) provides almost linear scalability in terms of either increased workload (data volume) or increased computing nodes with both spatial decomposition strategies. We believe that the proposed framework provides valuable references on developing a collaborative cyberinfrastructure for processing big earth science data in a highly scalable environment.

  10. Storage of sparse files using parallel log-structured file system

    SciTech Connect

    Bent, John M.; Faibish, Sorin; Grider, Gary

    A sparse file is stored without holes by storing a data portion of the sparse file using a parallel log-structured file system; and generating an index entry for the data portion, the index entry comprising a logical offset, physical offset and length of the data portion. The holes can be restored to the sparse file upon a reading of the sparse file. The data portion can be stored at a logical end of the sparse file. Additional storage efficiency can optionally be achieved by (i) detecting a write pattern for a plurality of the data portions and generating a singlemore » patterned index entry for the plurality of the patterned data portions; and/or (ii) storing the patterned index entries for a plurality of the sparse files in a single directory, wherein each entry in the single directory comprises an identifier of a corresponding sparse file.« less

  11. 26 CFR 1.355-5 - Records to be kept and information to be filed.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... Records to be kept and information to be filed. (a) Distributing corporation—(1) In general. Every corporation that makes a distribution (the distributing corporation) of stock or securities of a controlled... (IF ANY) OF TAXPAYER], A DISTRIBUTING CORPORATION,” on or with its return for the year of the...

  12. Heterogeneous distributed query processing: The DAVID system

    NASA Technical Reports Server (NTRS)

    Jacobs, Barry E.

    1985-01-01

    The objective of the Distributed Access View Integrated Database (DAVID) project is the development of an easy to use computer system with which NASA scientists, engineers and administrators can uniformly access distributed heterogeneous databases. Basically, DAVID will be a database management system that sits alongside already existing database and file management systems. Its function is to enable users to access the data in other languages and file systems without having to learn the data manipulation languages. Given here is an outline of a talk on the DAVID project and several charts.

  13. 49 CFR 564.5 - Information filing; agency processing of filings.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 49 Transportation 6 2010-10-01 2010-10-01 false Information filing; agency processing of filings... HIGHWAY TRAFFIC SAFETY ADMINISTRATION, DEPARTMENT OF TRANSPORTATION REPLACEABLE LIGHT SOURCE INFORMATION (Eff. until 12-01-12) § 564.5 Information filing; agency processing of filings. (a) Each manufacturer...

  14. Estimation Accuracy on Execution Time of Run-Time Tasks in a Heterogeneous Distributed Environment

    PubMed Central

    Liu, Qi; Cai, Weidong; Jin, Dandan; Shen, Jian; Fu, Zhangjie; Liu, Xiaodong; Linge, Nigel

    2016-01-01

    Distributed Computing has achieved tremendous development since cloud computing was proposed in 2006, and played a vital role promoting rapid growth of data collecting and analysis models, e.g., Internet of things, Cyber-Physical Systems, Big Data Analytics, etc. Hadoop has become a data convergence platform for sensor networks. As one of the core components, MapReduce facilitates allocating, processing and mining of collected large-scale data, where speculative execution strategies help solve straggler problems. However, there is still no efficient solution for accurate estimation on execution time of run-time tasks, which can affect task allocation and distribution in MapReduce. In this paper, task execution data have been collected and employed for the estimation. A two-phase regression (TPR) method is proposed to predict the finishing time of each task accurately. Detailed data of each task have drawn interests with detailed analysis report being made. According to the results, the prediction accuracy of concurrent tasks’ execution time can be improved, in particular for some regular jobs. PMID:27589753

  15. Estimation Accuracy on Execution Time of Run-Time Tasks in a Heterogeneous Distributed Environment.

    PubMed

    Liu, Qi; Cai, Weidong; Jin, Dandan; Shen, Jian; Fu, Zhangjie; Liu, Xiaodong; Linge, Nigel

    2016-08-30

    Distributed Computing has achieved tremendous development since cloud computing was proposed in 2006, and played a vital role promoting rapid growth of data collecting and analysis models, e.g., Internet of things, Cyber-Physical Systems, Big Data Analytics, etc. Hadoop has become a data convergence platform for sensor networks. As one of the core components, MapReduce facilitates allocating, processing and mining of collected large-scale data, where speculative execution strategies help solve straggler problems. However, there is still no efficient solution for accurate estimation on execution time of run-time tasks, which can affect task allocation and distribution in MapReduce. In this paper, task execution data have been collected and employed for the estimation. A two-phase regression (TPR) method is proposed to predict the finishing time of each task accurately. Detailed data of each task have drawn interests with detailed analysis report being made. According to the results, the prediction accuracy of concurrent tasks' execution time can be improved, in particular for some regular jobs.

  16. VLBA Archive &Distribution Architecture

    NASA Astrophysics Data System (ADS)

    Wells, D. C.

    1994-01-01

    Signals from the 10 antennas of NRAO's VLBA [Very Long Baseline Array] are processed by a Correlator. The complex fringe visibilities produced by the Correlator are archived on magnetic cartridges using a low-cost architecture which is capable of scaling and evolving. Archive files are copied to magnetic media to be distributed to users in FITS format, using the BINTABLE extension. Archive files are labelled using SQL INSERT statements, in order to bind the DBMS-based archive catalog to the archive media.

  17. Methods and apparatus for capture and storage of semantic information with sub-files in a parallel computing system

    DOEpatents

    Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Torres, Aaron

    2015-02-03

    Techniques are provided for storing files in a parallel computing system using sub-files with semantically meaningful boundaries. A method is provided for storing at least one file generated by a distributed application in a parallel computing system. The file comprises one or more of a complete file and a plurality of sub-files. The method comprises the steps of obtaining a user specification of semantic information related to the file; providing the semantic information as a data structure description to a data formatting library write function; and storing the semantic information related to the file with one or more of the sub-files in one or more storage nodes of the parallel computing system. The semantic information provides a description of data in the file. The sub-files can be replicated based on semantically meaningful boundaries.

  18. Simple Automatic File Exchange (SAFE) to Support Low-Cost Spacecraft Operation via the Internet

    NASA Technical Reports Server (NTRS)

    Baker, Paul; Repaci, Max; Sames, David

    1998-01-01

    Various issues associated with Simple Automatic File Exchange (SAFE) are presented in viewgraph form. Specific topics include: 1) Packet telemetry, Internet IP networks and cost reduction; 2) Basic functions and technical features of SAFE; 3) Project goals, including low-cost satellite transmission to data centers to be distributed via an Internet; 4) Operations with a replicated file protocol; 5) File exchange operation; 6) Ground stations as gateways; 7) Lessons learned from demonstrations and tests with SAFE; and 8) Feedback and future initiatives.

  19. A cloud platform for remote diagnosis of breast cancer in mammography by fusion of machine and human intelligence

    NASA Astrophysics Data System (ADS)

    Jiang, Guodong; Fan, Ming; Li, Lihua

    2016-03-01

    Mammography is the gold standard for breast cancer screening, reducing mortality by about 30%. The application of a computer-aided detection (CAD) system to assist a single radiologist is important to further improve mammographic sensitivity for breast cancer detection. In this study, a design and realization of the prototype for remote diagnosis system in mammography based on cloud platform were proposed. To build this system, technologies were utilized including medical image information construction, cloud infrastructure and human-machine diagnosis model. Specifically, on one hand, web platform for remote diagnosis was established by J2EE web technology. Moreover, background design was realized through Hadoop open-source framework. On the other hand, storage system was built up with Hadoop distributed file system (HDFS) technology which enables users to easily develop and run on massive data application, and give full play to the advantages of cloud computing which is characterized by high efficiency, scalability and low cost. In addition, the CAD system was realized through MapReduce frame. The diagnosis module in this system implemented the algorithms of fusion of machine and human intelligence. Specifically, we combined results of diagnoses from doctors' experience and traditional CAD by using the man-machine intelligent fusion model based on Alpha-Integration and multi-agent algorithm. Finally, the applications on different levels of this system in the platform were also discussed. This diagnosis system will have great importance for the balanced health resource, lower medical expense and improvement of accuracy of diagnosis in basic medical institutes.

  20. Solving data-at-rest for the storage and retrieval of files in ad hoc networks

    NASA Astrophysics Data System (ADS)

    Knobler, Ron; Scheffel, Peter; Williams, Jonathan; Gaj, Kris; Kaps, Jens-Peter

    2013-05-01

    Based on current trends for both military and commercial applications, the use of mobile devices (e.g. smartphones and tablets) is greatly increasing. Several military applications consist of secure peer to peer file sharing without a centralized authority. For these military applications, if one or more of these mobile devices are lost or compromised, sensitive files can be compromised by adversaries, since COTS devices and operating systems are used. Complete system files cannot be stored on a device, since after compromising a device, an adversary can attack the data at rest, and eventually obtain the original file. Also after a device is compromised, the existing peer to peer system devices must still be able to access all system files. McQ has teamed with the Cryptographic Engineering Research Group at George Mason University to develop a custom distributed file sharing system to provide a complete solution to the data at rest problem for resource constrained embedded systems and mobile devices. This innovative approach scales very well to a large number of network devices, without a single point of failure. We have implemented the approach on representative mobile devices as well as developed an extensive system simulator to benchmark expected system performance based on detailed modeling of the network/radio characteristics, CONOPS, and secure distributed file system functionality. The simulator is highly customizable for the purpose of determining expected system performance for other network topologies and CONOPS.

  1. EVALUATED NUCLEAR STRUCTURE DATA FILE AND RELATED PRODUCTS.

    SciTech Connect

    TULI,J.K.

    The Evaluated Nuclear Structure Data File (ENSDF) is a leading resource for the experimental nuclear data. It is maintained and distributed by the National Nuclear Data Center, Brookhaven National Laboratory. The file is mainly contributed to by an international network of evaluators under the auspice of the International Atomic Energy Agency. The ENSDF is updated, generally by mass number, i.e., evaluating together all isobars for a given mass number. If, however, experimental activity in an isobaric chain is limited to a particular nuclide then only that nuclide is updated. The evaluations are published in the journal Nuclear Data Sheets, Academicmore » Press, a division of Elsevier.« less

  2. Replication in the Harp File System

    DTIC Science & Technology

    1981-07-01

    Shrira Michael Williams iadly 1991 © Massachusetts Institute of Technology (To appear In the Proceedings of the Thirteenth ACM Symposium on Operating...S., Spector, A. Z., and Thompson, D. S. Distributed Logging for Transaction Processing. ACM Special Interest Group on Management of Data 1987 Annual ...System. USENIX Conference Proceedings , June, 1990, pp. 63-71. 15. Hagmann, R. Reimplementing the Cedar File System Using Logging and Group Commit

  3. 76 FR 45517 - Endangered Species; File No. 13330-01

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-07-29

    ... Species; File No. 13330-01 AGENCY: National Marine Fisheries Service (NMFS), National Oceanic and... permit: to collect data on the biology, distribution and abundance of the endangered smalltooth sawfish... and Education Division, Office of Protected Resources, National Marine Fisheries Service. [FR Doc...

  4. NASA thesaurus combined file postings statistics

    NASA Technical Reports Server (NTRS)

    1993-01-01

    The NASA Thesaurus Combined File Postings Statistics is published semiannually (January and July). This alphabetical listing of postable subject terms contained in the NASA Thesaurus is used to display the number of postings (documents) indexed by each subject term from 1968 to date. The postings totals per item are separated by announcement of other media into STAR, IAA, COSMIC, and OTHER, columnar entries covering the NASA document collection (1968 to date). This is a cumulative publication, and except for special cases, no reference is needed to previous issuances. Retention of the January 1992 issue could be helpful for book information. With the July 1992 issue, NALNET book statistics have been replaced by COSMIC statistics for NASA funded software. File postings statistics for the Alternate Data Base covering NASA collection from 1962 through 1967 were published on a one-time basis in September 1975. Subject terms for the Alternate Data Base are derived from the subject Authority List, reprinted 1985, which is available upon request. The distribution of 19,697,748 postings among the 17,446 NASA Thesaurus terms is tabulated on the last page of the NASA Thesaurus Combined File Postings Statistics.

  5. 78 FR 17394 - Filing via the Internet; Electronic Tariff Filings; Revisions to Electric Quarterly Report Filing...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-03-21

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission [Docket Nos. RM07-16-000; RM01-5-000; RM12-3-000] Filing via the Internet; Electronic Tariff Filings; Revisions to Electric Quarterly Report Filing Process; Notice of Technical Conference Take notice that on April 16, 2013, the staff of the...

  6. Registered File Support for Critical Operations Files at (Space Infrared Telescope Facility) SIRTF

    NASA Technical Reports Server (NTRS)

    Turek, G.; Handley, Tom; Jacobson, J.; Rector, J.

    2001-01-01

    The SIRTF Science Center's (SSC) Science Operations System (SOS) has to contend with nearly one hundred critical operations files via comprehensive file management services. The management is accomplished via the registered file system (otherwise known as TFS) which manages these files in a registered file repository composed of a virtual file system accessible via a TFS server and a file registration database. The TFS server provides controlled, reliable, and secure file transfer and storage by registering all file transactions and meta-data in the file registration database. An API is provided for application programs to communicate with TFS servers and the repository. A command line client implementing this API has been developed as a client tool. This paper describes the architecture, current implementation, but more importantly, the evolution of these services based on evolving community use cases and emerging information system technology.

  7. Highway Safety Information System guidebook for the Minnesota state data files. Volume 1 : SAS file formats

    DOT National Transportation Integrated Search

    2001-02-01

    The Minnesota data system includes the following basic files: Accident data (Accident File, Vehicle File, Occupant File); Roadlog File; Reference Post File; Traffic File; Intersection File; Bridge (Structures) File; and RR Grade Crossing File. For ea...

  8. Oscar — Using Byte Pairs to Find File Type and Camera Make of Data Fragments

    NASA Astrophysics Data System (ADS)

    Karresand, Martin; Shahmehri, Nahid

    Mapping out the contents of fragmented storage media is hard if the file system has been corrupted, especially as the current forensic tools rely on meta information to do their job. If it was possible to find all fragments belonging to a certain file type, it would also be possible to recover a lost file. Such a tool could for example be used in the hunt for child pornography. The Oscar method identifies the file type of data fragments based solely on statistics calculated from their structure. The method does not need any meta data to work. We have previously used the byte frequency distribution and the rate of change between consecutive bytes as basis for the statistics, as well as calculating the 2-gram frequency distribution to create a model of different file types. This paper present a variant of the 2-gram method, in that it uses a dynamic smoothing factor. In this way we take the amount of data used to create the centroid into consideration. A previous experiment on file type identification is extended with .mp3 files reaching a detection rate of 76% with a false positives rate of 0.4%. We also use the method to identify the camera make used to capture a .jpg picture from a fragment of the picture. The result shows that we can clearly separate a picture fragment coming from a Fuji or Olympus cameras from a fragment of a picture of the other camera makes used in our test.

  9. Distributed Kernelized Locality-Sensitive Hashing for Faster Image Based Navigation

    DTIC Science & Technology

    2015-03-26

    Facebook, Google, and Yahoo !. Current methods for image retrieval become problematic when implemented on image datasets that can easily reach billions of...correlations. Tech industry leaders like Facebook, Google, and Yahoo ! sort and index even larger volumes of “big data” daily. When attempting to process...open source implementation of Google’s MapReduce programming paradigm [13] which has been used for many different things. Using Apache Hadoop, Yahoo

  10. Storing files in a parallel computing system using list-based index to identify replica files

    SciTech Connect

    Faibish, Sorin; Bent, John M.; Tzelnic, Percy

    Improved techniques are provided for storing files in a parallel computing system using a list-based index to identify file replicas. A file and at least one replica of the file are stored in one or more storage nodes of the parallel computing system. An index for the file comprises at least one list comprising a pointer to a storage location of the file and a storage location of the at least one replica of the file. The file comprises one or more of a complete file and one or more sub-files. The index may also comprise a checksum value formore » one or more of the file and the replica(s) of the file. The checksum value can be evaluated to validate the file and/or the file replica(s). A query can be processed using the list.« less

  11. 43 CFR 4.1362 - Where to file; when to file.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... APPEALS PROCEDURES Special Rules Applicable to Surface Coal Mining Hearings and Appeals Request for Review... Transfer, Assignment Or Sale of Rights Granted Under Permit (federal Program; Federal Lands Program... file; when to file. (a) The request for review shall be filed with the Hearings Division, Office of...

  12. Digital Stratigraphy: Contextual Analysis of File System Traces in Forensic Science.

    PubMed

    Casey, Eoghan

    2017-12-28

    This work introduces novel methods for conducting forensic analysis of file allocation traces, collectively called digital stratigraphy. These in-depth forensic analysis methods can provide insight into the origin, composition, distribution, and time frame of strata within storage media. Using case examples and empirical studies, this paper illuminates the successes, challenges, and limitations of digital stratigraphy. This study also shows how understanding file allocation methods can provide insight into concealment activities and how real-world computer usage can complicate digital stratigraphy. Furthermore, this work explains how forensic analysts have misinterpreted traces of normal file system behavior as indications of concealment activities. This work raises awareness of the value of taking the overall context into account when analyzing file system traces. This work calls for further research in this area and for forensic tools to provide necessary information for such contextual analysis, such as highlighting mass deletion, mass copying, and potential backdating. © 2017 American Academy of Forensic Sciences.

  13. PCF File Format.

    SciTech Connect

    Thoreson, Gregory G

    PCF files are binary files designed to contain gamma spectra and neutron count rates from radiation sensors. It is the native format for the GAmma Detector Response and Analysis Software (GADRAS) package [1]. It can contain multiple spectra and information about each spectrum such as energy calibration. This document outlines the format of the file that would allow one to write a computer program to parse and write such files.

  14. 26 CFR 1.562-3 - Distributions by a member of an affiliated group.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... Distributions by a member of an affiliated group. A personal holding company which files or is required to file a consolidated return with other members of an affiliated group may be required to file a separate personal holding company schedule by reason of the limitations and exceptions provided in section 542(b...

  15. 26 CFR 1.562-3 - Distributions by a member of an affiliated group.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... Distributions by a member of an affiliated group. A personal holding company which files or is required to file a consolidated return with other members of an affiliated group may be required to file a separate personal holding company schedule by reason of the limitations and exceptions provided in section 542(b...

  16. PDBToSDF: Create ligand structure files from PDB file.

    PubMed

    Muppalaneni, Naresh Babu; Rao, Allam Appa

    2011-01-01

    Protein Data Bank (PDB) file contains atomic data for protein and ligand in protein-ligand complexes. Structure data file (SDF) contains data for atoms, bonds, connectivity and coordinates of molecule for ligands. We describe PDBToSDF as a tool to separate the ligand data from pdb file for the calculation of ligand properties like molecular weight, number of hydrogen bond acceptors, hydrogen bond receptors easily.

  17. Analysis of the access patterns at GSFC distributed active archive center

    NASA Technical Reports Server (NTRS)

    Johnson, Theodore; Bedet, Jean-Jacques

    1996-01-01

    The Goddard Space Flight Center (GSFC) Distributed Active Archive Center (DAAC) has been operational for more than two years. Its mission is to support existing and pre Earth Observing System (EOS) Earth science datasets, facilitate the scientific research, and test Earth Observing System Data and Information System (EOSDIS) concepts. Over 550,000 files and documents have been archived, and more than six Terabytes have been distributed to the scientific community. Information about user request and file access patterns, and their impact on system loading, is needed to optimize current operations and to plan for future archives. To facilitate the management of daily activities, the GSFC DAAC has developed a data base system to track correspondence, requests, ingestion and distribution. In addition, several log files which record transactions on Unitree are maintained and periodically examined. This study identifies some of the users' requests and file access patterns at the GSFC DAAC during 1995. The analysis is limited to the subset of orders for which the data files are under the control of the Hierarchical Storage Management (HSM) Unitree. The results show that most of the data volume ordered was for two data products. The volume was also mostly made up of level 3 and 4 data and most of the volume was distributed on 8 mm and 4 mm tapes. In addition, most of the volume ordered was for deliveries in North America although there was a significant world-wide use. There was a wide range of request sizes in terms of volume and number of files ordered. On an average 78.6 files were ordered per request. Using the data managed by Unitree, several caching algorithms have been evaluated for both hit rate and the overhead ('cost') associated with the movement of data from near-line devices to disks. The algorithm called LRU/2 bin was found to be the best for this workload, but the STbin algorithm also worked well.

  18. 7 CFR 283.22 - Form; filing; service; proof of service; computation of time; and extensions of time.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... Agriculture (Continued) FOOD AND NUTRITION SERVICE, DEPARTMENT OF AGRICULTURE FOOD STAMP AND FOOD DISTRIBUTION...) Filing. Papers are considered filed when they are postmarked, or, received, if hand delivered. Date of... serving the document by personal delivery or by mail, setting forth the date, time and manner of service...

  19. Distributed Structure-Searchable Toxicity (DSSTox) Database

    EPA Pesticide Factsheets

    The Distributed Structure-Searchable Toxicity network provides a public forum for publishing downloadable, structure-searchable, standardized chemical structure files associated with chemical inventories or toxicity data sets of environmental relevance.

  20. Files in /noaa/dhs

    Science.gov Websites

    noaa_20110510_wbg.gif 10-May-2011 20:58 31K generic file noaa_20110510_wbg.pdf 10-May-2011 20:58 128K generic file noaa_20110513_wbg.gif 13-May-2011 20:10 27K generic file noaa_20110513_wbg.pdf 13-May-2011 20:10 122K generic file noaa_20110518_wbg.gif 18-May-2011 21:10 33K generic file noaa_20110518_wbg.pdf 18-May-2011 21:10 128K generic file

  1. DSSTox chemical-index files for exposure-related experiments in ArrayExpress and Gene Expression Omnibus: enabling toxico-chemogenomics data linkages

    EPA Science Inventory

    The Distributed Structure-Searchable Toxicity (DSSTox) ARYEXP and GEOGSE files are newly published, structure-annotated files of the chemical-associated and chemical exposure-related summary experimental content contained in the ArrayExpress Repository and Gene Expression Omnibus...

  2. Text File Comparator

    NASA Technical Reports Server (NTRS)

    Kotler, R. S.

    1983-01-01

    File Comparator program IFCOMP, is text file comparator for IBM OS/VScompatable systems. IFCOMP accepts as input two text files and produces listing of differences in pseudo-update form. IFCOMP is very useful in monitoring changes made to software at the source code level.

  3. Comparative Evaluation of Stress Distribution in Experimentally Designed Nickel-titanium Rotary Files with Varying Cross Sections: A Finite Element Analysis.

    PubMed

    Basheer Ahamed, Shadir Bughari; Vanajassun, Purushothaman Pranav; Rajkumar, Kothandaraman; Mahalaxmi, Sekar

    2018-04-01

    Single cross-sectional nickel-titanium (NiTi) rotary instruments during continuous rotations are subjected to constant and variable stresses depending on the canal anatomy. This study was intended to create 2 new experimental, theoretic single-file designs with combinations of triple U (TU), triangle (TR), and convex triangle (CT) cross sections and to compare their bending stresses in simulated root canals with a single cross-sectional instrument using finite element analysis. A 3-dimensional model of the simulated root canal with 45° curvature and NiTi files with 5 cross-sectional designs were created using Pro/ENGINEER Wildfire 4.0 software (PTC Inc, Needham, MA) and ANSYS software (version 17; ANSYS, Inc, Canonsburg, PA) for finite element analysis. The NiTi files of 3 groups had single cross-sectional shapes of CT, TR, and TU designs, and 2 experimental groups had a CT, TR, and TU (CTU) design and a TU, TR, and CT (UTC) design. The file was rotated in simulated root canals to analyze the bending stress, and the von Mises stress value for every file was recorded in MPa. Statistical analysis was performed using the Kruskal-Wallis test and the Bonferroni-adjusted Mann-Whitney test for multiple pair-wise comparison with a P value <.05 (95 %). The maximum bending stress of the rotary file was observed in the apical third of the CT design, whereas comparatively less stress was recorded in the CTU design. The TU and TR designs showed a similar stress pattern at the curvature, whereas the UTC design showed greater stress in the apical and middle thirds of the file in curved canals. All the file designs showed a statistically significant difference. The CTU designed instruments showed the least bending stress on a 45° angulated simulated root canal when compared with all the other tested designs. Copyright © 2017 American Association of Endodontists. Published by Elsevier Inc. All rights reserved.

  4. Data Analysis & Statistical Methods for Command File Errors

    NASA Technical Reports Server (NTRS)

    Meshkat, Leila; Waggoner, Bruce; Bryant, Larry

    2014-01-01

    This paper explains current work on modeling for managing the risk of command file errors. It is focused on analyzing actual data from a JPL spaceflight mission to build models for evaluating and predicting error rates as a function of several key variables. We constructed a rich dataset by considering the number of errors, the number of files radiated, including the number commands and blocks in each file, as well as subjective estimates of workload and operational novelty. We have assessed these data using different curve fitting and distribution fitting techniques, such as multiple regression analysis, and maximum likelihood estimation to see how much of the variability in the error rates can be explained with these. We have also used goodness of fit testing strategies and principal component analysis to further assess our data. Finally, we constructed a model of expected error rates based on the what these statistics bore out as critical drivers to the error rate. This model allows project management to evaluate the error rate against a theoretically expected rate as well as anticipate future error rates.

  5. File-System Workload on a Scientific Multiprocessor

    NASA Technical Reports Server (NTRS)

    Kotz, David; Nieuwejaar, Nils

    1995-01-01

    Many scientific applications have intense computational and I/O requirements. Although multiprocessors have permitted astounding increases in computational performance, the formidable I/O needs of these applications cannot be met by current multiprocessors a their I/O subsystems. To prevent I/O subsystems from forever bottlenecking multiprocessors and limiting the range of feasible applications, new I/O subsystems must be designed. The successful design of computer systems (both hardware and software) depends on a thorough understanding of their intended use. A system designer optimizes the policies and mechanisms for the cases expected to most common in the user's workload. In the case of multiprocessor file systems, however, designers have been forced to build file systems based only on speculation about how they would be used, extrapolating from file-system characterizations of general-purpose workloads on uniprocessor and distributed systems or scientific workloads on vector supercomputers (see sidebar on related work). To help these system designers, in June 1993 we began the Charisma Project, so named because the project sought to characterize 1/0 in scientific multiprocessor applications from a variety of production parallel computing platforms and sites. The Charisma project is unique in recording individual read and write requests-in live, multiprogramming, parallel workloads (rather than from selected or nonparallel applications). In this article, we present the first results from the project: a characterization of the file-system workload an iPSC/860 multiprocessor running production, parallel scientific applications at NASA's Ames Research Center.

  6. Distributed Virtual System (DIVIRS) project

    NASA Technical Reports Server (NTRS)

    Schorr, Herbert; Neuman, B. Clifford

    1993-01-01

    As outlined in the continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC 2-539, the investigators are developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; developing communications routines that support the abstractions implemented; continuing the development of file and information systems based on the Virtual System Model; and incorporating appropriate security measures to allow the mechanisms developed to be used on an open network. The goal throughout the work is to provide a uniform model that can be applied to both parallel and distributed systems. The authors believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. The work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.

  7. Archive Inventory Management System (AIMS) — A Fast, Metrics Gathering Framework for Validating and Gaining Insight from Large File-Based Data Archives

    NASA Astrophysics Data System (ADS)

    Verma, R. V.

    2018-04-01

    The Archive Inventory Management System (AIMS) is a software package for understanding the distribution, characteristics, integrity, and nuances of files and directories in large file-based data archives on a continuous basis.

  8. 25 CFR 580.5 - What happens if I file late or fail to file?

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... 25 Indians 2 2013-04-01 2013-04-01 false What happens if I file late or fail to file? 580.5 Section 580.5 Indians NATIONAL INDIAN GAMING COMMISSION, DEPARTMENT OF THE INTERIOR APPEAL PROCEEDINGS... What happens if I file late or fail to file? (a) Failure to file an appeal within the time provided...

  9. 25 CFR 580.5 - What happens if I file late or fail to file?

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 25 Indians 2 2014-04-01 2014-04-01 false What happens if I file late or fail to file? 580.5 Section 580.5 Indians NATIONAL INDIAN GAMING COMMISSION, DEPARTMENT OF THE INTERIOR APPEAL PROCEEDINGS... What happens if I file late or fail to file? (a) Failure to file an appeal within the time provided...

  10. Register file soft error recovery

    DOEpatents

    Fleischer, Bruce M.; Fox, Thomas W.; Wait, Charles D.; Muff, Adam J.; Watson, III, Alfred T.

    2013-10-15

    Register file soft error recovery including a system that includes a first register file and a second register file that mirrors the first register file. The system also includes an arithmetic pipeline for receiving data read from the first register file, and error detection circuitry to detect whether the data read from the first register file includes corrupted data. The system further includes error recovery circuitry to insert an error recovery instruction into the arithmetic pipeline in response to detecting the corrupted data. The inserted error recovery instruction replaces the corrupted data in the first register file with a copy of the data from the second register file.

  11. Report filing in histopathology.

    PubMed Central

    Blenkinsopp, W K

    1977-01-01

    An assessment of alternative methods of filing histopathology report forms in alphabetical order showed that orthodox card index filing is satisfactory up to about 100000 reports but, because of the need for long-term retrieval, when the reports filed exceed this number they should be copied on jacketed microfilm and a new card index file begun. PMID:591645

  12. 11 CFR 100.19 - File, filed or filing (2 U.S.C. 434(a)).

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... Time on the filing date, except that pre-election reports must have a postmark dated no later than 11:59 p.m. Eastern Standard/Daylight Time on the fifteenth day before the date of the election. (2... Standard/Daylight Time on the filing date. (d) 48-hour and 24-hour reports of independent expenditures—(1...

  13. DSSTOX MASTER STRUCTURE-INDEX FILE: SDF FILE AND DOCUMENTATION

    EPA Science Inventory

    The DSSTox Master Structure-Index File serves to consolidate, manage, and ensure quality and uniformity of the chemical and substance information spanning all DSSTox Structure Data Files, including those in development but not yet published separately on this website.

  14. Astronomy In The Cloud: Using Mapreduce For Image Coaddition

    NASA Astrophysics Data System (ADS)

    Wiley, Keith; Connolly, A.; Gardner, J.; Krughoff, S.; Balazinska, M.; Howe, B.; Kwon, Y.; Bu, Y.

    2011-01-01

    In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computational challenges such as anomaly detection, classification, and moving object tracking. Since such studies require the highest quality data, methods such as image coaddition, i.e., registration, stacking, and mosaicing, will be critical to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources, e.g., asteroids, or transient objects, e.g., supernovae, these datastreams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this paper we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data is partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources, i.e., platforms where Hadoop is offered as a service. We report on our experience implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multi-terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image coaddition to the MapReduce framework. Then we describe a number of optimizations to our basic approach and report experimental results compring their performance. This work is funded by

  15. Fail-over file transfer process

    NASA Technical Reports Server (NTRS)

    Semancik, Susan K. (Inventor); Conger, Annette M. (Inventor)

    2005-01-01

    The present invention provides a fail-over file transfer process to handle data file transfer when the transfer is unsuccessful in order to avoid unnecessary network congestion and enhance reliability in an automated data file transfer system. If a file cannot be delivered after attempting to send the file to a receiver up to a preset number of times, and the receiver has indicated the availability of other backup receiving locations, then the file delivery is automatically attempted to one of the backup receiving locations up to the preset number of times. Failure of the file transfer to one of the backup receiving locations results in a failure notification being sent to the receiver, and the receiver may retrieve the file from the location indicated in the failure notification when ready.

  16. 33 CFR 148.246 - When is a document considered filed and where should I file it?

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... filed and where should I file it? 148.246 Section 148.246 Navigation and Navigable Waters COAST GUARD... Formal Hearings § 148.246 When is a document considered filed and where should I file it? (a) If a document to be filed is submitted by mail, it is considered filed on the date it is postmarked. If a...

  17. Program to convert SUDS2ASC files to a single binary SEGY file

    USGS Publications Warehouse

    Goldman, Mark

    2000-01-01

    This program, SUDS2SEGY, converts and combines ASCII files created using SUDS2ASC Version 2.60, to a single SEGY file. SUDS2ASC has been used previously to create an ASCII file of three-component seismic data for an individual recording station. However, many seismic processing packages have difficulty reading in ASCII data. In addition, it may be cumbersome to process a separate file for each recording station, particularly if traces from different recording stations contain a different number of data samples and/or a different start time. This new program - SUDS2SEGY - combines these recording station files into a single SEGY file. In addition, SUDS2SEGY normalizes the trace times so that each trace starts at a given time and consists of a fixed number of samples. This normalization allows seismic data from many different stations to be read in as a single "data gather". SUDS2SEGY also produces a report summarizing the offset and maximum absolute amplitude for each component in a station file. These data are output separately to an ASCII file and can be subsequently input to a plotting package.

  18. Converting CSV Files to RKSML Files

    NASA Technical Reports Server (NTRS)

    Trebi-Ollennu, Ashitey; Liebersbach, Robert

    2009-01-01

    A computer program converts, into a format suitable for processing on Earth, files of downlinked telemetric data pertaining to the operation of the Instrument Deployment Device (IDD), which is a robot arm on either of the Mars Explorer Rovers (MERs). The raw downlinked data files are in comma-separated- value (CSV) format. The present program converts the files into Rover Kinematics State Markup Language (RKSML), which is an Extensible Markup Language (XML) format that facilitates representation of operations of the IDD and enables analysis of the operations by means of the Rover Sequencing Validation Program (RSVP), which is used to build sequences of commanded operations for the MERs. After conversion by means of the present program, the downlinked data can be processed by RSVP, enabling the MER downlink operations team to play back the actual IDD activity represented by the telemetric data against the planned IDD activity. Thus, the present program enhances the diagnosis of anomalies that manifest themselves as differences between actual and planned IDD activities.

  19. 29 CFR 2200.8 - Filing.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... document shall be filed. (e) Filing date. (1) Except for the documents listed in paragraph (e)(2) of this... 29 Labor 9 2010-07-01 2010-07-01 false Filing. 2200.8 Section 2200.8 Labor Regulations Relating to... § 2200.8 Filing. (a) What to file. All papers required to be served on a party or intervenor, except for...

  20. 76 FR 43679 - Filing via the Internet; Notice of Additional File Formats for efiling

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-07-21

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission [Docket No. RM07-16-000] Filing via the Internet; Notice of Additional File Formats for efiling Take notice that the Commission has added to its list of acceptable file formats the four-character file extensions for Microsoft Office 2007/2010...

  1. Zebra: A striped network file system

    NASA Technical Reports Server (NTRS)

    Hartman, John H.; Ousterhout, John K.

    1992-01-01

    The design of Zebra, a striped network file system, is presented. Zebra applies ideas from log-structured file system (LFS) and RAID research to network file systems, resulting in a network file system that has scalable performance, uses its servers efficiently even when its applications are using small files, and provides high availability. Zebra stripes file data across multiple servers, so that the file transfer rate is not limited by the performance of a single server. High availability is achieved by maintaining parity information for the file system. If a server fails its contents can be reconstructed using the contents of the remaining servers and the parity information. Zebra differs from existing striped file systems in the way it stripes file data: Zebra does not stripe on a per-file basis; instead it stripes the stream of bytes written by each client. Clients write to the servers in units called stripe fragments, which are analogous to segments in an LFS. Stripe fragments contain file blocks that were written recently, without regard to which file they belong. This method of striping has numerous advantages over per-file striping, including increased server efficiency, efficient parity computation, and elimination of parity update.

  2. Permanent-File-Validation Utility Computer Program

    NASA Technical Reports Server (NTRS)

    Derry, Stephen D.

    1988-01-01

    Errors in files detected and corrected during operation. Permanent File Validation (PFVAL) utility computer program provides CDC CYBER NOS sites with mechanism to verify integrity of permanent file base. Locates and identifies permanent file errors in Mass Storage Table (MST) and Track Reservation Table (TRT), in permanent file catalog entries (PFC's) in permit sectors, and in disk sector linkage. All detected errors written to listing file and system and job day files. Program operates by reading system tables , catalog track, permit sectors, and disk linkage bytes to vaidate expected and actual file linkages. Used extensively to identify and locate errors in permanent files and enable online correction, reducing computer-system downtime.

  3. 77 FR 71587 - Wisconsin Public Service Corporation; Notices of Intent To File License Applications, Filing of...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-12-03

    ...] Wisconsin Public Service Corporation; Notices of Intent To File License Applications, Filing of Pre-Application Documents (PAD), Commencement of Pre-Filing Processes and Scoping, Request for Comments on the...: Notices of Intent to File License Applications for Two New Licenses and Commencing the Pre-filing Process...

  4. Distributed Virtual System (DIVIRS) Project

    NASA Technical Reports Server (NTRS)

    Schorr, Herbert; Neuman, B. Clifford

    1993-01-01

    As outlined in our continuation proposal 92-ISI-50R (revised) on contract NCC 2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to program parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the virtual system model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.

  5. DIstributed VIRtual System (DIVIRS) project

    NASA Technical Reports Server (NTRS)

    Schorr, Herbert; Neuman, B. Clifford

    1994-01-01

    As outlined in our continuation proposal 92-ISI-. OR (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.

  6. DIstributed VIRtual System (DIVIRS) project

    NASA Technical Reports Server (NTRS)

    Schorr, Herbert; Neuman, Clifford B.

    1995-01-01

    As outlined in our continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.

  7. FeynArts model file for MSSM transition counterterms from DREG to DRED

    NASA Astrophysics Data System (ADS)

    Stöckinger, Dominik; Varšo, Philipp

    2012-02-01

    The FeynArts model file MSSMdreg2dred implements MSSM transition counterterms which can convert one-loop Green functions from dimensional regularization to dimensional reduction. They correspond to a slight extension of the well-known Martin/Vaughn counterterms, specialized to the MSSM, and can serve also as supersymmetry-restoring counterterms. The paper provides full analytic results for the counterterms and gives one- and two-loop usage examples. The model file can simplify combining MS¯-parton distribution functions with supersymmetric renormalization or avoiding the renormalization of ɛ-scalars in dimensional reduction. Program summaryProgram title:MSSMdreg2dred.mod Catalogue identifier: AEKR_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEKR_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: LGPL-License [1] No. of lines in distributed program, including test data, etc.: 7600 No. of bytes in distributed program, including test data, etc.: 197 629 Distribution format: tar.gz Programming language: Mathematica, FeynArts Computer: Any, capable of running Mathematica and FeynArts Operating system: Any, with running Mathematica, FeynArts installation Classification: 4.4, 5, 11.1 Subprograms used: Cat Id Title Reference ADOW_v1_0 FeynArts CPC 140 (2001) 418 Nature of problem: The computation of one-loop Feynman diagrams in the minimal supersymmetric standard model (MSSM) requires regularization. Two schemes, dimensional regularization and dimensional reduction are both common but have different pros and cons. In order to combine the advantages of both schemes one would like to easily convert existing results from one scheme into the other. Solution method: Finite counterterms are constructed which correspond precisely to the one-loop scheme differences for the MSSM. They are provided as a FeynArts [2] model file. Using this model file together with FeynArts, the (ultra

  8. GEWEX-RFA Data File Format and File Naming Convention

    Atmospheric Science Data Center

    2016-05-20

    ... documentation, will be stored for each data product. Each time data is added to, removed from, or modified in the file set for a product, ... including 29 days in leap-year Februaries. Time series files containing 15-minute data should start at the top of an hour to ...

  9. SU-E-T-184: Clinical VMAT QA Practice Using LINAC Delivery Log Files

    SciTech Connect

    Johnston, H; Jacobson, T; Gu, X

    2015-06-15

    Purpose: To evaluate the accuracy of volumetric modulated arc therapy (VMAT) treatment delivery dose clouds by comparing linac log data to doses measured using an ionization chamber and film. Methods: A commercial IMRT quality assurance (QA) process utilizing a DICOM-RT framework was tested for clinical practice using 30 prostate and 30 head and neck VMAT plans. Delivered 3D VMAT dose distributions were independently checked using a PinPoint ionization chamber and radiographic film in a solid water phantom. DICOM RT coordinates were used to extract the corresponding point and planar doses from 3D log file dose distributions. Point doses were evaluatedmore » by computing the percent error between log file and chamber measured values. A planar dose evaluation was performed for each plan using a 2D gamma analysis with 3% global dose difference and 3 mm isodose point distance criteria. The same analysis was performed to compare treatment planning system (TPS) doses to measured values to establish a baseline assessment of agreement. Results: The mean percent error between log file and ionization chamber dose was 1.0%±2.1% for prostate VMAT plans and −0.2%±1.4% for head and neck plans. The corresponding TPS calculated and measured ionization chamber values agree within 1.7%±1.6%. The average 2D gamma passing rates for the log file comparison to film are 98.8%±1.0% and 96.2%±4.2% for the prostate and head and neck plans, respectively. The corresponding passing rates for the TPS comparison to film are 99.4%±0.5% and 93.9%±5.1%. Overall, the point dose and film data indicate that log file determined doses are in excellent agreement with measured values. Conclusion: Clinical VMAT QA practice using LINAC treatment log files is a fast and reliable method for patient-specific plan evaluation.« less

  10. A File Archival System

    NASA Technical Reports Server (NTRS)

    Fanselow, J. L.; Vavrus, J. L.

    1984-01-01

    ARCH, file archival system for DEC VAX, provides for easy offline storage and retrieval of arbitrary files on DEC VAX system. System designed to eliminate situations that tie up disk space and lead to confusion when different programers develop different versions of same programs and associated files.

  11. Demographic Profile of U.S. Children: National File [Machine-Readable Data File].

    ERIC Educational Resources Information Center

    Peterson, J. L.; White, R. N.

    These two computer files contain social and demographic data about U.S. children and their families taken from the March 1985 Current Population Survey of the U.S. Census. One file is for all children; the second file is for black children. The following column variables are included: (1) family structure; (2) parent educational attainment; (3)…

  12. Distributed Storage Algorithm for Geospatial Image Data Based on Data Access Patterns.

    PubMed

    Pan, Shaoming; Li, Yongkai; Xu, Zhengquan; Chong, Yanwen

    2015-01-01

    Declustering techniques are widely used in distributed environments to reduce query response time through parallel I/O by splitting large files into several small blocks and then distributing those blocks among multiple storage nodes. Unfortunately, however, many small geospatial image data files cannot be further split for distributed storage. In this paper, we propose a complete theoretical system for the distributed storage of small geospatial image data files based on mining the access patterns of geospatial image data using their historical access log information. First, an algorithm is developed to construct an access correlation matrix based on the analysis of the log information, which reveals the patterns of access to the geospatial image data. Then, a practical heuristic algorithm is developed to determine a reasonable solution based on the access correlation matrix. Finally, a number of comparative experiments are presented, demonstrating that our algorithm displays a higher total parallel access probability than those of other algorithms by approximately 10-15% and that the performance can be further improved by more than 20% by simultaneously applying a copy storage strategy. These experiments show that the algorithm can be applied in distributed environments to help realize parallel I/O and thereby improve system performance.

  13. Unstructured medical image query using big data - An epilepsy case study.

    PubMed

    Istephan, Sarmad; Siadat, Mohammad-Reza

    2016-02-01

    Big data technologies are critical to the medical field which requires new frameworks to leverage them. Such frameworks would benefit medical experts to test hypotheses by querying huge volumes of unstructured medical data to provide better patient care. The objective of this work is to implement and examine the feasibility of having such a framework to provide efficient querying of unstructured data in unlimited ways. The feasibility study was conducted specifically in the epilepsy field. The proposed framework evaluates a query in two phases. In phase 1, structured data is used to filter the clinical data warehouse. In phase 2, feature extraction modules are executed on the unstructured data in a distributed manner via Hadoop to complete the query. Three modules have been created, volume comparer, surface to volume conversion and average intensity. The framework allows for user-defined modules to be imported to provide unlimited ways to process the unstructured data hence potentially extending the application of this framework beyond epilepsy field. Two types of criteria were used to validate the feasibility of the proposed framework - the ability/accuracy of fulfilling an advanced medical query and the efficiency that Hadoop provides. For the first criterion, the framework executed an advanced medical query that spanned both structured and unstructured data with accurate results. For the second criterion, different architectures were explored to evaluate the performance of various Hadoop configurations and were compared to a traditional Single Server Architecture (SSA). The surface to volume conversion module performed up to 40 times faster than the SSA (using a 20 node Hadoop cluster) and the average intensity module performed up to 85 times faster than the SSA (using a 40 node Hadoop cluster). Furthermore, the 40 node Hadoop cluster executed the average intensity module on 10,000 models in 3h which was not even practical for the SSA. The current study is

  14. Converting Inhouse Subject Card Files to Electronic Keyword Files.

    ERIC Educational Resources Information Center

    Culmer, Carita M.

    The library at Phoenix College developed the Controversial Issues Files (CIF), a "home made" card file containing references pertinent to specific ongoing assignments. Although the CIF had proven itself to be an excellent resource tool for beginning researchers, it was cumbersome to maintain in the card format, and was limited to very…

  15. Interoperability format translation and transformation between IFC architectural design file and simulation file formats

    DOEpatents

    Chao, Tian-Jy; Kim, Younghun

    2015-02-03

    Automatically translating a building architecture file format (Industry Foundation Class) to a simulation file, in one aspect, may extract data and metadata used by a target simulation tool from a building architecture file. Interoperability data objects may be created and the extracted data is stored in the interoperability data objects. A model translation procedure may be prepared to identify a mapping from a Model View Definition to a translation and transformation function. The extracted data may be transformed using the data stored in the interoperability data objects, an input Model View Definition template, and the translation and transformation function to convert the extracted data to correct geometric values needed for a target simulation file format used by the target simulation tool. The simulation file in the target simulation file format may be generated.

  16. Torsional and cyclic fatigue resistances of glide path preparation instruments: G-file and PathFile.

    PubMed

    Sung, Sang Yup; Ha, Jung-Hong; Kwak, Sang-Won; Abed, Rashid El; Byeon, Kyeongmin; Kim, Hyeon-Cheol

    2014-01-01

    This study aimed to compare cyclic fatigue and torsional resistances of glide path creating instruments with different tapers and tip sizes. Two sizes (G1 and G2) from G-File system and three sizes (PathFile #1, #2, and #3) from PathFile system were used for torsional resistance and cyclic fatigue resistance tests (n = 10). The torsional resistance was evaluated at 2-, 3-, 4-, 5-, and 6-mm from the file tip by plotting the torsional load changes until fracture by rotational loading of 2 rpm. The cyclic fatigue resistance was compared by measuring the number of cycles to failure. Data were analyzed statistically using one-way ANOVA and Duncan's post-hoc comparison. The length of the fractured file fragment was also measured. All fractured fragments were observed under a scanning electron microscope (SEM). Although G-2 file showed a lower torsional strength than PathFile #3 at 2- and 3-mm levels (p < 0.05), they had similar ultimate strengths at 4-, 5-, and 6-mm levels (p > 0.05). The smaller files of each brand had a significantly higher cyclic fatigue resistance than the bigger ones (p < 0.05). PathFile #1 and #2 had higher fatigue resistances than G-files (p < 0.05). While G-1 had a similar fatigue resistance as PathFile #3, G-2 showed the lowest and PathFile #1 showed the highest resistances among the tested groups (p < 0.05). The SEM examination showed typical appearances of cyclic fatigue and torsional fractures, regardless of the tested levels. Clinicians may consider the instruments' sizes for each clinical case in order to get efficient glide path with minimal risk of fracture. © 2014 Wiley Periodicals, Inc.

  17. Apically extruded dentin debris by reciprocating single-file and multi-file rotary system.

    PubMed

    De-Deus, Gustavo; Neves, Aline; Silva, Emmanuel João; Mendonça, Thais Accorsi; Lourenço, Caroline; Calixto, Camila; Lima, Edson Jorge Moreira

    2015-03-01

    This study aims to evaluate the apical extrusion of debris by the two reciprocating single-file systems: WaveOne and Reciproc. Conventional multi-file rotary system was used as a reference for comparison. The hypotheses tested were (i) the reciprocating single-file systems extrude more than conventional multi-file rotary system and (ii) the reciprocating single-file systems extrude similar amounts of dentin debris. After solid selection criteria, 80 mesial roots of lower molars were included in the present study. The use of four different instrumentation techniques resulted in four groups (n = 20): G1 (hand-file technique), G2 (ProTaper), G3 (WaveOne), and G4 (Reciproc). The apparatus used to evaluate the collection of apically extruded debris was typical double-chamber collector. Statistical analysis was performed for multiple comparisons. No significant difference was found in the amount of the debris extruded between the two reciprocating systems. In contrast, conventional multi-file rotary system group extruded significantly more debris than both reciprocating groups. Hand instrumentation group extruded significantly more debris than all other groups. The present results yielded favorable input for both reciprocation single-file systems, inasmuch as they showed an improved control of apically extruded debris. Apical extrusion of debris has been studied extensively because of its clinical relevance, particularly since it may cause flare-ups, originated by the introduction of bacteria, pulpal tissue, and irrigating solutions into the periapical tissues.

  18. 76 FR 52323 - Combined Notice of Filings; Filings Instituting Proceedings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-08-22

    .... Applicants: Young Gas Storage Company, Ltd. Description: Young Gas Storage Company, Ltd. submits tariff..., but intervention is necessary to become a party to the proceeding. The filings are accessible in the.... More detailed information relating to filing requirements, interventions, protests, and service can be...

  19. Interoperability format translation and transformation between IFC architectural design file and simulation file formats

    SciTech Connect

    Chao, Tian-Jy; Kim, Younghun

    Automatically translating a building architecture file format (Industry Foundation Class) to a simulation file, in one aspect, may extract data and metadata used by a target simulation tool from a building architecture file. Interoperability data objects may be created and the extracted data is stored in the interoperability data objects. A model translation procedure may be prepared to identify a mapping from a Model View Definition to a translation and transformation function. The extracted data may be transformed using the data stored in the interoperability data objects, an input Model View Definition template, and the translation and transformation function tomore » convert the extracted data to correct geometric values needed for a target simulation file format used by the target simulation tool. The simulation file in the target simulation file format may be generated.« less

  20. ECFS: A decentralized, distributed and fault-tolerant FUSE filesystem for the LHCb online farm

    NASA Astrophysics Data System (ADS)

    Rybczynski, Tomasz; Bonaccorsi, Enrico; Neufeld, Niko

    2014-06-01

    The LHCb experiment records millions of proton collisions every second, but only a fraction of them are useful for LHCb physics. In order to filter out the "bad events" a large farm of x86-servers (~2000 nodes) has been put in place. These servers boot from and run from NFS, however they use their local disk to temporarily store data, which cannot be processed in real-time ("data-deferring"). These events are subsequently processed, when there are no live-data coming in. The effective CPU power is thus greatly increased. This gain in CPU power depends critically on the availability of the local disks. For cost and power-reasons, mirroring (RAID-1) is not used, leading to a lot of operational headache with failing disks and disk-errors or server failures induced by faulty disks. To mitigate these problems and increase the reliability of the LHCb farm, while at same time keeping cost and power-consumption low, an extensive research and study of existing highly available and distributed file systems has been done. While many distributed file systems are providing reliability by "file replication", none of the evaluated ones supports erasure algorithms. A decentralised, distributed and fault-tolerant "write once read many" file system has been designed and implemented as a proof of concept providing fault tolerance without using expensive - in terms of disk space - file replication techniques and providing a unique namespace as a main goals. This paper describes the design and the implementation of the Erasure Codes File System (ECFS) and presents the specialised FUSE interface for Linux. Depending on the encoding algorithm ECFS will use a certain number of target directories as a backend to store the segments that compose the encoded data. When target directories are mounted via nfs/autofs - ECFS will act as a file-system over network/block-level raid over multiple servers.

  1. Demonstration of application-driven network slicing and orchestration in optical/packet domains: on-demand vDC expansion for Hadoop MapReduce optimization.

    PubMed

    Kong, Bingxin; Liu, Siqi; Yin, Jie; Li, Shengru; Zhu, Zuqing

    2018-05-28

    Nowadays, it is common for service providers (SPs) to leverage hybrid clouds to improve the quality-of-service (QoS) of their Big Data applications. However, for achieving guaranteed latency and/or bandwidth in its hybrid cloud, an SP might desire to have a virtual datacenter (vDC) network, in which it can manage and manipulate the network connections freely. To address this requirement, we design and implement a network slicing and orchestration (NSO) system that can create and expand vDCs across optical/packet domains on-demand. Considering Hadoop MapReduce (M/R) as the use-case, we describe the proposed architectures of the system's data, control and management planes, and present the operation procedures for creating, expanding, monitoring and managing a vDC for M/R optimization. The proposed NSO system is then realized in a small-scale network testbed that includes four optical/packet domains, and we conduct experiments in it to demonstrate the whole operations of the data, control and management planes. Our experimental results verify that application-driven on-demand vDC expansion across optical/packet domains can be achieved for M/R optimization, and after being provisioned with a vDC, the SP using the NSO system can fully control the vDC network and further optimize the M/R jobs in it with network orchestration.

  2. NASA Uniform Files Index

    NASA Technical Reports Server (NTRS)

    1987-01-01

    This handbook is a guide for the use of all personnel engaged in handling NASA files. It is issued in accordance with the regulations of the National Archives and Records Administration, in the Code of Federal Regulations Title 36, Part 1224, Files Management; and the Federal Information Resources Management Regulation, Subpart 201-45.108, Files Management. It is intended to provide a standardized classification and filing scheme to achieve maximum uniformity and ease in maintaining and using agency records. It is a framework for consistent organization of information in an arrangement that will be useful to current and future researchers. The NASA Uniform Files Index coding structure is composed of the subject classification table used for NASA management directives and the subject groups in the NASA scientific and technical information system. It is designed to correlate files throughout NASA and it is anticipated that it may be useful with automated filing systems. It is expected that in the conversion of current files to this arrangement it will be necessary to add tertiary subjects and make further subdivisions under the existing categories. Established primary and secondary subject categories may not be changed arbitrarily. Proposals for additional subject categories of NASA-wide applicability, and suggestions for improvement in this handbook, should be addressed to the Records Program Manager at the pertinent installation who will forward it to the NASA Records Management Office, Code NTR, for approval. This handbook is issued in loose-leaf form and will be revised by page changes.

  3. How Object-Specific Are Object Files? Evidence for Integration by Location

    ERIC Educational Resources Information Center

    van Dam, Wessel O.; Hommel, Bernhard

    2010-01-01

    Given the distributed representation of visual features in the human brain, binding mechanisms are necessary to integrate visual information about the same perceptual event. It has been assumed that feature codes are bound into object files--pointers to the neural codes of the features of a given event. The present study investigated the…

  4. The European Southern Observatory-MIDAS table file system

    NASA Technical Reports Server (NTRS)

    Peron, M.; Grosbol, P.

    1992-01-01

    The new and substantially upgraded version of the Table File System in MIDAS is presented as a scientific database system. MIDAS applications for performing database operations on tables are discussed, for instance, the exchange of the data to and from the TFS, the selection of objects, the uncertainty joins across tables, and the graphical representation of data. This upgraded version of the TFS is a full implementation of the binary table extension of the FITS format; in addition, it also supports arrays of strings. Different storage strategies for optimal access of very large data sets are implemented and are addressed in detail. As a simple relational database, the TFS may be used for the management of personal data files. This opens the way to intelligent pipeline processing of large amounts of data. One of the key features of the Table File System is to provide also an extensive set of tools for the analysis of the final results of a reduction process. Column operations using standard and special mathematical functions as well as statistical distributions can be carried out; commands for linear regression and model fitting using nonlinear least square methods and user-defined functions are available. Finally, statistical tests of hypothesis and multivariate methods can also operate on tables.

  5. 20 CFR 217.16 - Filing date.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... 20 Employees' Benefits 1 2010-04-01 2010-04-01 false Filing date. 217.16 Section 217.16 Employees... LUMP SUM Filing An Application § 217.16 Filing date. An application filed in a manner and form acceptable to the Board is officially filed with the Board on the earliest of the following dates: (a) On the...

  6. Sensitivity Data File Formats

    SciTech Connect

    Rearden, Bradley T.

    2016-04-01

    The format of the TSUNAMI-A sensitivity data file produced by SAMS for cases with deterministic transport solutions is given in Table 6.3.A.1. The occurrence of each entry in the data file is followed by an identification of the data contained on each line of the file and the FORTRAN edit descriptor denoting the format of each line. A brief description of each line is also presented. A sample of the TSUNAMI-A data file for the Flattop-25 sample problem is provided in Figure 6.3.A.1. Here, only two profiles out of the 130 computed are shown.

  7. 75 FR 38805 - Filing Via the Internet; Electronic Tariff Filings Notice of Display of Time on Commission's...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-07-06

    ... system the time used by the Commission to mark officially the time that eFilings and eTariff submissions... timely. The time display will assist users in ensuring that their filings are timely filed, i.e., are... electronic submissions in lieu of paper using the eFiling link at http://www.ferc.gov . Also, Filing...

  8. Cyclic fatigue resistance of a novel rotary file manufactured using controlled memory Ni-Ti technology compared to a file made from M-wire file.

    PubMed

    AlShwaimi, E

    2018-01-01

    To compare the cyclic fatigue properties of a novel file made using controlled memory Ni-Ti technology with those of files made from M-wire. Twelve files with similar cross-sectional geometry and tip size from each of the following groups were tested: Proflexendo made from CMT (PE; size 30 0.04; Nexden, Houston, Tx, USA), ProFile Vortex made from M-wire (PV; size 30 0.04; Dentsply Tulsa Dental, Tulsa, OK, USA) and ProTaper Universal made from regular alloy (PU; F3; Dentsply Tulsa Dental). A custom-made cyclic fatigue device was made to evaluate the total number of cycles to failure for each system. A scanning electron microscope (SEM) was used to examine the fractured surfaces of the fragments. The arithmetic means and standard deviations were calculated for the total number of cycles to failure. One-way analysis of variance was used to compare the mean cyclic failure amongst the three groups. Post hoc Tukey's test was performed to compare the difference of the means between the groups at a significance level of P < 0.05. Proflexendo had a significantly greater resistance to cyclic fatigue compared to other systems (P < 0.001). Proflexendo files were able to withstand 500% more cycles to fracture when compared to ProFile Vortex files. Manufacturing technique had a significant impact on the resistance to cyclic fatigue. Proflexendo files made from controlled memory Ni-Ti technology had the highest number of cycles to failure compared to ProFile Vortex made from M-wire files with similar taper and tip size. © 2017 International Endodontic Journal. Published by John Wiley & Sons Ltd.

  9. 78 FR 6772 - Failure To File Gain Recognition Agreements and Other Required Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-01-31

    ... regulations that would amend the existing rules governing the consequences to U.S. persons for failing to file... current law, if a U.S. transferor fails to timely file an initial GRA, or fails to comply in any material... fails to timely file an annual certification), the U.S. transferor is subject to full gain recognition...

  10. NCEP BUFR File Structure

    Science.gov Websites

    . These tables may be defined within a separate ASCII text file (see Description and Format of BUFR Tables time, the BUFR tables are usually read from an external ASCII text file (although it is also possible reports. Click here to view the ASCII text file (called /nwprod/fix/bufrtab.002 on the NCEP CCS machines

  11. Small Aircraft Data Distribution System

    NASA Technical Reports Server (NTRS)

    Chazanoff, Seth L.; Dinardo, Steven J.

    2012-01-01

    The CARVE Small Aircraft Data Distribution System acquires the aircraft location and attitude data that is required by the various programs running on a distributed network. This system distributes the data it acquires to the data acquisition programs for inclusion in their data files. It uses UDP (User Datagram Protocol) to broadcast data over a LAN (Local Area Network) to any programs that might have a use for the data. The program is easily adaptable to acquire additional data and log that data to disk. The current version also drives displays using precision pitch and roll information to aid the pilot in maintaining a level-level attitude for radar/radiometer mapping beyond the degree available by flying visually or using a standard gyro-driven attitude indicator. The software is designed to acquire an array of data to help the mission manager make real-time decisions as to the effectiveness of the flight. This data is displayed for the mission manager and broadcast to the other experiments on the aircraft for inclusion in their data files. The program also drives real-time precision pitch and roll displays for the pilot and copilot to aid them in maintaining the desired attitude, when required, during data acquisition on mapping lines.

  12. 22 CFR 123.22 - Filing, retention, and return of export licenses and filing of export information.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 22 Foreign Relations 1 2014-04-01 2014-04-01 false Filing, retention, and return of export licenses and filing of export information. 123.22 Section 123.22 Foreign Relations DEPARTMENT OF STATE....22 Filing, retention, and return of export licenses and filing of export information. (a) Any export...

  13. 48 CFR 6101.2 - Filing cases; time limits for filing; notice of docketing; consolidation [Rule 2].

    Code of Federal Regulations, 2012 CFR

    2012-10-01

    ... 48 Federal Acquisition Regulations System 7 2012-10-01 2012-10-01 false Filing cases; time limits... 6101.2 Filing cases; time limits for filing; notice of docketing; consolidation [Rule 2]. (a) Filing... name, address, telephone number, facsimile machine number, and e-mail address, if available, of the...

  14. Cleaning of endodontic files, Part I: The effect of bioburden on the sterilization of endodontic files.

    PubMed

    Johnson, M A; Primack, P D; Loushine, R J; Craft, D W

    1997-01-01

    Ninety-two new endodontic files were randomly assigned to five groups with varying parameters of contamination, cleaning method, and sterilization (steam or chemical). Files were instrumented in bovine teeth to accumulate debris and a known contaminant, Bacillus stearothermophilus. Positive controls produced growth on both T-soy agar plates and in T-soy broth. Negative controls and experimental files (some with heavy debris) failed to produce growth. The results showed that there was no significant difference between contaminated files that were not cleaned before sterilization and contaminated files that were cleaned before sterilization. Bioburden present on endodontic files does not appear to affect the sterilization process.

  15. The Galley Parallel File System

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David

    1996-01-01

    As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. The interface conceals the parallelism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. We discuss Galley's file structure and application interface, as well as an application that has been implemented using that interface.

  16. 77 FR 66601 - Electronic Tariff Filings; Notice of Change to eTariff Type of Filing Codes

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-11-06

    ... Tariff Filings; Notice of Change to eTariff Type of Filing Codes Take notice that, effective November 18, 2012, the list of available eTariff Type of Filing Codes (TOFC) will be modified to include a new TOFC... Energy's regulations. Tariff records included in such filings will be automatically accepted to be...

  17. Issues in ATM Support of High-Performance, Geographically Distributed Computing

    NASA Technical Reports Server (NTRS)

    Claus, Russell W.; Dowd, Patrick W.; Srinidhi, Saragur M.; Blade, Eric D.G

    1995-01-01

    This report experimentally assesses the effect of the underlying network in a cluster-based computing environment. The assessment is quantified by application-level benchmarking, process-level communication, and network file input/output. Two testbeds were considered, one small cluster of Sun workstations and another large cluster composed of 32 high-end IBM RS/6000 platforms. The clusters had Ethernet, fiber distributed data interface (FDDI), Fibre Channel, and asynchronous transfer mode (ATM) network interface cards installed, providing the same processors and operating system for the entire suite of experiments. The primary goal of this report is to assess the suitability of an ATM-based, local-area network to support interprocess communication and remote file input/output systems for distributed computing.

  18. 76 FR 62092 - Filing Procedures

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-10-06

    ... INTERNATIONAL TRADE COMMISSION Filing Procedures AGENCY: International Trade Commission. ACTION: Notice of issuance of Handbook on Filing Procedures. SUMMARY: The United States International Trade Commission (``Commission'') is issuing a Handbook on Filing Procedures to replace its Handbook on Electronic...

  19. Folksonomical P2P File Sharing Networks Using Vectorized KANSEI Information as Search Tags

    NASA Astrophysics Data System (ADS)

    Ohnishi, Kei; Yoshida, Kaori; Oie, Yuji

    We present the concept of folksonomical peer-to-peer (P2P) file sharing networks that allow participants (peers) to freely assign structured search tags to files. These networks are similar to folksonomies in the present Web from the point of view that users assign search tags to information distributed over a network. As a concrete example, we consider an unstructured P2P network using vectorized Kansei (human sensitivity) information as structured search tags for file search. Vectorized Kansei information as search tags indicates what participants feel about their files and is assigned by the participant to each of their files. A search query also has the same form of search tags and indicates what participants want to feel about files that they will eventually obtain. A method that enables file search using vectorized Kansei information is the Kansei query-forwarding method, which probabilistically propagates a search query to peers that are likely to hold more files having search tags that are similar to the query. The similarity between the search query and the search tags is measured in terms of their dot product. The simulation experiments examine if the Kansei query-forwarding method can provide equal search performance for all peers in a network in which only the Kansei information and the tendency with respect to file collection are different among all of the peers. The simulation results show that the Kansei query forwarding method and a random-walk-based query forwarding method, for comparison, work effectively in different situations and are complementary. Furthermore, the Kansei query forwarding method is shown, through simulations, to be superior to or equal to the random-walk based one in terms of search speed.

  20. 12 CFR 1780.9 - Filing of papers.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 7 2010-01-01 2010-01-01 false Filing of papers. 1780.9 Section 1780.9 Banks... papers. (a) Filing. Any papers required to be filed shall be addressed to the presiding officer and filed... Director or the presiding officer. All papers filed by electronic media shall also concurrently be filed in...

  1. 12 CFR 1780.9 - Filing of papers.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 12 Banks and Banking 7 2011-01-01 2011-01-01 false Filing of papers. 1780.9 Section 1780.9 Banks... papers. (a) Filing. Any papers required to be filed shall be addressed to the presiding officer and filed... Director or the presiding officer. All papers filed by electronic media shall also concurrently be filed in...

  2. The Standard Autonomous File Server, a Customized, Off-the-Shelf Success Story

    NASA Technical Reports Server (NTRS)

    Semancik, Susan K.; Conger, Annette M.; Obenschain, Arthur F. (Technical Monitor)

    2001-01-01

    The Standard Autonomous File Server (SAFS), which includes both off-the-shelf hardware and software, uses an improved automated file transfer process to provide a quicker, more reliable, prioritized file distribution for customers of near real-time data without interfering with the assets involved in the acquisition and processing of the data. It operates as a stand-alone solution, monitoring itself, and providing an automated fail-over process to enhance reliability. This paper will describe the unique problems and lessons learned both during the COTS selection and integration into SAFS, and the system's first year of operation in support of NASA's satellite ground network. COTS was the key factor in allowing the two-person development team to deploy systems in less than a year, meeting the required launch schedule. The SAFS system his been so successful, it is becoming a NASA standard resource, leading to its nomination for NASA's Software or the Year Award in 1999.

  3. Local File Disclosure Vulnerability: A Case Study of Public-Sector Web Applications

    NASA Astrophysics Data System (ADS)

    Ahmed, M. Imran; Maruf Hassan, Md; Bhuyian, Touhid

    2018-01-01

    Almost all public-sector organisations in Bangladesh now offer online services through web applications, along with the existing channels, in their endeavour to realise the dream of a ‘Digital Bangladesh’. Nations across the world have joined the online environment thanks to training and awareness initiatives by their government. File sharing and downloading activities using web applications have now become very common, not only ensuring the easy distribution of different types of files and documents but also enormously reducing the time and effort of users. Although the online services that are being used frequently have made users’ life easier, it has increased the risk of exploitation of local file disclosure (LFD) vulnerability in the web applications of different public-sector organisations due to unsecure design and careless coding. This paper analyses the root cause of LFD vulnerability, its exploitation techniques, and its impact on 129 public-sector websites in Bangladesh by examining the use of manual black box testing approach.

  4. The Standard Autonomous File Server, A Customized, Off-the-Shelf Success Story

    NASA Technical Reports Server (NTRS)

    Semancik, Susan K.; Conger, Annette M.; Obenschain, Arthur F. (Technical Monitor)

    2001-01-01

    The Standard Autonomous File Server (SAFS), which includes both off-the-shelf hardware and software, uses an improved automated file transfer process to provide a quicker, more reliable, prioritized file distribution for customers of near real-time data without interfering with the assets involved in the acquisition and processing of the data. It operates as a stand-alone solution, monitoring itself, and providing an automated fail-over process to enhance reliability. This paper describes the unique problems and lessons learned both during the COTS selection and integration into SAFS, and the system's first year of operation in support of NASA's satellite ground network. COTS was the key factor in allowing the two-person development team to deploy systems in less than a year, meeting the required launch schedule. The SAFS system has been so successful; it is becoming a NASA standard resource, leading to its nomination for NASA's Software of the Year Award in 1999.

  5. 78 FR 35618 - Three Valleys Municipal Water District; Notice of Application Accepted for Filing and Soliciting...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-06-13

    ... generation of 600,000 kilowatt-hours. m. This filing is available for review and reproduction at the..., and distributing and consulting on a draft exemption application. Dated: June 6, 2013. Kimberly D...

  6. 76 FR 50210 - Combined Notice of Filings #

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-08-12

    ..., Inc. submits tariff filing per 35.13(a)(2)(iii: Filing of Notice of Succession to be effective 10/5..., Inc. submits tariff filing per 35.13(a)(2)(iii: Filing of Notice of Succession to be effective 10/5..., Inc. submits tariff filing per 35.13(a)(2)(iii: Filing of Notice of Succession to be effective 10/5...

  7. 76 FR 12950 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-03-09

    ....204: EnCana Marketing Negotiated Rate Agreement Amendment to be effective 2/24/2011. Filed Date: 03/01... submits tariff filing per 154.403(d)(2): Fuel Filing 2011 to be effective 4/1/2011. Filed Date: 03/01/2011... submits tariff filing per 154.403(d)(2): Fuel Tracker 2011 to be effective 4/1/2011. Filed Date: 03/01...

  8. The Galley Parallel File System

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David

    1996-01-01

    Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/0 requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.

  9. 46 CFR 380.2 - Filing applications.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 46 Shipping 8 2010-10-01 2010-10-01 false Filing applications. 380.2 Section 380.2 Shipping MARITIME ADMINISTRATION, DEPARTMENT OF TRANSPORTATION MISCELLANEOUS PROCEDURES Filing of Applications Under Section 805(a), 1936 Act § 380.2 Filing applications. (a) An applicant under section 805(a) shall file his...

  10. 48 CFR 1404.802 - Contract files.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 48 Federal Acquisition Regulations System 5 2011-10-01 2011-10-01 false Contract files. 1404.802 Section 1404.802 Federal Acquisition Regulations System DEPARTMENT OF THE INTERIOR GENERAL ADMINISTRATIVE MATTERS Contract Files 1404.802 Contract files. In addition to the requirements in FAR 4.802, files shall...

  11. 48 CFR 1404.802 - Contract files.

    Code of Federal Regulations, 2013 CFR

    2013-10-01

    ... 48 Federal Acquisition Regulations System 5 2013-10-01 2013-10-01 false Contract files. 1404.802 Section 1404.802 Federal Acquisition Regulations System DEPARTMENT OF THE INTERIOR GENERAL ADMINISTRATIVE MATTERS Contract Files 1404.802 Contract files. In addition to the requirements in FAR 4.802, files shall...

  12. 48 CFR 1404.802 - Contract files.

    Code of Federal Regulations, 2012 CFR

    2012-10-01

    ... 48 Federal Acquisition Regulations System 5 2012-10-01 2012-10-01 false Contract files. 1404.802 Section 1404.802 Federal Acquisition Regulations System DEPARTMENT OF THE INTERIOR GENERAL ADMINISTRATIVE MATTERS Contract Files 1404.802 Contract files. In addition to the requirements in FAR 4.802, files shall...

  13. 48 CFR 1404.802 - Contract files.

    Code of Federal Regulations, 2014 CFR

    2014-10-01

    ... 48 Federal Acquisition Regulations System 5 2014-10-01 2014-10-01 false Contract files. 1404.802 Section 1404.802 Federal Acquisition Regulations System DEPARTMENT OF THE INTERIOR GENERAL ADMINISTRATIVE MATTERS Contract Files 1404.802 Contract files. In addition to the requirements in FAR 4.802, files shall...

  14. 48 CFR 1404.802 - Contract files.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 48 Federal Acquisition Regulations System 5 2010-10-01 2010-10-01 false Contract files. 1404.802 Section 1404.802 Federal Acquisition Regulations System DEPARTMENT OF THE INTERIOR GENERAL ADMINISTRATIVE MATTERS Contract Files 1404.802 Contract files. In addition to the requirements in FAR 4.802, files shall...

  15. Adding Big Data Analytics to GCSS-MC

    DTIC Science & Technology

    2014-09-30

    TERMS Big Data , Hadoop , MapReduce, GCSS-MC 15. NUMBER OF PAGES 93 16. PRICE CODE 17. SECURITY CLASSIFICATION OF REPORT Unclassified 18. SECURITY...10 2.5 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 The Experiment Design 23 3.1 Why Add a Big Data Element...23 3.2 Adding a Big Data Element to GCSS-MC . . . . . . . . . . . . . . 24 3.3 Building a Hadoop Cluster

  16. Astronomy in the Cloud: Using MapReduce for Image Co-Addition

    NASA Astrophysics Data System (ADS)

    Wiley, K.; Connolly, A.; Gardner, J.; Krughoff, S.; Balazinska, M.; Howe, B.; Kwon, Y.; Bu, Y.

    2011-03-01

    In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computation challenges such as anomaly detection and classification and moving-object tracking. Since such studies benefit from the highest-quality data, methods such as image co-addition, i.e., astrometric registration followed by per-pixel summation, will be a critical preprocessing step prior to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources such as potentially hazardous asteroids or transient objects such as supernovae, these data streams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this article we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data are partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources: i.e., platforms where Hadoop is offered as a service. We report on our experience of implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multiterabyte imaging data set provides a good testbed for algorithm development, since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image co-addition to the MapReduce framework. Then we describe a number of optimizations to our basic approach

  17. Log-less metadata management on metadata server for parallel file systems.

    PubMed

    Liao, Jianwei; Xiao, Guoqiang; Peng, Xiaoning

    2014-01-01

    This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally.

  18. Log-Less Metadata Management on Metadata Server for Parallel File Systems

    PubMed Central

    Xiao, Guoqiang; Peng, Xiaoning

    2014-01-01

    This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally. PMID:24892093

  19. 78 FR 21930 - Aquenergy Systems, Inc.; Notice of Intent To File License Application, Filing of Pre-Application...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-04-12

    ... Systems, Inc.; Notice of Intent To File License Application, Filing of Pre-Application Document, and Approving Use of the Traditional Licensing Process a. Type of Filing: Notice of Intent to File License...: November 11, 2012. d. Submitted by: Aquenergy Systems, Inc., a fully owned subsidiaries of Enel Green Power...

  20. 5 CFR 2422.5 - Filing petitions.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 5 Administrative Personnel 3 2010-01-01 2010-01-01 false Filing petitions. 2422.5 Section 2422.5... FEDERAL LABOR RELATIONS AUTHORITY REPRESENTATION PROCEEDINGS § 2422.5 Filing petitions. (a) Where to file. Petitions must be filed with the Regional Director for the region in which the unit or employee(s) affected...

  1. 12 CFR 5.4 - Filing required.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 1 2010-01-01 2010-01-01 false Filing required. 5.4 Section 5.4 Banks and... CORPORATE ACTIVITIES Rules of General Applicability § 5.4 Filing required. (a) Filing. A depository... filings are available in the Manual and from each district office. (c) Other applications accepted. At the...

  2. 78 FR 46936 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-08-02

    ...: Filings Instituting Proceedings Docket Numbers: RP13-1103-000. Applicants: Northern Border Pipeline Company. Description: ACA Filing 2013 to be effective 10/1/2013. Filed Date: 7/25/13. Accession Number... Gas Transmission System. Description: ACA Filing 2013 to be effective 10/1/2013. Filed Date: 7/25/13...

  3. Monitoring WLCG with lambda-architecture: a new scalable data store and analytics platform for monitoring at petabyte scale.

    NASA Astrophysics Data System (ADS)

    Magnoni, L.; Suthakar, U.; Cordeiro, C.; Georgiou, M.; Andreeva, J.; Khan, A.; Smith, D. R.

    2015-12-01

    Monitoring the WLCG infrastructure requires the gathering and analysis of a high volume of heterogeneous data (e.g. data transfers, job monitoring, site tests) coming from different services and experiment-specific frameworks to provide a uniform and flexible interface for scientists and sites. The current architecture, where relational database systems are used to store, to process and to serve monitoring data, has limitations in coping with the foreseen increase in the volume (e.g. higher LHC luminosity) and the variety (e.g. new data-transfer protocols and new resource-types, as cloud-computing) of WLCG monitoring events. This paper presents a new scalable data store and analytics platform designed by the Support for Distributed Computing (SDC) group, at the CERN IT department, which uses a variety of technologies each one targeting specific aspects of big-scale distributed data-processing (commonly referred as lambda-architecture approach). Results of data processing on Hadoop for WLCG data activities monitoring are presented, showing how the new architecture can easily analyze hundreds of millions of transfer logs in a few minutes. Moreover, a comparison of data partitioning, compression and file format (e.g. CSV, Avro) is presented, with particular attention given to how the file structure impacts the overall MapReduce performance. In conclusion, the evolution of the current implementation, which focuses on data storage and batch processing, towards a complete lambda-architecture is discussed, with consideration of candidate technology for the serving layer (e.g. Elasticsearch) and a description of a proof of concept implementation, based on Apache Spark and Esper, for the real-time part which compensates for batch-processing latency and automates problem detection and failures.

  4. Accessing and distributing EMBL data using CORBA (common object request broker architecture).

    PubMed

    Wang, L; Rodriguez-Tomé, P; Redaschi, N; McNeil, P; Robinson, A; Lijnzaad, P

    2000-01-01

    The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences and related information traditionally made available in flat-file format. Queries through tools such as SRS (Sequence Retrieval System) also return data in flat-file format. Flat files have a number of shortcomings, however, and the resources therefore currently lack a flexible environment to meet individual researchers' needs. The Object Management Group's common object request broker architecture (CORBA) is an industry standard that provides platform-independent programming interfaces and models for portable distributed object-oriented computing applications. Its independence from programming languages, computing platforms and network protocols makes it attractive for developing new applications for querying and distributing biological data. A CORBA infrastructure developed by EMBL-EBI provides an efficient means of accessing and distributing EMBL data. The EMBL object model is defined such that it provides a basis for specifying interfaces in interface definition language (IDL) and thus for developing the CORBA servers. The mapping from the object model to the relational schema in the underlying Oracle database uses the facilities provided by PersistenceTM, an object/relational tool. The techniques of developing loaders and 'live object caching' with persistent objects achieve a smart live object cache where objects are created on demand. The objects are managed by an evictor pattern mechanism. The CORBA interfaces to the EMBL database address some of the problems of traditional flat-file formats and provide an efficient means for accessing and distributing EMBL data. CORBA also provides a flexible environment for users to develop their applications by building clients to our CORBA servers, which can be integrated into existing systems.

  5. Accessing and distributing EMBL data using CORBA (common object request broker architecture)

    PubMed Central

    Wang, Lichun; Rodriguez-Tomé, Patricia; Redaschi, Nicole; McNeil, Phil; Robinson, Alan; Lijnzaad, Philip

    2000-01-01

    Background: The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences and related information traditionally made available in flat-file format. Queries through tools such as SRS (Sequence Retrieval System) also return data in flat-file format. Flat files have a number of shortcomings, however, and the resources therefore currently lack a flexible environment to meet individual researchers' needs. The Object Management Group's common object request broker architecture (CORBA) is an industry standard that provides platform-independent programming interfaces and models for portable distributed object-oriented computing applications. Its independence from programming languages, computing platforms and network protocols makes it attractive for developing new applications for querying and distributing biological data. Results: A CORBA infrastructure developed by EMBL-EBI provides an efficient means of accessing and distributing EMBL data. The EMBL object model is defined such that it provides a basis for specifying interfaces in interface definition language (IDL) and thus for developing the CORBA servers. The mapping from the object model to the relational schema in the underlying Oracle database uses the facilities provided by PersistenceTM, an object/relational tool. The techniques of developing loaders and 'live object caching' with persistent objects achieve a smart live object cache where objects are created on demand. The objects are managed by an evictor pattern mechanism. Conclusions: The CORBA interfaces to the EMBL database address some of the problems of traditional flat-file formats and provide an efficient means for accessing and distributing EMBL data. CORBA also provides a flexible environment for users to develop their applications by building clients to our CORBA servers, which can be integrated into existing systems. PMID:11178259

  6. Small file aggregation in a parallel computing system

    DOEpatents

    Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Zhang, Jingwang

    2014-09-02

    Techniques are provided for small file aggregation in a parallel computing system. An exemplary method for storing a plurality of files generated by a plurality of processes in a parallel computing system comprises aggregating the plurality of files into a single aggregated file; and generating metadata for the single aggregated file. The metadata comprises an offset and a length of each of the plurality of files in the single aggregated file. The metadata can be used to unpack one or more of the files from the single aggregated file.

  7. 5 CFR 1203.13 - Filing pleadings.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... delivery, by facsimile, or by e-filing in accordance with § 1201.14 of this chapter. If the document was... submitted by e-filing, it is considered to have been filed on the date of electronic submission. (e... 5 Administrative Personnel 3 2010-01-01 2010-01-01 false Filing pleadings. 1203.13 Section 1203.13...

  8. 18 CFR 385.2001 - Filings (Rule 2001).

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... 18 Conservation of Power and Water Resources 1 2010-04-01 2010-04-01 false Filings (Rule 2001... Filings in Proceedings Before the Commission § 385.2001 Filings (Rule 2001). (a) Filings with the... filing via the Internet pursuant to Rule 2003 through the links provided at http://www.ferc.gov. Note to...

  9. How Hedstrom files fail during clinical use? A retrieval study based on SEM, optical microscopy and micro-XCT analysis.

    PubMed

    Zinelis, Spiros; Al Jabbari, Youssef S

    2018-05-01

    This study was conducted to evaluate the failure mechanism of clinically failed Hedstrom (H)-files. Discarded H-files (n=160) from #8 to #40 ISO sizes were collected from different dental clinics. Retrieved files were classified according to their macroscopic appearance and they were investigated under scanning electron microscopy (SEM) and X-ray micro-computed tomography (mXCT). Then the files were embedded in resin along their longitudinal axis and after metallographic grinding and polishing, studied under an incident light microscope. The macroscopic evaluation showed that small ISO sizes (#08-#15) failed by extensive plastic deformation, while larger sizes (≥#20) tended to fracture. Light microscopy and mXCT results coincided showing that unused and plastically deformed files were free of internal defects, while fractured files demonstrate the presence of intense cracking in the flute region. SEM analysis revealed the presence of striations attributed to the fatigue mechanism. Secondary cracks were also identified by optical microscopy and their distribution was correlated to fatigue under bending loading. Experimental results demonstrated that while overloading of cutting instruments is the predominating failure mechanism of small file sizes (#08-#15), fatigue should be considered the fracture mechanism for larger sizes (≥#20).

  10. Cytoscape file of chemical networks

    EPA Pesticide Factsheets

    The maximum connectivity scores of pairwise chemical conditions summarized from Cmap results in a file with Cytoscape format (http://www.cytoscape.org/). The figures in the publication were generated from this file. The Cytoscape file is formed from importing the eight text file therein.This dataset is associated with the following publication:Wang , R., A. Biales , N. Garcia-Reyero, E. Perkins, D. Villeneuve, G. Ankley, and D. Bencic. Fish Connectivity Mapping: Linking Chemical Stressors by Their MOA-Driven Transcriptomic Profiles. BMC Genomics. BioMed Central Ltd, London, UK, 17(84): 1-20, (2016).

  11. 17 CFR 232.13 - Date of filing; adjustment of filing date.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... Standard Time or Eastern Daylight Saving Time, whichever is currently in effect, shall be deemed filed on.... Eastern Standard Time or Eastern Daylight Saving Time, whichever is currently in effect, shall be deemed... Daylight Savings Time, whichever is currently in effect, shall be deemed filed on the same business day. (4...

  12. 17 CFR 232.13 - Date of filing; adjustment of filing date.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... Standard Time or Eastern Daylight Saving Time, whichever is currently in effect, shall be deemed filed on.... Eastern Standard Time or Eastern Daylight Saving Time, whichever is currently in effect, shall be deemed... Daylight Savings Time, whichever is currently in effect, shall be deemed filed on the same business day. (4...

  13. 17 CFR 232.13 - Date of filing; adjustment of filing date.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... Standard Time or Eastern Daylight Saving Time, whichever is currently in effect, shall be deemed filed on.... Eastern Standard Time or Eastern Daylight Saving Time, whichever is currently in effect, shall be deemed... Daylight Savings Time, whichever is currently in effect, shall be deemed filed on the same business day. (4...

  14. 17 CFR 232.13 - Date of filing; adjustment of filing date.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... Standard Time or Eastern Daylight Saving Time, whichever is currently in effect, shall be deemed filed on.... Eastern Standard Time or Eastern Daylight Saving Time, whichever is currently in effect, shall be deemed... Daylight Savings Time, whichever is currently in effect, shall be deemed filed on the same business day. (4...

  15. 49 CFR 1104.6 - Timely filing required.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... offers next day delivery to Washington, DC. If the e-filing option is chosen (for those pleadings and documents that are appropriate for e-filing, as determined by reference to the information on the Board's Web site), then the e-filed pleading or document is timely filed if the e-filing process is completed...

  16. 10 CFR 110.89 - Filing and service.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ...: Rulemakings and Adjudications Staff or via the E-Filing system, following the procedure set forth in 10 CFR 2.302. Filing by mail is complete upon deposit in the mail. Filing via the E-Filing system is completed... residence with some occupant of suitable age and discretion; (2) Following the requirements for E-Filing in...

  17. 7 CFR 989.23 - File.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 7 Agriculture 8 2010-01-01 2010-01-01 false File. 989.23 Section 989.23 Agriculture Regulations of... CALIFORNIA Order Regulating Handling Definitions § 989.23 File. File means transmit or deliver to the... time: (a) Of actual receipt by the Secretary or committee in the event of personal delivery; (b) Of...

  18. 7 CFR 989.23 - File.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 7 Agriculture 8 2011-01-01 2011-01-01 false File. 989.23 Section 989.23 Agriculture Regulations of... CALIFORNIA Order Regulating Handling Definitions § 989.23 File. File means transmit or deliver to the... time: (a) Of actual receipt by the Secretary or committee in the event of personal delivery; (b) Of...

  19. The "U.S. Monthly Catalog" and the "Publications Reference File" as Collection Development Tools.

    ERIC Educational Resources Information Center

    Reno, Ramona L.

    1994-01-01

    Reports on an analysis of the availability of publications contained in the 1991 "Monthly Catalog of United States Government Publications" through depository libraries, the Government Printing Office (GPO) Sales Program and its Publications Reference File (PRF), and federal agency distribution centers. Implications for collection…

  20. Distributed Systems Technology Survey.

    DTIC Science & Technology

    1987-03-01

    and prolocols. 2. Hardware Technology Ecnomic factor we a majo reonm for the prolierat of dlstbted systoe. Processors, memory, an magne tc ndoptical...destined messages and pertorn the a pro te forwarding. There gImsno agreement that a ightweight process mechanism is essential to support com- monly used...Xerox PARC environment [311. Shared file servers, discussed below, are essential to the success of such a scheme. 11. ecurlity A distributed

  1. 12 CFR 908.25 - Filing of papers.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 12 Banks and Banking 7 2011-01-01 2011-01-01 false Filing of papers. 908.25 Section 908.25 Banks... RULES OF PRACTICE AND PROCEDURE IN HEARINGS ON THE RECORD General Rules § 908.25 Filing of papers. (a) Filing. Any papers required to be filed shall be addressed to the presiding officer and filed with the...

  2. 12 CFR 908.25 - Filing of papers.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 7 2010-01-01 2010-01-01 false Filing of papers. 908.25 Section 908.25 Banks... RULES OF PRACTICE AND PROCEDURE IN HEARINGS ON THE RECORD General Rules § 908.25 Filing of papers. (a) Filing. Any papers required to be filed shall be addressed to the presiding officer and filed with the...

  3. TH-AB-201-12: Using Machine Log-Files for Treatment Planning and Delivery QA

    SciTech Connect

    Stanhope, C; Liang, J; Drake, D

    2016-06-15

    Purpose: To determine the segment reduction and dose resolution necessary for machine log-files to effectively replace current phantom-based patient-specific quality assurance, while minimizing computational cost. Methods: Elekta’s Log File Convertor R3.2 records linac delivery parameters (dose rate, gantry angle, leaf position) every 40ms. Five VMAT plans [4 H&N, 1 Pulsed Brain] comprised of 2 arcs each were delivered on the ArcCHECK phantom. Log-files were reconstructed in Pinnacle on the phantom geometry using 1/2/3/4° control point spacing and 2/3/4mm dose grid resolution. Reconstruction effectiveness was quantified by comparing 2%/2mm gamma passing rates of the original and log-file plans. Modulation complexity scoresmore » (MCS) were calculated for each beam to correlate reconstruction accuracy and beam modulation. Percent error in absolute dose for each plan-pair combination (log-file vs. ArcCHECK, original vs. ArcCHECK, log-file vs. original) was calculated for each arc and every diode greater than 10% of the maximum measured dose (per beam). Comparing standard deviations of the three plan-pair distributions, relative noise of the ArcCHECK and log-file systems was elucidated. Results: The original plans exhibit a mean passing rate of 95.1±1.3%. The eight more modulated H&N arcs [MCS=0.088±0.014] and two less modulated brain arcs [MCS=0.291±0.004] yielded log-file pass rates most similar to the original plan when using 1°/2mm [0.05%±1.3% lower] and 2°/3mm [0.35±0.64% higher] log-file reconstructions respectively. Log-file and original plans displayed percent diode dose errors 4.29±6.27% and 3.61±6.57% higher than measurement. Excluding the phantom eliminates diode miscalibration and setup errors; log-file dose errors were 0.72±3.06% higher than the original plans – significantly less noisy. Conclusion: For log-file reconstructed VMAT arcs, 1° control point spacing and 2mm dose resolution is recommended, however, less modulated arcs may

  4. 77 FR 38279 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-06-27

    .... Description: CO2 Gas Quality Settlement Filing of Wyoming Interstate Company, LLC. Filed Date: 6/11/12.... Description: Fuel Filing to be effective 7/1/2012. Filed Date: 6/20/12. Accession Number: 20120620-5118...

  5. 40 CFR 22.5 - Filing, service, and form of all filed documents; business confidentiality claims.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... 40 Protection of Environment 1 2013-07-01 2013-07-01 false Filing, service, and form of all filed... PENALTIES AND THE REVOCATION/TERMINATION OR SUSPENSION OF PERMITS General § 22.5 Filing, service, and form... association which is subject to suit under a common name, complainant shall serve an officer, partner, a...

  6. 77 FR 74839 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-12-18

    ..., LP. Description: National Grid LNG, LP submits tariff filing per 154.203: Adoption of NAESB Version 2... with Order to Amend NAESB Version 2.0 Filing to be effective 12/1/2012. Filed Date: 12/11/12. Accession...: Refile to comply with Order on NAESB Version 2.0 Filing to be effective 12/1/2012. Filed Date: 12/11/12...

  7. 77 FR 103 - JD Products, LLC; Notice of Intent To File License Application, Filing of Pre-Application...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-01-03

    ... Commission strongly encourages electronic filing, documents may also be paper-filed. To paper-file, mail an... needed please contact Mr. David Pryor, Senior Environmental Scientist--California State Parks, at dpryor...

  8. 40 CFR 78.4 - Filings.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... § 78.4 Filings. (a) All original filings made under this part shall be signed by the person making the... behalf of persons with an interest in allowances in a general account shall be signed by the authorized account representative. Any filings on behalf of owners and operators of a NOX Budget unit or source shall...

  9. Merged Federal Files [Academic Year] 1978-79 [machine-readable data file].

    ERIC Educational Resources Information Center

    National Center for Education Statistics (ED), Washington, DC.

    The Merged Federal File for 1978-79 contains school district level data from the following six source files: (1) the Census of Governments' Survey of Local Government Finances--School Systems (F-33) (with 16,343 records merged); (2) the National Center for Education Statistics Survey of School Systems (School District Universe) (with 16,743…

  10. 12 CFR 303.8 - Public access to filing.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... portions of a filing (the public file) until 180 days following final disposition of a filing. Following the 180-day period, non-confidential portions of an application file will be made available in accordance with ' 303.8(c). The public file generally consists of portions of the filing, supporting data...

  11. 12 CFR 303.8 - Public access to filing.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... portions of a filing (the public file) until 180 days following final disposition of a filing. Following the 180-day period, non-confidential portions of an application file will be made available in accordance with ' 303.8(c). The public file generally consists of portions of the filing, supporting data...

  12. 77 FR 51985 - Archon Energy 1, Inc.; Notice of Intent To File License Application, Filing of Pre-Application...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-08-28

    ..., Inc.; Notice of Intent To File License Application, Filing of Pre-Application Document, and Approving... Application and Request to Use the Traditional Licensing Process. b. Project No.: 14432-000. c. Date Filed... Endangered Species Act. m. Archon filed a Pre-Application Document (PAD) with the Commission, pursuant to 18...

  13. 77 FR 61585 - FPL Energy Maine Hydro LLC; Notice of Intent To File License Application, Filing of Pre...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-10

    ... Hydro LLC; Notice of Intent To File License Application, Filing of Pre-Application Document (PAD... Application for a New License and Commencing Pre-filing Process. b. Project No.: 2531-067. c. Dated Filed... Commission a Pre-Application Document (PAD; including a proposed process plan and schedule), pursuant to 18...

  14. 77 FR 61584 - FFP Missouri 12, LLC; Notice of Intent To File License Application, Filing of Pre-Application...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-10

    ..., LLC; Notice of Intent To File License Application, Filing of Pre-Application Document, and Approving... Application and Request to Use the Traditional Licensing Process. b. Project No.: 13755-001. c. Date Filed.... m. Free Flow Power filed a Pre-Application Document (PAD; including a proposed process plan and...

  15. DISTRIBUTED STRUCTURE-SEARCHABLE TOXICITY ...

    EPA Pesticide Factsheets

    The ability to assess the potential genotoxicity, carcinogenicity, or other toxicity of pharmaceutical or industrial chemicals based on chemical structure information is a highly coveted and shared goal of varied academic, commercial, and government regulatory groups. These diverse interests often employ different approaches and have different criteria and use for toxicity assessments, but they share a need for unrestricted access to existing public toxicity data linked with chemical structure information. Currently, there exists no central repository of toxicity information, commercial or public, that adequately meets the data requirements for flexible analogue searching, SAR model development, or building of chemical relational databases (CRD). The Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network is being proposed as a community-supported, web-based effort to address these shared needs of the SAR and toxicology communities. The DSSTox project has the following major elements: 1) to adopt and encourage the use of a common standard file format (SDF) for public toxicity databases that includes chemical structure, text and property information, and that can easily be imported into available CRD applications; 2) to implement a distributed source approach, managed by a DSSTox Central Website, that will enable decentralized, free public access to structure-toxicity data files, and that will effectively link knowledgeable toxicity data s

  16. Virtual file system for PSDS

    NASA Technical Reports Server (NTRS)

    Runnels, Tyson D.

    1993-01-01

    This is a case study. It deals with the use of a 'virtual file system' (VFS) for Boeing's UNIX-based Product Standards Data System (PSDS). One of the objectives of PSDS is to store digital standards documents. The file-storage requirements are that the files must be rapidly accessible, stored for long periods of time - as though they were paper, protected from disaster, and accumulative to about 80 billion characters (80 gigabytes). This volume of data will be approached in the first two years of the project's operation. The approach chosen is to install a hierarchical file migration system using optical disk cartridges. Files are migrated from high-performance media to lower performance optical media based on a least-frequency-used algorithm. The optical media are less expensive per character stored and are removable. Vital statistics about the removable optical disk cartridges are maintained in a database. The assembly of hardware and software acts as a single virtual file system transparent to the PSDS user. The files are copied to 'backup-and-recover' media whose vital statistics are also stored in the database. Seventeen months into operation, PSDS is storing 49 gigabytes. A number of operational and performance problems were overcome. Costs are under control. New and/or alternative uses for the VFS are being considered.

  17. 75 FR 81594 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-12-28

    ... Pipeline Company submits tariff filing per 154.204: RP11-20 TOC Update to be effective 10/1/2010. Filed... filing per 154.204: RP11-1474 TOC Update to be effective 11/1/2010. Filed Date: 12/16/2010. Accession...

  18. [Master files: less paper, more substance. Special rules for special medicines: Plasma Master File and Vaccine Antigen Master File].

    PubMed

    Seitz, Rainer; Haase, M

    2008-07-01

    The process of reviewing the European pharmaceutical legislation resulted in a codex, which contains two new instruments related to marketing authorisation of biological medicines: Plasma Master File (PMF) and Vaccine Antigen Master File (VAMF). In the manufacture of plasma derivatives (e. g. coagulation factors, albumin, immunoglobulins), usually the same starting material, i. e. a plasma pool, is used for several products. In the case of vaccines, the same active substance, i.e. vaccine antigen, may be included in several combination vaccine products. The intention behind the introduction of PMF and VAMF was to avoid unnecessary and redundant documentation, and to improve and harmonise assessment by means of procedures for certification of master files on the community level.

  19. 76 FR 49761 - Combined Notice of Filings #1

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-08-11

    ... Operator, Inc. submits tariff filing per 35.13(a)(2)(iii: Filing of Notice of Succession to Interconnection.... submits tariff filing per 35.13(a)(2)(iii: Filing of Notice of Succession of ITC Midwest to be effective.... submits tariff filing per 35.13(a)(2)(iii: Notice of Succession to be effective 10/4/2011. Filed Date: 08...

  20. 77 FR 23708 - Combined Notice of Filings #2

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-04-20

    ... Company submits tariff filing per 35.13(a)(2)(iii: 7--20120413 OPCo OATT Conc to be effective 1/1/2012... tariff filing per 35: ER12-247 Compliance Filing to be effective 4/20/2011. Filed Date: 4/13/12... Compliance Filing to be effective 4/20/2011. Filed Date: 4/13/12. Accession Number: 20120413-5145. Comments...

  1. Root canal shaping with manual stainless steel files and rotary Ni-Ti files performed by students.

    PubMed

    Sonntag, D; Guntermann, A; Kim, S K; Stachniss, V

    2003-04-01

    To investigate root canal shaping with manual stainless steel files and rotary Ni-Ti files by students. Two hundred and ten simulated root canals with the same geometrical shape and size in acrylic resin blocks were prepared by 21 undergraduate dental students with manual stainless steel files using a stepback technique or with rotary Ni-Ti files in crown-down technique. Preparation length, canal shape, incidence of fracture and preparation time were investigated. Zips and elbows occurred significantly (P < 0.001) less frequently with rotary than with manual preparation. The correct preparation length was achieved significantly (P < 0.05) more often with rotary Ni-Ti files than with manual stainless steel files. Fractures occurred significantly (P < 0.05) less frequently with hand instrumentation. The mean time required for manual preparation was significantly (P < 0.001) longer than that required for rotary preparation. Prior experience with a hand preparation technique was not reflected in an improved quality of the subsequent engine-driven preparation. Inexperienced operators achieved better canal preparations with rotary Ni-Ti instruments than with manual stainless steel files. However, rotary preparation was associated with significantly more fractures.

  2. Professors Join the Fray as Supreme Court Hears Arguments in File-Sharing Case

    ERIC Educational Resources Information Center

    Foster, Andrea L.

    2005-01-01

    U.S. Supreme Court justices struggled in a lively debate with how to balance the competing interests of the entertainment industry and developers of file-sharing technology. Some justices sharply questioned whether it was fair to hold inventors of a distribution technology liable for copyright infringement, while others suggested that it was wrong…

  3. DMFS: A Data Migration File System for NetBSD

    NASA Technical Reports Server (NTRS)

    Studenmund, William

    1999-01-01

    I have recently developed dmfs, a Data Migration File System, for NetBSD. This file system is based on the overlay file system, which is discussed in a separate paper, and provides kernel support for the data migration system being developed by my research group here at NASA/Ames. The file system utilizes an underlying file store to provide the file backing, and coordinates user and system access to the files. It stores its internal meta data in a flat file, which resides on a separate file system. Our data migration system provides archiving and file migration services. System utilities scan the dmfs file system for recently modified files, and archive them to two separate tape stores. Once a file has been doubly archived, files larger than a specified size will be truncated to that size, potentially freeing up large amounts of the underlying file store. Some sites will choose to retain none of the file (deleting its contents entirely from the file system) while others may choose to retain a portion, for instance a preamble describing the remainder of the file. The dmfs layer coordinates access to the file, retaining user-perceived access and modification times, file size, and restricting access to partially migrated files to the portion actually resident. When a user process attempts to read from the non-resident portion of a file, it is blocked and the dmfs layer sends a request to a system daemon to restore the file. As more of the file becomes resident, the user process is permitted to begin accessing the now-resident portions of the file. For simplicity, our data migration system divides a file into two portions, a resident portion followed by an optional non-resident portion. Also, a file is in one of three states: fully resident, fully resident and archived, and (partially) non-resident and archived. For a file which is only partially resident, any attempt to write or truncate the file, or to read a non-resident portion, will trigger a file restoration

  4. 18 CFR 35.7 - Electronic filing requirements.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... 18 Conservation of Power and Water Resources 1 2011-04-01 2011-04-01 false Electronic filing... § 35.7 Electronic filing requirements. (a) General rule. All filings made in proceedings initiated... declarations or statements and electronic signatures. (c) Format requirements for electronic filing. The...

  5. 18 CFR 35.7 - Electronic filing requirements.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... 18 Conservation of Power and Water Resources 1 2012-04-01 2012-04-01 false Electronic filing... § 35.7 Electronic filing requirements. (a) General rule. All filings made in proceedings initiated... declarations or statements and electronic signatures. (c) Format requirements for electronic filing. The...

  6. 18 CFR 35.7 - Electronic filing requirements.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... 18 Conservation of Power and Water Resources 1 2013-04-01 2013-04-01 false Electronic filing... § 35.7 Electronic filing requirements. (a) General rule. All filings made in proceedings initiated... declarations or statements and electronic signatures. (c) Format requirements for electronic filing. The...

  7. 18 CFR 35.7 - Electronic filing requirements.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 18 Conservation of Power and Water Resources 1 2014-04-01 2014-04-01 false Electronic filing... § 35.7 Electronic filing requirements. (a) General rule. All filings made in proceedings initiated... declarations or statements and electronic signatures. (c) Format requirements for electronic filing. The...

  8. 78 FR 21925 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-04-12

    ... comment date. The filings are accessible in the Commission's eLibrary system by clicking on the links or querying the docket number. eFiling is encouraged. More detailed information relating to filing... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings Take notice...

  9. File concepts for parallel I/O

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1989-01-01

    The subject of input/output (I/O) was often neglected in the design of parallel computer systems, although for many problems I/O rates will limit the speedup attainable. The I/O problem is addressed by considering the role of files in parallel systems. The notion of parallel files is introduced. Parallel files provide for concurrent access by multiple processes, and utilize parallelism in the I/O system to improve performance. Parallel files can also be used conventionally by sequential programs. A set of standard parallel file organizations is proposed, organizations are suggested, using multiple storage devices. Problem areas are also identified and discussed.

  10. Preservation of root canal anatomy using self-adjusting file instrumentation with glide path prepared by 20/0.02 hand files versus 20/0.04 rotary files

    PubMed Central

    Jain, Niharika; Pawar, Ajinkya M.; Ukey, Piyush D.; Jain, Prashant K.; Thakur, Bhagyashree; Gupta, Abhishek

    2017-01-01

    Objectives: To compare the relative axis modification and canal concentricity after glide path preparation with 20/0.02 hand K-file (NITIFLEX®) and 20/0.04 rotary file (HyFlex™ CM) with subsequent instrumentation with 1.5 mm self-adjusting file (SAF). Materials and Methods: One hundred and twenty ISO 15, 0.02 taper, Endo Training Blocks (Dentsply Maillefer, Ballaigues, Switzerland) were acquired and randomly divided into following two groups (n = 60): group 1, establishing glide path till 20/0.02 hand K-file (NITIFLEX®) followed by instrumentation with 1.5 mm SAF; and Group 2, establishing glide path till 20/0.04 rotary file (HyFlex™ CM) followed by instrumentation with 1.5 mm SAF. Pre- and post-instrumentation digital images were processed with MATLAB R 2013 software to identify the central axis, and then superimposed using digital imaging software (Picasa 3.0 software, Google Inc., California, USA) taking five landmarks as reference points. Student's t-test for pairwise comparisons was applied with the level of significance set at 0.05. Results: Training blocks instrumented with 20/0.04 rotary file and SAF were associated less deviation in canal axis (at all the five marked points), representing better canal concentricity compared to those, in which glide path was established by 20/0.02 hand K-files followed by SAF instrumentation. Conclusion: Canal geometry is better maintained after SAF instrumentation with a prior glide path established with 20/0.04 rotary file. PMID:28855752

  11. 76 FR 41331 - Application Filing Requirements

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-07-13

    ... DEPARTMENT OF THE TREASURY Office of Thrift Supervision Application Filing Requirements AGENCY... following information collection. Title of Proposal: Application Filing Requirements. OMB Number: 1550-0056. Form Number: N/A. Description: OTS regulations require that applications, notices, or other filings...

  12. Sector and Sphere: the design and implementation of a high-performance data cloud

    PubMed Central

    Gu, Yunhong; Grossman, Robert L.

    2009-01-01

    Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply, given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. By contrast with the existing storage and compute clouds, Sector can manage data not only within a data centre, but also across geographically distributed data centres. Similarly, the Sphere compute cloud supports user-defined functions (UDFs) over data both within and across data centres. As a special case, MapReduce-style programming can be implemented in Sphere by using a Map UDF followed by a Reduce UDF. We describe some experimental studies comparing Sector/Sphere and Hadoop using the Terasort benchmark. In these studies, Sector is approximately twice as fast as Hadoop. Sector/Sphere is open source. PMID:19451100

  13. Distributed Leadership in Drainage Basin Management: A Critical Analysis of ‘River Chief Policy’ from a Distributed Leadership Perspective

    NASA Astrophysics Data System (ADS)

    Zhang, Liuyi

    2018-02-01

    Water resources management has been more significant than ever since the official file stipulated ‘three red lines’ to scrupulously control water usage and water pollution, accelerating the promotion of ‘River Chief Policy’ throughout China. The policy launches creative approaches to include people from different administrative levels to participate and distributes power to increase drainage basin management efficiency. Its execution resembles features of distributed leadership theory, a vastly acknowledged western leadership theory with innovative perspective and visions to suit the modern world. This paper intends to analyse the policy from a distributed leadership perspective using Taylor’s critical policy analysis framework.

  14. Compression of Index Term Dictionary in an Inverted-File-Oriented Database: Some Effective Algorithms.

    ERIC Educational Resources Information Center

    Wisniewski, Janusz L.

    1986-01-01

    Discussion of a new method of index term dictionary compression in an inverted-file-oriented database highlights a technique of word coding, which generates short fixed-length codes obtained from the index terms themselves by analysis of monogram and bigram statistical distributions. Substantial savings in communication channel utilization are…

  15. 78 FR 57374 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-09-18

    ...: Kinetica Energy Express, LLC. Description: Kinetica Energy Express, LLC submits tariff filing per 154.203: Kinetica Energy Express LLC--FERC Gas Tariff--Volume 1 A Baseline Filing to be effective 9/1/2013. Filed... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings Take notice...

  16. Evaluation of canal transportation after preparation with Reciproc single-file systems with or without glide path files.

    PubMed

    Aydin, Ugur; Karataslioglu, Emrah

    2017-01-01

    Canal transportation is a common sequel caused by rotary instruments. The purpose of the present study is to evaluate the degree of transportation after the use of Reciproc single-file instruments with or without glide path files. Thirty resin blocks with L-shaped canals were divided into three groups ( n = 10). Group 1 - canals were prepared with Reciproc-25 file. Group 2 - glide path file-G1 was used before Reciproc. Group 3 - glide path files-G1 and G2 were used before Reciproc. Pre- and post-instrumentation images were superimposed under microscope, and resin removed from the inner and outer surfaces of the root canal was calculated throughout 10 points. Statistical analysis was performed with Kruskal-Wallis test and post hoc Dunn test. For coronal and middle one-thirds, there was no significant difference among groups ( P > 0.05). For apical section, transportation of Group 1 was significantly higher than other groups ( P < 0.05). Using glide path files before Reciproc single-file system reduced the degree of apical canal transportation.

  17. 22 CFR 123.22 - Filing, retention, and return of export licenses and filing of export information.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... 22 Foreign Relations 1 2012-04-01 2012-04-01 false Filing, retention, and return of export licenses and filing of export information. 123.22 Section 123.22 Foreign Relations DEPARTMENT OF STATE INTERNATIONAL TRAFFIC IN ARMS REGULATIONS LICENSES FOR THE EXPORT OF DEFENSE ARTICLES § 123.22 Filing, retention...

  18. 22 CFR 123.22 - Filing, retention, and return of export licenses and filing of export information.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... 22 Foreign Relations 1 2010-04-01 2010-04-01 false Filing, retention, and return of export licenses and filing of export information. 123.22 Section 123.22 Foreign Relations DEPARTMENT OF STATE INTERNATIONAL TRAFFIC IN ARMS REGULATIONS LICENSES FOR THE EXPORT OF DEFENSE ARTICLES § 123.22 Filing, retention...

  19. 22 CFR 123.22 - Filing, retention, and return of export licenses and filing of export information.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... 22 Foreign Relations 1 2011-04-01 2011-04-01 false Filing, retention, and return of export licenses and filing of export information. 123.22 Section 123.22 Foreign Relations DEPARTMENT OF STATE INTERNATIONAL TRAFFIC IN ARMS REGULATIONS LICENSES FOR THE EXPORT OF DEFENSE ARTICLES § 123.22 Filing, retention...

  20. 22 CFR 123.22 - Filing, retention, and return of export licenses and filing of export information.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... 22 Foreign Relations 1 2013-04-01 2013-04-01 false Filing, retention, and return of export licenses and filing of export information. 123.22 Section 123.22 Foreign Relations DEPARTMENT OF STATE INTERNATIONAL TRAFFIC IN ARMS REGULATIONS LICENSES FOR THE EXPORT OF DEFENSE ARTICLES § 123.22 Filing, retention...

  1. 10 CFR 2.302 - Filing of documents.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... this part shall be electronically transmitted through the E-Filing system, unless the Commission or... all methods of filing have been completed. (e) For filings by electronic transmission, the filer must... digital ID certificates, the NRC permits participants in the proceeding to access the E-Filing system to...

  2. 77 FR 1481 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-01-10

    ...: Filings Instituting Proceedings Docket Numbers: RP12-276-000. Applicants: Gulf Crossing Pipeline Company LLC. Description: Gulf Crossing Pipeline Company LLC submits tariff filing per 154.204: Antero 2 to Tenaska 243 Capacity Release Negotiated Rate Agreement Filing to be effective 1/1/2012. Filed Date: 1/3/12...

  3. 78 FR 55246 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-09-10

    ... that the Commission has received the following Natural Gas Pipeline Rate and Refund Report filings: Filings Instituting Proceedings Docket Numbers: RP13-1290-000. Applicants: MoGas Pipeline LLC. Description: Annual Fuel and Gas Loss Retention Percentage Adjustment Filing to be effective 10/1/2013. Filed Date: 8...

  4. 48 CFR 204.802 - Contract files.

    Code of Federal Regulations, 2014 CFR

    2014-10-01

    ... 48 Federal Acquisition Regulations System 3 2014-10-01 2014-10-01 false Contract files. 204.802 Section 204.802 Federal Acquisition Regulations System DEFENSE ACQUISITION REGULATIONS SYSTEM, DEPARTMENT OF DEFENSE GENERAL ADMINISTRATIVE MATTERS Contract Files 204.802 Contract files. Official contract...

  5. 48 CFR 204.802 - Contract files.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 48 Federal Acquisition Regulations System 3 2011-10-01 2011-10-01 false Contract files. 204.802 Section 204.802 Federal Acquisition Regulations System DEFENSE ACQUISITION REGULATIONS SYSTEM, DEPARTMENT OF DEFENSE GENERAL ADMINISTRATIVE MATTERS Contract Files 204.802 Contract files. Official contract...

  6. 48 CFR 204.802 - Contract files.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 48 Federal Acquisition Regulations System 3 2010-10-01 2010-10-01 false Contract files. 204.802 Section 204.802 Federal Acquisition Regulations System DEFENSE ACQUISITION REGULATIONS SYSTEM, DEPARTMENT OF DEFENSE GENERAL ADMINISTRATIVE MATTERS Contract Files 204.802 Contract files. Official contract...

  7. 48 CFR 204.802 - Contract files.

    Code of Federal Regulations, 2012 CFR

    2012-10-01

    ... 48 Federal Acquisition Regulations System 3 2012-10-01 2012-10-01 false Contract files. 204.802 Section 204.802 Federal Acquisition Regulations System DEFENSE ACQUISITION REGULATIONS SYSTEM, DEPARTMENT OF DEFENSE GENERAL ADMINISTRATIVE MATTERS Contract Files 204.802 Contract files. Official contract...

  8. 48 CFR 204.802 - Contract files.

    Code of Federal Regulations, 2013 CFR

    2013-10-01

    ... 48 Federal Acquisition Regulations System 3 2013-10-01 2013-10-01 false Contract files. 204.802 Section 204.802 Federal Acquisition Regulations System DEFENSE ACQUISITION REGULATIONS SYSTEM, DEPARTMENT OF DEFENSE GENERAL ADMINISTRATIVE MATTERS Contract Files 204.802 Contract files. Official contract...

  9. Snake River Plain Geothermal Play Fairway Analysis - Phase 1 Raster Files

    DOE Data Explorer

    John Shervais

    2015-10-09

    Snake River Plain Play Fairway Analysis - Phase 1 CRS Raster Files. This dataset contains raster files created in ArcGIS. These raster images depict Common Risk Segment (CRS) maps for HEAT, PERMEABILITY, AND SEAL, as well as selected maps of Evidence Layers. These evidence layers consist of either Bayesian krige functions or kernel density functions, and include: (1) HEAT: Heat flow (Bayesian krige map), Heat flow standard error on the krige function (data confidence), volcanic vent distribution as function of age and size, groundwater temperature (equivalue interval and natural breaks bins), and groundwater T standard error. (2) PERMEABILTY: Fault and lineament maps, both as mapped and as kernel density functions, processed for both dilational tendency (TD) and slip tendency (ST), along with data confidence maps for each data type. Data types include mapped surface faults from USGS and Idaho Geological Survey data bases, as well as unpublished mapping; lineations derived from maximum gradients in magnetic, deep gravity, and intermediate depth gravity anomalies. (3) SEAL: Seal maps based on presence and thickness of lacustrine sediments and base of SRP aquifer. Raster size is 2 km. All files generated in ArcGIS.

  10. Jefferson Lab Mass Storage and File Replication Services

    SciTech Connect

    Ian Bird; Ying Chen; Bryan Hess

    Jefferson Lab has implemented a scalable, distributed, high performance mass storage system - JASMine. The system is entirely implemented in Java, provides access to robotic tape storage and includes disk cache and stage manager components. The disk manager subsystem may be used independently to manage stand-alone disk pools. The system includes a scheduler to provide policy-based access to the storage systems. Security is provided by pluggable authentication modules and is implemented at the network socket level. The tape and disk cache systems have well defined interfaces in order to provide integration with grid-based services. The system is in production andmore » being used to archive 1 TB per day from the experiments, and currently moves over 2 TB per day total. This paper will describe the architecture of JASMine; discuss the rationale for building the system, and present a transparent 3rd party file replication service to move data to collaborating institutes using JASMine, XM L, and servlet technology interfacing to grid-based file transfer mechanisms.« less

  11. High School and Beyond: Twins and Siblings' File Users' Manual, User's Manual for Teacher Comment File, Friends File Users' Manual.

    ERIC Educational Resources Information Center

    National Center for Education Statistics (ED), Washington, DC.

    These three users' manuals are for specific files of the High School and Beyond Study, a national longitudinal study of high school sophomores and seniors in 1980. The three files are computerized databases that are available on magnetic tape. As one component of base year data collection, information identifying twins, triplets, and some non-twin…

  12. 32 CFR 727.14 - Files and records.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... 32 National Defense 5 2012-07-01 2012-07-01 false Files and records. 727.14 Section 727.14 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL LEGAL ASSISTANCE § 727.14 Files and records. (a) Case files. The material contained in legal assistance case files is...

  13. 19 CFR 201.8 - Filing of documents.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... 19 Customs Duties 3 2013-04-01 2013-04-01 false Filing of documents. 201.8 Section 201.8 Customs Duties UNITED STATES INTERNATIONAL TRADE COMMISSION GENERAL RULES OF GENERAL APPLICATION Initiation and Conduct of Investigations § 201.8 Filing of documents. (a) Applicability; where to file; date of filing...

  14. 19 CFR 201.8 - Filing of documents.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 19 Customs Duties 3 2014-04-01 2014-04-01 false Filing of documents. 201.8 Section 201.8 Customs Duties UNITED STATES INTERNATIONAL TRADE COMMISSION GENERAL RULES OF GENERAL APPLICATION Initiation and Conduct of Investigations § 201.8 Filing of documents. (a) Applicability; where to file; date of filing...

  15. 77 FR 20812 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-04-06

    ...: Trailblazer Pipeline Company LLC. Description: Negotiated Rate Filing--United Energy to be effective 4/1/2012... Company, LP. Description: Sequent 39412 Negotiated Rate Agreement Filing to be effective 4/1/2012. Filed... Energy Company FA0845 to be effective 4/1/ 2012. Filed Date: 3/29/12. Accession Number: 20120329-5032...

  16. 77 FR 35371 - Combined Notice of Filings #1

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-06-13

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings 1 Take notice.... Applicants: Duke Energy Miami Fort, LLC. Description: MBR Filing to be effective 10/1/2012. Filed Date: 6/5...-000. Applicants: Duke Energy Piketon, LLC. Description: MBR Filing to be effective 10/1/2012. Filed...

  17. 47 CFR 2.1205 - Filing of required declaration.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... Interference § 2.1205 Filing of required declaration. (a) For points of entry where electronic filing with... electronic filing with Customs is available, submit the following information to Customs when filing the... Customs for electronic filing. (i) The terms under which the device is being imported, as indicated by...

  18. Reporting Differences Between Spacecraft Sequence Files

    NASA Technical Reports Server (NTRS)

    Khanampompan, Teerapat; Gladden, Roy E.; Fisher, Forest W.

    2010-01-01

    A suite of computer programs, called seq diff suite, reports differences between the products of other computer programs involved in the generation of sequences of commands for spacecraft. These products consist of files of several types: replacement sequence of events (RSOE), DSN keyword file [DKF (wherein DSN signifies Deep Space Network)], spacecraft activities sequence file (SASF), spacecraft sequence file (SSF), and station allocation file (SAF). These products can include line numbers, request identifications, and other pieces of information that are not relevant when generating command sequence products, though these fields can result in the appearance of many changes to the files, particularly when using the UNIX diff command to inspect file differences. The outputs of prior software tools for reporting differences between such products include differences in these non-relevant pieces of information. In contrast, seq diff suite removes the fields containing the irrelevant pieces of information before processing to extract differences, so that only relevant differences are reported. Thus, seq diff suite is especially useful for reporting changes between successive versions of the various products and in particular flagging difference in fields relevant to the sequence command generation and review process.

  19. Personal File Management for the Health Sciences.

    ERIC Educational Resources Information Center

    Apostle, Lynne

    Written as an introduction to the concepts of creating a personal or reprint file, this workbook discusses both manual and computerized systems, with emphasis on the preliminary groundwork that needs to be done before starting any filing system. A file assessment worksheet is provided; considerations in developing a personal filing system are…

  20. 12 CFR 509.10 - Filing of papers.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 5 2010-01-01 2010-01-01 false Filing of papers. 509.10 Section 509.10 Banks... IN ADJUDICATORY PROCEEDINGS Uniform Rules of Practice and Procedure § 509.10 Filing of papers. (a) Filing. Any papers required to be filed, excluding documents produced in response to a discovery request...

  1. 12 CFR 509.10 - Filing of papers.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 12 Banks and Banking 5 2011-01-01 2011-01-01 false Filing of papers. 509.10 Section 509.10 Banks... IN ADJUDICATORY PROCEEDINGS Uniform Rules of Practice and Procedure § 509.10 Filing of papers. (a) Filing. Any papers required to be filed, excluding documents produced in response to a discovery request...

  2. 10 CFR 1003.51 - What to file.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... 10 Energy 4 2014-01-01 2014-01-01 false What to file. 1003.51 Section 1003.51 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) OFFICE OF HEARINGS AND APPEALS PROCEDURAL REGULATIONS Modification or Rescission § 1003.51 What to file. A person filing under this subpart shall file an “Application for...

  3. 77 FR 56833 - Combined Notice of Filings #1

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-09-14

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings 1 Take notice... revisions compliance filing to be effective 9/1/2012. Filed Date: 8/29/12. Accession Number: 20120830-5007... Company. Description: SEGCO 2012 PBOP Filing to be effective 1/1/2012. Filed Date: 8/29/12. Accession...

  4. 47 CFR 1.10006 - Is electronic filing mandatory?

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 47 Telecommunication 1 2011-10-01 2011-10-01 false Is electronic filing mandatory? 1.10006 Section... International Bureau Filing System § 1.10006 Is electronic filing mandatory? Electronic filing is mandatory for... System (IBFS) form is available. Applications for which an electronic form is not available must be filed...

  5. 47 CFR 61.14 - Method of filing publications.

    Code of Federal Regulations, 2014 CFR

    2014-10-01

    ... 47 Telecommunication 3 2014-10-01 2014-10-01 false Method of filing publications. 61.14 Section 61...) TARIFFS Rules for Electronic Filing § 61.14 Method of filing publications. (a) Publications filed... date of a publication received by the Electronic Tariff Filing System will be determined by the date...

  6. 77 FR 60419 - Lock + Hydro Friends Fund XIX, LLC; Notice of Intent To File License Application, Filing of Pre...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-03

    ... Friends Fund XIX, LLC; Notice of Intent To File License Application, Filing of Pre-Application Document.... Date Filed: August 7, 2012. d. Submitted By: Lock + Hydro Friends Fund XIX, LLC. e. Name of Project.... Lock + Hydro Friends Fund XIX, LLC filed its request to use the Traditional Licensing Process on August...

  7. 46 CFR 530.5 - Duty to file.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 46 Shipping 9 2010-10-01 2010-10-01 false Duty to file. 530.5 Section 530.5 Shipping FEDERAL... Provisions § 530.5 Duty to file. (a) The duty under this part to file service contracts, amendments and... conditions as the parties may agree. (c) Registration—(1) Application. Authority to file or delegate the...

  8. 46 CFR 530.5 - Duty to file.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 46 Shipping 9 2011-10-01 2011-10-01 false Duty to file. 530.5 Section 530.5 Shipping FEDERAL... Provisions § 530.5 Duty to file. (a) The duty under this part to file service contracts, amendments and... conditions as the parties may agree. (c) Registration—(1) Application. Authority to file or delegate the...

  9. 77 FR 56197 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-09-12

    ... Transmission Northwest LLC. Description: Revising Creditworthiness Language to be effective 10/ 1/2012. Filed...-12 to be effective 10/1/2012. Filed Date: 8/31/12. Accession Number: 20120831-5051. Comments Due: 5 p.... Description: Fuel Filing on 8-31-12 to be effective 10/1/2012. Filed Date: 8/31/12. Accession Number: 20120831...

  10. 14 CFR 16.13 - Filing of documents.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... previously agreed with the complainant in writing to participate in electronic filing. Documents may be filed... 14 Aeronautics and Space 1 2014-01-01 2014-01-01 false Filing of documents. 16.13 Section 16.13..., Proceedings Initiated by the FAA, and Appeals § 16.13 Filing of documents. Except as otherwise provided in...

  11. 12 CFR 263.10 - Filing of papers.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 3 2010-01-01 2010-01-01 false Filing of papers. 263.10 Section 263.10 Banks... OF PRACTICE FOR HEARINGS Uniform Rules of Practice and Procedure § 263.10 Filing of papers. (a) Filing. Any papers required to be filed, excluding documents produced in response to a discovery request...

  12. 12 CFR 263.10 - Filing of papers.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 12 Banks and Banking 3 2011-01-01 2011-01-01 false Filing of papers. 263.10 Section 263.10 Banks... OF PRACTICE FOR HEARINGS Uniform Rules of Practice and Procedure § 263.10 Filing of papers. (a) Filing. Any papers required to be filed, excluding documents produced in response to a discovery request...

  13. 12 CFR 308.10 - Filing of papers.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 4 2010-01-01 2010-01-01 false Filing of papers. 308.10 Section 308.10 Banks... AND PROCEDURE Uniform Rules of Practice and Procedure § 308.10 Filing of papers. (a) Filing. Any papers required to be filed, excluding documents produced in response to a discovery request pursuant to...

  14. 12 CFR 308.10 - Filing of papers.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 12 Banks and Banking 4 2011-01-01 2011-01-01 false Filing of papers. 308.10 Section 308.10 Banks... AND PROCEDURE Uniform Rules of Practice and Procedure § 308.10 Filing of papers. (a) Filing. Any papers required to be filed, excluding documents produced in response to a discovery request pursuant to...

  15. 17 CFR 20.5 - Series S filings.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... 17 Commodity and Securities Exchanges 1 2012-04-01 2012-04-01 false Series S filings. 20.5 Section... FOR PHYSICAL COMMODITY SWAPS § 20.5 Series S filings. (a) 102S filing. (1) When a counterparty... 102S filing only once for each counterparty, even if such persons at various times have multiple...

  16. 10 CFR 205.81 - What to file.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 10 Energy 3 2010-01-01 2010-01-01 false What to file. 205.81 Section 205.81 Energy DEPARTMENT OF ENERGY OIL ADMINISTRATIVE PROCEDURES AND SANCTIONS Interpretation § 205.81 What to file. (a) A person filing under this subpart shall file a “Request for Interpretation,” which should be clearly labeled as...

  17. 10 CFR 1003.41 - What to file.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... 10 Energy 4 2014-01-01 2014-01-01 false What to file. 1003.41 Section 1003.41 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) OFFICE OF HEARINGS AND APPEALS PROCEDURAL REGULATIONS Stays § 1003.41 What to file. A person filing under this subpart shall file an “Application for Stay” which should be clearly...

  18. 10 CFR 1003.32 - What to file.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... 10 Energy 4 2014-01-01 2014-01-01 false What to file. 1003.32 Section 1003.32 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) OFFICE OF HEARINGS AND APPEALS PROCEDURAL REGULATIONS Appeals § 1003.32 What to file. A person filing under this subpart shall file an “Appeal of Order” which should be clearly...

  19. 76 FR 41790 - Natural Currents Energy Services, LLC; Notice of Intent To File License Application, Filing of...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-07-15

    ... Energy Services, LLC; Notice of Intent To File License Application, Filing of Draft Application, Request for Waivers of Integrated Licensing Process Regulations Necessary for Expedited Processing of a.... Project No.: 12718-002. c. Date Filed: June 28, 2011. d. Submitted By: Natural Currents Energy Services...

  20. High Performance Data Transfer for Distributed Data Intensive Sciences

    SciTech Connect

    Fang, Chin; Cottrell, R 'Les' A.; Hanushevsky, Andrew B.

    We report on the development of ZX software providing high performance data transfer and encryption. The design scales in: computation power, network interfaces, and IOPS while carefully balancing the available resources. Two U.S. patent-pending algorithms help tackle data sets containing lots of small files and very large files, and provide insensitivity to network latency. It has a cluster-oriented architecture, using peer-to-peer technologies to ease deployment, operation, usage, and resource discovery. Its unique optimizations enable effective use of flash memory. Using a pair of existing data transfer nodes at SLAC and NERSC, we compared its performance to that of bbcp andmore » GridFTP and determined that they were comparable. With a proof of concept created using two four-node clusters with multiple distributed multi-core CPUs, network interfaces and flash memory, we achieved 155Gbps memory-to-memory over a 2x100Gbps link aggregated channel and 70Gbps file-to-file with encryption over a 5000 mile 100Gbps link.« less

  1. 78 FR 33138 - Self-Regulatory Organizations; The Options Clearing Corporation; Notice of Filing of Proposed...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-06-03

    ... distributions or other corporate actions, that affect the underlying security or other underlying interest... SECURITIES AND EXCHANGE COMMISSION [Release No. 34-69642; File No. SR-OCC-2013-05] Self-Regulatory..., Rather Than an Adjustment Panel of the Securities Committee, Will Determine Adjustments to the Terms of...

  2. A Big-Data-based platform of workers' behavior: Observations from the field.

    PubMed

    Guo, S Y; Ding, L Y; Luo, H B; Jiang, X Y

    2016-08-01

    Behavior-Based Safety (BBS) has been used in construction to observe, analyze and modify workers' behavior. However, studies have identified that BBS has several limitations, which have hindered its effective implementation. To mitigate the negative impact of BBS, this paper uses a case study approach to develop a Big-Data-based platform to classify, collect and store data about workers' unsafe behavior that is derived from a metro construction project. In developing the platform, three processes were undertaken: (1) a behavioral risk knowledge base was established; (2) images reflecting workers' unsafe behavior were collected from intelligent video surveillance and mobile application; and (3) images with semantic information were stored via a Hadoop Distributed File System (HDFS). The platform was implemented during the construction of the metro-system and it is demonstrated that it can effectively analyze semantic information contained in images, automatically extract workers' unsafe behavior and quickly retrieve on HDFS as well. The research presented in this paper can enable construction organizations with the ability to visualize unsafe acts in real-time and further identify patterns of behavior that can jeopardize safety outcomes. Copyright © 2015 Elsevier Ltd. All rights reserved.

  3. 11 CFR 9006.2 - Filing dates.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 11 Federal Elections 1 2010-01-01 2010-01-01 false Filing dates. 9006.2 Section 9006.2 Federal Elections FEDERAL ELECTION COMMISSION PRESIDENTIAL ELECTION CAMPAIGN FUND: GENERAL ELECTION FINANCING REPORTS AND RECORDKEEPING § 9006.2 Filing dates. The reports required to be filed under 11 CFR 9006.1...

  4. 49 CFR 604.30 - Filing complaints.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ..., DEPARTMENT OF TRANSPORTATION CHARTER SERVICE Complaints § 604.30 Filing complaints. (a) Filing address. Unless provided otherwise, the complainant shall file the complaint with the Office of the Chief Counsel... Service Complaint docket number FTA-2007-0025 at http://www.regulations.gov or mail it to the docket by...

  5. 21 CFR 314.420 - Drug master files.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 21 Food and Drugs 5 2014-04-01 2014-04-01 false Drug master files. 314.420 Section 314.420 Food... master files. (a) A drug master file is a submission of information to the Food and Drug Administration by a person (the drug master file holder) who intends it to be used for one of the following purposes...

  6. 78 FR 20908 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-04-08

    .... Description: UGI Storage Compliance Filing TL-96 to be effective 4/ 1/2013. Filed Date: 3/28/13. Accession... Eastern Pipe Line Company, LP. Description: Flow Through of Cash-Out Revenues filed on 3-28-13. Filed Date: 3/28/13. Accession Number: 20130328-5022. Comments Due: 5 p.m. ET 4/9/13. Docket Numbers: RP13-726...

  7. 12 CFR 19.10 - Filing of papers.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 12 Banks and Banking 1 2011-01-01 2011-01-01 false Filing of papers. 19.10 Section 19.10 Banks and... Rules of Practice and Procedure § 19.10 Filing of papers. (a) Filing. Any papers required to be filed...) Delivering the papers to a reliable commercial courier service, overnight delivery service, or to the U.S...

  8. 12 CFR 19.10 - Filing of papers.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 1 2010-01-01 2010-01-01 false Filing of papers. 19.10 Section 19.10 Banks and... Rules of Practice and Procedure § 19.10 Filing of papers. (a) Filing. Any papers required to be filed...) Delivering the papers to a reliable commercial courier service, overnight delivery service, or to the U.S...

  9. 12 CFR 747.10 - Filing of papers.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 12 Banks and Banking 6 2011-01-01 2011-01-01 false Filing of papers. 747.10 Section 747.10 Banks... Practice and Procedure § 747.10 Filing of papers. (a) Filing. Any papers required to be filed, excluding...) Delivering the papers to a reliable commercial courier service, overnight delivery service, or to the U.S...

  10. 12 CFR 747.10 - Filing of papers.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 6 2010-01-01 2010-01-01 false Filing of papers. 747.10 Section 747.10 Banks... Practice and Procedure § 747.10 Filing of papers. (a) Filing. Any papers required to be filed, excluding...) Delivering the papers to a reliable commercial courier service, overnight delivery service, or to the U.S...

  11. 29 CFR 24.103 - Filing of retaliation complaint.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... be reduced to writing by OSHA. If a complainant is not able to file the complaint in English, the complaint may be filed in any language. (c) Place of Filing. The complaint should be filed with the OSHA... resides or was employed, but may be filed with any OSHA officer or employee. Addresses and telephone...

  12. 75 FR 70733 - Combined Notice of Filings #1

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-11-18

    ... Corporation submits tariff filing per 35: 2010-10-29 CAISO MSG Transition Cost Compliance to be effective 11.... Description: Alcan Power Marketing, Inc. submits tariff filing per 35.12: Baseline Filing to be effective 11/1... Compliance Filing to be effective 9/30/2010. Filed Date: 10/29/2010. Accession Number: 20101029-5211. Comment...

  13. 76 FR 87 - Grant of Authority for Subzone Status; Skechers USA, LLC (Distribution of Footwear); Moreno...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-01-03

    ... Status; Skechers USA, LLC (Distribution of Footwear); Moreno Valley, California Pursuant to its authority... distribution facility of Skechers USA, LLC, located in Moreno Valley, California, (FTZ Docket 5- 2008, filed 2... activity related to footwear warehousing and distribution at the facility of Skechers USA, LLC, located in...

  14. Indiva: a middleware for managing distributed media environment

    NASA Astrophysics Data System (ADS)

    Ooi, Wei-Tsang; Pletcher, Peter; Rowe, Lawrence A.

    2003-12-01

    This paper presents a unified set of abstractions and operations for hardware devices, software processes, and media data in a distributed audio and video environment. These abstractions, which are provided through a middleware layer called Indiva, use a file system metaphor to access resources and high-level commands to simplify the development of Internet webcast and distributed collaboration control applications. The design and implementation of Indiva are described and examples are presented to illustrate the usefulness of the abstractions.

  15. The distributed production system of the SuperB project: description and results

    NASA Astrophysics Data System (ADS)

    Brown, D.; Corvo, M.; Di Simone, A.; Fella, A.; Luppi, E.; Paoloni, E.; Stroili, R.; Tomassetti, L.

    2011-12-01

    The SuperB experiment needs large samples of MonteCarlo simulated events in order to finalize the detector design and to estimate the data analysis performances. The requirements are beyond the capabilities of a single computing farm, so a distributed production model capable of exploiting the existing HEP worldwide distributed computing infrastructure is needed. In this paper we describe the set of tools that have been developed to manage the production of the required simulated events. The production of events follows three main phases: distribution of input data files to the remote site Storage Elements (SE); job submission, via SuperB GANGA interface, to all available remote sites; output files transfer to CNAF repository. The job workflow includes procedures for consistency checking, monitoring, data handling and bookkeeping. A replication mechanism allows storing the job output on the local site SE. Results from 2010 official productions are reported.

  16. Developing CORBA-Based Distributed Scientific Applications From Legacy Fortran Programs

    NASA Technical Reports Server (NTRS)

    Sang, Janche; Kim, Chan; Lopez, Isaac

    2000-01-01

    An efficient methodology is presented for integrating legacy applications written in Fortran into a distributed object framework. Issues and strategies regarding the conversion and decomposition of Fortran codes into Common Object Request Broker Architecture (CORBA) objects are discussed. Fortran codes are modified as little as possible as they are decomposed into modules and wrapped as objects. A new conversion tool takes the Fortran application as input and generates the C/C++ header file and Interface Definition Language (IDL) file. In addition, the performance of the client server computing is evaluated.

  17. 77 FR 34943 - Combined Notice of Filings #2

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-06-12

    ... DEPARTMENT OF ENERGY Federal Energy Regulatory Commission Combined Notice of Filings 2 Take notice... Texas North Company submits tariff filing per 35.13(a)(2)(iii: TNC-Texas New Mexico Power Amd. 3 to IA... filing per 35.13(a)(2)(iii: Reactive Filing to be effective 12/31/9998. Filed Date: 6/5/12. Accession...

  18. 17 CFR 12.25 - Filing fees.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... 17 Commodity and Securities Exchanges 1 2010-04-01 2010-04-01 false Filing fees. 12.25 Section 12... REPARATIONS General Information and Preliminary Consideration of Pleadings § 12.25 Filing fees. (a) Fees payable upon filing a complaint. (1) A complainant who, in the complaint, has elected the voluntary...

  19. 10 CFR 430.42 - Filing requirements.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... the right to refuse to accept, and not to consider, untimely submissions. (e) Filing of petitions. (1... 10 Energy 3 2010-01-01 2010-01-01 false Filing requirements. 430.42 Section 430.42 Energy... Filing requirements. (a) Service. All documents required to be served under this subpart shall, if mailed...

  20. 12 CFR 303.161 - Filing procedures.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 12 Banks and Banking 4 2010-01-01 2010-01-01 false Filing procedures. 303.161 Section 303.161 Banks and Banking FEDERAL DEPOSIT INSURANCE CORPORATION PROCEDURE AND RULES OF PRACTICE FILING... include all materials that have been filed with any state or federal banking regulator and any state or...

  1. 22 CFR 911.2 - Filing complaint.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... 22 Foreign Relations 2 2012-04-01 2009-04-01 true Filing complaint. 911.2 Section 911.2 Foreign Relations FOREIGN SERVICE GRIEVANCE BOARD IMPLEMENTATION DISPUTES § 911.2 Filing complaint. If the dispute is not satisfactorily resolved at the agency level, the moving party may file a complaint within 45...

  2. 22 CFR 911.2 - Filing complaint.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... 22 Foreign Relations 2 2011-04-01 2009-04-01 true Filing complaint. 911.2 Section 911.2 Foreign Relations FOREIGN SERVICE GRIEVANCE BOARD IMPLEMENTATION DISPUTES § 911.2 Filing complaint. If the dispute is not satisfactorily resolved at the agency level, the moving party may file a complaint within 45...

  3. 22 CFR 911.2 - Filing complaint.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 22 Foreign Relations 2 2014-04-01 2014-04-01 false Filing complaint. 911.2 Section 911.2 Foreign Relations FOREIGN SERVICE GRIEVANCE BOARD IMPLEMENTATION DISPUTES § 911.2 Filing complaint. If the dispute is not satisfactorily resolved at the agency level, the moving party may file a complaint within 45...

  4. 19 CFR 10.911 - Filing procedures.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... 19 Customs Duties 1 2012-04-01 2012-04-01 false Filing procedures. 10.911 Section 10.911 Customs Duties U.S. CUSTOMS AND BORDER PROTECTION, DEPARTMENT OF HOMELAND SECURITY; DEPARTMENT OF THE TREASURY... Post-Importation Duty Refund Claims § 10.911 Filing procedures. (a) Place of filing. A post-importation...

  5. 22 CFR 911.2 - Filing complaint.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... 22 Foreign Relations 2 2010-04-01 2010-04-01 true Filing complaint. 911.2 Section 911.2 Foreign Relations FOREIGN SERVICE GRIEVANCE BOARD IMPLEMENTATION DISPUTES § 911.2 Filing complaint. If the dispute is not satisfactorily resolved at the agency level, the moving party may file a complaint within 45...

  6. 19 CFR 10.911 - Filing procedures.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 19 Customs Duties 1 2014-04-01 2014-04-01 false Filing procedures. 10.911 Section 10.911 Customs Duties U.S. CUSTOMS AND BORDER PROTECTION, DEPARTMENT OF HOMELAND SECURITY; DEPARTMENT OF THE TREASURY... Post-Importation Duty Refund Claims § 10.911 Filing procedures. (a) Place of filing. A post-importation...

  7. 19 CFR 10.911 - Filing procedures.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... 19 Customs Duties 1 2013-04-01 2013-04-01 false Filing procedures. 10.911 Section 10.911 Customs Duties U.S. CUSTOMS AND BORDER PROTECTION, DEPARTMENT OF HOMELAND SECURITY; DEPARTMENT OF THE TREASURY... Post-Importation Duty Refund Claims § 10.911 Filing procedures. (a) Place of filing. A post-importation...

  8. 22 CFR 911.2 - Filing complaint.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... 22 Foreign Relations 2 2013-04-01 2009-04-01 true Filing complaint. 911.2 Section 911.2 Foreign Relations FOREIGN SERVICE GRIEVANCE BOARD IMPLEMENTATION DISPUTES § 911.2 Filing complaint. If the dispute is not satisfactorily resolved at the agency level, the moving party may file a complaint within 45...

  9. 47 CFR 1.1405 - File numbers.

    Code of Federal Regulations, 2012 CFR

    2012-10-01

    ... 47 Telecommunication 1 2012-10-01 2012-10-01 false File numbers. 1.1405 Section 1.1405... Attachment Complaint Procedures § 1.1405 File numbers. Each complaint which appears to be essentially complete under § 1.1404 will be accepted and assigned a file number. Such assignment is for administrative...

  10. 47 CFR 1.1405 - File numbers.

    Code of Federal Regulations, 2014 CFR

    2014-10-01

    ... 47 Telecommunication 1 2014-10-01 2014-10-01 false File numbers. 1.1405 Section 1.1405... Attachment Complaint Procedures § 1.1405 File numbers. Each complaint which appears to be essentially complete under § 1.1404 will be accepted and assigned a file number. Such assignment is for administrative...

  11. 47 CFR 1.1405 - File numbers.

    Code of Federal Regulations, 2013 CFR

    2013-10-01

    ... 47 Telecommunication 1 2013-10-01 2013-10-01 false File numbers. 1.1405 Section 1.1405... Attachment Complaint Procedures § 1.1405 File numbers. Each complaint which appears to be essentially complete under § 1.1404 will be accepted and assigned a file number. Such assignment is for administrative...

  12. 77 FR 56833 - Combined Notice of Filings

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-09-14

    ...: Filings Instituting Proceedings Docket Numbers: RP12-957-000. Applicants: Chandeleur Pipe Line Company. Description: Chandeleur ACA filing withdrawal. Filed Date: 8/29/12. Accession Number: 20120829-5131. Comments...

  13. 75 FR 61474 - Juneau Hydropower, Inc.; Notice of Intent To File License Application, Filing of Pre-Application...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-10-05

    ... Hydropower, Inc.; Notice of Intent To File License Application, Filing of Pre-Application Document, and....: 13563-001. c. Dated Filed: July 28, 2010. d. Submitted By: Juneau Hydropower, Inc. e. Name of Project... Commission's regulations. h. Potential Applicant Contact: Duff W. Mitchell, Juneau Hydropower, Inc., P.O. Box...

  14. Smartfiles: An OO approach to data file interoperability

    NASA Technical Reports Server (NTRS)

    Haines, Matthew; Mehrotra, Piyush; Vanrosendale, John

    1995-01-01

    Data files for scientific and engineering codes typically consist of a series of raw data values whose descriptions are buried in the programs that interact with these files. In this situation, making even minor changes in the file structure or sharing files between programs (interoperability) can only be done after careful examination of the data file and the I/O statement of the programs interacting with this file. In short, scientific data files lack self-description, and other self-describing data techniques are not always appropriate or useful for scientific data files. By applying an object-oriented methodology to data files, we can add the intelligence required to improve data interoperability and provide an elegant mechanism for supporting complex, evolving, or multidisciplinary applications, while still supporting legacy codes. As a result, scientists and engineers should be able to share datasets with far greater ease, simplifying multidisciplinary applications and greatly facilitating remote collaboration between scientists.

  15. The sterilization of endodontic hand files.

    PubMed

    Hurtt, C A; Rossman, L E

    1996-06-01

    Several different methods of file sterilization were analyzed to determine the best method of providing complete file sterility, including the metal shaft and plastic handle. Six test groups of 15 files were studied using Bacillus stearothermophilus as the test organism. Groups were "sterilized" by glutaraldehyde immersion, steam autoclaving, and various techniques of salt sterilization. Only proper steam autoclaving reliably produced completely sterile instruments. Salt sterilization and glutaraldehyde solutions may not be adequate sterilization methods for endodontic hand files and should not be relied on to provide completely sterile instruments.

  16. 32 CFR 842.103 - Filing a claim.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... 32 National Defense 6 2011-07-01 2011-07-01 false Filing a claim. 842.103 Section 842.103 National... CLAIMS Claims Under the National Guard Claims Act (32 U.S.C. 715) § 842.103 Filing a claim. This paragraph explains how to file a claim under the National Guard Claims Act. (a) How and when filed. A claim...

  17. 32 CFR 842.103 - Filing a claim.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... 32 National Defense 6 2013-07-01 2013-07-01 false Filing a claim. 842.103 Section 842.103 National... CLAIMS Claims Under the National Guard Claims Act (32 U.S.C. 715) § 842.103 Filing a claim. This paragraph explains how to file a claim under the National Guard Claims Act. (a) How and when filed. A claim...

  18. 32 CFR 842.103 - Filing a claim.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... 32 National Defense 6 2012-07-01 2012-07-01 false Filing a claim. 842.103 Section 842.103 National... CLAIMS Claims Under the National Guard Claims Act (32 U.S.C. 715) § 842.103 Filing a claim. This paragraph explains how to file a claim under the National Guard Claims Act. (a) How and when filed. A claim...

  19. 32 CFR 842.103 - Filing a claim.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... 32 National Defense 6 2014-07-01 2014-07-01 false Filing a claim. 842.103 Section 842.103 National... CLAIMS Claims Under the National Guard Claims Act (32 U.S.C. 715) § 842.103 Filing a claim. This paragraph explains how to file a claim under the National Guard Claims Act. (a) How and when filed. A claim...

  20. 47 CFR 1.10008 - What are IBFS file numbers?

    Code of Federal Regulations, 2013 CFR

    2013-10-01

    ... Random Selection International Bureau Filing System § 1.10008 What are IBFS file numbers? (a) We assign...) For a description of file number information, see The International Bureau Filing System File Number... 47 Telecommunication 1 2013-10-01 2013-10-01 false What are IBFS file numbers? 1.10008 Section 1...