Knowledge Query Language (KQL)
2016-02-12
Lexington Massachusetts This page intentionally left blank. iii EXECUTIVE SUMMARY Currently, queries for data ...retrieval from non-Structured Query Language (NoSQL) data stores are tightly coupled to the specific implementation of the data store implementation...independent of the storage content and format for querying NoSQL or relational data stores. This approach uses address expressions (or A-Expressions
Knowledge Query Language (KQL)
2016-02-01
unlimited. This page intentionally left blank. iii EXECUTIVE SUMMARY Currently, queries for data ...retrieval from non-Structured Query Language (NoSQL) data stores are tightly coupled to the specific implementation of the data store implementation, making...of the storage content and format for querying NoSQL or relational data stores. This approach uses address expressions (or A-Expressions) embedded in
Recommender System for Learning SQL Using Hints
ERIC Educational Resources Information Center
Lavbic, Dejan; Matek, Tadej; Zrnec, Aljaž
2017-01-01
Today's software industry requires individuals who are proficient in as many programming languages as possible. Structured query language (SQL), as an adopted standard, is no exception, as it is the most widely used query language to retrieve and manipulate data. However, the process of learning SQL turns out to be challenging. The need for a…
An Experimental Investigation of Complexity in Database Query Formulation Tasks
ERIC Educational Resources Information Center
Casterella, Gretchen Irwin; Vijayasarathy, Leo
2013-01-01
Information Technology professionals and other knowledge workers rely on their ability to extract data from organizational databases to respond to business questions and support decision making. Structured query language (SQL) is the standard programming language for querying data in relational databases, and SQL skills are in high demand and are…
ERIC Educational Resources Information Center
Piyayodilokchai, Hongsiri; Panjaburee, Patcharin; Laosinchai, Parames; Ketpichainarong, Watcharee; Ruenwongsa, Pintip
2013-01-01
With the benefit of multimedia and the learning cycle approach in promoting effective active learning, this paper proposed a learning cycle approach-based, multimedia-supplemented instructional unit for Structured Query Language (SQL) for second-year undergraduate students with the aim of enhancing their basic knowledge of SQL and ability to apply…
Sánchez-de-Madariaga, Ricardo; Muñoz, Adolfo; Castro, Antonio L; Moreno, Oscar; Pascual, Mario
2018-01-01
This research shows a protocol to assess the computational complexity of querying relational and non-relational (NoSQL (not only Structured Query Language)) standardized electronic health record (EHR) medical information database systems (DBMS). It uses a set of three doubling-sized databases, i.e. databases storing 5000, 10,000 and 20,000 realistic standardized EHR extracts, in three different database management systems (DBMS): relational MySQL object-relational mapping (ORM), document-based NoSQL MongoDB, and native extensible markup language (XML) NoSQL eXist. The average response times to six complexity-increasing queries were computed, and the results showed a linear behavior in the NoSQL cases. In the NoSQL field, MongoDB presents a much flatter linear slope than eXist. NoSQL systems may also be more appropriate to maintain standardized medical information systems due to the special nature of the updating policies of medical information, which should not affect the consistency and efficiency of the data stored in NoSQL databases. One limitation of this protocol is the lack of direct results of improved relational systems such as archetype relational mapping (ARM) with the same data. However, the interpolation of doubling-size database results to those presented in the literature and other published results suggests that NoSQL systems might be more appropriate in many specific scenarios and problems to be solved. For example, NoSQL may be appropriate for document-based tasks such as EHR extracts used in clinical practice, or edition and visualization, or situations where the aim is not only to query medical information, but also to restore the EHR in exactly its original form. PMID:29608174
Sánchez-de-Madariaga, Ricardo; Muñoz, Adolfo; Castro, Antonio L; Moreno, Oscar; Pascual, Mario
2018-03-19
This research shows a protocol to assess the computational complexity of querying relational and non-relational (NoSQL (not only Structured Query Language)) standardized electronic health record (EHR) medical information database systems (DBMS). It uses a set of three doubling-sized databases, i.e. databases storing 5000, 10,000 and 20,000 realistic standardized EHR extracts, in three different database management systems (DBMS): relational MySQL object-relational mapping (ORM), document-based NoSQL MongoDB, and native extensible markup language (XML) NoSQL eXist. The average response times to six complexity-increasing queries were computed, and the results showed a linear behavior in the NoSQL cases. In the NoSQL field, MongoDB presents a much flatter linear slope than eXist. NoSQL systems may also be more appropriate to maintain standardized medical information systems due to the special nature of the updating policies of medical information, which should not affect the consistency and efficiency of the data stored in NoSQL databases. One limitation of this protocol is the lack of direct results of improved relational systems such as archetype relational mapping (ARM) with the same data. However, the interpolation of doubling-size database results to those presented in the literature and other published results suggests that NoSQL systems might be more appropriate in many specific scenarios and problems to be solved. For example, NoSQL may be appropriate for document-based tasks such as EHR extracts used in clinical practice, or edition and visualization, or situations where the aim is not only to query medical information, but also to restore the EHR in exactly its original form.
Relational Algebra and SQL: Better Together
ERIC Educational Resources Information Center
McMaster, Kirby; Sambasivam, Samuel; Hadfield, Steven; Wolthuis, Stuart
2013-01-01
In this paper, we describe how database instructors can teach Relational Algebra and Structured Query Language together through programming. Students write query programs consisting of sequences of Relational Algebra operations vs. Structured Query Language SELECT statements. The query programs can then be run interactively, allowing students to…
2006-06-01
SPARQL SPARQL Protocol and RDF Query Language SQL Structured Query Language SUMO Suggested Upper Merged Ontology SW... Query optimization algorithms are implemented in the Pellet reasoner in order to ensure querying a knowledge base is efficient . These algorithms...memory as a treelike structure in order for the data to be queried . XML Query (XQuery) is the standard language used when querying XML
Improved Information Retrieval Performance on SQL Database Using Data Adapter
NASA Astrophysics Data System (ADS)
Husni, M.; Djanali, S.; Ciptaningtyas, H. T.; Wicaksana, I. G. N. A.
2018-02-01
The NoSQL databases, short for Not Only SQL, are increasingly being used as the number of big data applications increases. Most systems still use relational databases (RDBs), but as the number of data increases each year, the system handles big data with NoSQL databases to analyze and access data more quickly. NoSQL emerged as a result of the exponential growth of the internet and the development of web applications. The query syntax in the NoSQL database differs from the SQL database, therefore requiring code changes in the application. Data adapter allow applications to not change their SQL query syntax. Data adapters provide methods that can synchronize SQL databases with NotSQL databases. In addition, the data adapter provides an interface which is application can access to run SQL queries. Hence, this research applied data adapter system to synchronize data between MySQL database and Apache HBase using direct access query approach, where system allows application to accept query while synchronization process in progress. From the test performed using data adapter, the results obtained that the data adapter can synchronize between SQL databases, MySQL, and NoSQL database, Apache HBase. This system spends the percentage of memory resources in the range of 40% to 60%, and the percentage of processor moving from 10% to 90%. In addition, from this system also obtained the performance of database NoSQL better than SQL database.
Flexible network reconstruction from relational databases with Cytoscape and CytoSQL
2010-01-01
Background Molecular interaction networks can be efficiently studied using network visualization software such as Cytoscape. The relevant nodes, edges and their attributes can be imported in Cytoscape in various file formats, or directly from external databases through specialized third party plugins. However, molecular data are often stored in relational databases with their own specific structure, for which dedicated plugins do not exist. Therefore, a more generic solution is presented. Results A new Cytoscape plugin 'CytoSQL' is developed to connect Cytoscape to any relational database. It allows to launch SQL ('Structured Query Language') queries from within Cytoscape, with the option to inject node or edge features of an existing network as SQL arguments, and to convert the retrieved data to Cytoscape network components. Supported by a set of case studies we demonstrate the flexibility and the power of the CytoSQL plugin in converting specific data subsets into meaningful network representations. Conclusions CytoSQL offers a unified approach to let Cytoscape interact with relational databases. Thanks to the power of the SQL syntax, this tool can rapidly generate and enrich networks according to very complex criteria. The plugin is available at http://www.ptools.ua.ac.be/CytoSQL. PMID:20594316
Flexible network reconstruction from relational databases with Cytoscape and CytoSQL.
Laukens, Kris; Hollunder, Jens; Dang, Thanh Hai; De Jaeger, Geert; Kuiper, Martin; Witters, Erwin; Verschoren, Alain; Van Leemput, Koenraad
2010-07-01
Molecular interaction networks can be efficiently studied using network visualization software such as Cytoscape. The relevant nodes, edges and their attributes can be imported in Cytoscape in various file formats, or directly from external databases through specialized third party plugins. However, molecular data are often stored in relational databases with their own specific structure, for which dedicated plugins do not exist. Therefore, a more generic solution is presented. A new Cytoscape plugin 'CytoSQL' is developed to connect Cytoscape to any relational database. It allows to launch SQL ('Structured Query Language') queries from within Cytoscape, with the option to inject node or edge features of an existing network as SQL arguments, and to convert the retrieved data to Cytoscape network components. Supported by a set of case studies we demonstrate the flexibility and the power of the CytoSQL plugin in converting specific data subsets into meaningful network representations. CytoSQL offers a unified approach to let Cytoscape interact with relational databases. Thanks to the power of the SQL syntax, this tool can rapidly generate and enrich networks according to very complex criteria. The plugin is available at http://www.ptools.ua.ac.be/CytoSQL.
Blind Seer: A Scalable Private DBMS
2014-05-01
searchable index terms per DB row, in time comparable to (insecure) MySQL (many practical queries can be privately executed with work 1.2-3 times slower...than MySQL , although some queries are costlier). We support a rich query set, including searching on arbitrary boolean formulas on keywords and ranges...index terms per DB row, in time comparable to (insecure) MySQL (many practical queries can be privately executed with work 1.2-3 times slower than MySQL
Supporting temporal queries on clinical relational databases: the S-WATCH-QL language.
Combi, C.; Missora, L.; Pinciroli, F.
1996-01-01
Due to the ubiquitous and special nature of time, specially in clinical datábases there's the need of particular temporal data and operators. In this paper we describe S-WATCH-QL (Structured Watch Query Language), a temporal extension of SQL, the widespread query language based on the relational model. S-WATCH-QL extends the well-known SQL by the addition of: a) temporal data types that allow the storage of information with different levels of granularity; b) historical relations that can store together both instantaneous valid times and intervals; c) some temporal clauses, functions and predicates allowing to define complex temporal queries. PMID:8947722
PiCO QL: A software library for runtime interactive queries on program data
NASA Astrophysics Data System (ADS)
Fragkoulis, Marios; Spinellis, Diomidis; Louridas, Panos
PiCO QL is an open source C/C++ software whose scientific scope is real-time interactive analysis of in-memory data through SQL queries. It exposes a relational view of a system's or application's data structures, which is queryable through SQL. While the application or system is executing, users can input queries through a web-based interface or issue web service requests. Queries execute on the live data structures through the respective relational views. PiCO QL makes a good candidate for ad-hoc data analysis in applications and for diagnostics in systems settings. Applications of PiCO QL include the Linux kernel, the Valgrind instrumentation framework, a GIS application, a virtual real-time observatory of stellar objects, and a source code analyser.
NVST Data Archiving System Based On FastBit NoSQL Database
NASA Astrophysics Data System (ADS)
Liu, Ying-bo; Wang, Feng; Ji, Kai-fan; Deng, Hui; Dai, Wei; Liang, Bo
2014-06-01
The New Vacuum Solar Telescope (NVST) is a 1-meter vacuum solar telescope that aims to observe the fine structures of active regions on the Sun. The main tasks of the NVST are high resolution imaging and spectral observations, including the measurements of the solar magnetic field. The NVST has been collecting more than 20 million FITS files since it began routine observations in 2012 and produces a maximum observational records of 120 thousand files in a day. Given the large amount of files, the effective archiving and retrieval of files becomes a critical and urgent problem. In this study, we implement a new data archiving system for the NVST based on the Fastbit Not Only Structured Query Language (NoSQL) database. Comparing to the relational database (i.e., MySQL; My Structured Query Language), the Fastbit database manifests distinctive advantages on indexing and querying performance. In a large scale database of 40 million records, the multi-field combined query response time of Fastbit database is about 15 times faster and fully meets the requirements of the NVST. Our study brings a new idea for massive astronomical data archiving and would contribute to the design of data management systems for other astronomical telescopes.
Flexible Decision Support in Device-Saturated Environments
2003-10-01
also output tuples to a remote MySQL or Postgres database. 3.3 GUI The GUI allows the user to pose queries using SQL and to display query...DatabaseConnection.java – handles connections to an external database (such as MySQL or Postgres ). • Debug.java – contains the code for printing out Debug messages...also provided. It is possible to output the results of queries to a MySQL or Postgres database for archival and the GUI can query those results
Analysis and Development of a Web-Enabled Planning and Scheduling Database Application
2013-09-01
establishes an entity—relationship diagram for the desired process, constructs an operable database using MySQL , and provides a web- enabled interface for...development, develop, design, process, re- engineering, reengineering, MySQL , structured query language, SQL, myPHPadmin. 15. NUMBER OF PAGES 107 16...relationship diagram for the desired process, constructs an operable database using MySQL , and provides a web-enabled interface for the population of
Lee, Ken Ka-Yin; Tang, Wai-Choi; Choi, Kup-Sze
2013-04-01
Clinical data are dynamic in nature, often arranged hierarchically and stored as free text and numbers. Effective management of clinical data and the transformation of the data into structured format for data analysis are therefore challenging issues in electronic health records development. Despite the popularity of relational databases, the scalability of the NoSQL database model and the document-centric data structure of XML databases appear to be promising features for effective clinical data management. In this paper, three database approaches--NoSQL, XML-enabled and native XML--are investigated to evaluate their suitability for structured clinical data. The database query performance is reported, together with our experience in the databases development. The results show that NoSQL database is the best choice for query speed, whereas XML databases are advantageous in terms of scalability, flexibility and extensibility, which are essential to cope with the characteristics of clinical data. While NoSQL and XML technologies are relatively new compared to the conventional relational database, both of them demonstrate potential to become a key database technology for clinical data management as the technology further advances. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
SQL/NF Translator for the Triton Nested Relational Database System
1990-12-01
18as., Ohio .. 9~~ ~~ 1 4- AFIT/GCE/ENG/90D-05 SQL/Nk1 TRANSLATOR FOR THE TRITON NESTED RELATIONAL DATABASE SYSTEM THESIS Craig William Schnepf Captain...FOR THE TRITON NESTED RELATIONAL DATABASE SYSTEM THESIS Presented to the Faculty of the School of Engineering of the Air Force Institute of Technnlogy... systems . The SQL/NF query language used for the nested relationil model is an extension of the popular relational model query language SQL. The query
Shuttle-Data-Tape XML Translator
NASA Technical Reports Server (NTRS)
Barry, Matthew R.; Osborne, Richard N.
2005-01-01
JSDTImport is a computer program for translating native Shuttle Data Tape (SDT) files from American Standard Code for Information Interchange (ASCII) format into databases in other formats. JSDTImport solves the problem of organizing the SDT content, affording flexibility to enable users to choose how to store the information in a database to better support client and server applications. JSDTImport can be dynamically configured by use of a simple Extensible Markup Language (XML) file. JSDTImport uses this XML file to define how each record and field will be parsed, its layout and definition, and how the resulting database will be structured. JSDTImport also includes a client application programming interface (API) layer that provides abstraction for the data-querying process. The API enables a user to specify the search criteria to apply in gathering all the data relevant to a query. The API can be used to organize the SDT content and translate into a native XML database. The XML format is structured into efficient sections, enabling excellent query performance by use of the XPath query language. Optionally, the content can be translated into a Structured Query Language (SQL) database for fast, reliable SQL queries on standard database server computers.
Automatically Preparing Safe SQL Queries
NASA Astrophysics Data System (ADS)
Bisht, Prithvi; Sistla, A. Prasad; Venkatakrishnan, V. N.
We present the first sound program source transformation approach for automatically transforming the code of a legacy web application to employ PREPARE statements in place of unsafe SQL queries. Our approach therefore opens the way for eradicating the SQL injection threat vector from legacy web applications.
Petaminer: Using ROOT for efficient data storage in MySQL database
NASA Astrophysics Data System (ADS)
Cranshaw, J.; Malon, D.; Vaniachine, A.; Fine, V.; Lauret, J.; Hamill, P.
2010-04-01
High Energy and Nuclear Physics (HENP) experiments store Petabytes of event data and Terabytes of calibration data in ROOT files. The Petaminer project is developing a custom MySQL storage engine to enable the MySQL query processor to directly access experimental data stored in ROOT files. Our project is addressing the problem of efficient navigation to PetaBytes of HENP experimental data described with event-level TAG metadata, which is required by data intensive physics communities such as the LHC and RHIC experiments. Physicists need to be able to compose a metadata query and rapidly retrieve the set of matching events, where improved efficiency will facilitate the discovery process by permitting rapid iterations of data evaluation and retrieval. Our custom MySQL storage engine enables the MySQL query processor to directly access TAG data stored in ROOT TTrees. As ROOT TTrees are column-oriented, reading them directly provides improved performance over traditional row-oriented TAG databases. Leveraging the flexible and powerful SQL query language to access data stored in ROOT TTrees, the Petaminer approach enables rich MySQL index-building capabilities for further performance optimization.
TabSQL: a MySQL tool to facilitate mapping user data to public databases.
Xia, Xiao-Qin; McClelland, Michael; Wang, Yipeng
2010-06-23
With advances in high-throughput genomics and proteomics, it is challenging for biologists to deal with large data files and to map their data to annotations in public databases. We developed TabSQL, a MySQL-based application tool, for viewing, filtering and querying data files with large numbers of rows. TabSQL provides functions for downloading and installing table files from public databases including the Gene Ontology database (GO), the Ensembl databases, and genome databases from the UCSC genome bioinformatics site. Any other database that provides tab-delimited flat files can also be imported. The downloaded gene annotation tables can be queried together with users' data in TabSQL using either a graphic interface or command line. TabSQL allows queries across the user's data and public databases without programming. It is a convenient tool for biologists to annotate and enrich their data.
TabSQL: a MySQL tool to facilitate mapping user data to public databases
2010-01-01
Background With advances in high-throughput genomics and proteomics, it is challenging for biologists to deal with large data files and to map their data to annotations in public databases. Results We developed TabSQL, a MySQL-based application tool, for viewing, filtering and querying data files with large numbers of rows. TabSQL provides functions for downloading and installing table files from public databases including the Gene Ontology database (GO), the Ensembl databases, and genome databases from the UCSC genome bioinformatics site. Any other database that provides tab-delimited flat files can also be imported. The downloaded gene annotation tables can be queried together with users' data in TabSQL using either a graphic interface or command line. Conclusions TabSQL allows queries across the user's data and public databases without programming. It is a convenient tool for biologists to annotate and enrich their data. PMID:20573251
Standard Port-Visit Cost Forecasting Model for U.S. Navy Husbanding Contracts
2009-12-01
Protocol (HTTP) server.35 2. MySQL . An open-source database.36 3. PHP . A common scripting language used for Web development.37 E. IMPLEMENTATION OF...Inc. (2009). MySQL Community Server (Version 5.1) [Software]. Available from http://dev.mysql.com/downloads/ 37 The PHP Group (2009). PHP (Version...Logistics Services MySQL My Structured Query Language NAVSUP Navy Supply Systems Command NC Non-Contract Items NPS Naval Postgraduate
Social media based NPL system to find and retrieve ARM data: Concept paper
DOE Office of Scientific and Technical Information (OSTI.GOV)
Devarakonda, Ranjeet; Giansiracusa, Michael T.; Kumar, Jitendra
Information connectivity and retrieval has a role in our daily lives. The most pervasive source of online information is databases. The amount of data is growing at rapid rate and database technology is improving and having a profound effect. Almost all online applications are storing and retrieving information from databases. One challenge in supplying the public with wider access to informational databases is the need for knowledge of database languages like Structured Query Language (SQL). Although the SQL language has been published in many forms, not everybody is able to write SQL queries. Another challenge is that it may notmore » be practical to make the public aware of the structure of the database. There is a need for novice users to query relational databases using their natural language. To solve this problem, many natural language interfaces to structured databases have been developed. The goal is to provide more intuitive method for generating database queries and delivering responses. Social media makes it possible to interact with a wide section of the population. Through this medium, and with the help of Natural Language Processing (NLP) we can make the data of the Atmospheric Radiation Measurement Data Center (ADC) more accessible to the public. We propose an architecture for using Apache Lucene/Solr [1], OpenML [2,3], and Kafka [4] to generate an automated query/response system with inputs from Twitter5, our Cassandra DB, and our log database. Using the Twitter API and NLP we can give the public the ability to ask questions of our database and get automated responses.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Devarakonda, Ranjeet; Giansiracusa, Michael T.; Kumar, Jitendra
Information connectivity and retrieval has a role in our daily lives. The most pervasive source of online information is databases. The amount of data is growing at rapid rate and database technology is improving and having a profound effect. Almost all online applications are storing and retrieving information from databases. One challenge in supplying the public with wider access to informational databases is the need for knowledge of database languages like Structured Query Language (SQL). Although the SQL language has been published in many forms, not everybody is able to write SQL queries. Another challenge is that it may notmore » be practical to make the public aware of the structure of the database. There is a need for novice users to query relational databases using their natural language. To solve this problem, many natural language interfaces to structured databases have been developed. The goal is to provide more intuitive method for generating database queries and delivering responses. Social media makes it possible to interact with a wide section of the population. Through this medium, and with the help of Natural Language Processing (NLP) we can make the data of the Atmospheric Radiation Measurement Data Center (ADC) more accessible to the public. We propose an architecture for using Apache Lucene/Solr [1], OpenML [2,3], and Kafka [4] to generate an automated query/response system with inputs from Twitter5, our Cassandra DB, and our log database. Using the Twitter API and NLP we can give the public the ability to ask questions of our database and get automated responses.« less
ERIC Educational Resources Information Center
Mills, Robert J.; Dupin-Bryant, Pamela A.; Johnson, John D.; Beaulieu, Tanya Y.
2015-01-01
The demand for Information Systems (IS) graduates with expertise in Structured Query Language (SQL) and database management is vast and projected to increase as "big data" becomes ubiquitous. To prepare students to solve complex problems in a data-driven world, educators must explore instructional strategies to help link prior knowledge…
Epstein, Richard H; Dexter, Franklin
2017-07-01
Comorbidity adjustment is often performed during outcomes and health care resource utilization research. Our goal was to develop an efficient algorithm in structured query language (SQL) to determine the Elixhauser comorbidity index. We wrote an SQL algorithm to calculate the Elixhauser comorbidities from Diagnosis Related Group and International Classification of Diseases (ICD) codes. Validation was by comparison to expected comorbidities from combinations of these codes and to the 2013 Nationwide Readmissions Database (NRD). The SQL algorithm matched perfectly with expected comorbidities for all combinations of ICD-9 or ICD-10, and Diagnosis Related Groups. Of 13 585 859 evaluable NRD records, the algorithm matched 100% of the listed comorbidities. Processing time was ∼0.05 ms/record. The SQL Elixhauser code was efficient and computationally identical to the SAS algorithm used for the NRD. This algorithm may be useful where preprocessing of large datasets in a relational database environment and comorbidity determination is desired before statistical analysis. A validated SQL procedure to calculate Elixhauser comorbidities and the van Walraven index from ICD-9 or ICD-10 discharge diagnosis codes has been published. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com
An XML-Based Knowledge Management System of Port Information for U.S. Coast Guard Cutters
2003-03-01
using DTDs was not chosen. XML Schema performs many of the same functions as SQL type schemas, but differ by the unique structure of XML documents...to access data from content files within the developed system. XPath is not equivalent to SQL . While XPath is very powerful at reaching into an XML...document and finding nodes or node sets, it is not a complete query language. For operations like joins, unions, intersections, etc., SQL is far
Evaluation of relational and NoSQL database architectures to manage genomic annotations.
Schulz, Wade L; Nelson, Brent G; Felker, Donn K; Durant, Thomas J S; Torres, Richard
2016-12-01
While the adoption of next generation sequencing has rapidly expanded, the informatics infrastructure used to manage the data generated by this technology has not kept pace. Historically, relational databases have provided much of the framework for data storage and retrieval. Newer technologies based on NoSQL architectures may provide significant advantages in storage and query efficiency, thereby reducing the cost of data management. But their relative advantage when applied to biomedical data sets, such as genetic data, has not been characterized. To this end, we compared the storage, indexing, and query efficiency of a common relational database (MySQL), a document-oriented NoSQL database (MongoDB), and a relational database with NoSQL support (PostgreSQL). When used to store genomic annotations from the dbSNP database, we found the NoSQL architectures to outperform traditional, relational models for speed of data storage, indexing, and query retrieval in nearly every operation. These findings strongly support the use of novel database technologies to improve the efficiency of data management within the biological sciences. Copyright © 2016 Elsevier Inc. All rights reserved.
An advanced web query interface for biological databases
Latendresse, Mario; Karp, Peter D.
2010-01-01
Although most web-based biological databases (DBs) offer some type of web-based form to allow users to author DB queries, these query forms are quite restricted in the complexity of DB queries that they can formulate. They can typically query only one DB, and can query only a single type of object at a time (e.g. genes) with no possible interaction between the objects—that is, in SQL parlance, no joins are allowed between DB objects. Writing precise queries against biological DBs is usually left to a programmer skillful enough in complex DB query languages like SQL. We present a web interface for building precise queries for biological DBs that can construct much more precise queries than most web-based query forms, yet that is user friendly enough to be used by biologists. It supports queries containing multiple conditions, and connecting multiple object types without using the join concept, which is unintuitive to biologists. This interactive web interface is called the Structured Advanced Query Page (SAQP). Users interactively build up a wide range of query constructs. Interactive documentation within the SAQP describes the schema of the queried DBs. The SAQP is based on BioVelo, a query language based on list comprehension. The SAQP is part of the Pathway Tools software and is available as part of several bioinformatics web sites powered by Pathway Tools, including the BioCyc.org site that contains more than 500 Pathway/Genome DBs. PMID:20624715
NASA Technical Reports Server (NTRS)
McGlynn, T.; Santisteban, M.
2007-01-01
This chapter provides a very brief introduction to the Structured Query Language (SQL) for getting information from relational databases. We make no pretense that this is a complete or comprehensive discussion of SQL. There are many aspects of the language the will be completely ignored in the presentation. The goal here is to provide enough background so that users understand the basic concepts involved in building and using relational databases. We also go through the steps involved in building a particular astronomical database used in some of the other presentations in this volume.
A Framework for Building and Reasoning with Adaptive and Interoperable PMESII Models
2007-11-01
Description Logic SOA Service Oriented Architecture SPARQL Simple Protocol And RDF Query Language SQL Standard Query Language SROM Stability and...another by providing a more expressive ontological structure for one of the models, e.g., semantic networks can be mapped to first- order logical...Pellet is an open-source reasoner that works with OWL-DL. It accepts the SPARQL protocol and RDF query language ( SPARQL ) and provides a Java API to
NoSQL: collection document and cloud by using a dynamic web query form
NASA Astrophysics Data System (ADS)
Abdalla, Hemn B.; Lin, Jinzhao; Li, Guoquan
2015-07-01
Mongo-DB (from "humongous") is an open-source document database and the leading NoSQL database. A NoSQL (Not Only SQL, next generation databases, being non-relational, deal, open-source and horizontally scalable) presenting a mechanism for storage and retrieval of documents. Previously, we stored and retrieved the data using the SQL queries. Here, we use the MonogoDB that means we are not utilizing the MySQL and SQL queries. Directly importing the documents into our Drives, retrieving the documents on that drive by not applying the SQL queries, using the IO BufferReader and Writer, BufferReader for importing our type of document files to my folder (Drive). For retrieving the document files, the usage is BufferWriter from the particular folder (or) Drive. In this sense, providing the security for those storing files for what purpose means if we store the documents in our local folder means all or views that file and modified that file. So preventing that file, we are furnishing the security. The original document files will be changed to another format like in this paper; Binary format is used. Our documents will be converting to the binary format after that direct storing in one of our folder, that time the storage space will provide the private key for accessing that file. Wherever any user tries to discover the Document files means that file data are in the binary format, the document's file owner simply views that original format using that personal key from receive the secret key from the cloud.
Analyzing Enron Data: Bitmap Indexing Outperforms MySQL Queries bySeveral Orders of Magnitude
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stockinger, Kurt; Rotem, Doron; Shoshani, Arie
2006-01-28
FastBit is an efficient, compressed bitmap indexing technology that was developed in our group. In this report we evaluate the performance of MySQL and FastBit for analyzing the email traffic of the Enron dataset. The first finding shows that materializing the join results of several tables significantly improves the query performance. The second finding shows that FastBit outperforms MySQL by several orders of magnitude.
Mining the SDSS SkyServer SQL queries log
NASA Astrophysics Data System (ADS)
Hirota, Vitor M.; Santos, Rafael; Raddick, Jordan; Thakar, Ani
2016-05-01
SkyServer, the Internet portal for the Sloan Digital Sky Survey (SDSS) astronomic catalog, provides a set of tools that allows data access for astronomers and scientific education. One of SkyServer data access interfaces allows users to enter ad-hoc SQL statements to query the catalog. SkyServer also presents some template queries that can be used as basis for more complex queries. This interface has logged over 330 million queries submitted since 2001. It is expected that analysis of this data can be used to investigate usage patterns, identify potential new classes of queries, find similar queries, etc. and to shed some light on how users interact with the Sloan Digital Sky Survey data and how scientists have adopted the new paradigm of e-Science, which could in turn lead to enhancements on the user interfaces and experience in general. In this paper we review some approaches to SQL query mining, apply the traditional techniques used in the literature and present lessons learned, namely, that the general text mining approach for feature extraction and clustering does not seem to be adequate for this type of data, and, most importantly, we find that this type of analysis can result in very different queries being clustered together.
Evaluation of Sub Query Performance in SQL Server
NASA Astrophysics Data System (ADS)
Oktavia, Tanty; Sujarwo, Surya
2014-03-01
The paper explores several sub query methods used in a query and their impact on the query performance. The study uses experimental approach to evaluate the performance of each sub query methods combined with indexing strategy. The sub query methods consist of in, exists, relational operator and relational operator combined with top operator. The experimental shows that using relational operator combined with indexing strategy in sub query has greater performance compared with using same method without indexing strategy and also other methods. In summary, for application that emphasized on the performance of retrieving data from database, it better to use relational operator combined with indexing strategy. This study is done on Microsoft SQL Server 2012.
Sánchez-de-Madariaga, Ricardo; Muñoz, Adolfo; Lozano-Rubí, Raimundo; Serrano-Balazote, Pablo; Castro, Antonio L; Moreno, Oscar; Pascual, Mario
2017-08-18
The objective of this research is to compare the relational and non-relational (NoSQL) database systems approaches in order to store, recover, query and persist standardized medical information in the form of ISO/EN 13606 normalized Electronic Health Record XML extracts, both in isolation and concurrently. NoSQL database systems have recently attracted much attention, but few studies in the literature address their direct comparison with relational databases when applied to build the persistence layer of a standardized medical information system. One relational and two NoSQL databases (one document-based and one native XML database) of three different sizes have been created in order to evaluate and compare the response times (algorithmic complexity) of six different complexity growing queries, which have been performed on them. Similar appropriate results available in the literature have also been considered. Relational and non-relational NoSQL database systems show almost linear algorithmic complexity query execution. However, they show very different linear slopes, the former being much steeper than the two latter. Document-based NoSQL databases perform better in concurrency than in isolation, and also better than relational databases in concurrency. Non-relational NoSQL databases seem to be more appropriate than standard relational SQL databases when database size is extremely high (secondary use, research applications). Document-based NoSQL databases perform in general better than native XML NoSQL databases. EHR extracts visualization and edition are also document-based tasks more appropriate to NoSQL database systems. However, the appropriate database solution much depends on each particular situation and specific problem.
Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick
2016-01-01
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest. PMID:26958859
Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick
2016-01-01
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.
Providing R-Tree Support for Mongodb
NASA Astrophysics Data System (ADS)
Xiang, Longgang; Shao, Xiaotian; Wang, Dehao
2016-06-01
Supporting large amounts of spatial data is a significant characteristic of modern databases. However, unlike some mature relational databases, such as Oracle and PostgreSQL, most of current burgeoning NoSQL databases are not well designed for storing geospatial data, which is becoming increasingly important in various fields. In this paper, we propose a novel method to provide R-tree index, as well as corresponding spatial range query and nearest neighbour query functions, for MongoDB, one of the most prevalent NoSQL databases. First, after in-depth analysis of MongoDB's features, we devise an efficient tabular document structure which flattens R-tree index into MongoDB collections. Further, relevant mechanisms of R-tree operations are issued, and then we discuss in detail how to integrate R-tree into MongoDB. Finally, we present the experimental results which show that our proposed method out-performs the built-in spatial index of MongoDB. Our research will greatly facilitate big data management issues with MongoDB in a variety of geospatial information applications.
Nadkarni, P M
1997-08-01
Concept Locator (CL) is a client-server application that accesses a Sybase relational database server containing a subset of the UMLS Metathesaurus for the purpose of retrieval of concepts corresponding to one or more query expressions supplied to it. CL's query grammar permits complex Boolean expressions, wildcard patterns, and parenthesized (nested) subexpressions. CL translates the query expressions supplied to it into one or more SQL statements that actually perform the retrieval. The generated SQL is optimized by the client to take advantage of the strengths of the server's query optimizer, and sidesteps its weaknesses, so that execution is reasonably efficient.
2008-03-01
Fortunately, built into Excel is the capability to use ActiveX Data Objects (ADO), a software feature which uses VBA to interface with external...part of Excel’s ActiveX Direct Objects (ADO) functionality, Excel can execute SQL queries in Access with VBA. An SQL query statement can be written
An SQL query generator for CLIPS
NASA Technical Reports Server (NTRS)
Snyder, James; Chirica, Laurian
1990-01-01
As expert systems become more widely used, their access to large amounts of external information becomes increasingly important. This information exists in several forms such as statistical, tabular data, knowledge gained by experts and large databases of information maintained by companies. Because many expert systems, including CLIPS, do not provide access to this external information, much of the usefulness of expert systems is left untapped. The scope of this paper is to describe a database extension for the CLIPS expert system shell. The current industry standard database language is SQL. Due to SQL standardization, large amounts of information stored on various computers, potentially at different locations, will be more easily accessible. Expert systems should be able to directly access these existing databases rather than requiring information to be re-entered into the expert system environment. The ORACLE relational database management system (RDBMS) was used to provide a database connection within the CLIPS environment. To facilitate relational database access a query generation system was developed as a CLIPS user function. The queries are entered in a CLlPS-like syntax and are passed to the query generator, which constructs and submits for execution, an SQL query to the ORACLE RDBMS. The query results are asserted as CLIPS facts. The query generator was developed primarily for use within the ICADS project (Intelligent Computer Aided Design System) currently being developed by the CAD Research Unit in the California Polytechnic State University (Cal Poly). In ICADS, there are several parallel or distributed expert systems accessing a common knowledge base of facts. Expert system has a narrow domain of interest and therefore needs only certain portions of the information. The query generator provides a common method of accessing this information and allows the expert system to specify what data is needed without specifying how to retrieve it.
Agile Datacube Analytics (not just) for the Earth Sciences
NASA Astrophysics Data System (ADS)
Misev, Dimitar; Merticariu, Vlad; Baumann, Peter
2017-04-01
Metadata are considered small, smart, and queryable; data, on the other hand, are known as big, clumsy, hard to analyze. Consequently, gridded data - such as images, image timeseries, and climate datacubes - are managed separately from the metadata, and with different, restricted retrieval capabilities. One reason for this silo approach is that databases, while good at tables, XML hierarchies, RDF graphs, etc., traditionally do not support multi-dimensional arrays well. This gap is being closed by Array Databases which extend the SQL paradigm of "any query, anytime" to NoSQL arrays. They introduce semantically rich modelling combined with declarative, high-level query languages on n-D arrays. On Server side, such queries can be optimized, parallelized, and distributed based on partitioned array storage. This way, they offer new vistas in flexibility, scalability, performance, and data integration. In this respect, the forthcoming ISO SQL extension MDA ("Multi-dimensional Arrays") will be a game changer in Big Data Analytics. We introduce concepts and opportunities through the example of rasdaman ("raster data manager") which in fact has pioneered the field of Array Databases and forms the blueprint for ISO SQL/MDA and further Big Data standards, such as OGC WCPS for querying spatio-temporal Earth datacubes. With operational installations exceeding 140 TB queries have been split across more than one thousand cloud nodes, using CPUs as well as GPUs. Installations can easily be mashed up securely, enabling large-scale location-transparent query processing in federations. Federation queries have been demonstrated live at EGU 2016 spanning Europe and Australia in the context of the intercontinental EarthServer initiative, visualized through NASA WorldWind.
Agile Datacube Analytics (not just) for the Earth Sciences
NASA Astrophysics Data System (ADS)
Baumann, P.
2016-12-01
Metadata are considered small, smart, and queryable; data, on the other hand, are known as big, clumsy, hard to analyze. Consequently, gridded data - such as images, image timeseries, and climate datacubes - are managed separately from the metadata, and with different, restricted retrieval capabilities. One reason for this silo approach is that databases, while good at tables, XML hierarchies, RDF graphs, etc., traditionally do not support multi-dimensional arrays well.This gap is being closed by Array Databases which extend the SQL paradigm of "any query, anytime" to NoSQL arrays. They introduce semantically rich modelling combined with declarative, high-level query languages on n-D arrays. On Server side, such queries can be optimized, parallelized, and distributed based on partitioned array storage. This way, they offer new vistas in flexibility, scalability, performance, and data integration. In this respect, the forthcoming ISO SQL extension MDA ("Multi-dimensional Arrays") will be a game changer in Big Data Analytics.We introduce concepts and opportunities through the example of rasdaman ("raster data manager") which in fact has pioneered the field of Array Databases and forms the blueprint for ISO SQL/MDA and further Big Data standards, such as OGC WCPS for querying spatio-temporal Earth datacubes. With operational installations exceeding 140 TB queries have been split across more than one thousand cloud nodes, using CPUs as well as GPUs. Installations can easily be mashed up securely, enabling large-scale location-transparent query processing in federations. Federation queries have been demonstrated live at EGU 2016 spanning Europe and Australia in the context of the intercontinental EarthServer initiative, visualized through NASA WorldWind.
Network Configuration of Oracle and Database Programming Using SQL
NASA Technical Reports Server (NTRS)
Davis, Melton; Abdurrashid, Jibril; Diaz, Philip; Harris, W. C.
2000-01-01
A database can be defined as a collection of information organized in such a way that it can be retrieved and used. A database management system (DBMS) can further be defined as the tool that enables us to manage and interact with the database. The Oracle 8 Server is a state-of-the-art information management environment. It is a repository for very large amounts of data, and gives users rapid access to that data. The Oracle 8 Server allows for sharing of data between applications; the information is stored in one place and used by many systems. My research will focus primarily on SQL (Structured Query Language) programming. SQL is the way you define and manipulate data in Oracle's relational database. SQL is the industry standard adopted by all database vendors. When programming with SQL, you work on sets of data (i.e., information is not processed one record at a time).
Ontological Approach to Military Knowledge Modeling and Management
2004-03-01
federated search mechanism has to reformulate user queries (expressed using the ontology) in the query languages of the different sources (e.g. SQL...ontologies as a common terminology – Unified query to perform federated search • Query processing – Ontology mapping to sources reformulate queries
Use of Graph Database for the Integration of Heterogeneous Biological Data.
Yoon, Byoung-Ha; Kim, Seon-Kyu; Kim, Seon-Young
2017-03-01
Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.
Use of Graph Database for the Integration of Heterogeneous Biological Data
Yoon, Byoung-Ha; Kim, Seon-Kyu
2017-01-01
Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data. PMID:28416946
High dimensional biological data retrieval optimization with NoSQL technology.
Wang, Shicai; Pandis, Ioannis; Wu, Chao; He, Sijin; Johnson, David; Emam, Ibrahim; Guitton, Florian; Guo, Yike
2014-01-01
High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
High dimensional biological data retrieval optimization with NoSQL technology
2014-01-01
Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data. PMID:25435347
Hewitt, Robin; Gobbi, Alberto; Lee, Man-Ling
2005-01-01
Relational databases are the current standard for storing and retrieving data in the pharmaceutical and biotech industries. However, retrieving data from a relational database requires specialized knowledge of the database schema and of the SQL query language. At Anadys, we have developed an easy-to-use system for searching and reporting data in a relational database to support our drug discovery project teams. This system is fast and flexible and allows users to access all data without having to write SQL queries. This paper presents the hierarchical, graph-based metadata representation and SQL-construction methods that, together, are the basis of this system's capabilities.
Integrating Scientific Array Processing into Standard SQL
NASA Astrophysics Data System (ADS)
Misev, Dimitar; Bachhuber, Johannes; Baumann, Peter
2014-05-01
We live in a time that is dominated by data. Data storage is cheap and more applications than ever accrue vast amounts of data. Storing the emerging multidimensional data sets efficiently, however, and allowing them to be queried by their inherent structure, is a challenge many databases have to face today. Despite the fact that multidimensional array data is almost always linked to additional, non-array information, array databases have mostly developed separately from relational systems, resulting in a disparity between the two database categories. The current SQL standard and SQL DBMS supports arrays - and in an extension also multidimensional arrays - but does so in a very rudimentary and inefficient way. This poster demonstrates the practicality of an SQL extension for array processing, implemented in a proof-of-concept multi-faceted system that manages a federation of array and relational database systems, providing transparent, efficient and scalable access to the heterogeneous data in them.
Geographic Video 3d Data Model And Retrieval
NASA Astrophysics Data System (ADS)
Han, Z.; Cui, C.; Kong, Y.; Wu, H.
2014-04-01
Geographic video includes both spatial and temporal geographic features acquired through ground-based or non-ground-based cameras. With the popularity of video capture devices such as smartphones, the volume of user-generated geographic video clips has grown significantly and the trend of this growth is quickly accelerating. Such a massive and increasing volume poses a major challenge to efficient video management and query. Most of the today's video management and query techniques are based on signal level content extraction. They are not able to fully utilize the geographic information of the videos. This paper aimed to introduce a geographic video 3D data model based on spatial information. The main idea of the model is to utilize the location, trajectory and azimuth information acquired by sensors such as GPS receivers and 3D electronic compasses in conjunction with video contents. The raw spatial information is synthesized to point, line, polygon and solid according to the camcorder parameters such as focal length and angle of view. With the video segment and video frame, we defined the three categories geometry object using the geometry model of OGC Simple Features Specification for SQL. We can query video through computing the spatial relation between query objects and three categories geometry object such as VFLocation, VSTrajectory, VSFOView and VFFovCone etc. We designed the query methods using the structured query language (SQL) in detail. The experiment indicate that the model is a multiple objective, integration, loosely coupled, flexible and extensible data model for the management of geographic stereo video.
Incremental Query Rewriting with Resolution
NASA Astrophysics Data System (ADS)
Riazanov, Alexandre; Aragão, Marcelo A. T.
We address the problem of semantic querying of relational databases (RDB) modulo knowledge bases using very expressive knowledge representation formalisms, such as full first-order logic or its various fragments. We propose to use a resolution-based first-order logic (FOL) reasoner for computing schematic answers to deductive queries, with the subsequent translation of these schematic answers to SQL queries which are evaluated using a conventional relational DBMS. We call our method incremental query rewriting, because an original semantic query is rewritten into a (potentially infinite) series of SQL queries. In this chapter, we outline the main idea of our technique - using abstractions of databases and constrained clauses for deriving schematic answers, and provide completeness and soundness proofs to justify the applicability of this technique to the case of resolution for FOL without equality. The proposed method can be directly used with regular RDBs, including legacy databases. Moreover, we propose it as a potential basis for an efficient Web-scale semantic search technology.
Shark: SQL and Analytics with Cost-Based Query Optimization on Coarse-Grained Distributed Memory
2014-01-13
RDBMS and contains a database (often MySQL or Derby) with a namespace for tables, table metadata and partition information. Table data is stored in an...serialization/deserialization) Java interface implementations with corresponding object inspectors. The Hive driver controls the processing of queries, coordinat...native API, RDD operations are invoked through a functional interface similar to DryadLINQ [32] in Scala, Java or Python. For example, the Scala code for
An effective model for store and retrieve big health data in cloud computing.
Goli-Malekabadi, Zohreh; Sargolzaei-Javan, Morteza; Akbari, Mohammad Kazem
2016-08-01
The volume of healthcare data including different and variable text types, sounds, and images is increasing day to day. Therefore, the storage and processing of these data is a necessary and challenging issue. Generally, relational databases are used for storing health data which are not able to handle the massive and diverse nature of them. This study aimed at presenting the model based on NoSQL databases for the storage of healthcare data. Despite different types of NoSQL databases, document-based DBs were selected by a survey on the nature of health data. The presented model was implemented in the Cloud environment for accessing to the distribution properties. Then, the data were distributed on the database by applying the Shard property. The efficiency of the model was evaluated in comparison with the previous data model, Relational Database, considering query time, data preparation, flexibility, and extensibility parameters. The results showed that the presented model approximately performed the same as SQL Server for "read" query while it acted more efficiently than SQL Server for "write" query. Also, the performance of the presented model was better than SQL Server in the case of flexibility, data preparation and extensibility. Based on these observations, the proposed model was more effective than Relational Databases for handling health data. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce.
Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng
2013-11-01
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS - a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing.
Data Processing on Database Management Systems with Fuzzy Query
NASA Astrophysics Data System (ADS)
Şimşek, Irfan; Topuz, Vedat
In this study, a fuzzy query tool (SQLf) for non-fuzzy database management systems was developed. In addition, samples of fuzzy queries were made by using real data with the tool developed in this study. Performance of SQLf was tested with the data about the Marmara University students' food grant. The food grant data were collected in MySQL database by using a form which had been filled on the web. The students filled a form on the web to describe their social and economical conditions for the food grant request. This form consists of questions which have fuzzy and crisp answers. The main purpose of this fuzzy query is to determine the students who deserve the grant. The SQLf easily found the eligible students for the grant through predefined fuzzy values. The fuzzy query tool (SQLf) could be used easily with other database system like ORACLE and SQL server.
Efficient hemodynamic event detection utilizing relational databases and wavelet analysis
NASA Technical Reports Server (NTRS)
Saeed, M.; Mark, R. G.
2001-01-01
Development of a temporal query framework for time-oriented medical databases has hitherto been a challenging problem. We describe a novel method for the detection of hemodynamic events in multiparameter trends utilizing wavelet coefficients in a MySQL relational database. Storage of the wavelet coefficients allowed for a compact representation of the trends, and provided robust descriptors for the dynamics of the parameter time series. A data model was developed to allow for simplified queries along several dimensions and time scales. Of particular importance, the data model and wavelet framework allowed for queries to be processed with minimal table-join operations. A web-based search engine was developed to allow for user-defined queries. Typical queries required between 0.01 and 0.02 seconds, with at least two orders of magnitude improvement in speed over conventional queries. This powerful and innovative structure will facilitate research on large-scale time-oriented medical databases.
A Tutorial in Creating Web-Enabled Databases with Inmagic DB/TextWorks through ODBC.
ERIC Educational Resources Information Center
Breeding, Marshall
2000-01-01
Explains how to create Web-enabled databases. Highlights include Inmagic's DB/Text WebPublisher product called DB/TextWorks; ODBC (Open Database Connectivity) drivers; Perl programming language; HTML coding; Structured Query Language (SQL); Common Gateway Interface (CGI) programming; and examples of HTML pages and Perl scripts. (LRW)
Datacube Services in Action, Using Open Source and Open Standards
NASA Astrophysics Data System (ADS)
Baumann, P.; Misev, D.
2016-12-01
Array Databases comprise novel, promising technology for massive spatio-temporal datacubes, extending the SQL paradigm of "any query, anytime" to n-D arrays. On server side, such queries can be optimized, parallelized, and distributed based on partitioned array storage. The rasdaman ("raster data manager") system, which has pioneered Array Databases, is available in open source on www.rasdaman.org. Its declarative query language extends SQL with array operators which are optimized and parallelized on server side. The rasdaman engine, which is part of OSGeo Live, is mature and in operational use databases individually holding dozens of Terabytes. Further, the rasdaman concepts have strongly impacted international Big Data standards in the field, including the forthcoming MDA ("Multi-Dimensional Array") extension to ISO SQL, the OGC Web Coverage Service (WCS) and Web Coverage Processing Service (WCPS) standards, and the forthcoming INSPIRE WCS/WCPS; in both OGC and INSPIRE, OGC is WCS Core Reference Implementation. In our talk we present concepts, architecture, operational services, and standardization impact of open-source rasdaman, as well as experiences made.
ERIC Educational Resources Information Center
Bosc, P.; Lietard, L.; Pivert, O.
2003-01-01
Considers flexible querying of relational databases. Highlights include SQL languages and basic aggregate operators; Sugeno's fuzzy integral; evaluation examples; and how and under what conditions other aggregate functions could be applied to fuzzy sets in a flexible query. (Author/LRW)
NASA Astrophysics Data System (ADS)
Kuznetsov, Valentin; Riley, Daniel; Afaq, Anzar; Sekhri, Vijay; Guo, Yuyi; Lueking, Lee
2010-04-01
The CMS experiment has implemented a flexible and powerful system enabling users to find data within the CMS physics data catalog. The Dataset Bookkeeping Service (DBS) comprises a database and the services used to store and access metadata related to CMS physics data. To this, we have added a generalized query system in addition to the existing web and programmatic interfaces to the DBS. This query system is based on a query language that hides the complexity of the underlying database structure by discovering the join conditions between database tables. This provides a way of querying the system that is simple and straightforward for CMS data managers and physicists to use without requiring knowledge of the database tables or keys. The DBS Query Language uses the ANTLR tool to build the input query parser and tokenizer, followed by a query builder that uses a graph representation of the DBS schema to construct the SQL query sent to underlying database. We will describe the design of the query system, provide details of the language components and overview of how this component fits into the overall data discovery system architecture.
A SQL-Database Based Meta-CASE System and its Query Subsystem
NASA Astrophysics Data System (ADS)
Eessaar, Erki; Sgirka, Rünno
Meta-CASE systems simplify the creation of CASE (Computer Aided System Engineering) systems. In this paper, we present a meta-CASE system that provides a web-based user interface and uses an object-relational database system (ORDBMS) as its basis. The use of ORDBMSs allows us to integrate different parts of the system and simplify the creation of meta-CASE and CASE systems. ORDBMSs provide powerful query mechanism. The proposed system allows developers to use queries to evaluate and gradually improve artifacts and calculate values of software measures. We illustrate the use of the systems by using SimpleM modeling language and discuss the use of SQL in the context of queries about artifacts. We have created a prototype of the meta-CASE system by using PostgreSQL™ ORDBMS and PHP scripting language.
2015-09-01
Detectability ...............................................................................................37 Figure 20. Excel VBA Codes for Checker...National Vulnerability Database OS Operating System SQL Structured Query Language VC Verification Condition VBA Visual Basic for Applications...checks each of these assertions for detectability by Daikon. The checker is an Excel Visual Basic for Applications ( VBA ) script that checks the
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce
Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng
2016-01-01
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS – a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing. PMID:27617325
NoSQL Based 3D City Model Management System
NASA Astrophysics Data System (ADS)
Mao, B.; Harrie, L.; Cao, J.; Wu, Z.; Shen, J.
2014-04-01
To manage increasingly complicated 3D city models, a framework based on NoSQL database is proposed in this paper. The framework supports import and export of 3D city model according to international standards such as CityGML, KML/COLLADA and X3D. We also suggest and implement 3D model analysis and visualization in the framework. For city model analysis, 3D geometry data and semantic information (such as name, height, area, price and so on) are stored and processed separately. We use a Map-Reduce method to deal with the 3D geometry data since it is more complex, while the semantic analysis is mainly based on database query operation. For visualization, a multiple 3D city representation structure CityTree is implemented within the framework to support dynamic LODs based on user viewpoint. Also, the proposed framework is easily extensible and supports geoindexes to speed up the querying. Our experimental results show that the proposed 3D city management system can efficiently fulfil the analysis and visualization requirements.
Evaluating a NoSQL Alternative for Chilean Virtual Observatory Services
NASA Astrophysics Data System (ADS)
Antognini, J.; Araya, M.; Solar, M.; Valenzuela, C.; Lira, F.
2015-09-01
Currently, the standards and protocols for data access in the Virtual Observatory architecture (DAL) are generally implemented with relational databases based on SQL. In particular, the Astronomical Data Query Language (ADQL), language used by IVOA to represent queries to VO services, was created to satisfy the different data access protocols, such as Simple Cone Search. ADQL is based in SQL92, and has extra functionality implemented using PgSphere. An emergent alternative to SQL are the so called NoSQL databases, which can be classified in several categories such as Column, Document, Key-Value, Graph, Object, etc.; each one recommended for different scenarios. Within their notable characteristics we can find: schema-free, easy replication support, simple API, Big Data, etc. The Chilean Virtual Observatory (ChiVO) is developing a functional prototype based on the IVOA architecture, with the following relevant factors: Performance, Scalability, Flexibility, Complexity, and Functionality. Currently, it's very difficult to compare these factors, due to a lack of alternatives. The objective of this paper is to compare NoSQL alternatives with SQL through the implementation of a Web API REST that satisfies ChiVO's needs: a SESAME-style name resolver for the data from ALMA. Therefore, we propose a test scenario by configuring a NoSQL database with data from different sources and evaluating the feasibility of creating a Simple Cone Search service and its performance. This comparison will allow to pave the way for the application of Big Data databases in the Virtual Observatory.
NASA Technical Reports Server (NTRS)
Alfaro, Victor O.; Casey, Nancy J.
2005-01-01
SQL-RAMS (where "SQL" signifies Structured Query Language and "RAMS" signifies Rocketdyne Automated Management System) is a successor to the legacy version of RAMS -- a computer program used to manage all work, nonconformance, corrective action, and configuration management on rocket engines and ground support equipment at Stennis Space Center. The legacy version resided in the File-Maker Pro software system and was constructed in modules that could act as standalone programs. There was little or no integration among modules. Because of limitations on file-management capabilities in FileMaker Pro, and because of difficulty of integration of FileMaker Pro with other software systems for exchange of data using such industry standards as SQL, the legacy version of RAMS proved to be limited, and working to circumvent its limitations too time-consuming. In contrast, SQL-RAMS is an integrated SQL-server-based program that supports all data-exchange software industry standards. Whereas in the legacy version, it was necessary to access individual modules to gain insight into a particular workstatus document, SQL-RAMS provides access through a single-screen presentation of core modules. In addition, SQL-RAMS enables rapid and efficient filtering of displayed statuses by predefined categories and test numbers. SQL-RAMS is rich in functionality and encompasses significant improvements over the legacy system. It provides users the ability to perform many tasks, which in the past required administrator intervention. Additionally, many of the design limitations have been corrected, allowing for a robust application that is user centric.
NASA Technical Reports Server (NTRS)
Alfaro, Victor O.; Casey, Nancy J.
2005-01-01
SQL-RAMS (where "SQL" signifies Structured Query Language and "RAMS" signifies Rocketdyne Automated Management System) is a successor to the legacy version of RAMS a computer program used to manage all work, nonconformance, corrective action, and configuration management on rocket engines and ground support equipment at Stennis Space Center. The legacy version resided in the FileMaker Pro software system and was constructed in modules that could act as stand-alone programs. There was little or no integration among modules. Because of limitations on file-management capabilities in FileMaker Pro, and because of difficulty of integration of FileMaker Pro with other software systems for exchange of data using such industry standards as SQL, the legacy version of RAMS proved to be limited, and working to circumvent its limitations too time-consuming. In contrast, SQL-RAMS is an integrated SQL-server-based program that supports all data-exchange software industry standards. Whereas in the legacy version, it was necessary to access individual modules to gain insight to a particular work-status documents, SQL-RAMS provides access through a single-screen presentation of core modules. In addition, SQL-RAMS enable rapid and efficient filtering of displayed statuses by predefined categories and test numbers. SQL-RAMS is rich in functionality and encompasses significant improvements over the legacy system. It provides users the ability to perform many tasks which in the past required administrator intervention. Additionally many of the design limitations have been corrected allowing for a robust application that is user centric.
Methods to Secure Databases Against Vulnerabilities
2015-12-01
for several languages such as C, C++, PHP, Java and Python [16]. MySQL will work well with very large databases. The documentation references...using Eclipse and connected to each database management system using Python and Java drivers provided by MySQL , MongoDB, and Datastax (for Cassandra...tiers in Python and Java . Problem MySQL MongoDB Cassandra 1. Injection a. Tautologies Vulnerable Vulnerable Not Vulnerable b. Illegal query
PACSY, a relational database management system for protein structure and chemical shift analysis.
Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo; Lee, Weontae; Markley, John L
2012-10-01
PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu.
Zhu, Xinjie; Zhang, Qiang; Ho, Eric Dun; Yu, Ken Hung-On; Liu, Chris; Huang, Tim H; Cheng, Alfred Sze-Lok; Kao, Ben; Lo, Eric; Yip, Kevin Y
2017-09-22
A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.
The Armed Forces Casualty Assistance Readiness Enhancement System (CARES): Design for Flexibility
2006-06-01
Special Form SQL Structured Query Language SSA Social Security Administration U USMA United States Military Academy V VB Visual Basic VBA Visual Basic for...of Abbreviations ................................................................... 26 Appendix B: Key VBA Macros and MS Excel Coding...internet portal, CARES Version 1.0 is a MS Excel spreadsheet application that contains a considerable number of Visual Basic for Applications ( VBA
CGDM: collaborative genomic data model for molecular profiling data using NoSQL.
Wang, Shicai; Mares, Mihaela A; Guo, Yi-Ke
2016-12-01
High-throughput molecular profiling has greatly improved patient stratification and mechanistic understanding of diseases. With the increasing amount of data used in translational medicine studies in recent years, there is a need to improve the performance of data warehouses in terms of data retrieval and statistical processing. Both relational and Key Value models have been used for managing molecular profiling data. Key Value models such as SeqWare have been shown to be particularly advantageous in terms of query processing speed for large datasets. However, more improvement can be achieved, particularly through better indexing techniques of the Key Value models, taking advantage of the types of queries which are specific for the high-throughput molecular profiling data. In this article, we introduce a Collaborative Genomic Data Model (CGDM), aimed at significantly increasing the query processing speed for the main classes of queries on genomic databases. CGDM creates three Collaborative Global Clustering Index Tables (CGCITs) to solve the velocity and variety issues at the cost of limited extra volume. Several benchmarking experiments were carried out, comparing CGDM implemented on HBase to the traditional SQL data model (TDM) implemented on both HBase and MySQL Cluster, using large publicly available molecular profiling datasets taken from NCBI and HapMap. In the microarray case, CGDM on HBase performed up to 246 times faster than TDM on HBase and 7 times faster than TDM on MySQL Cluster. In single nucleotide polymorphism case, CGDM on HBase outperformed TDM on HBase by up to 351 times and TDM on MySQL Cluster by up to 9 times. The CGDM source code is available at https://github.com/evanswang/CGDM. y.guo@imperial.ac.uk. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
ExplorEnz: a MySQL database of the IUBMB enzyme nomenclature
McDonald, Andrew G; Boyce, Sinéad; Moss, Gerard P; Dixon, Henry BF; Tipton, Keith F
2007-01-01
Background We describe the database ExplorEnz, which is the primary repository for EC numbers and enzyme data that are being curated on behalf of the IUBMB. The enzyme nomenclature is incorporated into many other resources, including the ExPASy-ENZYME, BRENDA and KEGG bioinformatics databases. Description The data, which are stored in a MySQL database, preserve the formatting of chemical and enzyme names. A simple, easy to use, web-based query interface is provided, along with an advanced search engine for more complex queries. The database is publicly available at . The data are available for download as SQL and XML files via FTP. Conclusion ExplorEnz has powerful and flexible search capabilities and provides the scientific community with the most up-to-date version of the IUBMB Enzyme List. PMID:17662133
ExplorEnz: a MySQL database of the IUBMB enzyme nomenclature.
McDonald, Andrew G; Boyce, Sinéad; Moss, Gerard P; Dixon, Henry B F; Tipton, Keith F
2007-07-27
We describe the database ExplorEnz, which is the primary repository for EC numbers and enzyme data that are being curated on behalf of the IUBMB. The enzyme nomenclature is incorporated into many other resources, including the ExPASy-ENZYME, BRENDA and KEGG bioinformatics databases. The data, which are stored in a MySQL database, preserve the formatting of chemical and enzyme names. A simple, easy to use, web-based query interface is provided, along with an advanced search engine for more complex queries. The database is publicly available at http://www.enzyme-database.org. The data are available for download as SQL and XML files via FTP. ExplorEnz has powerful and flexible search capabilities and provides the scientific community with the most up-to-date version of the IUBMB Enzyme List.
Harris, Daniel R.; Henderson, Darren W.; Kavuluru, Ramakanth; Stromberg, Arnold J.; Johnson, Todd R.
2015-01-01
We present a custom, Boolean query generator utilizing common-table expressions (CTEs) that is capable of scaling with big datasets. The generator maps user-defined Boolean queries, such as those interactively created in clinical-research and general-purpose healthcare tools, into SQL. We demonstrate the effectiveness of this generator by integrating our work into the Informatics for Integrating Biology and the Bedside (i2b2) query tool and show that it is capable of scaling. Our custom generator replaces and outperforms the default query generator found within the Clinical Research Chart (CRC) cell of i2b2. In our experiments, sixteen different types of i2b2 queries were identified by varying four constraints: date, frequency, exclusion criteria, and whether selected concepts occurred in the same encounter. We generated non-trivial, random Boolean queries based on these 16 types; the corresponding SQL queries produced by both generators were compared by execution times. The CTE-based solution significantly outperformed the default query generator and provided a much more consistent response time across all query types (M=2.03, SD=6.64 vs. M=75.82, SD=238.88 seconds). Without costly hardware upgrades, we provide a scalable solution based on CTEs with very promising empirical results centered on performance gains. The evaluation methodology used for this provides a means of profiling clinical data warehouse performance. PMID:25192572
Nosql for Storage and Retrieval of Large LIDAR Data Collections
NASA Astrophysics Data System (ADS)
Boehm, J.; Liu, K.
2015-08-01
Developments in LiDAR technology over the past decades have made LiDAR to become a mature and widely accepted source of geospatial information. This in turn has led to an enormous growth in data volume. The central idea for a file-centric storage of LiDAR point clouds is the observation that large collections of LiDAR data are typically delivered as large collections of files, rather than single files of terabyte size. This split of the dataset, commonly referred to as tiling, was usually done to accommodate a specific processing pipeline. It makes therefore sense to preserve this split. A document oriented NoSQL database can easily emulate this data partitioning, by representing each tile (file) in a separate document. The document stores the metadata of the tile. The actual files are stored in a distributed file system emulated by the NoSQL database. We demonstrate the use of MongoDB a highly scalable document oriented NoSQL database for storing large LiDAR files. MongoDB like any NoSQL database allows for queries on the attributes of the document. As a specialty MongoDB also allows spatial queries. Hence we can perform spatial queries on the bounding boxes of the LiDAR tiles. Inserting and retrieving files on a cloud-based database is compared to native file system and cloud storage transfer speed.
PACSY, a relational database management system for protein structure and chemical shift analysis
Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo
2012-01-01
PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu. PMID:22903636
Comparing IndexedHBase and Riak for Serving Truthy: Performance of Data Loading and Query Evaluation
2013-08-01
Research Triangle Park, NC 27709-2211 15. SUBJECT TERMS performance evaluation, distributed database, noSQL , HBase, indexing Xiaoming Gao, Judy Qiu...common hashtags created during a given time window. With the purpose of finding a solution for these challenges, we evaluate NoSQL databases such as
Access Based Cost Estimation for Beddown Analysis
2006-03-23
logic. This research expands upon the existing research by using Visual Basic for Applications ( VBA ) to further customize and streamline the...methods with the use of VBA . Calculations are completed in either underlying Form VBA code or through global modules accessible throughout the...query and SQL referencing. Attempts were made where possible to align data structures with possible external sources to minimize import errors and
An Ada/SQL (Structured Query Language) Application Scanner.
1988-03-01
Digital ...8217 (" DIGITS "), 46 new STRING’ ("DO"), new STRING’ ("ELSE"), new STRING’ ("ELSIF"), new STRING’ ("END"), new STRING’ ("ENTRY"), new STRING’ ("EXCEPTION...INTEGERPRINT; generic type NUM is digits <>; package FLOATPRINT is package txtprts.ada 18 prcdr PR (FL inFL %YE LINE n LINTYPE UNCLASSIFIED procedure
ERIC Educational Resources Information Center
Sanchez, Pablo; Zorrilla, Marta; Duque, Rafael; Nieto-Reyes, Alicia
2011-01-01
Models in Software Engineering are considered as abstract representations of software systems. Models highlight relevant details for a certain purpose, whereas irrelevant ones are hidden. Models are supposed to make system comprehension easier by reducing complexity. Therefore, models should play a key role in education, since they would ease the…
Digitizing Consumption Across the Operational Spectrum
2014-09-01
Figure 14. Java -implemented Dictionary and Query: Result ............................................22 Figure 15. Global Database Architecture...format. Figure 14 is an illustration of the query submitted in Java and the result which would be shown using the data shown in Figure 13. Figure...13. NoSQL (key, value) Dictionary Example 22 Figure 14. Java -implemented Dictionary and Query: Result While a
Database Reports Over the Internet
NASA Technical Reports Server (NTRS)
Smith, Dean Lance
2002-01-01
Most of the summer was spent developing software that would permit existing test report forms to be printed over the web on a printer that is supported by Adobe Acrobat Reader. The data is stored in a DBMS (Data Base Management System). The client asks for the information from the database using an HTML (Hyper Text Markup Language) form in a web browser. JavaScript is used with the forms to assist the user and verify the integrity of the entered data. Queries to a database are made in SQL (Sequential Query Language), a widely supported standard for making queries to databases. Java servlets, programs written in the Java programming language running under the control of network server software, interrogate the database and complete a PDF form template kept in a file. The completed report is sent to the browser requesting the report. Some errors are sent to the browser in an HTML web page, others are reported to the server. Access to the databases was restricted since the data are being transported to new DBMS software that will run on new hardware. However, the SQL queries were made to Microsoft Access, a DBMS that is available on most PCs (Personal Computers). Access does support the SQL commands that were used, and a database was created with Access that contained typical data for the report forms. Some of the problems and features are discussed below.
Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data
NASA Astrophysics Data System (ADS)
Zhang, Y.; Scheers, B.; Kersten, M.; Ivanova, M.; Nes, N.
2012-09-01
SciQL (pronounced as ‘cycle’) is a novel SQL-based array query language for scientific applications with both tables and arrays as first class citizens. SciQL lowers the entrance fee of adopting relational DBMS (RDBMS) in scientific domains, because it includes functionality often only found in mathematics software packages. In this paper, we demonstrate the usefulness of SciQL for astronomical data processing using examples from the Transient Key Project of the LOFAR radio telescope. In particular, how the LOFAR light-curve database of all detected sources can be constructed, by correlating sources across the spatial, frequency, time and polarisation domains.
Systematic Assessment of the Impact of User Roles on Network Flow Patterns
2017-09-01
Protocol SNMP Simple Network Management Protocol SQL Structured Query Language SSH Secure Shell SYN TCP Sync Flag SVDD Support Vector Data Description SVM...and evaluating users based on roles provide the best approach for defining normal digital behaviors? People are individuals, with different interests...activities on the network. We evaluate the assumption that users sharing similar roles exhibit similar network behaviors, and contrast the level of similarity
2017-01-01
Reusing the data from healthcare information systems can effectively facilitate clinical trials (CTs). How to select candidate patients eligible for CT recruitment criteria is a central task. Related work either depends on DBA (database administrator) to convert the recruitment criteria to native SQL queries or involves the data mapping between a standard ontology/information model and individual data source schema. This paper proposes an alternative computer-aided CT recruitment paradigm, based on syntax translation between different DSLs (domain-specific languages). In this paradigm, the CT recruitment criteria are first formally represented as production rules. The referenced rule variables are all from the underlying database schema. Then the production rule is translated to an intermediate query-oriented DSL (e.g., LINQ). Finally, the intermediate DSL is directly mapped to native database queries (e.g., SQL) automated by ORM (object-relational mapping). PMID:29065644
GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus
Zhu, Yuelin; Davis, Sean; Stephens, Robert; Meltzer, Paul S.; Chen, Yidong
2008-01-01
The NCBI Gene Expression Omnibus (GEO) represents the largest public repository of microarray data. However, finding data in GEO can be challenging. We have developed GEOmetadb in an attempt to make querying the GEO metadata both easier and more powerful. All GEO metadata records as well as the relationships between them are parsed and stored in a local MySQL database. A powerful, flexible web search interface with several convenient utilities provides query capabilities not available via NCBI tools. In addition, a Bioconductor package, GEOmetadb that utilizes a SQLite export of the entire GEOmetadb database is also available, rendering the entire GEO database accessible with full power of SQL-based queries from within R. Availability: The web interface and SQLite databases available at http://gbnci.abcc.ncifcrf.gov/geo/. The Bioconductor package is available via the Bioconductor project. The corresponding MATLAB implementation is also available at the same website. Contact: yidong@mail.nih.gov PMID:18842599
Zhang, Yinsheng; Zhang, Guoming; Shang, Qian
2017-01-01
Reusing the data from healthcare information systems can effectively facilitate clinical trials (CTs). How to select candidate patients eligible for CT recruitment criteria is a central task. Related work either depends on DBA (database administrator) to convert the recruitment criteria to native SQL queries or involves the data mapping between a standard ontology/information model and individual data source schema. This paper proposes an alternative computer-aided CT recruitment paradigm, based on syntax translation between different DSLs (domain-specific languages). In this paradigm, the CT recruitment criteria are first formally represented as production rules. The referenced rule variables are all from the underlying database schema. Then the production rule is translated to an intermediate query-oriented DSL (e.g., LINQ). Finally, the intermediate DSL is directly mapped to native database queries (e.g., SQL) automated by ORM (object-relational mapping).
SiC: An Agent Based Architecture for Preventing and Detecting Attacks to Ubiquitous Databases
NASA Astrophysics Data System (ADS)
Pinzón, Cristian; de Paz, Yanira; Bajo, Javier; Abraham, Ajith; Corchado, Juan M.
One of the main attacks to ubiquitous databases is the structure query language (SQL) injection attack, which causes severe damages both in the commercial aspect and in the user’s confidence. This chapter proposes the SiC architecture as a solution to the SQL injection attack problem. This is a hierarchical distributed multiagent architecture, which involves an entirely new approach with respect to existing architectures for the prevention and detection of SQL injections. SiC incorporates a kind of intelligent agent, which integrates a case-based reasoning system. This agent, which is the core of the architecture, allows the application of detection techniques based on anomalies as well as those based on patterns, providing a great degree of autonomy, flexibility, robustness and dynamic scalability. The characteristics of the multiagent system allow an architecture to detect attacks from different types of devices, regardless of the physical location. The architecture has been tested on a medical database, guaranteeing safe access from various devices such as PDAs and notebook computers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Wu, Kesheng
The Resource Description Framework (RDF) is a popular data model for representing linked data sets arising from the web, as well as large scienti c data repositories such as UniProt. RDF data intrinsically represents a labeled and directed multi-graph. SPARQL is a query language for RDF that expresses subgraph pattern- nding queries on this implicit multigraph in a SQL- like syntax. SPARQL queries generate complex intermediate join queries; to compute these joins e ciently, we propose a new strategy based on bitmap indexes. We store the RDF data in column-oriented structures as compressed bitmaps along with two dictionaries. This papermore » makes three new contributions. (i) We present an e cient parallel strategy for parsing the raw RDF data, building dictionaries of unique entities, and creating compressed bitmap indexes of the data. (ii) We utilize the constructed bitmap indexes to e ciently answer SPARQL queries, simplifying the join evaluations. (iii) To quantify the performance impact of using bitmap indexes, we compare our approach to the state-of-the-art triple-store RDF-3X. We nd that our bitmap index-based approach to answering queries is up to an order of magnitude faster for a variety of SPARQL queries, on gigascale RDF data sets.« less
NASA Astrophysics Data System (ADS)
Vacca, G.; Pili, D.; Fiorino, D. R.; Pintus, V.
2017-05-01
The presented work is part of the research project, titled "Tecniche murarie tradizionali: conoscenza per la conservazione ed il miglioramento prestazionale" (Traditional building techniques: from knowledge to conservation and performance improvement), with the purpose of studying the building techniques of the 13th-18th centuries in the Sardinia Region (Italy) for their knowledge, conservation, and promotion. The end purpose of the entire study is to improve the performance of the examined structures. In particular, the task of the authors within the research project was to build a WebGIS to manage the data collected during the examination and study phases. This infrastructure was entirely built using Open Source software. The work consisted of designing a database built in PostgreSQL and its spatial extension PostGIS, which allows to store and manage feature geometries and spatial data. The data input is performed via a form built in HTML and PHP. The HTML part is based on Bootstrap, an open tools library for websites and web applications. The implementation of this template used both PHP and Javascript code. The PHP code manages the reading and writing of data to the database, using embedded SQL queries. As of today, we surveyed and archived more than 300 buildings, belonging to three main macro categories: fortification architectures, religious architectures, residential architectures. The masonry samples investigated in relation to the construction techniques are more than 150. The database is published on the Internet as a WebGIS built using the Leaflet Javascript open libraries, which allows creating map sites with background maps and navigation, input and query tools. This too uses an interaction of HTML, Javascript, PHP and SQL code.
Software Engineering Laboratory (SEL) database organization and user's guide, revision 2
NASA Technical Reports Server (NTRS)
Morusiewicz, Linda; Bristow, John
1992-01-01
The organization of the Software Engineering Laboratory (SEL) database is presented. Included are definitions and detailed descriptions of the database tables and views, the SEL data, and system support data. The mapping from the SEL and system support data to the base table is described. In addition, techniques for accessing the database through the Database Access Manager for the SEL (DAMSEL) system and via the ORACLE structured query language (SQL) are discussed.
Software Engineering Laboratory (SEL) database organization and user's guide
NASA Technical Reports Server (NTRS)
So, Maria; Heller, Gerard; Steinberg, Sandra; Spiegel, Douglas
1989-01-01
The organization of the Software Engineering Laboratory (SEL) database is presented. Included are definitions and detailed descriptions of the database tables and views, the SEL data, and system support data. The mapping from the SEL and system support data to the base tables is described. In addition, techniques for accessing the database, through the Database Access Manager for the SEL (DAMSEL) system and via the ORACLE structured query language (SQL), are discussed.
Chen, R S; Nadkarni, P; Marenco, L; Levin, F; Erdos, J; Miller, P L
2000-01-01
The entity-attribute-value representation with classes and relationships (EAV/CR) provides a flexible and simple database schema to store heterogeneous biomedical data. In certain circumstances, however, the EAV/CR model is known to retrieve data less efficiently than conventionally based database schemas. To perform a pilot study that systematically quantifies performance differences for database queries directed at real-world microbiology data modeled with EAV/CR and conventional representations, and to explore the relative merits of different EAV/CR query implementation strategies. Clinical microbiology data obtained over a ten-year period were stored using both database models. Query execution times were compared for four clinically oriented attribute-centered and entity-centered queries operating under varying conditions of database size and system memory. The performance characteristics of three different EAV/CR query strategies were also examined. Performance was similar for entity-centered queries in the two database models. Performance in the EAV/CR model was approximately three to five times less efficient than its conventional counterpart for attribute-centered queries. The differences in query efficiency became slightly greater as database size increased, although they were reduced with the addition of system memory. The authors found that EAV/CR queries formulated using multiple, simple SQL statements executed in batch were more efficient than single, large SQL statements. This paper describes a pilot project to explore issues in and compare query performance for EAV/CR and conventional database representations. Although attribute-centered queries were less efficient in the EAV/CR model, these inefficiencies may be addressable, at least in part, by the use of more powerful hardware or more memory, or both.
ClimateSpark: An in-memory distributed computing framework for big climate data analytics
NASA Astrophysics Data System (ADS)
Hu, Fei; Yang, Chaowei; Schnase, John L.; Duffy, Daniel Q.; Xu, Mengchao; Bowen, Michael K.; Lee, Tsengdar; Song, Weiwei
2018-06-01
The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.
An integrated information retrieval and document management system
NASA Technical Reports Server (NTRS)
Coles, L. Stephen; Alvarez, J. Fernando; Chen, James; Chen, William; Cheung, Lai-Mei; Clancy, Susan; Wong, Alexis
1993-01-01
This paper describes the requirements and prototype development for an intelligent document management and information retrieval system that will be capable of handling millions of pages of text or other data. Technologies for scanning, Optical Character Recognition (OCR), magneto-optical storage, and multiplatform retrieval using a Standard Query Language (SQL) will be discussed. The semantic ambiguity inherent in the English language is somewhat compensated-for through the use of coefficients or weighting factors for partial synonyms. Such coefficients are used both for defining structured query trees for routine queries and for establishing long-term interest profiles that can be used on a regular basis to alert individual users to the presence of relevant documents that may have just arrived from an external source, such as a news wire service. Although this attempt at evidential reasoning is limited in comparison with the latest developments in AI Expert Systems technology, it has the advantage of being commercially available.
Development of a replicated database of DHCP data for evaluation of drug use.
Graber, S E; Seneker, J A; Stahl, A A; Franklin, K O; Neel, T E; Miller, R A
1996-01-01
This case report describes development and testing of a method to extract clinical information stored in the Veterans Affairs (VA) Decentralized Hospital Computer System (DHCP) for the purpose of analyzing data about groups of patients. The authors used a microcomputer-based, structured query language (SQL)-compatible, relational database system to replicate a subset of the Nashville VA Hospital's DHCP patient database. This replicated database contained the complete current Nashville DHCP prescription, provider, patient, and drug data sets, and a subset of the laboratory data. A pilot project employed this replicated database to answer questions that might arise in drug-use evaluation, such as identification of cases of polypharmacy, suboptimal drug regimens, and inadequate laboratory monitoring of drug therapy. These database queries included as candidates for review all prescriptions for all outpatients. The queries demonstrated that specific drug-use events could be identified for any time interval represented in the replicated database. PMID:8653451
Development of a replicated database of DHCP data for evaluation of drug use.
Graber, S E; Seneker, J A; Stahl, A A; Franklin, K O; Neel, T E; Miller, R A
1996-01-01
This case report describes development and testing of a method to extract clinical information stored in the Veterans Affairs (VA) Decentralized Hospital Computer System (DHCP) for the purpose of analyzing data about groups of patients. The authors used a microcomputer-based, structured query language (SQL)-compatible, relational database system to replicate a subset of the Nashville VA Hospital's DHCP patient database. This replicated database contained the complete current Nashville DHCP prescription, provider, patient, and drug data sets, and a subset of the laboratory data. A pilot project employed this replicated database to answer questions that might arise in drug-use evaluation, such as identification of cases of polypharmacy, suboptimal drug regimens, and inadequate laboratory monitoring of drug therapy. These database queries included as candidates for review all prescriptions for all outpatients. The queries demonstrated that specific drug-use events could be identified for any time interval represented in the replicated database.
WE-E-BRB-11: Riview a Web-Based Viewer for Radiotherapy.
Apte, A; Wang, Y; Deasy, J
2012-06-01
Collaborations involving radiotherapy data collection, such as the recently proposed international radiogenomics consortium, require robust, web-based tools to facilitate reviewing treatment planning information. We present the architecture and prototype characteristics for a web-based radiotherapy viewer. The web-based environment developed in this work consists of the following components: 1) Import of DICOM/RTOG data: CERR was leveraged to import DICOM/RTOG data and to convert to database friendly RT objects. 2) Extraction and Storage of RT objects: The scan and dose distributions were stored as .png files per slice and view plane. The file locations were written to the MySQL database. Structure contours and DVH curves were written to the database as numeric data. 3) Web interfaces to query, retrieve and visualize the RT objects: The Web application was developed using HTML 5 and Ruby on Rails (RoR) technology following the MVC philosophy. The open source ImageMagick library was utilized to overlay scan, dose and structures. The application allows users to (i) QA the treatment plans associated with a study, (ii) Query and Retrieve patients matching anonymized ID and study, (iii) Review up to 4 plans simultaneously in 4 window panes (iv) Plot DVH curves for the selected structures and dose distributions. A subset of data for lung cancer patients was used to prototype the system. Five user accounts were created to have access to this study. The scans, doses, structures and DVHs for 10 patients were made available via the web application. A web-based system to facilitate QA, and support Query, Retrieve and the Visualization of RT data was prototyped. The RIVIEW system was developed using open source and free technology like MySQL and RoR. We plan to extend the RIVIEW system further to be useful in clinical trial data collection, outcomes research, cohort plan review and evaluation. © 2012 American Association of Physicists in Medicine.
Enhancing SAMOS Data Access in DOMS via a Neo4j Property Graph Database.
NASA Astrophysics Data System (ADS)
Stallard, A. P.; Smith, S. R.; Elya, J. L.
2016-12-01
The Shipboard Automated Meteorological and Oceanographic System (SAMOS) initiative provides routine access to high-quality marine meteorological and near-surface oceanographic observations from research vessels. The Distributed Oceanographic Match-Up Service (DOMS) under development is a centralized service that allows researchers to easily match in situ and satellite oceanographic data from distributed sources to facilitate satellite calibration, validation, and retrieval algorithm development. The service currently uses Apache Solr as a backend search engine on each node in the distributed network. While Solr is a high-performance solution that facilitates creation and maintenance of indexed data, it is limited in the sense that its schema is fixed. The property graph model escapes this limitation by creating relationships between data objects. The authors will present the development of the SAMOS Neo4j property graph database including new search possibilities that take advantage of the property graph model, performance comparisons with Apache Solr, and a vision for graph databases as a storage tool for oceanographic data. The integration of the SAMOS Neo4j graph into DOMS will also be described. Currently, Neo4j contains spatial and temporal records from SAMOS which are modeled into a time tree and r-tree using Graph Aware and Spatial plugin tools for Neo4j. These extensions provide callable Java procedures within CYPHER (Neo4j's query language) that generate in-graph structures. Once generated, these structures can be queried using procedures from these libraries, or directly via CYPHER statements. Neo4j excels at performing relationship and path-based queries, which challenge relational-SQL databases because they require memory intensive joins due to the limitation of their design. Consider a user who wants to find records over several years, but only for specific months. If a traditional database only stores timestamps, this type of query would be complex and likely prohibitively slow. Using the time tree model, one can specify a path from the root to the data which restricts resolutions to certain timeframes (e.g., months). This query can be executed without joins, unions, or other compute-intensive operations, putting Neo4j at a computational advantage to the SQL database alternative.
2003-06-01
delivery Data Access (1980s) "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language ( SQL ...macros written in Visual Basic for Applications ( VBA ). 32 Iteration Two: Class Diagram Tech OASIS Export ScriptImport Filter Data ProcessingMethod 1...MS Excel * 1 VBA Macro*1 contains sends data to co nt ai ns executes * * 1 1 contains contains Figure 20. Iteration two class diagram The
Ada (Trade Name)/SQL (Structured Query Language) Binding Specification
1988-06-01
TYPES iS package ADA-SOL Is type DWPLOYEEyNAME Is new STRING ( 1 .. 30 ); type BOSSNAME is new EMPLOYEENAME; type EMPLOYEE SALARY is digits 7 range 0.00...minimum number of significant decimal digits . All real numbers between the lower and upper bounds, inclusive, belong to the subtype, and are...and the elements of strings. Format <character> -:- < digit > I <letter> ! <special character> < digit > ::- 0111213141516171819 <letter> ::- <upper case
DCMS: A data analytics and management system for molecular simulation.
Kumar, Anand; Grupcev, Vladimir; Berrada, Meryem; Fogarty, Joseph C; Tu, Yi-Cheng; Zhu, Xingquan; Pandit, Sagar A; Xia, Yuni
Molecular Simulation (MS) is a powerful tool for studying physical/chemical features of large systems and has seen applications in many scientific and engineering domains. During the simulation process, the experiments generate a very large number of atoms and intend to observe their spatial and temporal relationships for scientific analysis. The sheer data volumes and their intensive interactions impose significant challenges for data accessing, managing, and analysis. To date, existing MS software systems fall short on storage and handling of MS data, mainly because of the missing of a platform to support applications that involve intensive data access and analytical process. In this paper, we present the database-centric molecular simulation (DCMS) system our team developed in the past few years. The main idea behind DCMS is to store MS data in a relational database management system (DBMS) to take advantage of the declarative query interface ( i.e. , SQL), data access methods, query processing, and optimization mechanisms of modern DBMSs. A unique challenge is to handle the analytical queries that are often compute-intensive. For that, we developed novel indexing and query processing strategies (including algorithms running on modern co-processors) as integrated components of the DBMS. As a result, researchers can upload and analyze their data using efficient functions implemented inside the DBMS. Index structures are generated to store analysis results that may be interesting to other users, so that the results are readily available without duplicating the analysis. We have developed a prototype of DCMS based on the PostgreSQL system and experiments using real MS data and workload show that DCMS significantly outperforms existing MS software systems. We also used it as a platform to test other data management issues such as security and compression.
HBVPathDB: a database of HBV infection-related molecular interaction network.
Zhang, Yi; Bo, Xiao-Chen; Yang, Jing; Wang, Sheng-Qi
2005-03-21
To describe molecules or genes interaction between hepatitis B viruses (HBV) and host, for understanding how virus' and host's genes and molecules are networked to form a biological system and for perceiving mechanism of HBV infection. The knowledge of HBV infection-related reactions was organized into various kinds of pathways with carefully drawn graphs in HBVPathDB. Pathway information is stored with relational database management system (DBMS), which is currently the most efficient way to manage large amounts of data and query is implemented with powerful Structured Query Language (SQL). The search engine is written using Personal Home Page (PHP) with SQL embedded and web retrieval interface is developed for searching with Hypertext Markup Language (HTML). We present the first version of HBVPathDB, which is a HBV infection-related molecular interaction network database composed of 306 pathways with 1 050 molecules involved. With carefully drawn graphs, pathway information stored in HBVPathDB can be browsed in an intuitive way. We develop an easy-to-use interface for flexible accesses to the details of database. Convenient software is implemented to query and browse the pathway information of HBVPathDB. Four search page layout options-category search, gene search, description search, unitized search-are supported by the search engine of the database. The database is freely available at http://www.bio-inf.net/HBVPathDB/HBV/. The conventional perspective HBVPathDB have already contained a considerable amount of pathway information with HBV infection related, which is suitable for in-depth analysis of molecular interaction network of virus and host. HBVPathDB integrates pathway data-sets with convenient software for query, browsing, visualization, that provides users more opportunity to identify regulatory key molecules as potential drug targets and to explore the possible mechanism of HBV infection based on gene expression datasets.
A Data Warehouse to Support Condition Based Maintenance (CBM)
2005-05-01
Application ( VBA ) code sequence to import the original MAST-generated CSV and then create a single output table in DBASE IV format. The DBASE IV format...database architecture (Oracle, Sybase, MS- SQL , etc). This design includes table definitions, comments, specification of table attributes, primary and foreign...built queries and applications. Needs the application developers to construct data views. No SQL programming experience. b. Power Database User - knows
Image query and indexing for digital x rays
NASA Astrophysics Data System (ADS)
Long, L. Rodney; Thoma, George R.
1998-12-01
The web-based medical information retrieval system (WebMIRS) allows interned access to databases containing 17,000 digitized x-ray spine images and associated text data from National Health and Nutrition Examination Surveys (NHANES). WebMIRS allows SQL query of the text, and viewing of the returned text records and images using a standard browser. We are now working (1) to determine utility of data directly derived from the images in our databases, and (2) to investigate the feasibility of computer-assisted or automated indexing of the images to support image retrieval of images of interest to biomedical researchers in the field of osteoarthritis. To build an initial database based on image data, we are manually segmenting a subset of the vertebrae, using techniques from vertebral morphometry. From this, we will derive and add to the database vertebral features. This image-derived data will enhance the user's data access capability by enabling the creation of combined SQL/image-content queries.
An approach for heterogeneous and loosely coupled geospatial data distributed computing
NASA Astrophysics Data System (ADS)
Chen, Bin; Huang, Fengru; Fang, Yu; Huang, Zhou; Lin, Hui
2010-07-01
Most GIS (Geographic Information System) applications tend to have heterogeneous and autonomous geospatial information resources, and the availability of these local resources is unpredictable and dynamic under a distributed computing environment. In order to make use of these local resources together to solve larger geospatial information processing problems that are related to an overall situation, in this paper, with the support of peer-to-peer computing technologies, we propose a geospatial data distributed computing mechanism that involves loosely coupled geospatial resource directories and a term named as Equivalent Distributed Program of global geospatial queries to solve geospatial distributed computing problems under heterogeneous GIS environments. First, a geospatial query process schema for distributed computing as well as a method for equivalent transformation from a global geospatial query to distributed local queries at SQL (Structured Query Language) level to solve the coordinating problem among heterogeneous resources are presented. Second, peer-to-peer technologies are used to maintain a loosely coupled network environment that consists of autonomous geospatial information resources, thus to achieve decentralized and consistent synchronization among global geospatial resource directories, and to carry out distributed transaction management of local queries. Finally, based on the developed prototype system, example applications of simple and complex geospatial data distributed queries are presented to illustrate the procedure of global geospatial information processing.
Using PHP/MySQL to Manage Potential Mass Impacts
NASA Technical Reports Server (NTRS)
Hager, Benjamin I.
2010-01-01
This paper presents a new application using commercially available software to manage mass properties for spaceflight vehicles. PHP/MySQL(PHP: Hypertext Preprocessor and My Structured Query Language) are a web scripting language and a database language commonly used in concert with each other. They open up new opportunities to develop cutting edge mass properties tools, and in particular, tools for the management of potential mass impacts (threats and opportunities). The paper begins by providing an overview of the functions and capabilities of PHP/MySQL. The focus of this paper is on how PHP/MySQL are being used to develop an advanced "web accessible" database system for identifying and managing mass impacts on NASA's Ares I Upper Stage program, managed by the Marshall Space Flight Center. To fully describe this application, examples of the data, search functions, and views are provided to promote, not only the function, but the security, ease of use, simplicity, and eye-appeal of this new application. This paper concludes with an overview of the other potential mass properties applications and tools that could be developed using PHP/MySQL. The premise behind this paper is that PHP/MySQL are software tools that are easy to use and readily available for the development of cutting edge mass properties applications. These tools are capable of providing "real-time" searching and status of an active database, automated report generation, and other capabilities to streamline and enhance mass properties management application. By using PHP/MySQL, proven existing methods for managing mass properties can be adapted to present-day information technology to accelerate mass properties data gathering, analysis, and reporting, allowing mass property management to keep pace with today's fast-pace design and development processes.
STARS 2.0: 2nd-generation open-source archiving and query software
NASA Astrophysics Data System (ADS)
Winegar, Tom
2008-07-01
The Subaru Telescope is in process of developing an open-source alternative to the 1st-generation software and databases (STARS 1) used for archiving and query. For STARS 2, we have chosen PHP and Python for scripting and MySQL as the database software. We have collected feedback from staff and observers, and used this feedback to significantly improve the design and functionality of our future archiving and query software. Archiving - We identified two weaknesses in 1st-generation STARS archiving software: a complex and inflexible table structure and uncoordinated system administration for our business model: taking pictures from the summit and archiving them in both Hawaii and Japan. We adopted a simplified and normalized table structure with passive keyword collection, and we are designing an archive-to-archive file transfer system that automatically reports real-time status and error conditions and permits error recovery. Query - We identified several weaknesses in 1st-generation STARS query software: inflexible query tools, poor sharing of calibration data, and no automatic file transfer mechanisms to observers. We are developing improved query tools and sharing of calibration data, and multi-protocol unassisted file transfer mechanisms for observers. In the process, we have redefined a 'query': from an invisible search result that can only transfer once in-house right now, with little status and error reporting and no error recovery - to a stored search result that can be monitored, transferred to different locations with multiple protocols, reporting status and error conditions and permitting recovery from errors.
Time series patterns and language support in DBMS
NASA Astrophysics Data System (ADS)
Telnarova, Zdenka
2017-07-01
This contribution is focused on pattern type Time Series as a rich in semantics representation of data. Some example of implementation of this pattern type in traditional Data Base Management Systems is briefly presented. There are many approaches how to manipulate with patterns and query patterns. Crucial issue can be seen in systematic approach to pattern management and specific pattern query language which takes into consideration semantics of patterns. Query language SQL-TS for manipulating with patterns is shown on Time Series data.
The Comparison of SQL, QBE, and DFQL as Query Languages for Relational Databases
1994-03-01
is: Dname F-mune Laame Headquarter James Borg b. Query 7: RetieMl involving explicit sets Retrieve the Social Security Numbers of employees who worked...i •••,• I• i , i I I • I 10. Ka Dispullahta MABES TNI-AL Cilangkap-Jakarta Timur Indonesia 11. Parunmungan Girsang 3 Jl. Cawang Baru 34-36 Jakarta
TreeQ-VISTA: An Interactive Tree Visualization Tool withFunctional Annotation Query Capabilities
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gu, Shengyin; Anderson, Iain; Kunin, Victor
2007-05-07
Summary: We describe a general multiplatform exploratorytool called TreeQ-Vista, designed for presenting functional annotationsin a phylogenetic context. Traits, such as phenotypic and genomicproperties, are interactively queried from a relational database with auser-friendly interface which provides a set of tools for users with orwithout SQL knowledge. The query results are projected onto aphylogenetic tree and can be displayed in multiple color groups. A richset of browsing, grouping and query tools are provided to facilitatetrait exploration, comparison and analysis.Availability: The program,detailed tutorial and examples are available online athttp://genome-test.lbl.gov/vista/TreeQVista.
Ultrabroadband photonic internet: safety aspects
NASA Astrophysics Data System (ADS)
Kalicki, Arkadiusz; Romaniuk, Ryszard
2008-11-01
Web applications became most popular medium in the Internet. Popularity, easiness of web application frameworks together with careless development results in high number of vulnerabilities and attacks. There are several types of attacks possible because of improper input validation. SQL injection is ability to execute arbitrary SQL queries in a database through an existing application. Cross-site scripting is the vulnerability which allows malicious web users to inject code into the web pages viewed by other users. Cross-Site Request Forgery (CSRF) is an attack that tricks the victim into loading a page that contains malicious request. Web spam in blogs. There are several techniques to mitigate attacks. Most important are web application strong design, correct input validation, defined data types for each field and parameterized statements in SQL queries. Server hardening with firewall, modern security policies systems and safe web framework interpreter configuration are essential. It is advised to keep proper security level on client side, keep updated software and install personal web firewalls or IDS/IPS systems. Good habits are logging out from services just after finishing work and using even separate web browser for most important sites, like e-banking.
A new relational database structure and online interface for the HITRAN database
NASA Astrophysics Data System (ADS)
Hill, Christian; Gordon, Iouli E.; Rothman, Laurence S.; Tennyson, Jonathan
2013-11-01
A new format for the HITRAN database is proposed. By storing the line-transition data in a number of linked tables described by a relational database schema, it is possible to overcome the limitations of the existing format, which have become increasingly apparent over the last few years as new and more varied data are being used by radiative-transfer models. Although the database in the new format can be searched using the well-established Structured Query Language (SQL), a web service, HITRANonline, has been deployed to allow users to make most common queries of the database using a graphical user interface in a web page. The advantages of the relational form of the database to ensuring data integrity and consistency are explored, and the compatibility of the online interface with the emerging standards of the Virtual Atomic and Molecular Data Centre (VAMDC) project is discussed. In particular, the ability to access HITRAN data using a standard query language from other websites, command line tools and from within computer programs is described.
ExplorEnz: the primary source of the IUBMB enzyme list
McDonald, Andrew G.; Boyce, Sinéad; Tipton, Keith F.
2009-01-01
ExplorEnz is the MySQL database that is used for the curation and dissemination of the International Union of Biochemistry and Molecular Biology (IUBMB) Enzyme Nomenclature. A simple web-based query interface is provided, along with an advanced search engine for more complex Boolean queries. The WWW front-end is accessible at http://www.enzyme-database.org, from where downloads of the database as SQL and XML are also available. An associated form-based curatorial application has been developed to facilitate the curation of enzyme data as well as the internal and public review processes that occur before an enzyme entry is made official. Suggestions for new enzyme entries, or modifications to existing ones, can be made using the forms provided at http://www.enzyme-database.org/forms.php. PMID:18776214
Intelligent search in Big Data
NASA Astrophysics Data System (ADS)
Birialtsev, E.; Bukharaev, N.; Gusenkov, A.
2017-10-01
An approach to data integration, aimed on the ontology-based intelligent search in Big Data, is considered in the case when information objects are represented in the form of relational databases (RDB), structurally marked by their schemes. The source of information for constructing an ontology and, later on, the organization of the search are texts in natural language, treated as semi-structured data. For the RDBs, these are comments on the names of tables and their attributes. Formal definition of RDBs integration model in terms of ontologies is given. Within framework of the model universal RDB representation ontology, oil production subject domain ontology and linguistic thesaurus of subject domain language are built. Technique of automatic SQL queries generation for subject domain specialists is proposed. On the base of it, information system for TATNEFT oil-producing company RDBs was implemented. Exploitation of the system showed good relevance with majority of queries.
A distributed query execution engine of big attributed graphs.
Batarfi, Omar; Elshawi, Radwa; Fayoumi, Ayman; Barnawi, Ahmed; Sakr, Sherif
2016-01-01
A graph is a popular data model that has become pervasively used for modeling structural relationships between objects. In practice, in many real-world graphs, the graph vertices and edges need to be associated with descriptive attributes. Such type of graphs are referred to as attributed graphs. G-SPARQL has been proposed as an expressive language, with a centralized execution engine, for querying attributed graphs. G-SPARQL supports various types of graph querying operations including reachability, pattern matching and shortest path where any G-SPARQL query may include value-based predicates on the descriptive information (attributes) of the graph edges/vertices in addition to the structural predicates. In general, a main limitation of centralized systems is that their vertical scalability is always restricted by the physical limits of computer systems. This article describes the design, implementation in addition to the performance evaluation of DG-SPARQL, a distributed, hybrid and adaptive parallel execution engine of G-SPARQL queries. In this engine, the topology of the graph is distributed over the main memory of the underlying nodes while the graph data are maintained in a relational store which is replicated on the disk of each of the underlying nodes. DG-SPARQL evaluates parts of the query plan via SQL queries which are pushed to the underlying relational stores while other parts of the query plan, as necessary, are evaluated via indexless memory-based graph traversal algorithms. Our experimental evaluation shows the efficiency and the scalability of DG-SPARQL on querying massive attributed graph datasets in addition to its ability to outperform the performance of Apache Giraph, a popular distributed graph processing system, by orders of magnitudes.
Example Level 1 Ada/SQL (Structured Query Language) System Software
1987-09-01
PUTLINE ("EMPNAME JOB SALARY COMMISSION"); loop FETCH ( CURSOR ); INTO ( VEMP NAME , STR LAST ); T LEN INTEGER (STR LAST - V EMP NAME’FIRST + 1); for I in 1...begin PUT_LINE ("EMPNAME JOB SALARY DEPT"); loop FETCH (CURSOR); INTO ( VEMP NAME , STRLAST ); T_LEN := INTEGER (STRLAST - V_EMPNAME’FIRST + 1); for I in...NUMBERS OPEN ( CURSOR ); begin PUT_LINE ("EMP_NAME SALARY JOB"); loop FETCH ( CURSOR ); INTO ( VEMP NAME , STRLAST ); T_LEN := INTEGER (STR_LAST
Abstracting data warehousing issues in scientific research.
Tews, Cody; Bracio, Boris R
2002-01-01
This paper presents the design and implementation of the Idaho Biomedical Data Management System (IBDMS). This system preprocesses biomedical data from the IMPROVE (Improving Control of Patient Status in Critical Care) library via an Open Database Connectivity (ODBC) connection. The ODBC connection allows for local and remote simulations to access filtered, joined, and sorted data using the Structured Query Language (SQL). The tool is capable of providing an overview of available data in addition to user defined data subset for verification of models of the human respiratory system.
ClimateSpark: An In-memory Distributed Computing Framework for Big Climate Data Analytics
NASA Astrophysics Data System (ADS)
Hu, F.; Yang, C. P.; Duffy, D.; Schnase, J. L.; Li, Z.
2016-12-01
Massive array-based climate data is being generated from global surveillance systems and model simulations. They are widely used to analyze the environment problems, such as climate changes, natural hazards, and public health. However, knowing the underlying information from these big climate datasets is challenging due to both data- and computing- intensive issues in data processing and analyzing. To tackle the challenges, this paper proposes ClimateSpark, an in-memory distributed computing framework to support big climate data processing. In ClimateSpark, the spatiotemporal index is developed to enable Apache Spark to treat the array-based climate data (e.g. netCDF4, HDF4) as native formats, which are stored in Hadoop Distributed File System (HDFS) without any preprocessing. Based on the index, the spatiotemporal query services are provided to retrieve dataset according to a defined geospatial and temporal bounding box. The data subsets will be read out, and a data partition strategy will be applied to equally split the queried data to each computing node, and store them in memory as climateRDDs for processing. By leveraging Spark SQL and User Defined Function (UDFs), the climate data analysis operations can be conducted by the intuitive SQL language. ClimateSpark is evaluated by two use cases using the NASA Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. One use case is to conduct the spatiotemporal query and visualize the subset results in animation; the other one is to compare different climate model outputs using Taylor-diagram service. Experimental results show that ClimateSpark can significantly accelerate data query and processing, and enable the complex analysis services served in the SQL-style fashion.
Tao, Shiqiang; Cui, Licong; Wu, Xi; Zhang, Guo-Qiang
2017-01-01
To help researchers better access clinical data, we developed a prototype query engine called DataSphere for exploring large-scale integrated clinical data repositories. DataSphere expedites data importing using a NoSQL data management system and dynamically renders its user interface for concept-based querying tasks. DataSphere provides an interactive query-building interface together with query translation and optimization strategies, which enable users to build and execute queries effectively and efficiently. We successfully loaded a dataset of one million patients for University of Kentucky (UK) Healthcare into DataSphere with more than 300 million clinical data records. We evaluated DataSphere by comparing it with an instance of i2b2 deployed at UK Healthcare, demonstrating that DataSphere provides enhanced user experience for both query building and execution.
Tao, Shiqiang; Cui, Licong; Wu, Xi; Zhang, Guo-Qiang
2017-01-01
To help researchers better access clinical data, we developed a prototype query engine called DataSphere for exploring large-scale integrated clinical data repositories. DataSphere expedites data importing using a NoSQL data management system and dynamically renders its user interface for concept-based querying tasks. DataSphere provides an interactive query-building interface together with query translation and optimization strategies, which enable users to build and execute queries effectively and efficiently. We successfully loaded a dataset of one million patients for University of Kentucky (UK) Healthcare into DataSphere with more than 300 million clinical data records. We evaluated DataSphere by comparing it with an instance of i2b2 deployed at UK Healthcare, demonstrating that DataSphere provides enhanced user experience for both query building and execution. PMID:29854239
Tamm, E P; Kawashima, A; Silverman, P
2001-06-01
Current commercial radiology information systems (RIS) are designed for scheduling, billing, charge collection, and report dissemination. Academic institutions have additional requirements for their missions for teaching, research and clinical care. The newest versions of commercial RIS offer greater flexibility than prior systems. We sent questionnaires to Cerner Corporation, ADAC Health Care Information Systems, IDX Systems, Per-Se' Technologies, and Siemens Health Services regarding features of their products. All of the products we surveyed offer user customizable fields. However, most products did not allow the user to expand their product's data table. The search capabilities of the products varied. All of the products supported the Health Level 7 (HL-7) interface and the use of structured query language (SQL). All of the products were offered with an SQL editor for creating customized queries and custom reports. All products included capabilities for collecting data for quality assurance and included capabilities for tracking "interesting cases," though they varied in the functionality offered. No product offered dedicated functions for research. Alternatively, radiology departments can create their own client-server Windows-based database systems to supplement the capabilities of commercial systems. Such systems can be developed with "web-enabled" database products like Microsoft Access or Apple Filemaker Pro.
Shark: SQL and Rich Analytics at Scale
2012-11-26
learning programs up to 100 faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a...Shark to run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop. Unlike previous systems, Shark...so using a runtime that is optimized for such workloads and a programming model that is designed to express machine learn - ing algorithms. 4.1
"Science SQL" as a Building Block for Flexible, Standards-based Data Infrastructures
NASA Astrophysics Data System (ADS)
Baumann, Peter
2016-04-01
We have learnt to live with the pain of separating data and metadata into non-interoperable silos. For metadata, we enjoy the flexibility of databases, be they relational, graph, or some other NoSQL. Contrasting this, users still "drown in files" as an unstructured, low-level archiving paradigm. It is time to bridge this chasm which once was technologically induced, but today can be overcome. One building block towards a common re-integrated information space is to support massive multi-dimensional spatio-temporal arrays. These "datacubes" appear as sensor, image, simulation, and statistics data in all science and engineering domains, and beyond. For example, 2-D satellilte imagery, 2-D x/y/t image timeseries and x/y/z geophysical voxel data, and 4-D x/y/z/t climate data contribute to today's data deluge in the Earth sciences. Virtual observatories in the Space sciences routinely generate Petabytes of such data. Life sciences deal with microarray data, confocal microscopy, human brain data, which all fall into the same category. The ISO SQL/MDA (Multi-Dimensional Arrays) candidate standard is extending SQL with modelling and query support for n-D arrays ("datacubes") in a flexible, domain-neutral way. This heralds a new generation of services with new quality parameters, such as flexibility, ease of access, embedding into well-known user tools, and scalability mechanisms that remain completely transparent to users. Technology like the EU rasdaman ("raster data manager") Array Database system can support all of the above examples simultaneously, with one technology. This is practically proven: As of today, rasdaman is in operational use on hundreds of Terabytes of satellite image timeseries datacubes, with transparent query distribution across more than 1,000 nodes. Therefore, Array Databases offering SQL/MDA constitute a natural common building block for next-generation data infrastructures. Being initiator and editor of the standard we present principles, implementation facets, and application examples as a basis for further discussion. Further, we highlight recent implementation progress in parallelization, data distribution, and query optimization showing their effects on real-life use cases.
Specialized microbial databases for inductive exploration of microbial genome sequences
Fang, Gang; Ho, Christine; Qiu, Yaowu; Cubas, Virginie; Yu, Zhou; Cabau, Cédric; Cheung, Frankie; Moszer, Ivan; Danchin, Antoine
2005-01-01
Background The enormous amount of genome sequence data asks for user-oriented databases to manage sequences and annotations. Queries must include search tools permitting function identification through exploration of related objects. Methods The GenoList package for collecting and mining microbial genome databases has been rewritten using MySQL as the database management system. Functions that were not available in MySQL, such as nested subquery, have been implemented. Results Inductive reasoning in the study of genomes starts from "islands of knowledge", centered around genes with some known background. With this concept of "neighborhood" in mind, a modified version of the GenoList structure has been used for organizing sequence data from prokaryotic genomes of particular interest in China. GenoChore , a set of 17 specialized end-user-oriented microbial databases (including one instance of Microsporidia, Encephalitozoon cuniculi, a member of Eukarya) has been made publicly available. These databases allow the user to browse genome sequence and annotation data using standard queries. In addition they provide a weekly update of searches against the world-wide protein sequences data libraries, allowing one to monitor annotation updates on genes of interest. Finally, they allow users to search for patterns in DNA or protein sequences, taking into account a clustering of genes into formal operons, as well as providing extra facilities to query sequences using predefined sequence patterns. Conclusion This growing set of specialized microbial databases organize data created by the first Chinese bacterial genome programs (ThermaList, Thermoanaerobacter tencongensis, LeptoList, with two different genomes of Leptospira interrogans and SepiList, Staphylococcus epidermidis) associated to related organisms for comparison. PMID:15698474
JBioWH: an open-source Java framework for bioinformatics data integration
Vera, Roberto; Perez-Riverol, Yasset; Perez, Sonia; Ligeti, Balázs; Kertész-Farkas, Attila; Pongor, Sándor
2013-01-01
The Java BioWareHouse (JBioWH) project is an open-source platform-independent programming framework that allows a user to build his/her own integrated database from the most popular data sources. JBioWH can be used for intensive querying of multiple data sources and the creation of streamlined task-specific data sets on local PCs. JBioWH is based on a MySQL relational database scheme and includes JAVA API parser functions for retrieving data from 20 public databases (e.g. NCBI, KEGG, etc.). It also includes a client desktop application for (non-programmer) users to query data. In addition, JBioWH can be tailored for use in specific circumstances, including the handling of massive queries for high-throughput analyses or CPU intensive calculations. The framework is provided with complete documentation and application examples and it can be downloaded from the Project Web site at http://code.google.com/p/jbiowh. A MySQL server is available for demonstration purposes at hydrax.icgeb.trieste.it:3307. Database URL: http://code.google.com/p/jbiowh PMID:23846595
JBioWH: an open-source Java framework for bioinformatics data integration.
Vera, Roberto; Perez-Riverol, Yasset; Perez, Sonia; Ligeti, Balázs; Kertész-Farkas, Attila; Pongor, Sándor
2013-01-01
The Java BioWareHouse (JBioWH) project is an open-source platform-independent programming framework that allows a user to build his/her own integrated database from the most popular data sources. JBioWH can be used for intensive querying of multiple data sources and the creation of streamlined task-specific data sets on local PCs. JBioWH is based on a MySQL relational database scheme and includes JAVA API parser functions for retrieving data from 20 public databases (e.g. NCBI, KEGG, etc.). It also includes a client desktop application for (non-programmer) users to query data. In addition, JBioWH can be tailored for use in specific circumstances, including the handling of massive queries for high-throughput analyses or CPU intensive calculations. The framework is provided with complete documentation and application examples and it can be downloaded from the Project Web site at http://code.google.com/p/jbiowh. A MySQL server is available for demonstration purposes at hydrax.icgeb.trieste.it:3307. Database URL: http://code.google.com/p/jbiowh.
Web-based Hyper Suprime-Cam Data Providing System
NASA Astrophysics Data System (ADS)
Koike, M.; Furusawa, H.; Takata, T.; Price, P.; Okura, Y.; Yamada, Y.; Yamanoi, H.; Yasuda, N.; Bickerton, S.; Katayama, N.; Mineo, S.; Lupton, R.; Bosch, J.; Loomis, C.
2014-05-01
We describe a web-based user interface to retrieve Hyper Suprime-Cam data products, including images and. Users can access data directly from a graphical user interface or by writing a database SQL query. The system provides raw images, reduced images and stacked images (from multiple individual exposures), with previews available. Catalog queries can be executed in preview or queue mode, allowing for both exploratory and comprehensive investigations.
An adaptable architecture for patient cohort identification from diverse data sources.
Bache, Richard; Miles, Simon; Taweel, Adel
2013-12-01
We define and validate an architecture for systems that identify patient cohorts for clinical trials from multiple heterogeneous data sources. This architecture has an explicit query model capable of supporting temporal reasoning and expressing eligibility criteria independently of the representation of the data used to evaluate them. The architecture has the key feature that queries defined according to the query model are both pre and post-processed and this is used to address both structural and semantic heterogeneity. The process of extracting the relevant clinical facts is separated from the process of reasoning about them. A specific instance of the query model is then defined and implemented. We show that the specific instance of the query model has wide applicability. We then describe how it is used to access three diverse data warehouses to determine patient counts. Although the proposed architecture requires greater effort to implement the query model than would be the case for using just SQL and accessing a data-based management system directly, this effort is justified because it supports both temporal reasoning and heterogeneous data sources. The query model only needs to be implemented once no matter how many data sources are accessed. Each additional source requires only the implementation of a lightweight adaptor. The architecture has been used to implement a specific query model that can express complex eligibility criteria and access three diverse data warehouses thus demonstrating the feasibility of this approach in dealing with temporal reasoning and data heterogeneity.
Named Entity Recognition in a Hungarian NL Based QA System
NASA Astrophysics Data System (ADS)
Tikkl, Domonkos; Szidarovszky, P. Ferenc; Kardkovacs, Zsolt T.; Magyar, Gábor
In WoW project our purpose is to create a complex search interface with the following features: search in the deep web content of contracted partners' databases, processing Hungarian natural language (NL) questions and transforming them to SQL queries for database access, image search supported by a visual thesaurus that describes in a structural form the visual content of images (also in Hungarian). This paper primarily focuses on a particular problem of question processing task: the entity recognition. Before going into details we give a short overview of the project's aims.
A Customizable Dashboarding System for Watershed Model Interpretation
NASA Astrophysics Data System (ADS)
Easton, Z. M.; Collick, A.; Wagena, M. B.; Sommerlot, A.; Fuka, D.
2017-12-01
Stakeholders, including policymakers, agricultural water managers, and small farm managers, can benefit from the outputs of commonly run watershed models. However, the information that each stakeholder needs is be different. While policy makers are often interested in the broader effects that small farm management may have on a watershed during extreme events or over long periods, farmers are often interested in field specific effects at daily or seasonal period. To provide stakeholders with the ability to analyze and interpret data from large scale watershed models, we have developed a framework that can support custom exploration of the large datasets produced. For the volume of data produced by these models, SQL-based data queries are not efficient; thus, we employ a "Not Only SQL" (NO-SQL) query language, which allows data to scale in both quantity and query volumes. We demonstrate a stakeholder customizable Dashboarding system that allows stakeholders to create custom `dashboards' to summarize model output specific to their needs. Dashboarding is a dynamic and purpose-based visual interface needed to display one-to-many database linkages so that the information can be presented for a single time period or dynamically monitored over time and allows a user to quickly define focus areas of interest for their analysis. We utilize a single watershed model that is run four times daily with a combined set of climate projections, which are then indexed, and added to an ElasticSearch datastore. ElasticSearch is a NO-SQL search engine built on top of Apache Lucene, a free and open-source information retrieval software library. Aligned with the ElasticSearch project is the open source visualization and analysis system, Kibana, which we utilize for custom stakeholder dashboarding. The dashboards create a visualization of the stakeholder selected analysis and can be extended to recommend robust strategies to support decision-making.
Integrated Array/Metadata Analytics
NASA Astrophysics Data System (ADS)
Misev, Dimitar; Baumann, Peter
2015-04-01
Data comes in various forms and types, and integration usually presents a problem that is often simply ignored and solved with ad-hoc solutions. Multidimensional arrays are an ubiquitous data type, that we find at the core of virtually all science and engineering domains, as sensor, model, image, statistics data. Naturally, arrays are richly described by and intertwined with additional metadata (alphanumeric relational data, XML, JSON, etc). Database systems, however, a fundamental building block of what we call "Big Data", lack adequate support for modelling and expressing these array data/metadata relationships. Array analytics is hence quite primitive or non-existent at all in modern relational DBMS. Recognizing this, we extended SQL with a new SQL/MDA part seamlessly integrating multidimensional array analytics into the standard database query language. We demonstrate the benefits of SQL/MDA with real-world examples executed in ASQLDB, an open-source mediator system based on HSQLDB and rasdaman, that already implements SQL/MDA.
A Magnetic Petrology Database for Satellite Magnetic Anomaly Interpretations
NASA Astrophysics Data System (ADS)
Nazarova, K.; Wasilewski, P.; Didenko, A.; Genshaft, Y.; Pashkevich, I.
2002-05-01
A Magnetic Petrology Database (MPDB) is now being compiled at NASA/Goddard Space Flight Center in cooperation with Russian and Ukrainian Institutions. The purpose of this database is to provide the geomagnetic community with a comprehensive and user-friendly method of accessing magnetic petrology data via Internet for more realistic interpretation of satellite magnetic anomalies. Magnetic Petrology Data had been accumulated in NASA/Goddard Space Flight Center, United Institute of Physics of the Earth (Russia) and Institute of Geophysics (Ukraine) over several decades and now consists of many thousands of records of data in our archives. The MPDB was, and continues to be in big demand especially since recent launching in near Earth orbit of the mini-constellation of three satellites - Oersted (in 1999), Champ (in 2000), and SAC-C (in 2000) which will provide lithospheric magnetic maps with better spatial and amplitude resolution (about 1 nT). The MPDB is focused on lower crustal and upper mantle rocks and will include data on mantle xenoliths, serpentinized ultramafic rocks, granulites, iron quartzites and rocks from Archean-Proterozoic metamorphic sequences from all around the world. A substantial amount of data is coming from the area of unique Kursk Magnetic Anomaly and Kola Deep Borehole (which recovered 12 km of continental crust). A prototype MPDB can be found on the Geodynamics Branch web server of Goddard Space Flight Center at http://core2.gsfc.nasa.gov/terr_mag/magnpetr.html. The MPDB employs a searchable relational design and consists of 7 interrelated tables. The schema of database is shown at http://core2.gsfc.nasa.gov/terr_mag/doc.html. MySQL database server was utilized to implement MPDB. The SQL (Structured Query Language) is used to query the database. To present the results of queries on WEB and for WEB programming we utilized PHP scripting language and CGI scripts. The prototype MPDB is designed to search database by major satellite magnetic anomaly, tectonic structure, geographical location, rock type, magnetic properties, chemistry and reference, see http://core2.gsfc.nasa.gov/terr_mag/query1.html. The output of database is HTML structured table, text file, and downloadable file. This database will be very useful for studies of lithospheric satellite magnetic anomalies on the Earth and other terrestrial planets.
NASA Astrophysics Data System (ADS)
Barouchou, Alexandra; Dendrinos, Markos
2015-02-01
An interesting issue in the domain of history of science and ideas is the concept of similarity of historical personalities. Similar objects of research of philosophers and scientists indicate prospective influences, caused either from one another's reading or meetings, communication or even cooperation. Key methodological role in the surfacing of the sought similarities play the keywords extracted from their works as well as their placement in a philosophical and scientific term taxonomy. The case study examined in the framework of this paper concerns scientists and philosophers, who lived in ancient Greece or Renaissance periods and dealt, in at least one work, with the subject God. All the available data (scientists, studies, recorded relations between scientists, keywords, and thematic hierarchy) have been organized in an RDBMS environment, aiming at the emergence of similarities and influences between scientists through properly created SQL queries based on date and thematic hierarchy criteria.
Advanced Query and Data Mining Capabilities for MaROS
NASA Technical Reports Server (NTRS)
Wang, Paul; Wallick, Michael N.; Allard, Daniel A.; Gladden, Roy E.; Hy, Franklin H.
2013-01-01
The Mars Relay Operational Service (MaROS) comprises a number of tools to coordinate, plan, and visualize various aspects of the Mars Relay network. These levels include a Web-based user interface, a back-end "ReSTlet" built in Java, and databases that store the data as it is received from the network. As part of MaROS, the innovators have developed and implemented a feature set that operates on several levels of the software architecture. This new feature is an advanced querying capability through either the Web-based user interface, or through a back-end REST interface to access all of the data gathered from the network. This software is not meant to replace the REST interface, but to augment and expand the range of available data. The current REST interface provides specific data that is used by the MaROS Web application to display and visualize the information; however, the returned information from the REST interface has typically been pre-processed to return only a subset of the entire information within the repository, particularly only the information that is of interest to the GUI (graphical user interface). The new, advanced query and data mining capabilities allow users to retrieve the raw data and/or to perform their own data processing. The query language used to access the repository is a restricted subset of the structured query language (SQL) that can be built safely from the Web user interface, or entered as freeform SQL by a user. The results are returned in a CSV (Comma Separated Values) format for easy exporting to third party tools and applications that can be used for data mining or user-defined visualization and interpretation. This is the first time that a service is capable of providing access to all cross-project relay data from a single Web resource. Because MaROS contains the data for a variety of missions from the Mars network, which span both NASA and ESA, the software also establishes an access control list (ACL) on each data record in the database repository to enforce user access permissions through a multilayered approach.
A comparison of database systems for XML-type data.
Risse, Judith E; Leunissen, Jack A M
2010-01-01
In the field of bioinformatics interchangeable data formats based on XML are widely used. XML-type data is also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. In this paper we analyse the suitability of different database systems for storing and querying large datasets in general and Medline in particular. All reviewed database systems perform well when tested with small to medium sized datasets, however when the full Medline dataset is queried a large variation in query times is observed. There is not one system that is vastly superior to the others in this comparison and, depending on the database size and the query requirements, different systems are most suitable. The best all-round solution is the Oracle 11~g database system using the new binary storage option. Alias-i's Lingpipe is a more lightweight, customizable and sufficiently fast solution. It does however require more initial configuration steps. For data with a changing XML structure Sedna and BaseX as native XML database systems or MySQL with an XML-type column are suitable.
BioWarehouse: a bioinformatics database warehouse toolkit
Lee, Thomas J; Pouliot, Yannick; Wagner, Valerie; Gupta, Priyanka; Stringer-Calvert, David WJ; Tenenbaum, Jessica D; Karp, Peter D
2006-01-01
Background This article addresses the problem of interoperation of heterogeneous bioinformatics databases. Results We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. Conclusion BioWarehouse embodies significant progress on the database integration problem for bioinformatics. PMID:16556315
BioWarehouse: a bioinformatics database warehouse toolkit.
Lee, Thomas J; Pouliot, Yannick; Wagner, Valerie; Gupta, Priyanka; Stringer-Calvert, David W J; Tenenbaum, Jessica D; Karp, Peter D
2006-03-23
This article addresses the problem of interoperation of heterogeneous bioinformatics databases. We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. BioWarehouse embodies significant progress on the database integration problem for bioinformatics.
Brave New World: Data Intensive Science with SDSS and the VO
NASA Astrophysics Data System (ADS)
Thakar, A. R.; Szalay, A. S.; O'Mullane, W.; Nieto-Santisteban, M.; Budavari, T.; Li, N.; Carliles, S.; Haridas, V.; Malik, T.; Gray, J.
2004-12-01
With the advent of digital archives and the VO, astronomy is quickly changing from a data-hungry to a data-intensive science. Local and specialized access to data will remain the most direct and efficient way to get data out of individual archives, especially if you know what you are looking for. However, the enormous sizes of the upcoming archives will preclude this type of access for most institutions, and will not allow researchers to tap the vast potential for discovery in cross-matching and comparing data between different archives. The VO makes this type of interoperability and distributed data access possible by adopting industry standards for data access (SQL) and data interchange (SOAP/XML) with platform independence (Web services). As a sneak preview of this brave new world where astronomers may need to become SQL warriors, we present a look at VO-enabled access to catalog data in the SDSS Catalog Archive Server (CAS): CasJobs - a workbench environment that allows arbitrarily complex SQL queries and your own personal database (MyDB) that you can share with collaborators; OpenSkyQuery - an IVOA (International Virtual Observatory Alliance) compliant federation of multiple archives (OpenSkyNodes) that currently links nearly 20 catalogs and allows cross-match queries (in ADQL - Astronomical Data Query Language) between them; Spectrum and Filter Profile Web services that provide access to an open database of spectra (registered users may add their own spectra); and VO-enabled Mirage - a Java visualizatiion tool developed at Bell Labs and enhanced at JHU that allows side-by-side comparison of SDSS catalog and FITS image data. Anticipating the next generation of Petabyte archives like LSST by the end of the decade, we are developing a parallel cross-match engine for all-sky cross-matches between large surveys, along with a 100-Terabyte data intensive science laboratory with high-speed parallel data access.
Sujansky, Walter V; Faus, Sam A; Stone, Ethan; Brennan, Patricia Flatley
2010-10-01
Online personal health records (PHRs) enable patients to access, manage, and share certain of their own health information electronically. This capability creates the need for precise access-controls mechanisms that restrict the sharing of data to that intended by the patient. The authors describe the design and implementation of an access-control mechanism for PHR repositories that is modeled on the eXtensible Access Control Markup Language (XACML) standard, but intended to reduce the cognitive and computational complexity of XACML. The authors implemented the mechanism entirely in a relational database system using ANSI-standard SQL statements. Based on a set of access-control rules encoded as relational table rows, the mechanism determines via a single SQL query whether a user who accesses patient data from a specific application is authorized to perform a requested operation on a specified data object. Testing of this query on a moderately large database has demonstrated execution times consistently below 100ms. The authors include the details of the implementation, including algorithms, examples, and a test database as Supplementary materials. Copyright © 2010 Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Roberts, D
Purpose: A unified database system was developed to allow accumulation, review and analysis of quality assurance (QA) data for measurement, treatment, imaging and simulation equipment in our department. Recording these data in a database allows a unified and structured approach to review and analysis of data gathered using commercial database tools. Methods: A clinical database was developed to track records of quality assurance operations on linear accelerators, a computed tomography (CT) scanner, high dose rate (HDR) afterloader and imaging systems such as on-board imaging (OBI) and Calypso in our department. The database was developed using Microsoft Access database and visualmore » basic for applications (VBA) programming interface. Separate modules were written for accumulation, review and analysis of daily, monthly and annual QA data. All modules were designed to use structured query language (SQL) as the basis of data accumulation and review. The SQL strings are dynamically re-written at run time. The database also features embedded documentation, storage of documents produced during QA activities and the ability to annotate all data within the database. Tests are defined in a set of tables that define test type, specific value, and schedule. Results: Daily, Monthly and Annual QA data has been taken in parallel with established procedures to test MQA. The database has been used to aggregate data across machines to examine the consistency of machine parameters and operations within the clinic for several months. Conclusion: The MQA application has been developed as an interface to a commercially available SQL engine (JET 5.0) and a standard database back-end. The MQA system has been used for several months for routine data collection.. The system is robust, relatively simple to extend and can be migrated to a commercial SQL server.« less
An adaptable architecture for patient cohort identification from diverse data sources
Bache, Richard; Miles, Simon; Taweel, Adel
2013-01-01
Objective We define and validate an architecture for systems that identify patient cohorts for clinical trials from multiple heterogeneous data sources. This architecture has an explicit query model capable of supporting temporal reasoning and expressing eligibility criteria independently of the representation of the data used to evaluate them. Method The architecture has the key feature that queries defined according to the query model are both pre and post-processed and this is used to address both structural and semantic heterogeneity. The process of extracting the relevant clinical facts is separated from the process of reasoning about them. A specific instance of the query model is then defined and implemented. Results We show that the specific instance of the query model has wide applicability. We then describe how it is used to access three diverse data warehouses to determine patient counts. Discussion Although the proposed architecture requires greater effort to implement the query model than would be the case for using just SQL and accessing a data-based management system directly, this effort is justified because it supports both temporal reasoning and heterogeneous data sources. The query model only needs to be implemented once no matter how many data sources are accessed. Each additional source requires only the implementation of a lightweight adaptor. Conclusions The architecture has been used to implement a specific query model that can express complex eligibility criteria and access three diverse data warehouses thus demonstrating the feasibility of this approach in dealing with temporal reasoning and data heterogeneity. PMID:24064442
Integrating Radar Image Data with Google Maps
NASA Technical Reports Server (NTRS)
Chapman, Bruce D.; Gibas, Sarah
2010-01-01
A public Web site has been developed as a method for displaying the multitude of radar imagery collected by NASA s Airborne Synthetic Aperture Radar (AIRSAR) instrument during its 16-year mission. Utilizing NASA s internal AIRSAR site, the new Web site features more sophisticated visualization tools that enable the general public to have access to these images. The site was originally maintained at NASA on six computers: one that held the Oracle database, two that took care of the software for the interactive map, and three that were for the Web site itself. Several tasks were involved in moving this complicated setup to just one computer. First, the AIRSAR database was migrated from Oracle to MySQL. Then the back-end of the AIRSAR Web site was updated in order to access the MySQL database. To do this, a few of the scripts needed to be modified; specifically three Perl scripts that query that database. The database connections were then updated from Oracle to MySQL, numerous syntax errors were corrected, and a query was implemented that replaced one of the stored Oracle procedures. Lastly, the interactive map was designed, implemented, and tested so that users could easily browse and access the radar imagery through the Google Maps interface.
Catalogue of HI PArameters (CHIPA)
NASA Astrophysics Data System (ADS)
Saponara, J.; Benaglia, P.; Koribalski, B.; Andruchow, I.
2015-08-01
The catalogue of HI parameters of galaxies HI (CHIPA) is the natural continuation of the compilation by M.C. Martin in 1998. CHIPA provides the most important parameters of nearby galaxies derived from observations of the neutral Hydrogen line. The catalogue contains information of 1400 galaxies across the sky and different morphological types. Parameters like the optical diameter of the galaxy, the blue magnitude, the distance, morphological type, HI extension are listed among others. Maps of the HI distribution, velocity and velocity dispersion can also be display for some cases. The main objective of this catalogue is to facilitate the bibliographic queries, through searching in a database accessible from the internet that will be available in 2015 (the website is under construction). The database was built using the open source `` mysql (SQL, Structured Query Language, management system relational database) '', while the website was built with ''HTML (Hypertext Markup Language)'' and ''PHP (Hypertext Preprocessor)''.
Querying and Computing with BioCyc Databases
Krummenacker, Markus; Paley, Suzanne; Mueller, Lukas; Yan, Thomas; Karp, Peter D.
2006-01-01
Summary We describe multiple methods for accessing and querying the complex and integrated cellular data in the BioCyc family of databases: access through multiple file formats, access through Application Program Interfaces (APIs) for LISP, Perl and Java, and SQL access through the BioWarehouse relational database. Availability The Pathway Tools software and 20 BioCyc DBs in Tiers 1 and 2 are freely available to academic users; fees apply to some types of commercial use. For download instructions see http://BioCyc.org/download.shtml PMID:15961440
Computerized database management system for breast cancer patients.
Sim, Kok Swee; Chong, Sze Siang; Tso, Chih Ping; Nia, Mohsen Esmaeili; Chong, Aun Kee; Abbas, Siti Fathimah
2014-01-01
Data analysis based on breast cancer risk factors such as age, race, breastfeeding, hormone replacement therapy, family history, and obesity was conducted on breast cancer patients using a new enhanced computerized database management system. My Structural Query Language (MySQL) is selected as the application for database management system to store the patient data collected from hospitals in Malaysia. An automatic calculation tool is embedded in this system to assist the data analysis. The results are plotted automatically and a user-friendly graphical user interface is developed that can control the MySQL database. Case studies show breast cancer incidence rate is highest among Malay women, followed by Chinese and Indian. The peak age for breast cancer incidence is from 50 to 59 years old. Results suggest that the chance of developing breast cancer is increased in older women, and reduced with breastfeeding practice. The weight status might affect the breast cancer risk differently. Additional studies are needed to confirm these findings.
Scale-Independent Relational Query Processing
2013-10-04
source options are also available, including Postgresql, MySQL , and SQLite. These mod- ern relational databases are generally very complex software systems...and Their Application to Data Stream Management. IGI Global, 2010. [68] George Reese. Database Programming with JDBC and Java , Second Edition. Ed. by
Find the fish: using PROC SQL to build a relational database
Fabrizio, Mary C.; Nelson, Scott N.
1995-01-01
Reliable estimates of abundance and survival, gained through mark-recapture studies, are necessary to better understand how to manage and restore lake trout populations in the Great Lakes. Working with a 24-year data set from a mark-recapture study conducted in Lake Superior, we attempted to disclose information on tag shedding by examining recaptures of double-tagged fish. The data set consisted of 64,288 observations on fish which had been marked with one or more tags; a subset of these fish had been marked with two tags at initial capture. Although DATA and PROC statements could be used to obtain some of the information we sought, these statements could not be used to extract a complete set of results from the double-tagging experiments. We therefore used SQL processing to create three tables representing the same information but in a fully normalized relational structure. In addition, we created indices to efficiently examine complex relationships among the individual capture records. This approach allowed us to obtain all the information necessary to estimate tag retention through subsequent modeling. We believe that our success with SQL was due in large part to its ability to simultaneosly scan the same table more than once and to permit consideration of other tables in sub-queries.
Extending SQL to Support Privacy Policies
NASA Astrophysics Data System (ADS)
Ghazinour, Kambiz; Pun, Sampson; Majedi, Maryam; Chinaci, Amir H.; Barker, Ken
Increasing concerns over Internet applications that violate user privacy by exploiting (back-end) database vulnerabilities must be addressed to protect both customer privacy and to ensure corporate strategic assets remain trustworthy. This chapter describes an extension onto database catalogues and Structured Query Language (SQL) for supporting privacy in Internet applications, such as in social networks, e-health, e-governmcnt, etc. The idea is to introduce new predicates to SQL commands to capture common privacy requirements, such as purpose, visibility, generalization, and retention for both mandatory and discretionary access control policies. The contribution is that corporations, when creating the underlying databases, will be able to define what their mandatory privacy policies arc with which all application users have to comply. Furthermore, each application user, when providing their own data, will be able to define their own privacy policies with which other users have to comply. The extension is supported with underlying catalogues and algorithms. The experiments demonstrate a very reasonable overhead for the extension. The result is a low-cost mechanism to create new systems that arc privacy aware and also to transform legacy databases to their privacy-preserving equivalents. Although the examples arc from social networks, one can apply the results to data security and user privacy of other enterprises as well.
A Scalable Data Access Layer to Manage Structured Heterogeneous Biomedical Data.
Delussu, Giovanni; Lianas, Luca; Frexia, Francesca; Zanetti, Gianluigi
2016-01-01
This work presents a scalable data access layer, called PyEHR, designed to support the implementation of data management systems for secondary use of structured heterogeneous biomedical and clinical data. PyEHR adopts the openEHR's formalisms to guarantee the decoupling of data descriptions from implementation details and exploits structure indexing to accelerate searches. Data persistence is guaranteed by a driver layer with a common driver interface. Interfaces for two NoSQL Database Management Systems are already implemented: MongoDB and Elasticsearch. We evaluated the scalability of PyEHR experimentally through two types of tests, called "Constant Load" and "Constant Number of Records", with queries of increasing complexity on synthetic datasets of ten million records each, containing very complex openEHR archetype structures, distributed on up to ten computing nodes.
The Binding Database: data management and interface design.
Chen, Xi; Lin, Yuhmei; Liu, Ming; Gilson, Michael K
2002-01-01
The large and growing body of experimental data on biomolecular binding is of enormous value in developing a deeper understanding of molecular biology, in developing new therapeutics, and in various molecular design applications. However, most of these data are found only in the published literature and are therefore difficult to access and use. No existing public database has focused on measured binding affinities and has provided query capabilities that include chemical structure and sequence homology searches. We have created Binding DataBase (BindingDB), a public, web-accessible database of measured binding affinities. BindingDB is based upon a relational data specification for describing binding measurements via Isothermal Titration Calorimetry (ITC) and enzyme inhibition. A corresponding XML Document Type Definition (DTD) is used to create and parse intermediate files during the on-line deposition process and will also be used for data interchange, including collection of data from other sources. The on-line query interface, which is constructed with Java Servlet technology, supports standard SQL queries as well as searches for molecules by chemical structure and sequence homology. The on-line deposition interface uses Java Server Pages and JavaBean objects to generate dynamic HTML and to store intermediate results. The resulting data resource provides a range of functionality with brisk response-times, and lends itself well to continued development and enhancement.
Constructing a Graph Database for Semantic Literature-Based Discovery.
Hristovski, Dimitar; Kastrin, Andrej; Dinevski, Dejan; Rindflesch, Thomas C
2015-01-01
Literature-based discovery (LBD) generates discoveries, or hypotheses, by combining what is already known in the literature. Potential discoveries have the form of relations between biomedical concepts; for example, a drug may be determined to treat a disease other than the one for which it was intended. LBD views the knowledge in a domain as a network; a set of concepts along with the relations between them. As a starting point, we used SemMedDB, a database of semantic relations between biomedical concepts extracted with SemRep from Medline. SemMedDB is distributed as a MySQL relational database, which has some problems when dealing with network data. We transformed and uploaded SemMedDB into the Neo4j graph database, and implemented the basic LBD discovery algorithms with the Cypher query language. We conclude that storing the data needed for semantic LBD is more natural in a graph database. Also, implementing LBD discovery algorithms is conceptually simpler with a graph query language when compared with standard SQL.
NASA Astrophysics Data System (ADS)
Barnsley, R. M.; Steele, Iain A.; Smith, R. J.; Mawson, Neil R.
2014-07-01
The Small Telescopes Installed at the Liverpool Telescope (STILT) project has been in operation since March 2009, collecting data with three wide field unfiltered cameras: SkycamA, SkycamT and SkycamZ. To process the data, a pipeline was developed to automate source extraction, catalogue cross-matching, photometric calibration and database storage. In this paper, modifications and further developments to this pipeline will be discussed, including a complete refactor of the pipeline's codebase into Python, migration of the back-end database technology from MySQL to PostgreSQL, and changing the catalogue used for source cross-matching from USNO-B1 to APASS. In addition to this, details will be given relating to the development of a preliminary front-end to the source extracted database which will allow a user to perform common queries such as cone searches and light curve comparisons of catalogue and non-catalogue matched objects. Some next steps and future ideas for the project will also be presented.
Footprint Representation of Planetary Remote Sensing Data
NASA Astrophysics Data System (ADS)
Walter, S. H. G.; Gasselt, S. V.; Michael, G.; Neukum, G.
The geometric outline of remote sensing image data, the so called footprint, can be represented as a number of coordinate tuples. These polygons are associated with according attribute information such as orbit name, ground- and image resolution, solar longitude and illumination conditions to generate a powerful base for classification of planetary experiment data. Speed, handling and extended capabilites are the reasons for using geodatabases to store and access these data types. Techniques for such a spatial database of footprint data are demonstrated using the Relational Database Management System (RDBMS) PostgreSQL, spatially enabled by the PostGIS extension. Exemplary, footprints of the HRSC and OMEGA instruments, both onboard ESA's Mars Express Orbiter, are generated and connected to attribute information. The aim is to provide high-resolution footprints of the OMEGA instrument to the science community for the first time and make them available for web-based mapping applications like the "Planetary Interactive GIS-on-the-Web Analyzable Database" (PIG- WAD), produced by the USGS. Map overlays with HRSC or other instruments like MOC and THEMIS (footprint maps are already available for these instruments and can be integrated into the database) allow on-the-fly intersection and comparison as well as extended statistics of the data. Footprint polygons are generated one by one using standard software provided by the instrument teams. Attribute data is calculated and stored together with the geometric information. In the case of HRSC, the coordinates of the footprints are already available in the VICAR label of each image file. Using the VICAR RTL and PostgreSQL's libpq C library they are loaded into the database using the Well-Known Text (WKT) notation by the Open Geospatial Consortium, Inc. (OGC). For the OMEGA instrument, image data is read using IDL routines developed and distributed by the OMEGA team. Image outlines are exported together with relevant attribute data to the industry standard Shapefile format. These files are translated to a Structured Query Language (SQL) command sequence suitable for insertion into the PostGIS/PostgrSQL database using the shp2pgsql data loader provided by the PostGIS software. PostgreSQL's advanced features such as geometry types, rules, operators and functions allow complex spatial queries and on-the-fly processing of data on DBMS level e.g. generalisation of the outlines. Processing done by the DBMS, visualisation via GIS systems and utilisation for web-based applications like mapservers will be demonstrated.
NASA Astrophysics Data System (ADS)
Knosp, B.; Gangl, M.; Hristova-Veleva, S. M.; Kim, R. M.; Li, P.; Turk, J.; Vu, Q. A.
2015-12-01
The JPL Tropical Cyclone Information System (TCIS) brings together satellite, aircraft, and model forecast data from several NASA, NOAA, and other data centers to assist researchers in comparing and analyzing data and model forecast related to tropical cyclones. The TCIS has been running a near-real time (NRT) data portal during North Atlantic hurricane season that typically runs from June through October each year, since 2010. Data collected by the TCIS varies by type, format, contents, and frequency and is served to the user in two ways: (1) as image overlays on a virtual globe and (2) as derived output from a suite of analysis tools. In order to support these two functions, the data must be collected and then made searchable by criteria such as date, mission, product, pressure level, and geospatial region. Creating a database architecture that is flexible enough to manage, intelligently interrogate, and ultimately present this disparate data to the user in a meaningful way has been the primary challenge. The database solution for the TCIS has been to use a hybrid MySQL + Solr implementation. After testing other relational database and NoSQL solutions, such as PostgreSQL and MongoDB respectively, this solution has given the TCIS the best offerings in terms of query speed and result reliability. This database solution also supports the challenging (and memory overwhelming) geospatial queries that are necessary to support analysis tools requested by users. Though hardly new technologies on their own, our implementation of MySQL + Solr had to be customized and tuned to be able to accurately store, index, and search the TCIS data holdings. In this presentation, we will discuss how we arrived on our MySQL + Solr database architecture, why it offers us the most consistent fast and reliable results, and how it supports our front end so that we can offer users a look into our "big data" holdings.
NASA Astrophysics Data System (ADS)
Velazquez, Enrique Israel
Improvements in medical and genomic technologies have dramatically increased the production of electronic data over the last decade. As a result, data management is rapidly becoming a major determinant, and urgent challenge, for the development of Precision Medicine. Although successful data management is achievable using Relational Database Management Systems (RDBMS), exponential data growth is a significant contributor to failure scenarios. Growing amounts of data can also be observed in other sectors, such as economics and business, which, together with the previous facts, suggests that alternate database approaches (NoSQL) may soon be required for efficient storage and management of big databases. However, this hypothesis has been difficult to test in the Precision Medicine field since alternate database architectures are complex to assess and means to integrate heterogeneous electronic health records (EHR) with dynamic genomic data are not easily available. In this dissertation, we present a novel set of experiments for identifying NoSQL database approaches that enable effective data storage and management in Precision Medicine using patients' clinical and genomic information from the cancer genome atlas (TCGA). The first experiment draws on performance and scalability from biologically meaningful queries with differing complexity and database sizes. The second experiment measures performance and scalability in database updates without schema changes. The third experiment assesses performance and scalability in database updates with schema modifications due dynamic data. We have identified two NoSQL approach, based on Cassandra and Redis, which seems to be the ideal database management systems for our precision medicine queries in terms of performance and scalability. We present NoSQL approaches and show how they can be used to manage clinical and genomic big data. Our research is relevant to the public health since we are focusing on one of the main challenges to the development of Precision Medicine and, consequently, investigating a potential solution to the progressively increasing demands on health care.
Murphy, SN; Barnett, GO; Chueh, HC
2000-01-01
The patient base of the Partners HealthCare System in Boston exceeds 1.8 million. Many of these patients are desirable for participation in research studies. To facilitate their discovery, we developed a data warehouse to contain clinical characteristics of these patients. The data warehouse contains diagnosis and procedures from administrative databases. The patients are indexed across institutions and their demographics provided by an Enterprise Master Patient Indexing service. Characteristics of the diagnoses and procedures such as associated providers, dates of service, inpatient/outpatient status, and other visit-related characteristics are also fed from the administrative systems. The targeted users of this system are research clinician s interested in finding patient cohorts for research studies. Their data requirements were analyzed and have been reported elsewhere. We did not expect the clinicians to become expert users of the system. Tools for querying healthcare data have traditionally been text based, although graphical interfaces have been pursued. In order to support the simple drag and drop visual model, as well as the identification and distribution of the patient data, a three-tier software architecture was developed. The user interface was developed in Visual Basic and distributed as an ActiveX object embedded in an HTML page. The middle layer was developed in Java and Microsoft COM. The queries are represented throughout their lifetime as XML objects, and the Microsoft SQL7 database is queried and managed in standard SQL. PMID:11080028
Murphy; Barnett; Chueh
2000-01-01
The patient base of the Partners HealthCare System in Boston exceeds 1.8 million. Many of these patients are desirable for participation in research studies. To facilitate their discovery, we developed a data warehouse to contain clinical characteristics of these patients. The data warehouse contains diagnosis and procedures from administrative databases. The patients are indexed across institutions and their demographics provided by an Enterprise Master Patient Indexing service. Characteristics of the diagnoses and procedures such as associated providers, dates of service, inpatient/outpatient status, and other visit-related characteristics are also fed from the administrative systems. The targeted users of this system are research clinician s interested in finding patient cohorts for research studies. Their data requirements were analyzed and have been reported elsewhere. We did not expect the clinicians to become expert users of the system. Tools for querying healthcare data have traditionally been text based, although graphical interfaces have been pursued. In order to support the simple drag and drop visual model, as well as the identification and distribution of the patient data, a three-tier software architecture was developed. The user interface was developed in Visual Basic and distributed as an ActiveX object embedded in an HTML page. The middle layer was developed in Java and Microsoft COM. The queries are represented throughout their lifetime as XML objects, and the Microsoft SQL7 database is queried and managed in standard SQL.
ESTminer: a Web interface for mining EST contig and cluster databases.
Huang, Yecheng; Pumphrey, Janie; Gingle, Alan R
2005-03-01
ESTminer is a Web application and database schema for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The Web interface contains a query frame that allows the selection of contigs/clusters with specific cDNA library makeup or a threshold number of members. The results are displayed as color-coded tree nodes, where the color indicates the fractional size of each cDNA library component. The nodes are expandable, revealing library statistics as well as EST or contig members, with links to sequence data, GenBank records or user configurable links. Also, the interface allows 'queries within queries' where the result set of a query is further filtered by the subsequent query. ESTminer is implemented in Java/JSP and the package, including MySQL and Oracle schema creation scripts, is available from http://cggc.agtec.uga.edu/Data/download.asp agingle@uga.edu.
The Protein Disease Database of human body fluids: II. Computer methods and data issues.
Lemkin, P F; Orr, G A; Goldstein, M P; Creed, G J; Myrick, J E; Merril, C R
1995-01-01
The Protein Disease Database (PDD) is a relational database of proteins and diseases. With this database it is possible to screen for quantitative protein abnormalities associated with disease states. These quantitative relationships use data drawn from the peer-reviewed biomedical literature. Assays may also include those observed in high-resolution electrophoretic gels that offer the potential to quantitate many proteins in a single test as well as data gathered by enzymatic or immunologic assays. We are using the Internet World Wide Web (WWW) and the Web browser paradigm as an access method for wide distribution and querying of the Protein Disease Database. The WWW hypertext transfer protocol and its Common Gateway Interface make it possible to build powerful graphical user interfaces that can support easy-to-use data retrieval using query specification forms or images. The details of these interactions are totally transparent to the users of these forms. Using a client-server SQL relational database, user query access, initial data entry and database maintenance are all performed over the Internet with a Web browser. We discuss the underlying design issues, mapping mechanisms and assumptions that we used in constructing the system, data entry, access to the database server, security, and synthesis of derived two-dimensional gel image maps and hypertext documents resulting from SQL database searches.
Microsoft Repository Version 2 and the Open Information Model.
ERIC Educational Resources Information Center
Bernstein, Philip A.; Bergstraesser, Thomas; Carlson, Jason; Pal, Shankar; Sanders, Paul; Shutt, David
1999-01-01
Describes the programming interface and implementation of the repository engine and the Open Information Model for Microsoft Repository, an object-oriented meta-data management facility that ships in Microsoft Visual Studio and Microsoft SQL Server. Discusses Microsoft's component object model, object manipulation, queries, and information…
Informatics in radiology: use of CouchDB for document-based storage of DICOM objects.
Rascovsky, Simón J; Delgado, Jorge A; Sanz, Alexander; Calvo, Víctor D; Castrillón, Gabriel
2012-01-01
Picture archiving and communication systems traditionally have depended on schema-based Structured Query Language (SQL) databases for imaging data management. To optimize database size and performance, many such systems store a reduced set of Digital Imaging and Communications in Medicine (DICOM) metadata, discarding informational content that might be needed in the future. As an alternative to traditional database systems, document-based key-value stores recently have gained popularity. These systems store documents containing key-value pairs that facilitate data searches without predefined schemas. Document-based key-value stores are especially suited to archive DICOM objects because DICOM metadata are highly heterogeneous collections of tag-value pairs conveying specific information about imaging modalities, acquisition protocols, and vendor-supported postprocessing options. The authors used an open-source document-based database management system (Apache CouchDB) to create and test two such databases; CouchDB was selected for its overall ease of use, capability for managing attachments, and reliance on HTTP and Representational State Transfer standards for accessing and retrieving data. A large database was created first in which the DICOM metadata from 5880 anonymized magnetic resonance imaging studies (1,949,753 images) were loaded by using a Ruby script. To provide the usual DICOM query functionality, several predefined "views" (standard queries) were created by using JavaScript. For performance comparison, the same queries were executed in both the CouchDB database and a SQL-based DICOM archive. The capabilities of CouchDB for attachment management and database replication were separately assessed in tests of a similar, smaller database. Results showed that CouchDB allowed efficient storage and interrogation of all DICOM objects; with the use of information retrieval algorithms such as map-reduce, all the DICOM metadata stored in the large database were searchable with only a minimal increase in retrieval time over that with the traditional database management system. Results also indicated possible uses for document-based databases in data mining applications such as dose monitoring, quality assurance, and protocol optimization. RSNA, 2012
The establishment and use of the point source catalog database of the 2MASS near infrared survey
NASA Astrophysics Data System (ADS)
Gao, Y. F.; Shan, H. G.; Cheng, D.
2003-02-01
The 2MASS near infrared survey project is introduced briefly. The 2MASS point sources catalog (2MASS PSC) database and the network query system are established by using the PHP Hypertext Preprocessor and MySQL database server. By using the system, one can not only query information of sources listed in the catalog, but also draw the plots related. Moreover, after the 2MASS data are diagnosed , some research fields which can be benefited from this database are suggested.
ADVICE--Educational System for Teaching Database Courses
ERIC Educational Resources Information Center
Cvetanovic, M.; Radivojevic, Z.; Blagojevic, V.; Bojovic, M.
2011-01-01
This paper presents a Web-based educational system, ADVICE, that helps students to bridge the gap between database management system (DBMS) theory and practice. The usage of ADVICE is presented through a set of laboratory exercises developed to teach students conceptual and logical modeling, SQL, formal query languages, and normalization. While…
A Scalable Data Access Layer to Manage Structured Heterogeneous Biomedical Data
Lianas, Luca; Frexia, Francesca; Zanetti, Gianluigi
2016-01-01
This work presents a scalable data access layer, called PyEHR, designed to support the implementation of data management systems for secondary use of structured heterogeneous biomedical and clinical data. PyEHR adopts the openEHR’s formalisms to guarantee the decoupling of data descriptions from implementation details and exploits structure indexing to accelerate searches. Data persistence is guaranteed by a driver layer with a common driver interface. Interfaces for two NoSQL Database Management Systems are already implemented: MongoDB and Elasticsearch. We evaluated the scalability of PyEHR experimentally through two types of tests, called “Constant Load” and “Constant Number of Records”, with queries of increasing complexity on synthetic datasets of ten million records each, containing very complex openEHR archetype structures, distributed on up to ten computing nodes. PMID:27936191
The MAO NASU Plate Archive Database. Current Status and Perspectives
NASA Astrophysics Data System (ADS)
Pakuliak, L. K.; Sergeeva, T. P.
2006-04-01
The preliminary online version of the database of the MAO NASU plate archive is constructed on the basis of the relational database management system MySQL and permits an easy supplement of database with new collections of astronegatives, provides a high flexibility in constructing SQL-queries for data search optimization, PHP Basic Authorization protected access to administrative interface and wide range of search parameters. The current status of the database will be reported and the brief description of the search engine and means of the database integrity support will be given. Methods and means of the data verification and tasks for the further development will be discussed.
New tools and methods for direct programmatic access to the dbSNP relational database.
Saccone, Scott F; Quan, Jiaxi; Mehta, Gaurang; Bolze, Raphael; Thomas, Prasanth; Deelman, Ewa; Tischfield, Jay A; Rice, John P
2011-01-01
Genome-wide association studies often incorporate information from public biological databases in order to provide a biological reference for interpreting the results. The dbSNP database is an extensive source of information on single nucleotide polymorphisms (SNPs) for many different organisms, including humans. We have developed free software that will download and install a local MySQL implementation of the dbSNP relational database for a specified organism. We have also designed a system for classifying dbSNP tables in terms of common tasks we wish to accomplish using the database. For each task we have designed a small set of custom tables that facilitate task-related queries and provide entity-relationship diagrams for each task composed from the relevant dbSNP tables. In order to expose these concepts and methods to a wider audience we have developed web tools for querying the database and browsing documentation on the tables and columns to clarify the relevant relational structure. All web tools and software are freely available to the public at http://cgsmd.isi.edu/dbsnpq. Resources such as these for programmatically querying biological databases are essential for viably integrating biological information into genetic association experiments on a genome-wide scale.
Ni, Ming; Ye, Fuqiang; Zhu, Juanjuan; Li, Zongwei; Yang, Shuai; Yang, Bite; Han, Lu; Wu, Yongge; Chen, Ying; Li, Fei; Wang, Shengqi; Bo, Xiaochen
2014-12-01
Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users' own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved. ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as 'agent→drug→estrogen receptor antagonist→tamoxifen'. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest. The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
2MASS Catalog Server Kit Version 2.1
NASA Astrophysics Data System (ADS)
Yamauchi, C.
2013-10-01
The 2MASS Catalog Server Kit is open source software for use in easily constructing a high performance search server for important astronomical catalogs. This software utilizes the open source RDBMS PostgreSQL, therefore, any users can setup the database on their local computers by following step-by-step installation guide. The kit provides highly optimized stored functions for positional searchs similar to SDSS SkyServer. Together with these, the powerful SQL environment of PostgreSQL will meet various user's demands. We released 2MASS Catalog Server Kit version 2.1 in 2012 May, which supports the latest WISE All-Sky catalog (563,921,584 rows) and 9 major all-sky catalogs. Local databases are often indispensable for observatories with unstable or narrow-band networks or severe use, such as retrieving large numbers of records within a small period of time. This software is the best for such purposes, and increasing supported catalogs and improvements of version 2.1 can cover a wider range of applications including advanced calibration system, scientific studies using complicated SQL queries, etc. Official page: http://www.ir.isas.jaxa.jp/~cyamauch/2masskit/
The Ruby UCSC API: accessing the UCSC genome database using Ruby.
Mishima, Hiroyuki; Aerts, Jan; Katayama, Toshiaki; Bonnal, Raoul J P; Yoshiura, Koh-ichiro
2012-09-21
The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby. The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.The API uses the bin index-if available-when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby). Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/.
The Ruby UCSC API: accessing the UCSC genome database using Ruby
2012-01-01
Background The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby. Results The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast. The API uses the bin index—if available—when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby). Conclusions Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/. PMID:22994508
NASA Astrophysics Data System (ADS)
Merticariu, Vlad; Misev, Dimitar; Baumann, Peter
2017-04-01
While python has developed into the lingua franca in Data Science there is often a paradigm break when accessing specialized tools. In particular for one of the core data categories in science and engineering, massive multi-dimensional arrays, out-of-memory solutions typically employ their own, different models. We discuss this situation on the example of the scalable open-source array engine, rasdaman ("raster data manager") which offers access to and processing of Petascale multi-dimensional arrays through an SQL-style array query language, rasql. Such queries are executed in the server on a storage engine utilizing adaptive array partitioning and based on a processing engine implementing a "tile streaming" paradigm to allow processing of arrays massively larger than server RAM. The rasdaman QL has acted as blueprint for forthcoming ISO Array SQL and the Open Geospatial Consortium (OGC) geo analytics language, Web Coverage Processing Service, adopted in 2008. Not surprisingly, rasdaman is OGC and INSPIRE Reference Implementation for their "Big Earth Data" standards suite. Recently, rasdaman has been augmented with a python interface which allows to transparently interact with the database (credits go to Siddharth Shukla's Master Thesis at Jacobs University). Programmers do not need to know the rasdaman query language, as the operators are silently transformed, through lazy evaluation, into queries. Arrays delivered are likewise automatically transformed into their python representation. In the talk, the rasdaman concept will be illustrated with the help of large-scale real-life examples of operational satellite image and weather data services, and sample python code.
NASA Astrophysics Data System (ADS)
Lyapin, Sergey; Kukovyakin, Alexey
Within the framework of the research program "Textaurus" an operational prototype of multifunctional library T-Libra v.4.1. has been created which makes it possible to carry out flexible parametrizable search within a full-text database. The information system is realized in the architecture Web-browser / Web-server / SQL-server. This allows to achieve an optimal combination of universality and efficiency of text processing, on the one hand, and convenience and minimization of expenses for an end user (due to applying of a standard Web-browser as a client application), on the other one. The following principles underlie the information system: a) multifunctionality, b) intelligence, c) multilingual primary texts and full-text searching, d) development of digital library (DL) by a user ("administrative client"), e) multi-platform working. A "library of concepts", i.e. a block of functional models of semantic (concept-oriented) searching, as well as a subsystem of parametrizable queries to a full-text database, which is closely connected with the "library", serve as a conceptual basis of multifunctionality and "intelligence" of the DL T-Libra v.4.1. An author's paragraph is a unit of full-text searching in the suggested technology. At that, the "logic" of an educational / scientific topic or a problem can be built in a multilevel flexible structure of a query and the "library of concepts", replenishable by the developers and experts. About 10 queries of various level of complexity and conceptuality are realized in the suggested version of the information system: from simple terminological searching (taking into account lexical and grammatical paradigms of Russian) to several kinds of explication of terminological fields and adjustable two-parameter thematic searching (a [set of terms] and a [distance between terms] within the limits of an author's paragraph are such parameters correspondingly).
EasyKSORD: A Platform of Keyword Search Over Relational Databases
NASA Astrophysics Data System (ADS)
Peng, Zhaohui; Li, Jing; Wang, Shan
Keyword Search Over Relational Databases (KSORD) enables casual users to use keyword queries (a set of keywords) to search relational databases just like searching the Web, without any knowledge of the database schema or any need of writing SQL queries. Based on our previous work, we design and implement a novel KSORD platform named EasyKSORD for users and system administrators to use and manage different KSORD systems in a novel and simple manner. EasyKSORD supports advanced queries, efficient data-graph-based search engines, multiform result presentations, and system logging and analysis. Through EasyKSORD, users can search relational databases easily and read search results conveniently, and system administrators can easily monitor and analyze the operations of KSORD and manage KSORD systems much better.
A Hybrid Spatio-Temporal Data Indexing Method for Trajectory Databases
Ke, Shengnan; Gong, Jun; Li, Songnian; Zhu, Qing; Liu, Xintao; Zhang, Yeting
2014-01-01
In recent years, there has been tremendous growth in the field of indoor and outdoor positioning sensors continuously producing huge volumes of trajectory data that has been used in many fields such as location-based services or location intelligence. Trajectory data is massively increased and semantically complicated, which poses a great challenge on spatio-temporal data indexing. This paper proposes a spatio-temporal data indexing method, named HBSTR-tree, which is a hybrid index structure comprising spatio-temporal R-tree, B*-tree and Hash table. To improve the index generation efficiency, rather than directly inserting trajectory points, we group consecutive trajectory points as nodes according to their spatio-temporal semantics and then insert them into spatio-temporal R-tree as leaf nodes. Hash table is used to manage the latest leaf nodes to reduce the frequency of insertion. A new spatio-temporal interval criterion and a new node-choosing sub-algorithm are also proposed to optimize spatio-temporal R-tree structures. In addition, a B*-tree sub-index of leaf nodes is built to query the trajectories of targeted objects efficiently. Furthermore, a database storage scheme based on a NoSQL-type DBMS is also proposed for the purpose of cloud storage. Experimental results prove that HBSTR-tree outperforms TB*-tree in some aspects such as generation efficiency, query performance and query type. PMID:25051028
A hybrid spatio-temporal data indexing method for trajectory databases.
Ke, Shengnan; Gong, Jun; Li, Songnian; Zhu, Qing; Liu, Xintao; Zhang, Yeting
2014-07-21
In recent years, there has been tremendous growth in the field of indoor and outdoor positioning sensors continuously producing huge volumes of trajectory data that has been used in many fields such as location-based services or location intelligence. Trajectory data is massively increased and semantically complicated, which poses a great challenge on spatio-temporal data indexing. This paper proposes a spatio-temporal data indexing method, named HBSTR-tree, which is a hybrid index structure comprising spatio-temporal R-tree, B*-tree and Hash table. To improve the index generation efficiency, rather than directly inserting trajectory points, we group consecutive trajectory points as nodes according to their spatio-temporal semantics and then insert them into spatio-temporal R-tree as leaf nodes. Hash table is used to manage the latest leaf nodes to reduce the frequency of insertion. A new spatio-temporal interval criterion and a new node-choosing sub-algorithm are also proposed to optimize spatio-temporal R-tree structures. In addition, a B*-tree sub-index of leaf nodes is built to query the trajectories of targeted objects efficiently. Furthermore, a database storage scheme based on a NoSQL-type DBMS is also proposed for the purpose of cloud storage. Experimental results prove that HBSTR-tree outperforms TB*-tree in some aspects such as generation efficiency, query performance and query type.
Establishment and Assessment of Plasma Disruption and Warning Databases from EAST
NASA Astrophysics Data System (ADS)
Wang, Bo; Robert, Granetz; Xiao, Bingjia; Li, Jiangang; Yang, Fei; Li, Junjun; Chen, Dalong
2016-12-01
Disruption database and disruption warning database of the EAST tokamak had been established by a disruption research group. The disruption database, based on Structured Query Language (SQL), comprises 41 disruption parameters, which include current quench characteristics, EFIT equilibrium characteristics, kinetic parameters, halo currents, and vertical motion. Presently most disruption databases are based on plasma experiments of non-superconducting tokamak devices. The purposes of the EAST database are to find disruption characteristics and disruption statistics to the fully superconducting tokamak EAST, to elucidate the physics underlying tokamak disruptions, to explore the influence of disruption on superconducting magnets and to extrapolate toward future burning plasma devices. In order to quantitatively assess the usefulness of various plasma parameters for predicting disruptions, a similar SQL database to Alcator C-Mod for EAST has been created by compiling values for a number of proposed disruption-relevant parameters sampled from all plasma discharges in the 2015 campaign. The detailed statistic results and analysis of two databases on the EAST tokamak are presented. supported by the National Magnetic Confinement Fusion Science Program of China (No. 2014GB103000)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Enders, Alexander L.; Lousteau, Angela L.
The Desktop Analysis Reporting Tool (DART) is a software package that allows users to easily view and analyze daily files that span long periods. DART gives users the capability to quickly determine the state of health of a radiation portal monitor (RPM), troubleshoot and diagnose problems, and view data in various time frames to perform trend analysis. In short, it converts the data strings written in the daily files into meaningful tables and plots. The standalone version of DART (“soloDART”) utilizes a database engine that is included with the application; no additional installations are necessary. There is also a networkedmore » version of DART (“polyDART”) that is designed to maximize the benefit of a centralized data repository while distributing the workload to individual desktop machines. This networked approach requires a more complex database manager Structured Query Language (SQL) Server; however, SQL Server is not currently provided with DART. Regardless of which version is used, DART will import daily files from RPMs, store the relevant data in its database, and it can produce reports for status, trend analysis, and reporting purposes.« less
a Novel Approach of Indexing and Retrieving Spatial Polygons for Efficient Spatial Region Queries
NASA Astrophysics Data System (ADS)
Zhao, J. H.; Wang, X. Z.; Wang, F. Y.; Shen, Z. H.; Zhou, Y. C.; Wang, Y. L.
2017-10-01
Spatial region queries are more and more widely used in web-based applications. Mechanisms to provide efficient query processing over geospatial data are essential. However, due to the massive geospatial data volume, heavy geometric computation, and high access concurrency, it is difficult to get response in real time. Spatial indexes are usually used in this situation. In this paper, based on k-d tree, we introduce a distributed KD-Tree (DKD-Tree) suitbable for polygon data, and a two-step query algorithm. The spatial index construction is recursive and iterative, and the query is an in memory process. Both the index and query methods can be processed in parallel, and are implemented based on HDFS, Spark and Redis. Experiments on a large volume of Remote Sensing images metadata have been carried out, and the advantages of our method are investigated by comparing with spatial region queries executed on PostgreSQL and PostGIS. Results show that our approach not only greatly improves the efficiency of spatial region query, but also has good scalability, Moreover, the two-step spatial range query algorithm can also save cluster resources to support a large number of concurrent queries. Therefore, this method is very useful when building large geographic information systems.
NASA Astrophysics Data System (ADS)
Indrayana, I. N. E.; P, N. M. Wirasyanti D.; Sudiartha, I. KG
2018-01-01
Mobile application allow many users to access data from the application without being limited to space, space and time. Over time the data population of this application will increase. Data access time will cause problems if the data record has reached tens of thousands to millions of records.The objective of this research is to maintain the performance of data execution for large data records. One effort to maintain data access time performance is to apply query optimization method. The optimization used in this research is query heuristic optimization method. The built application is a mobile-based financial application using MySQL database with stored procedure therein. This application is used by more than one business entity in one database, thus enabling rapid data growth. In this stored procedure there is an optimized query using heuristic method. Query optimization is performed on a “Select” query that involves more than one table with multiple clausa. Evaluation is done by calculating the average access time using optimized and unoptimized queries. Access time calculation is also performed on the increase of population data in the database. The evaluation results shown the time of data execution with query heuristic optimization relatively faster than data execution time without using query optimization.
Performance Prediction of a MongoDB-Based Traceability System in Smart Factory Supply Chains
Kang, Yong-Shin; Park, Il-Ha; Youm, Sekyoung
2016-01-01
In the future, with the advent of the smart factory era, manufacturing and logistics processes will become more complex, and the complexity and criticality of traceability will further increase. This research aims at developing a performance assessment method to verify scalability when implementing traceability systems based on key technologies for smart factories, such as Internet of Things (IoT) and BigData. To this end, based on existing research, we analyzed traceability requirements and an event schema for storing traceability data in MongoDB, a document-based Not Only SQL (NoSQL) database. Next, we analyzed the algorithm of the most representative traceability query and defined a query-level performance model, which is composed of response times for the components of the traceability query algorithm. Next, this performance model was solidified as a linear regression model because the response times increase linearly by a benchmark test. Finally, for a case analysis, we applied the performance model to a virtual automobile parts logistics. As a result of the case study, we verified the scalability of a MongoDB-based traceability system and predicted the point when data node servers should be expanded in this case. The traceability system performance assessment method proposed in this research can be used as a decision-making tool for hardware capacity planning during the initial stage of construction of traceability systems and during their operational phase. PMID:27983654
Rasdaman for Big Spatial Raster Data
NASA Astrophysics Data System (ADS)
Hu, F.; Huang, Q.; Scheele, C. J.; Yang, C. P.; Yu, M.; Liu, K.
2015-12-01
Spatial raster data have grown exponentially over the past decade. Recent advancements on data acquisition technology, such as remote sensing, have allowed us to collect massive observation data of various spatial resolution and domain coverage. The volume, velocity, and variety of such spatial data, along with the computational intensive nature of spatial queries, pose grand challenge to the storage technologies for effective big data management. While high performance computing platforms (e.g., cloud computing) can be used to solve the computing-intensive issues in big data analysis, data has to be managed in a way that is suitable for distributed parallel processing. Recently, rasdaman (raster data manager) has emerged as a scalable and cost-effective database solution to store and retrieve massive multi-dimensional arrays, such as sensor, image, and statistics data. Within this paper, the pros and cons of using rasdaman to manage and query spatial raster data will be examined and compared with other common approaches, including file-based systems, relational databases (e.g., PostgreSQL/PostGIS), and NoSQL databases (e.g., MongoDB and Hive). Earth Observing System (EOS) data collected from NASA's Atmospheric Scientific Data Center (ASDC) will be used and stored in these selected database systems, and a set of spatial and non-spatial queries will be designed to benchmark their performance on retrieving large-scale, multi-dimensional arrays of EOS data. Lessons learnt from using rasdaman will be discussed as well.
Performance Prediction of a MongoDB-Based Traceability System in Smart Factory Supply Chains.
Kang, Yong-Shin; Park, Il-Ha; Youm, Sekyoung
2016-12-14
In the future, with the advent of the smart factory era, manufacturing and logistics processes will become more complex, and the complexity and criticality of traceability will further increase. This research aims at developing a performance assessment method to verify scalability when implementing traceability systems based on key technologies for smart factories, such as Internet of Things (IoT) and BigData. To this end, based on existing research, we analyzed traceability requirements and an event schema for storing traceability data in MongoDB, a document-based Not Only SQL (NoSQL) database. Next, we analyzed the algorithm of the most representative traceability query and defined a query-level performance model, which is composed of response times for the components of the traceability query algorithm. Next, this performance model was solidified as a linear regression model because the response times increase linearly by a benchmark test. Finally, for a case analysis, we applied the performance model to a virtual automobile parts logistics. As a result of the case study, we verified the scalability of a MongoDB-based traceability system and predicted the point when data node servers should be expanded in this case. The traceability system performance assessment method proposed in this research can be used as a decision-making tool for hardware capacity planning during the initial stage of construction of traceability systems and during their operational phase.
An Extensible Schema-less Database Framework for Managing High-throughput Semi-Structured Documents
NASA Technical Reports Server (NTRS)
Maluf, David A.; Tran, Peter B.; La, Tracy; Clancy, Daniel (Technical Monitor)
2002-01-01
Object-Relational database management system is an integrated hybrid cooperative approach to combine the best practices of both the relational model utilizing SQL queries and the object oriented, semantic paradigm for supporting complex data creation. In this paper, a highly scalable, information on demand database framework, called NETMARK is introduced. NETMARK takes advantages of the Oracle 8i object-relational database using physical addresses data types for very efficient keyword searches of records for both context and content. NETMARK was originally developed in early 2000 as a research and development prototype to solve the vast amounts of unstructured and semi-structured documents existing within NASA enterprises. Today, NETMARK is a flexible, high throughput open database framework for managing, storing, and searching unstructured or semi structured arbitrary hierarchal models, XML and HTML.
NASA Technical Reports Server (NTRS)
Maluf, David A.; Tran, Peter B.
2003-01-01
Object-Relational database management system is an integrated hybrid cooperative approach to combine the best practices of both the relational model utilizing SQL queries and the object-oriented, semantic paradigm for supporting complex data creation. In this paper, a highly scalable, information on demand database framework, called NETMARK, is introduced. NETMARK takes advantages of the Oracle 8i object-relational database using physical addresses data types for very efficient keyword search of records spanning across both context and content. NETMARK was originally developed in early 2000 as a research and development prototype to solve the vast amounts of unstructured and semi-structured documents existing within NASA enterprises. Today, NETMARK is a flexible, high-throughput open database framework for managing, storing, and searching unstructured or semi-structured arbitrary hierarchal models, such as XML and HTML.
Cyclone: java-based querying and computing with Pathway/Genome databases.
Le Fèvre, François; Smidtas, Serge; Schächter, Vincent
2007-05-15
Cyclone aims at facilitating the use of BioCyc, a collection of Pathway/Genome Databases (PGDBs). Cyclone provides a fully extensible Java Object API to analyze and visualize these data. Cyclone can read and write PGDBs, and can write its own data in the CycloneML format. This format is automatically generated from the BioCyc ontology by Cyclone itself, ensuring continued compatibility. Cyclone objects can also be stored in a relational database CycloneDB. Queries can be written in SQL, and in an intuitive and concise object-oriented query language, Hibernate Query Language (HQL). In addition, Cyclone interfaces easily with Java software including the Eclipse IDE for HQL edition, the Jung API for graph algorithms or Cytoscape for graph visualization. Cyclone is freely available under an open source license at: http://sourceforge.net/projects/nemo-cyclone. For download and installation instructions, tutorials, use cases and examples, see http://nemo-cyclone.sourceforge.net.
HippDB: a database of readily targeted helical protein-protein interactions.
Bergey, Christina M; Watkins, Andrew M; Arora, Paramjit S
2013-11-01
HippDB catalogs every protein-protein interaction whose structure is available in the Protein Data Bank and which exhibits one or more helices at the interface. The Web site accepts queries on variables such as helix length and sequence, and it provides computational alanine scanning and change in solvent-accessible surface area values for every interfacial residue. HippDB is intended to serve as a starting point for structure-based small molecule and peptidomimetic drug development. HippDB is freely available on the web at http://www.nyu.edu/projects/arora/hippdb. The Web site is implemented in PHP, MySQL and Apache. Source code freely available for download at http://code.google.com/p/helidb, implemented in Perl and supported on Linux. arora@nyu.edu.
An approach in building a chemical compound search engine in oracle database.
Wang, H; Volarath, P; Harrison, R
2005-01-01
A searching or identifying of chemical compounds is an important process in drug design and in chemistry research. An efficient search engine involves a close coupling of the search algorithm and database implementation. The database must process chemical structures, which demands the approaches to represent, store, and retrieve structures in a database system. In this paper, a general database framework for working as a chemical compound search engine in Oracle database is described. The framework is devoted to eliminate data type constrains for potential search algorithms, which is a crucial step toward building a domain specific query language on top of SQL. A search engine implementation based on the database framework is also demonstrated. The convenience of the implementation emphasizes the efficiency and simplicity of the framework.
FastBit: Interactively Searching Massive Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Kesheng; Ahern, Sean; Bethel, E. Wes
2009-06-23
As scientific instruments and computer simulations produce more and more data, the task of locating the essential information to gain insight becomes increasingly difficult. FastBit is an efficient software tool to address this challenge. In this article, we present a summary of the key underlying technologies, namely bitmap compression, encoding, and binning. Together these techniques enable FastBit to answer structured (SQL) queries orders of magnitude faster than popular database systems. To illustrate how FastBit is used in applications, we present three examples involving a high-energy physics experiment, a combustion simulation, and an accelerator simulation. In each case, FastBit significantly reducesmore » the response time and enables interactive exploration on terabytes of data.« less
Toward An Unstructured Mesh Database
NASA Astrophysics Data System (ADS)
Rezaei Mahdiraji, Alireza; Baumann, Peter Peter
2014-05-01
Unstructured meshes are used in several application domains such as earth sciences (e.g., seismology), medicine, oceanography, cli- mate modeling, GIS as approximate representations of physical objects. Meshes subdivide a domain into smaller geometric elements (called cells) which are glued together by incidence relationships. The subdivision of a domain allows computational manipulation of complicated physical structures. For instance, seismologists model earthquakes using elastic wave propagation solvers on hexahedral meshes. The hexahedral con- tains several hundred millions of grid points and millions of hexahedral cells. Each vertex node in the hexahedrals stores a multitude of data fields. To run simulation on such meshes, one needs to iterate over all the cells, iterate over incident cells to a given cell, retrieve coordinates of cells, assign data values to cells, etc. Although meshes are used in many application domains, to the best of our knowledge there is no database vendor that support unstructured mesh features. Currently, the main tool for querying and manipulating unstructured meshes are mesh libraries, e.g., CGAL and GRAL. Mesh li- braries are dedicated libraries which includes mesh algorithms and can be run on mesh representations. The libraries do not scale with dataset size, do not have declarative query language, and need deep C++ knowledge for query implementations. Furthermore, due to high coupling between the implementations and input file structure, the implementations are less reusable and costly to maintain. A dedicated mesh database offers the following advantages: 1) declarative querying, 2) ease of maintenance, 3) hiding mesh storage structure from applications, and 4) transparent query optimization. To design a mesh database, the first challenge is to define a suitable generic data model for unstructured meshes. We proposed ImG-Complexes data model as a generic topological mesh data model which extends incidence graph model to multi-incidence relationships. We instrument ImG model with sets of optional and application-specific constraints which can be used to check validity of meshes for a specific class of object such as manifold, pseudo-manifold, and simplicial manifold. We conducted experiments to measure the performance of the graph database solution in processing mesh queries and compare it with GrAL mesh library and PostgreSQL database on synthetic and real mesh datasets. The experiments show that each system perform well on specific types of mesh queries, e.g., graph databases perform well on global path-intensive queries. In the future, we investigate database operations for the ImG model and design a mesh query language.
Array Databases: Agile Analytics (not just) for the Earth Sciences
NASA Astrophysics Data System (ADS)
Baumann, P.; Misev, D.
2015-12-01
Gridded data, such as images, image timeseries, and climate datacubes, today are managed separately from the metadata, and with different, restricted retrieval capabilities. While databases are good at metadata modelled in tables, XML hierarchies, or RDF graphs, they traditionally do not support multi-dimensional arrays.This gap is being closed by Array Databases, pioneered by the scalable rasdaman ("raster data manager") array engine. Its declarative query language, rasql, extends SQL with array operators which are optimized and parallelized on server side. Installations can easily be mashed up securely, thereby enabling large-scale location-transparent query processing in federations. Domain experts value the integration with their commonly used tools leading to a quick learning curve.Earth, Space, and Life sciences, but also Social sciences as well as business have massive amounts of data and complex analysis challenges that are answered by rasdaman. As of today, rasdaman is mature and in operational use on hundreds of Terabytes of timeseries datacubes, with transparent query distribution across more than 1,000 nodes. Additionally, its concepts have shaped international Big Data standards in the field, including the forthcoming array extension to ISO SQL, many of which are supported by both open-source and commercial systems meantime. In the geo field, rasdaman is reference implementation for the Open Geospatial Consortium (OGC) Big Data standard, WCS, now also under adoption by ISO. Further, rasdaman is in the final stage of OSGeo incubation.In this contribution we present array queries a la rasdaman, describe the architecture and novel optimization and parallelization techniques introduced in 2015, and put this in context of the intercontinental EarthServer initiative which utilizes rasdaman for enabling agile analytics on Petascale datacubes.
Safari, Leila; Patrick, Jon D
2018-06-01
This paper reports on a generic framework to provide clinicians with the ability to conduct complex analyses on elaborate research topics using cascaded queries to resolve internal time-event dependencies in the research questions, as an extension to the proposed Clinical Data Analytics Language (CliniDAL). A cascaded query model is proposed to resolve internal time-event dependencies in the queries which can have up to five levels of criteria starting with a query to define subjects to be admitted into a study, followed by a query to define the time span of the experiment. Three more cascaded queries can be required to define control groups, control variables and output variables which all together simulate a real scientific experiment. According to the complexity of the research questions, the cascaded query model has the flexibility of merging some lower level queries for simple research questions or adding a nested query to each level to compose more complex queries. Three different scenarios (one of them contains two studies) are described and used for evaluation of the proposed solution. CliniDAL's complex analyses solution enables answering complex queries with time-event dependencies at most in a few hours which manually would take many days. An evaluation of results of the research studies based on the comparison between CliniDAL and SQL solutions reveals high usability and efficiency of CliniDAL's solution. Copyright © 2018 Elsevier Inc. All rights reserved.
Improved oilfield GHG accounting using a global oilfield database
NASA Astrophysics Data System (ADS)
Roberts, S.; Brandt, A. R.; Masnadi, M.
2016-12-01
The definition of oil is shifting in considerable ways. Conventional oil resources are declining as oil sands, heavy oils, and others emerge. Technological advances mean that these unconventional hydrocarbons are now viable resources. Meanwhile, scientific evidence is mounting that climate change is occurring. The oil sector is responsible for 35% of global greenhouse gas (GHG) emissions, but the climate impacts of these new unconventional oils are not well understood. As such, the Oil Climate Index (OCI) project has been an international effort to evaluate the total life-cycle environmental GHG emissions of different oil fields globally. Over the course of the first and second phases of the project, 30 and 75 global oil fields have been investigated, respectively. The 75 fields account for about 25% of global oil production. For the third phase of the project, it is aimed to expand the OCI to contain closing to 100% of global oil production; leading to the analysis of 8000 fields. To accomplish this, a robust database system is required to handle and manipulate the data. Therefore, the integration of the data into the computer science language SQL (Structured Query Language) was performed. The implementation of SQL allows users to process the data more efficiently than would be possible by using the previously established program (Microsoft Excel). Next, a graphic user interface (gui) was implemented, in the computer science language of C#, in order to make the data interactive; enabling people to update the database without prior knowledge of SQL being necessary.
New tools and methods for direct programmatic access to the dbSNP relational database
Saccone, Scott F.; Quan, Jiaxi; Mehta, Gaurang; Bolze, Raphael; Thomas, Prasanth; Deelman, Ewa; Tischfield, Jay A.; Rice, John P.
2011-01-01
Genome-wide association studies often incorporate information from public biological databases in order to provide a biological reference for interpreting the results. The dbSNP database is an extensive source of information on single nucleotide polymorphisms (SNPs) for many different organisms, including humans. We have developed free software that will download and install a local MySQL implementation of the dbSNP relational database for a specified organism. We have also designed a system for classifying dbSNP tables in terms of common tasks we wish to accomplish using the database. For each task we have designed a small set of custom tables that facilitate task-related queries and provide entity-relationship diagrams for each task composed from the relevant dbSNP tables. In order to expose these concepts and methods to a wider audience we have developed web tools for querying the database and browsing documentation on the tables and columns to clarify the relevant relational structure. All web tools and software are freely available to the public at http://cgsmd.isi.edu/dbsnpq. Resources such as these for programmatically querying biological databases are essential for viably integrating biological information into genetic association experiments on a genome-wide scale. PMID:21037260
Mining Student Data Captured from a Web-Based Tutoring Tool: Initial Exploration and Results
ERIC Educational Resources Information Center
Merceron, Agathe; Yacef, Kalina
2004-01-01
In this article we describe the initial investigations that we have conducted on student data collected from a web-based tutoring tool. We have used some data mining techniques such as association rule and symbolic data analysis, as well as traditional SQL queries to gain further insight on the students' learning and deduce information to improve…
EmptyHeaded: A Relational Engine for Graph Processing
Aberger, Christopher R.; Tu, Susan; Olukotun, Kunle; Ré, Christopher
2016-01-01
There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded’s design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP. PMID:28077912
Using an image-extended relational database to support content-based image retrieval in a PACS.
Traina, Caetano; Traina, Agma J M; Araújo, Myrian R B; Bueno, Josiane M; Chino, Fabio J T; Razente, Humberto; Azevedo-Marques, Paulo M
2005-12-01
This paper presents a new Picture Archiving and Communication System (PACS), called cbPACS, which has content-based image retrieval capabilities. The cbPACS answers range and k-nearest- neighbor similarity queries, employing a relational database manager extended to support images. The images are compared through their features, which are extracted by an image-processing module and stored in the extended relational database. The database extensions were developed aiming at efficiently answering similarity queries by taking advantage of specialized indexing methods. The main concept supporting the extensions is the definition, inside the relational manager, of distance functions based on features extracted from the images. An extension to the SQL language enables the construction of an interpreter that intercepts the extended commands and translates them to standard SQL, allowing any relational database server to be used. By now, the system implemented works on features based on color distribution of the images through normalized histograms as well as metric histograms. Metric histograms are invariant regarding scale, translation and rotation of images and also to brightness transformations. The cbPACS is prepared to integrate new image features, based on texture and shape of the main objects in the image.
NASA Technical Reports Server (NTRS)
Maluf, David A.; Tran, Peter B.
2003-01-01
Object-Relational database management system is an integrated hybrid cooperative approach to combine the best practices of both the relational model utilizing SQL queries and the object-oriented, semantic paradigm for supporting complex data creation. In this paper, a highly scalable, information on demand database framework, called NETMARK, is introduced. NETMARK takes advantages of the Oracle 8i object-relational database using physical addresses data types for very efficient keyword search of records spanning across both context and content. NETMARK was originally developed in early 2000 as a research and development prototype to solve the vast amounts of unstructured and semistructured documents existing within NASA enterprises. Today, NETMARK is a flexible, high-throughput open database framework for managing, storing, and searching unstructured or semi-structured arbitrary hierarchal models, such as XML and HTML.
Accessing the public MIMIC-II intensive care relational database for clinical research.
Scott, Daniel J; Lee, Joon; Silva, Ikaro; Park, Shinhyuk; Moody, George B; Celi, Leo A; Mark, Roger G
2013-01-10
The Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database is a free, public resource for intensive care research. The database was officially released in 2006, and has attracted a growing number of researchers in academia and industry. We present the two major software tools that facilitate accessing the relational database: the web-based QueryBuilder and a downloadable virtual machine (VM) image. QueryBuilder and the MIMIC-II VM have been developed successfully and are freely available to MIMIC-II users. Simple example SQL queries and the resulting data are presented. Clinical studies pertaining to acute kidney injury and prediction of fluid requirements in the intensive care unit are shown as typical examples of research performed with MIMIC-II. In addition, MIMIC-II has also provided data for annual PhysioNet/Computing in Cardiology Challenges, including the 2012 Challenge "Predicting mortality of ICU Patients". QueryBuilder is a web-based tool that provides easy access to MIMIC-II. For more computationally intensive queries, one can locally install a complete copy of MIMIC-II in a VM. Both publicly available tools provide the MIMIC-II research community with convenient querying interfaces and complement the value of the MIMIC-II relational database.
TokSearch: A search engine for fusion experimental data
Sammuli, Brian S.; Barr, Jayson L.; Eidietis, Nicholas W.; ...
2018-04-01
At a typical fusion research site, experimental data is stored using archive technologies that deal with each discharge as an independent set of data. These technologies (e.g. MDSplus or HDF5) are typically supplemented with a database that aggregates metadata for multiple shots to allow for efficient querying of certain predefined quantities. Often, however, a researcher will need to extract information from the archives, possibly for many shots, that is not available in the metadata store or otherwise indexed for quick retrieval. To address this need, a new search tool called TokSearch has been added to the General Atomics TokSys controlmore » design and analysis suite [1]. This tool provides the ability to rapidly perform arbitrary, parallelized queries of archived tokamak shot data (both raw and analyzed) over large numbers of shots. The TokSearch query API borrows concepts from SQL, and users can choose to implement queries in either MatlabTM or Python.« less
TokSearch: A search engine for fusion experimental data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sammuli, Brian S.; Barr, Jayson L.; Eidietis, Nicholas W.
At a typical fusion research site, experimental data is stored using archive technologies that deal with each discharge as an independent set of data. These technologies (e.g. MDSplus or HDF5) are typically supplemented with a database that aggregates metadata for multiple shots to allow for efficient querying of certain predefined quantities. Often, however, a researcher will need to extract information from the archives, possibly for many shots, that is not available in the metadata store or otherwise indexed for quick retrieval. To address this need, a new search tool called TokSearch has been added to the General Atomics TokSys controlmore » design and analysis suite [1]. This tool provides the ability to rapidly perform arbitrary, parallelized queries of archived tokamak shot data (both raw and analyzed) over large numbers of shots. The TokSearch query API borrows concepts from SQL, and users can choose to implement queries in either MatlabTM or Python.« less
Reactome graph database: Efficient access to complex pathway data
Korninger, Florian; Viteri, Guilherme; Marin-Garcia, Pablo; Ping, Peipei; Wu, Guanming; Stein, Lincoln; D’Eustachio, Peter
2018-01-01
Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. One of its main priorities is to provide easy and efficient access to its high quality curated data. At present, biological pathway databases typically store their contents in relational databases. This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data. The same data in a graph database can be queried more efficiently. Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data. The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery. The adoption of this technology greatly improved query efficiency, reducing the average query time by 93%. The web service built on top of the graph database provides programmatic access to Reactome data by object oriented queries, but also supports more complex queries that take advantage of the new underlying graph-based data storage. By adopting graph database technology we are providing a high performance pathway data resource to the community. The Reactome graph database use case shows the power of NoSQL database engines for complex biological data types. PMID:29377902
Reactome graph database: Efficient access to complex pathway data.
Fabregat, Antonio; Korninger, Florian; Viteri, Guilherme; Sidiropoulos, Konstantinos; Marin-Garcia, Pablo; Ping, Peipei; Wu, Guanming; Stein, Lincoln; D'Eustachio, Peter; Hermjakob, Henning
2018-01-01
Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. One of its main priorities is to provide easy and efficient access to its high quality curated data. At present, biological pathway databases typically store their contents in relational databases. This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data. The same data in a graph database can be queried more efficiently. Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data. The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery. The adoption of this technology greatly improved query efficiency, reducing the average query time by 93%. The web service built on top of the graph database provides programmatic access to Reactome data by object oriented queries, but also supports more complex queries that take advantage of the new underlying graph-based data storage. By adopting graph database technology we are providing a high performance pathway data resource to the community. The Reactome graph database use case shows the power of NoSQL database engines for complex biological data types.
Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L
2015-02-01
Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.
2014-03-27
0.8.0. The virtual machine’s network adapter was set to internal network only to keep any outside traffic from interfering. A MySQL -based query...primary output of Fullstats is the ARFF file format, intended for use with the WEKA Java -based data mining software developed at the University of Waikato
Design and implementation of a database for Brucella melitensis genome annotation.
De Hertogh, Benoît; Lahlimi, Leïla; Lambert, Christophe; Letesson, Jean-Jacques; Depiereux, Eric
2008-03-18
The genome sequences of three Brucella biovars and of some species close to Brucella sp. have become available, leading to new relationship analysis. Moreover, the automatic genome annotation of the pathogenic bacteria Brucella melitensis has been manually corrected by a consortium of experts, leading to 899 modifications of start sites predictions among the 3198 open reading frames (ORFs) examined. This new annotation, coupled with the results of automatic annotation tools of the complete genome sequences of the B. melitensis genome (including BLASTs to 9 genomes close to Brucella), provides numerous data sets related to predicted functions, biochemical properties and phylogenic comparisons. To made these results available, alphaPAGe, a functional auto-updatable database of the corrected sequence genome of B. melitensis, has been built, using the entity-relationship (ER) approach and a multi-purpose database structure. A friendly graphical user interface has been designed, and users can carry out different kinds of information by three levels of queries: (1) the basic search use the classical keywords or sequence identifiers; (2) the original advanced search engine allows to combine (by using logical operators) numerous criteria: (a) keywords (textual comparison) related to the pCDS's function, family domains and cellular localization; (b) physico-chemical characteristics (numerical comparison) such as isoelectric point or molecular weight and structural criteria such as the nucleic length or the number of transmembrane helix (TMH); (c) similarity scores with Escherichia coli and 10 species phylogenetically close to B. melitensis; (3) complex queries can be performed by using a SQL field, which allows all queries respecting the database's structure. The database is publicly available through a Web server at the following url: http://www.fundp.ac.be/urbm/bioinfo/aPAGe.
StarView: The object oriented design of the ST DADS user interface
NASA Technical Reports Server (NTRS)
Williams, J. D.; Pollizzi, J. A.
1992-01-01
StarView is the user interface being developed for the Hubble Space Telescope Data Archive and Distribution Service (ST DADS). ST DADS is the data archive for HST observations and a relational database catalog describing the archived data. Users will use StarView to query the catalog and select appropriate datasets for study. StarView sends requests for archived datasets to ST DADS which processes the requests and returns the database to the user. StarView is designed to be a powerful and extensible user interface. Unique features include an internal relational database to navigate query results, a form definition language that will work with both CRT and X interfaces, a data definition language that will allow StarView to work with any relational database, and the ability to generate adhoc queries without requiring the user to understand the structure of the ST DADS catalog. Ultimately, StarView will allow the user to refine queries in the local database for improved performance and merge in data from external sources for correlation with other query results. The user will be able to create a query from single or multiple forms, merging the selected attributes into a single query. Arbitrary selection of attributes for querying is supported. The user will be able to select how query results are viewed. A standard form or table-row format may be used. Navigation capabilities are provided to aid the user in viewing query results. Object oriented analysis and design techniques were used in the design of StarView to support the mechanisms and concepts required to implement these features. One such mechanism is the Model-View-Controller (MVC) paradigm. The MVC allows the user to have multiple views of the underlying database, while providing a consistent mechanism for interaction regardless of the view. This approach supports both CRT and X interfaces while providing a common mode of user interaction. Another powerful abstraction is the concept of a Query Model. This concept allows a single query to be built form a single or multiple forms before it is submitted to ST DADS. Supporting this concept is the adhoc query generator which allows the user to select and qualify an indeterminate number attributes from the database. The user does not need any knowledge of how the joins across various tables are to be resolved. The adhoc generator calculates the joins automatically and generates the correct SQL query.
Oliveira, S R M; Almeida, G V; Souza, K R R; Rodrigues, D N; Kuser-Falcão, P R; Yamagishi, M E B; Santos, E H; Vieira, F D; Jardine, J G; Neshich, G
2007-10-05
An effective strategy for managing protein databases is to provide mechanisms to transform raw data into consistent, accurate and reliable information. Such mechanisms will greatly reduce operational inefficiencies and improve one's ability to better handle scientific objectives and interpret the research results. To achieve this challenging goal for the STING project, we introduce Sting_RDB, a relational database of structural parameters for protein analysis with support for data warehousing and data mining. In this article, we highlight the main features of Sting_RDB and show how a user can explore it for efficient and biologically relevant queries. Considering its importance for molecular biologists, effort has been made to advance Sting_RDB toward data quality assessment. To the best of our knowledge, Sting_RDB is one of the most comprehensive data repositories for protein analysis, now also capable of providing its users with a data quality indicator. This paper differs from our previous study in many aspects. First, we introduce Sting_RDB, a relational database with mechanisms for efficient and relevant queries using SQL. Sting_rdb evolved from the earlier, text (flat file)-based database, in which data consistency and integrity was not guaranteed. Second, we provide support for data warehousing and mining. Third, the data quality indicator was introduced. Finally and probably most importantly, complex queries that could not be posed on a text-based database, are now easily implemented. Further details are accessible at the Sting_RDB demo web page: http://www.cbi.cnptia.embrapa.br/StingRDB.
A Study of the Efficiency of Spatial Indexing Methods Applied to Large Astronomical Databases
NASA Astrophysics Data System (ADS)
Donaldson, Tom; Berriman, G. Bruce; Good, John; Shiao, Bernie
2018-01-01
Spatial indexing of astronomical databases generally uses quadrature methods, which partition the sky into cells used to create an index (usually a B-tree) written as database column. We report the results of a study to compare the performance of two common indexing methods, HTM and HEALPix, on Solaris and Windows database servers installed with a PostgreSQL database, and a Windows Server installed with MS SQL Server. The indexing was applied to the 2MASS All-Sky Catalog and to the Hubble Source catalog. On each server, the study compared indexing performance by submitting 1 million queries at each index level with random sky positions and random cone search radius, which was computed on a logarithmic scale between 1 arcsec and 1 degree, and measuring the time to complete the query and write the output. These simulated queries, intended to model realistic use patterns, were run in a uniform way on many combinations of indexing method and indexing level. The query times in all simulations are strongly I/O-bound and are linear with number of records returned for large numbers of sources. There are, however, considerable differences between simulations, which reveal that hardware I/O throughput is a more important factor in managing the performance of a DBMS than the choice of indexing scheme. The choice of index itself is relatively unimportant: for comparable index levels, the performance is consistent within the scatter of the timings. At small index levels (large cells; e.g. level 4; cell size 3.7 deg), there is large scatter in the timings because of wide variations in the number of sources found in the cells. At larger index levels, performance improves and scatter decreases, but the improvement at level 8 (14 min) and higher is masked to some extent in the timing scatter caused by the range of query sizes. At very high levels (20; 0.0004 arsec), the granularity of the cells becomes so high that a large number of extraneous empty cells begin to degrade performance. Thus, for the use patterns studied here the database performance is not critically dependent on the exact choices of index or level.
Migration from relational to NoSQL database
NASA Astrophysics Data System (ADS)
Ghotiya, Sunita; Mandal, Juhi; Kandasamy, Saravanakumar
2017-11-01
Data generated by various real time applications, social networking sites and sensor devices is of very huge amount and unstructured, which makes it difficult for Relational database management systems to handle the data. Data is very precious component of any application and needs to be analysed after arranging it in some structure. Relational databases are only able to deal with structured data, so there is need of NoSQL Database management System which can deal with semi -structured data also. Relational database provides the easiest way to manage the data but as the use of NoSQL is increasing it is becoming necessary to migrate the data from Relational to NoSQL databases. Various frameworks has been proposed previously which provides mechanisms for migration of data stored at warehouses in SQL, middle layer solutions which can provide facility of data to be stored in NoSQL databases to handle data which is not structured. This paper provides a literature review of some of the recent approaches proposed by various researchers to migrate data from relational to NoSQL databases. Some researchers proposed mechanisms for the co-existence of NoSQL and Relational databases together. This paper provides a summary of mechanisms which can be used for mapping data stored in Relational databases to NoSQL databases. Various techniques for data transformation and middle layer solutions are summarised in the paper.
Ontology-based geospatial data query and integration
Zhao, T.; Zhang, C.; Wei, M.; Peng, Z.-R.
2008-01-01
Geospatial data sharing is an increasingly important subject as large amount of data is produced by a variety of sources, stored in incompatible formats, and accessible through different GIS applications. Past efforts to enable sharing have produced standardized data format such as GML and data access protocols such as Web Feature Service (WFS). While these standards help enabling client applications to gain access to heterogeneous data stored in different formats from diverse sources, the usability of the access is limited due to the lack of data semantics encoded in the WFS feature types. Past research has used ontology languages to describe the semantics of geospatial data but ontology-based queries cannot be applied directly to legacy data stored in databases or shapefiles, or to feature data in WFS services. This paper presents a method to enable ontology query on spatial data available from WFS services and on data stored in databases. We do not create ontology instances explicitly and thus avoid the problems of data replication. Instead, user queries are rewritten to WFS getFeature requests and SQL queries to database. The method also has the benefits of being able to utilize existing tools of databases, WFS, and GML while enabling query based on ontology semantics. ?? 2008 Springer-Verlag Berlin Heidelberg.
Accessing the public MIMIC-II intensive care relational database for clinical research
2013-01-01
Background The Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database is a free, public resource for intensive care research. The database was officially released in 2006, and has attracted a growing number of researchers in academia and industry. We present the two major software tools that facilitate accessing the relational database: the web-based QueryBuilder and a downloadable virtual machine (VM) image. Results QueryBuilder and the MIMIC-II VM have been developed successfully and are freely available to MIMIC-II users. Simple example SQL queries and the resulting data are presented. Clinical studies pertaining to acute kidney injury and prediction of fluid requirements in the intensive care unit are shown as typical examples of research performed with MIMIC-II. In addition, MIMIC-II has also provided data for annual PhysioNet/Computing in Cardiology Challenges, including the 2012 Challenge “Predicting mortality of ICU Patients”. Conclusions QueryBuilder is a web-based tool that provides easy access to MIMIC-II. For more computationally intensive queries, one can locally install a complete copy of MIMIC-II in a VM. Both publicly available tools provide the MIMIC-II research community with convenient querying interfaces and complement the value of the MIMIC-II relational database. PMID:23302652
2006-09-01
Each of these layers will be described in more detail to include relevant technologies ( Java , PDA, Hibernate , and PostgreSQL) used to implement...Logic Layer -Object-Relational Mapper ( Hibernate ) Data 35 capable in order to interface with Java applications. Based on meeting the selection...further discussed. Query List Application Logic Layer HibernateApache - Java Servlet - Hibernate Interface -OR Mapper -RDBMS Interface
Integrated Substrate and Thin Film Design Methods
1999-02-01
Proper Representation Once the required chemical databases had been converted to the Excel format, VBA macros were written to convert chemical...ternary systems databases were imported from MS Excel to MS Access to implement SQL queries. Further, this database was connected via an ODBC model, to the... VBA macro, corresponding to each of the elements A, B, and C, respectively. The B loop began with the next alphabetical choice of element symbols
TCW: Transcriptome Computational Workbench
Soderlund, Carol; Nelson, William; Willer, Mark; Gang, David R.
2013-01-01
Background The analysis of transcriptome data involves many steps and various programs, along with organization of large amounts of data and results. Without a methodical approach for storage, analysis and query, the resulting ad hoc analysis can lead to human error, loss of data and results, inefficient use of time, and lack of verifiability, repeatability, and extensibility. Methodology The Transcriptome Computational Workbench (TCW) provides Java graphical interfaces for methodical analysis for both single and comparative transcriptome data without the use of a reference genome (e.g. for non-model organisms). The singleTCW interface steps the user through importing transcript sequences (e.g. Illumina) or assembling long sequences (e.g. Sanger, 454, transcripts), annotating the sequences, and performing differential expression analysis using published statistical programs in R. The data, metadata, and results are stored in a MySQL database. The multiTCW interface builds a comparison database by importing sequence and annotation from one or more single TCW databases, executes the ESTscan program to translate the sequences into proteins, and then incorporates one or more clusterings, where the clustering options are to execute the orthoMCL program, compute transitive closure, or import clusters. Both singleTCW and multiTCW allow extensive query and display of the results, where singleTCW displays the alignment of annotation hits to transcript sequences, and multiTCW displays multiple transcript alignments with MUSCLE or pairwise alignments. The query programs can be executed on the desktop for fastest analysis, or from the web for sharing the results. Conclusion It is now affordable to buy a multi-processor machine, and easy to install Java and MySQL. By simply downloading the TCW, the user can interactively analyze, query and view their data. The TCW allows in-depth data mining of the results, which can lead to a better understanding of the transcriptome. TCW is freely available from www.agcol.arizona.edu/software/tcw. PMID:23874959
TCW: transcriptome computational workbench.
Soderlund, Carol; Nelson, William; Willer, Mark; Gang, David R
2013-01-01
The analysis of transcriptome data involves many steps and various programs, along with organization of large amounts of data and results. Without a methodical approach for storage, analysis and query, the resulting ad hoc analysis can lead to human error, loss of data and results, inefficient use of time, and lack of verifiability, repeatability, and extensibility. The Transcriptome Computational Workbench (TCW) provides Java graphical interfaces for methodical analysis for both single and comparative transcriptome data without the use of a reference genome (e.g. for non-model organisms). The singleTCW interface steps the user through importing transcript sequences (e.g. Illumina) or assembling long sequences (e.g. Sanger, 454, transcripts), annotating the sequences, and performing differential expression analysis using published statistical programs in R. The data, metadata, and results are stored in a MySQL database. The multiTCW interface builds a comparison database by importing sequence and annotation from one or more single TCW databases, executes the ESTscan program to translate the sequences into proteins, and then incorporates one or more clusterings, where the clustering options are to execute the orthoMCL program, compute transitive closure, or import clusters. Both singleTCW and multiTCW allow extensive query and display of the results, where singleTCW displays the alignment of annotation hits to transcript sequences, and multiTCW displays multiple transcript alignments with MUSCLE or pairwise alignments. The query programs can be executed on the desktop for fastest analysis, or from the web for sharing the results. It is now affordable to buy a multi-processor machine, and easy to install Java and MySQL. By simply downloading the TCW, the user can interactively analyze, query and view their data. The TCW allows in-depth data mining of the results, which can lead to a better understanding of the transcriptome. TCW is freely available from www.agcol.arizona.edu/software/tcw.
Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets
NASA Astrophysics Data System (ADS)
Juric, Mario
2011-01-01
The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.
SQLGEN: a framework for rapid client-server database application development.
Nadkarni, P M; Cheung, K H
1995-12-01
SQLGEN is a framework for rapid client-server relational database application development. It relies on an active data dictionary on the client machine that stores metadata on one or more database servers to which the client may be connected. The dictionary generates dynamic Structured Query Language (SQL) to perform common database operations; it also stores information about the access rights of the user at log-in time, which is used to partially self-configure the behavior of the client to disable inappropriate user actions. SQLGEN uses a microcomputer database as the client to store metadata in relational form, to transiently capture server data in tables, and to allow rapid application prototyping followed by porting to client-server mode with modest effort. SQLGEN is currently used in several production biomedical databases.
Adding Hierarchical Objects to Relational Database General-Purpose XML-Based Information Managements
NASA Technical Reports Server (NTRS)
Lin, Shu-Chun; Knight, Chris; La, Tracy; Maluf, David; Bell, David; Tran, Khai Peter; Gawdiak, Yuri
2006-01-01
NETMARK is a flexible, high-throughput software system for managing, storing, and rapid searching of unstructured and semi-structured documents. NETMARK transforms such documents from their original highly complex, constantly changing, heterogeneous data formats into well-structured, common data formats in using Hypertext Markup Language (HTML) and/or Extensible Markup Language (XML). The software implements an object-relational database system that combines the best practices of the relational model utilizing Structured Query Language (SQL) with those of the object-oriented, semantic database model for creating complex data. In particular, NETMARK takes advantage of the Oracle 8i object-relational database model using physical-address data types for very efficient keyword searches of records across both context and content. NETMARK also supports multiple international standards such as WEBDAV for drag-and-drop file management and SOAP for integrated information management using Web services. The document-organization and -searching capabilities afforded by NETMARK are likely to make this software attractive for use in disciplines as diverse as science, auditing, and law enforcement.
Meta sequence analysis of human blood peptides and their parent proteins.
Bowden, Peter; Pendrak, Voitek; Zhu, Peihong; Marshall, John G
2010-04-18
Sequence analysis of the blood peptides and their qualities will be key to understanding the mechanisms that contribute to error in LC-ESI-MS/MS. Analysis of peptides and their proteins at the level of sequences is much more direct and informative than the comparison of disparate accession numbers. A portable database of all blood peptide and protein sequences with descriptor fields and gene ontology terms might be useful for designing immunological or MRM assays from human blood. The results of twelve studies of human blood peptides and/or proteins identified by LC-MS/MS and correlated against a disparate array of genetic libraries were parsed and matched to proteins from the human ENSEMBL, SwissProt and RefSeq databases by SQL. The reported peptide and protein sequences were organized into an SQL database with full protein sequences and up to five unique peptides in order of prevalence along with the peptide count for each protein. Structured query language or BLAST was used to acquire descriptive information in current databases. Sampling error at the level of peptides is the largest source of disparity between groups. Chi Square analysis of peptide to protein distributions confirmed the significant agreement between groups on identified proteins. Copyright 2010. Published by Elsevier B.V.
Engineering the ATLAS TAG Browser
NASA Astrophysics Data System (ADS)
Zhang, Qizhi; ATLAS Collaboration
2011-12-01
ELSSI is a web-based event metadata (TAG) browser and event-level selection service for ATLAS. In this paper, we describe some of the challenges encountered in the process of developing ELSSI, and the software engineering strategies adopted to address those challenges. Approaches to management of access to data, browsing, data rendering, query building, query validation, execution, connection management, and communication with auxiliary services are discussed. We also describe strategies for dealing with data that may vary over time, such as run-dependent trigger decision decoding. Along with examples, we illustrate how programming techniques in multiple languages (PHP, JAVASCRIPT, XML, AJAX, and PL/SQL) have been blended to achieve the required results. Finally, we evaluate features of the ELSSI service in terms of functionality, scalability, and performance.
Zaman, Babar; Khandekar, Rajiv; Al Shahwan, Sami; Song, Jonathan; Al Jadaan, Ibrahim; Al Jiasim, Leyla; Owaydha, Ohood; Asghar, Nasira; Hijazi, Amar; Edward, Deepak P.
2014-01-01
In this brief communication, we present the steps used to establish a web-based congenital glaucoma registry at our institution. The contents of a case report form (CRF) were developed by a group of glaucoma subspecialists. Information Technology (IT) specialists used Lime Survey softwareTM to create an electronic CRF. A MY Structured Query Language (MySQL) server was used as a database with a virtual machine operating system. Two ophthalmologists and 2 IT specialists worked for 7 hours, and a biostatistician and a data registrar worked for 24 hours each to establish the electronic CRF. Using the CRF which was transferred to the Lime survey tool, and the MYSQL server application, data could be directly stored in spreadsheet programs that included Microsoft Excel, SPSS, and R-Language and queried in real-time. In a pilot test, clinical data from 80 patients with congenital glaucoma were entered into the registry and successful descriptive analysis and data entry validation was performed. A web-based disease registry was established in a short period of time in a cost-efficient manner using available resources and a team-based approach. PMID:24791112
Zaman, Babar; Khandekar, Rajiv; Al Shahwan, Sami; Song, Jonathan; Al Jadaan, Ibrahim; Al Jiasim, Leyla; Owaydha, Ohood; Asghar, Nasira; Hijazi, Amar; Edward, Deepak P
2014-01-01
In this brief communication, we present the steps used to establish a web-based congenital glaucoma registry at our institution. The contents of a case report form (CRF) were developed by a group of glaucoma subspecialists. Information Technology (IT) specialists used Lime Survey softwareTM to create an electronic CRF. A MY Structured Query Language (MySQL) server was used as a database with a virtual machine operating system. Two ophthalmologists and 2 IT specialists worked for 7 hours, and a biostatistician and a data registrar worked for 24 hours each to establish the electronic CRF. Using the CRF which was transferred to the Lime survey tool, and the MYSQL server application, data could be directly stored in spreadsheet programs that included Microsoft Excel, SPSS, and R-Language and queried in real-time. In a pilot test, clinical data from 80 patients with congenital glaucoma were entered into the registry and successful descriptive analysis and data entry validation was performed. A web-based disease registry was established in a short period of time in a cost-efficient manner using available resources and a team-based approach.
A Split-Path Schema-Based RFID Data Storage Model in Supply Chain Management
Fan, Hua; Wu, Quanyuan; Lin, Yisong; Zhang, Jianfeng
2013-01-01
In modern supply chain management systems, Radio Frequency IDentification (RFID) technology has become an indispensable sensor technology and massive RFID data sets are expected to become commonplace. More and more space and time are needed to store and process such huge amounts of RFID data, and there is an increasing realization that the existing approaches cannot satisfy the requirements of RFID data management. In this paper, we present a split-path schema-based RFID data storage model. With a data separation mechanism, the massive RFID data produced in supply chain management systems can be stored and processed more efficiently. Then a tree structure-based path splitting approach is proposed to intelligently and automatically split the movement paths of products. Furthermore, based on the proposed new storage model, we design the relational schema to store the path information and time information of tags, and some typical query templates and SQL statements are defined. Finally, we conduct various experiments to measure the effect and performance of our model and demonstrate that it performs significantly better than the baseline approach in both the data expression and path-oriented RFID data query performance. PMID:23645112
Motamed, Cyrus; Bourgain, Jean Louis
2016-06-01
Anaesthesia Information Management Systems (AIMS) generate large amounts of data, which might be useful for quality assurance programs. This study was designed to highlight the multiple contributions of our AIMS system in extracting quality indicators over a period of 10years. The study was conducted from 2002 to 2011. Two methods were used to extract anaesthesia indicators: the manual extraction of individual files for monitoring neuromuscular relaxation and structured query language (SQL) extraction for other indicators which were postoperative nausea and vomiting (PONV), pain, sedation scores, pain-related medications, scores and postoperative hypothermia. For each indicator, a program of information/meetings and adaptation/suggestions for operating room and PACU personnel was initiated to improve quality assurance, while data were extracted each year. The study included 77,573 patients. The mean overall completeness of data for the initial years ranged from 55 to 85% and was indicator-dependent, which then improved to 95% completeness for the last 5years. The incidence of neuromuscular monitoring was initially 67% and then increased to 95% (P<0.05). The rate of pharmacological reversal remained around 53% throughout the study. Regarding SQL data, an improvement of severe postoperative pain and PONV scores was observed throughout the study, while mild postoperative hypothermia remained a challenge, despite efforts for improvement. The AIMS system permitted the follow-up of certain indicators through manual sampling and many more via SQL extraction in a sustained and non-time-consuming way across years. However, it requires competent and especially dedicated resources to handle the database. Copyright © 2016 Société française d'anesthésie et de réanimation (Sfar). Published by Elsevier Masson SAS. All rights reserved.
A relational database in neurosurgery.
Sicurello, F; Marchetti, M R; Cazzaniga, P
1995-01-01
This paper describes teh automatic procedure for a clinical record management in a Neurosurgery ward. The automated record allows the storage, querying and effective management of clinical data. This is useful during the patient stay and also for data processing and analysis aiming at clinical research and statistical studies. The clinical record is problem-oriented. It contains a minimum data set regarding every patient and a data set which is defined by a classification nomenclature (using an inner protocol). The main parts of the clinical record are the following tables: PERSONAL DATA: contains the fields relating to personal and admission data of the patient. The compilation of some fields is compulsory because they serve as input for the automated discharge letter. This table is used as an identifier for patient retrieval. composed of five different tables according to the kind of data. They are: familiar anamnesis, physiological anamnesis, past and next pathology anamnesis, and trauma anamnesis. GENERAL OBJECTIVITY: contains the general physical information of a patient. The field hold default values, which quickens the compilation and assures the recording of normal values. NEUROLOGICAL EXAMINATION: contains information about the neurological status of the patient. Also in this table, ther are default values in the fields. COMA: contains standardized ata and classifications. The multiple choices are automated and driven and belong to homogeneous classes. SURGICAL OPERATIONS: the information recording is made defining the general kind of operation and then defining the peculiar kind of operation. INSTRUMENTAL EXAMINATIONS: some examination results are recorded in a free structure, while other ones (TAC, etc.) follow codified structure. In order to identify a pathology by means of TAC, it is enough to record three values corresponding to three variables. THis classification fully describes a lot of neurosurgical pathologies. DISCHARGE: contains conclusions, therapies, result, and hospital course. Medical language is closer to the natural one and presents some abiguities. In order to solve this problem, a classification nomenclature was used for diagnosis definition. DISCHARGE LETTER: the document given to the patient when he is discharged. It extracts data from the previously described modules and contains standard headings. The information stored int he database is structured (e.g., diagnosis, name, surname, etc.) and access to this data takes place when the user wants to search the database, using particular queries where the identifying data of a patient is put as conditions for the research (SELECT age, name WHERE diagnosis="TRAUMA"). Logical operators and relational algebra of the relational DBMS allows more complex queries ((diagnosis="TRAUMA" AND age="19") OR sex="M"). The queries are deterministic, because data management uses a classification nomenclature. Data retrieval takes place through a matching, and the DBMS answers directly to the queries. The information retrieval speed depends upon the kind of system that is used; in our case retrieval time is low because the accesses to disk are few even for big databases. In medicine, clinical records can have a hierarchical structure and/or a relational one. Nevertheless, the hierarchical model presents a disadvantage: it is not very flexible because it is linked to a pre-defined structure; as a matter of fact, the definition of path is established in the beginning and not during the execution. Thus, a better representation of the system at a logical level requries a relational DBMS which exploits the relationships between entities in a vertical and horizontal way. That is why the developers adopted a mixed strategy which exploits the advantages of both models and which is provided by M Technology with SQL language (M/SQL). For the future, it is important to have at one's disposal multimedia technologies, which integrate different kinds of information (alp
Querying Event Sequences by Exact Match or Similarity Search: Design and Empirical Evaluation
Wongsuphasawat, Krist; Plaisant, Catherine; Taieb-Maimon, Meirav; Shneiderman, Ben
2012-01-01
Specifying event sequence queries is challenging even for skilled computer professionals familiar with SQL. Most graphical user interfaces for database search use an exact match approach, which is often effective, but near misses may also be of interest. We describe a new similarity search interface, in which users specify a query by simply placing events on a blank timeline and retrieve a similarity-ranked list of results. Behind this user interface is a new similarity measure for event sequences which the users can customize by four decision criteria, enabling them to adjust the impact of missing, extra, or swapped events or the impact of time shifts. We describe a use case with Electronic Health Records based on our ongoing collaboration with hospital physicians. A controlled experiment with 18 participants compared exact match and similarity search interfaces. We report on the advantages and disadvantages of each interface and suggest a hybrid interface combining the best of both. PMID:22379286
Mynodbcsv: lightweight zero-config database solution for handling very large CSV files.
Adaszewski, Stanisław
2014-01-01
Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: "no copy" approach--data stay mostly in the CSV files; "zero configuration"--no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.
Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files
Adaszewski, Stanisław
2014-01-01
Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: “no copy” approach – data stay mostly in the CSV files; “zero configuration” – no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results. PMID:25068261
MO-F-CAMPUS-T-05: SQL Database Queries to Determine Treatment Planning Resource Usage
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fox, C; Gladstone, D
2015-06-15
Purpose: A radiation oncology clinic’s treatment capacity is traditionally thought to be limited by the number of machines in the clinic. As the number of fractions per course decrease and the number of adaptive plans increase, the question of how many treatment plans a clinic can plan becomes increasingly important. This work seeks to lay the ground work for assessing treatment planning resource usage. Methods: Care path templates were created using the Aria 11 care path interface. Care path tasks included key steps in the treatment planning process from the completion of CT simulation through the first radiation treatment. SQLmore » Server Management Studio was used to run SQL queries to extract task completion time stamps along with care path template information and diagnosis codes from the Aria database. 6 months of planning cycles were evaluated. Elapsed time was evaluated in terms of work hours within Monday – Friday, 7am to 5pm. Results: For the 195 validated treatment planning cycles, the average time for planning and MD review was 22.8 hours. Of those cases 33 were categorized as urgent. The average planning time for urgent plans was 5 hours. A strong correlation between diagnosis code and range of elapsed planning time was as well as between elapsed time and select diagnosis codes was observed. It was also observed that tasks were more likely to be completed on the date due than the time that they were due. Follow-up confirmed that most users did not look at the due time. Conclusion: Evaluation of elapsed planning time and other tasks suggest that care paths should be adjusted to allow for different contouring and planning times for certain diagnosis codes and urgent cases. Additional clinic training around task due times vs dates or a structuring of care paths around due dates is also needed.« less
Evaluation of Potential LSST Spatial Indexing Strategies
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nikolaev, S; Abdulla, G; Matzke, R
2006-10-13
The LSST requirement for producing alerts in near real-time, and the fact that generating an alert depends on knowing the history of light variations for a given sky position, both imply that the clustering information for all detections is available at any time during the survey. Therefore, any data structure describing clustering of detections in LSST needs to be continuously updated, even as new detections are arriving from the pipeline. We call this use case ''incremental clustering'', to reflect this continuous updating of clustering information. This document describes the evaluation results for several potential LSST incremental clustering strategies, using: (1)more » Neighbors table and zone optimization to store spatial clusters (a.k.a. Jim Grey's, or SDSS algorithm); (2) MySQL built-in R-tree implementation; (3) an external spatial index library which supports a query interface.« less
Informatics application provides instant research to practice benefits.
Bowles, K. H.; Peng, T.; Qian, R.; Naylor, M. D.
2001-01-01
A web-based research information system was designed to enable our research team to efficiently measure health related quality of life among frail older adults in a variety of health care settings (home care, nursing homes, assisted living, PACE). The structure, process, and outcome data is collected using laptop computers and downloaded to a SQL database. Unique features of this project are the ability to transfer research to practice by instantly sharing individual and aggregate results with the clinicians caring for these elders and directly impacting the quality of their care. Clinicians can also dial in to the database to access standard queries or receive customized reports about the patients in their facilities. This paper will describe the development and implementation of the information system. The conference presentation will include a demonstration and examples of research to practice benefits. PMID:11825156
Hosting and pulishing astronomical data in SQL databases
NASA Astrophysics Data System (ADS)
Galkin, Anastasia; Klar, Jochen; Riebe, Kristin; Matokevic, Gal; Enke, Harry
2017-04-01
In astronomy, terabytes and petabytes of data are produced by ground instruments, satellite missions and simulations. At Leibniz-Institute for Astrophysics Potsdam (AIP) we host and publish terabytes of cosmological simulation and observational data. The public archive at AIP has now reached a size of 60TB and growing and helps to produce numerous scientific papers. The web framework Daiquiri offers a dedicated web interface for each of the hosted scientific databases. Scientists all around the world run SQL queries which include specific astrophysical functions and get their desired data in reasonable time. Daiquiri supports the scientific projects by offering a number of administration tools such as database and user management, contact messages to the staff and support for organization of meetings and workshops. The webpages can be customized and the Wordpress integration supports the participating scientists in maintaining the documentation and the projects' news sections.
Big Data Analytics with Datalog Queries on Spark.
Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo
2016-01-01
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.
Big Data Analytics with Datalog Queries on Spark
Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo
2017-01-01
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics. PMID:28626296
Photo-z-SQL: Integrated, flexible photometric redshift computation in a database
NASA Astrophysics Data System (ADS)
Beck, R.; Dobos, L.; Budavári, T.; Szalay, A. S.; Csabai, I.
2017-04-01
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational capabilities of DB hardware. The code is able to perform both maximum likelihood and Bayesian estimation, and can handle inputs of variable photometric filter sets and corresponding broad-band magnitudes. It is possible to take into account the full covariance matrix between filters, and filter zero points can be empirically calibrated using measurements with given redshifts. The list of spectral templates and the prior can be specified flexibly, and the expensive synthetic magnitude computations are done via lazy evaluation, coupled with a caching of results. Parallel execution is fully supported. For large upcoming photometric surveys such as the LSST, the ability to perform in-place photo-z calculation would be a significant advantage. Also, the efficient handling of variable filter sets is a necessity for heterogeneous databases, for example the Hubble Source Catalog, and for cross-match services such as SkyQuery. We illustrate the performance of our code on two reference photo-z estimation testing datasets, and provide an analysis of execution time and scalability with respect to different configurations. The code is available for download at https://github.com/beckrob/Photo-z-SQL.
Efficiently Distributing Component-Based Applications Across Wide-Area Environments
2002-01-01
Oracle 8.1.7 Enterprise Edition), each running on a dedicated 1GHz dual-processor Pentium III workstation. For the RUBiS tests, we used a MySQL 4.0.12...a variety of sophisticated network-accessible services such as e-mail, banking, on-line shopping, entertainment, and serv - ing as a data exchange...Beans Catalog Handles read-only queries to product database Customer Serves as a façade to Order and Account Stateful Session Beans ShoppingCart
Auxiliary Library Explorer (ALEX) Development
2016-02-01
non-empty cells. This is a laborious manual task and could probably have been avoided by using Java code to read the data directly from Excel. In fact...it might be even easier to leave the data as a comma separated variables (CSV) file and read the data in with Java , although this could create other...This is first implemented using the MakeFullDatabaseapp Java project, which performs an SQL query on the DSpace data to return a list of items for which
Benchmarking distributed data warehouse solutions for storing genomic variant information
Wiewiórka, Marek S.; Wysakowicz, Dawid P.; Okoniewski, Michał J.
2017-01-01
Abstract Genomic-based personalized medicine encompasses storing, analysing and interpreting genomic variants as its central issues. At a time when thousands of patientss sequenced exomes and genomes are becoming available, there is a growing need for efficient database storage and querying. The answer could be the application of modern distributed storage systems and query engines. However, the application of large genomic variant databases to this problem has not been sufficiently far explored so far in the literature. To investigate the effectiveness of modern columnar storage [column-oriented Database Management System (DBMS)] and query engines, we have developed a prototypic genomic variant data warehouse, populated with large generated content of genomic variants and phenotypic data. Next, we have benchmarked performance of a number of combinations of distributed storages and query engines on a set of SQL queries that address biological questions essential for both research and medical applications. In addition, a non-distributed, analytical database (MonetDB) has been used as a baseline. Comparison of query execution times confirms that distributed data warehousing solutions outperform classic relational DBMSs. Moreover, pre-aggregation and further denormalization of data, which reduce the number of distributed join operations, significantly improve query performance by several orders of magnitude. Most of distributed back-ends offer a good performance for complex analytical queries, while the Optimized Row Columnar (ORC) format paired with Presto and Parquet with Spark 2 query engines provide, on average, the lowest execution times. Apache Kudu on the other hand, is the only solution that guarantees a sub-second performance for simple genome range queries returning a small subset of data, where low-latency response is expected, while still offering decent performance for running analytical queries. In summary, research and clinical applications that require the storage and analysis of variants from thousands of samples can benefit from the scalability and performance of distributed data warehouse solutions. Database URL: https://github.com/ZSI-Bio/variantsdwh PMID:29220442
NoSQL data model for semi-automatic integration of ethnomedicinal plant data from multiple sources.
Ningthoujam, Sanjoy Singh; Choudhury, Manabendra Dutta; Potsangbam, Kumar Singh; Chetia, Pankaj; Nahar, Lutfun; Sarker, Satyajit D; Basar, Norazah; Das Talukdar, Anupam
2014-01-01
Sharing traditional knowledge with the scientific community could refine scientific approaches to phytochemical investigation and conservation of ethnomedicinal plants. As such, integration of traditional knowledge with scientific data using a single platform for sharing is greatly needed. However, ethnomedicinal data are available in heterogeneous formats, which depend on cultural aspects, survey methodology and focus of the study. Phytochemical and bioassay data are also available from many open sources in various standards and customised formats. To design a flexible data model that could integrate both primary and curated ethnomedicinal plant data from multiple sources. The current model is based on MongoDB, one of the Not only Structured Query Language (NoSQL) databases. Although it does not contain schema, modifications were made so that the model could incorporate both standard and customised ethnomedicinal plant data format from different sources. The model presented can integrate both primary and secondary data related to ethnomedicinal plants. Accommodation of disparate data was accomplished by a feature of this database that supported a different set of fields for each document. It also allowed storage of similar data having different properties. The model presented is scalable to a highly complex level with continuing maturation of the database, and is applicable for storing, retrieving and sharing ethnomedicinal plant data. It can also serve as a flexible alternative to a relational and normalised database. Copyright © 2014 John Wiley & Sons, Ltd.
Assembling proteomics data as a prerequisite for the analysis of large scale experiments
Schmidt, Frank; Schmid, Monika; Thiede, Bernd; Pleißner, Klaus-Peter; Böhme, Martina; Jungblut, Peter R
2009-01-01
Background Despite the complete determination of the genome sequence of a huge number of bacteria, their proteomes remain relatively poorly defined. Beside new methods to increase the number of identified proteins new database applications are necessary to store and present results of large- scale proteomics experiments. Results In the present study, a database concept has been developed to address these issues and to offer complete information via a web interface. In our concept, the Oracle based data repository system SQL-LIMS plays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis, Helicobacter pylori, Salmonella typhimurium and protein complexes such as 20S proteasome. Technical operations of our proteomics labs were used as the standard for SQL-LIMS template creation. By means of a Java based data parser, post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-D gel electrophoresis (2-DE), were stored in SQL-LIMS. A minimum set of the proteomics data were transferred in our public 2D-PAGE database using a Java based interface (Data Transfer Tool) with the requirements of the PEDRo standardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS via XML. Conclusion The Oracle based data repository system SQL-LIMS played the central role in the proteomics workflow concept. Technical operations of our proteomics labs were used as standards for SQL-LIMS templates. Using a Java based parser, post-processed data of different approaches such as LC/ESI-MS, MALDI-MS and 1-DE and 2-DE were stored in SQL-LIMS. Thus, unique data formats of different instruments were unified and stored in SQL-LIMS tables. Moreover, a unique submission identifier allowed fast access to all experimental data. This was the main advantage compared to multi software solutions, especially if personnel fluctuations are high. Moreover, large scale and high-throughput experiments must be managed in a comprehensive repository system such as SQL-LIMS, to query results in a systematic manner. On the other hand, these database systems are expensive and require at least one full time administrator and specialized lab manager. Moreover, the high technical dynamics in proteomics may cause problems to adjust new data formats. To summarize, SQL-LIMS met the requirements of proteomics data handling especially in skilled processes such as gel-electrophoresis or mass spectrometry and fulfilled the PSI standardization criteria. The data transfer into a public domain via DTT facilitated validation of proteomics data. Additionally, evaluation of mass spectra by post-processing using MS-Screener improved the reliability of mass analysis and prevented storage of data junk. PMID:19166578
NASA Astrophysics Data System (ADS)
Mueller, Wolfgang; Mueller, Henning; Marchand-Maillet, Stephane; Pun, Thierry; Squire, David M.; Pecenovic, Zoran; Giess, Christoph; de Vries, Arjen P.
2000-10-01
While in the area of relational databases interoperability is ensured by common communication protocols (e.g. ODBC/JDBC using SQL), Content Based Image Retrieval Systems (CBIRS) and other multimedia retrieval systems are lacking both a common query language and a common communication protocol. Besides its obvious short term convenience, interoperability of systems is crucial for the exchange and analysis of user data. In this paper, we present and describe an extensible XML-based query markup language, called MRML (Multimedia Retrieval markup Language). MRML is primarily designed so as to ensure interoperability between different content-based multimedia retrieval systems. Further, MRML allows researchers to preserve their freedom in extending their system as needed. MRML encapsulates multimedia queries in a way that enable multimedia (MM) query languages, MM content descriptions, MM query engines, and MM user interfaces to grow independently from each other, reaching a maximum of interoperability while ensuring a maximum of freedom for the developer. For benefitting from this, only a few simple design principles have to be respected when extending MRML for one's fprivate needs. The design of extensions withing the MRML framework will be described in detail in the paper. MRML has been implemented and tested for the CBIRS Viper, using the user interface Snake Charmer. Both are part of the GNU project and can be downloaded at our site.
The NOAO Data Lab PHAT Photometry Database
NASA Astrophysics Data System (ADS)
Olsen, Knut; Williams, Ben; Fitzpatrick, Michael; PHAT Team
2018-01-01
We present a database containing both the combined photometric object catalog and the single epoch measurements from the Panchromatic Hubble Andromeda Treasury (PHAT). This database is hosted by the NOAO Data Lab (http://datalab.noao.edu), and as such exposes a number of data services to the PHAT photometry, including access through a Table Access Protocol (TAP) service, direct PostgreSQL queries, web-based and programmatic query interfaces, remote storage space for personal database tables and files, and a JupyterHub-based Notebook analysis environment, as well as image access through a Simple Image Access (SIA) service. We show how the Data Lab database and Jupyter Notebook environment allow for straightforward and efficient analyses of PHAT catalog data, including maps of object density, depth, and color, extraction of light curves of variable objects, and proper motion exploration.
PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan
Kinjo, Akira R.; Yamashita, Reiko; Nakamura, Haruki
2010-01-01
This article is a tutorial for PDBj Mine, a new database and its interface for Protein Data Bank Japan (PDBj). In PDBj Mine, data are loaded from files in the PDBMLplus format (an extension of PDBML, PDB's canonical XML format, enriched with annotations), which are then served for the user of PDBj via the worldwide web (WWW). We describe the basic design of the relational database (RDB) and web interfaces of PDBj Mine. The contents of PDBMLplus files are first broken into XPath entities, and these paths and data are indexed in the way that reflects the hierarchical structure of the XML files. The data for each XPath type are saved into the corresponding relational table that is named as the XPath itself. The generation of table definitions from the PDBMLplus XML schema is fully automated. For efficient search, frequently queried terms are compiled into a brief summary table. Casual users can perform simple keyword search, and 'Advanced Search' which can specify various conditions on the entries. More experienced users can query the database using SQL statements which can be constructed in a uniform manner. Thus, PDBj Mine achieves a combination of the flexibility of XML documents and the robustness of the RDB. Database URL: http://www.pdbj.org/ PMID:20798081
PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan.
Kinjo, Akira R; Yamashita, Reiko; Nakamura, Haruki
2010-08-25
This article is a tutorial for PDBj Mine, a new database and its interface for Protein Data Bank Japan (PDBj). In PDBj Mine, data are loaded from files in the PDBMLplus format (an extension of PDBML, PDB's canonical XML format, enriched with annotations), which are then served for the user of PDBj via the worldwide web (WWW). We describe the basic design of the relational database (RDB) and web interfaces of PDBj Mine. The contents of PDBMLplus files are first broken into XPath entities, and these paths and data are indexed in the way that reflects the hierarchical structure of the XML files. The data for each XPath type are saved into the corresponding relational table that is named as the XPath itself. The generation of table definitions from the PDBMLplus XML schema is fully automated. For efficient search, frequently queried terms are compiled into a brief summary table. Casual users can perform simple keyword search, and 'Advanced Search' which can specify various conditions on the entries. More experienced users can query the database using SQL statements which can be constructed in a uniform manner. Thus, PDBj Mine achieves a combination of the flexibility of XML documents and the robustness of the RDB. Database URL: http://www.pdbj.org/
A veterinary anatomy tutoring system.
Theodoropoulos, G; Loumos, V; Antonopoulos, J
1994-02-14
A veterinary anatomy tutoring system was developed by using Knowledge Pro, an object-oriented software development tool with hypermedia capabilities, and MS Access, a relational database. Communication between them is facilitated by using the Structured Query Language (SQL). The architecture of the system is based on knowledge sets, each of which covers four different descriptions of an organ, namely gross anatomy (general description), gross anatomy (comparative features), histology, and embryology, which constitute the knowledge units. These knowledge units are linked with three global variables that define the animals, the topographies, and the system to which this organ belongs, creating three data-bases. These three data-bases are interrelated through the organ field in order to establish a relational model. This system allows versatility in the student's navigation through the information space by offering different modes for information location and presentation. These include course mode, review mode, reference mode, dissection mode, and comparison mode. In addition, the system provides a self-evaluation mode.
Thriving on Chaos: The Development of a Surgical Information System
Olund, Steven R.
1988-01-01
Hospitals present unique challenges to the computer industry, generating a greater quantity and variety of data than nearly any other enterprise. This is complicated by the fact that a hospital is not one homogenous organization, but a bundle of semi-independent groups with unique data requirements. Therefore hospital information systems must be fast, flexible, reliable, easy to use and maintain, and cost-effective. The Surgical Information System at Rush Presbyterian-St. Luke's Medical Center, Chicago is such system. It uses a Sequent Balance 21000 multi-processor superminicomputer, running industry standard tools such as the Unix operating system, a 4th generation programming language (4GL), and Structured Query Language (SQL) relational database management software. This treatise illustrates a comprehensive yet generic approach which can be applied to almost any clinical situation where access to patient data is required by a variety of medical professionals.
Research on sudden environmental pollution public service platform construction based on WebGIS
NASA Astrophysics Data System (ADS)
Bi, T. P.; Gao, D. Y.; Zhong, X. Y.
2016-08-01
In order to actualize the social sharing and service of the emergency-response information for sudden pollution accidents, the public can share the risk source information service, dangerous goods control technology service and so on, The SQL Server and ArcSDE software are used to establish a spatial database to restore all kinds of information including risk sources, hazardous chemicals and handling methods in case of accidents. Combined with Chinese atmospheric environmental assessment standards, the SCREEN3 atmospheric dispersion model and one-dimensional liquid diffusion model are established to realize the query of related information and the display of the diffusion effect under B/S structure. Based on the WebGIS technology, C#.Net language is used to develop the sudden environmental pollution public service platform. As a result, the public service platform can make risk assessments and provide the best emergency processing services.
Big Geo Data Services: From More Bytes to More Barrels
NASA Astrophysics Data System (ADS)
Misev, Dimitar; Baumann, Peter
2016-04-01
The data deluge is affecting the oil and gas industry just as much as many other industries. However, aside from the sheer volume there is the challenge of data variety, such as regular and irregular grids, multi-dimensional space/time grids, point clouds, and TINs and other meshes. A uniform conceptualization for modelling and serving them could save substantial effort, such as the proverbial "department of reformatting". The notion of a coverage actually can accomplish this. Its abstract model in ISO 19123 together with the concrete, interoperable OGC Coverage Implementation Schema (CIS), which is currently under adoption as ISO 19123-2, provieds a common platform for representing any n-D grid type, point clouds, and general meshes. This is paired by the OGC Web Coverage Service (WCS) together with its datacube analytics language, the OGC Web Coverage Processing Service (WCPS). The OGC WCS Core Reference Implementation, rasdaman, relies on Array Database technology, i.e. a NewSQL/NoSQL approach. It supports the grid part of coverages, with installations of 100+ TB known and single queries parallelized across 1,000+ cloud nodes. Recent research attempts to address the point cloud and mesh part through a unified query model. The Holy Grail envisioned is that these approaches can be merged into a single service interface at some time. We present both grid amd point cloud / mesh approaches and discuss status, implementation, standardization, and research perspectives, including a live demo.
SORTEZ: a relational translator for NCBI's ASN.1 database.
Hart, K W; Searls, D B; Overton, G C
1994-07-01
The National Center for Biotechnology Information (NCBI) has created a database collection that includes several protein and nucleic acid sequence databases, a biosequence-specific subset of MEDLINE, as well as value-added information such as links between similar sequences. Information in the NCBI database is modeled in Abstract Syntax Notation 1 (ASN.1) an Open Systems Interconnection protocol designed for the purpose of exchanging structured data between software applications rather than as a data model for database systems. While the NCBI database is distributed with an easy-to-use information retrieval system, ENTREZ, the ASN.1 data model currently lacks an ad hoc query language for general-purpose data access. For that reason, we have developed a software package, SORTEZ, that transforms the ASN.1 database (or other databases with nested data structures) to a relational data model and subsequently to a relational database management system (Sybase) where information can be accessed through the relational query language, SQL. Because the need to transform data from one data model and schema to another arises naturally in several important contexts, including efficient execution of specific applications, access to multiple databases and adaptation to database evolution this work also serves as a practical study of the issues involved in the various stages of database transformation. We show that transformation from the ASN.1 data model to a relational data model can be largely automated, but that schema transformation and data conversion require considerable domain expertise and would greatly benefit from additional support tools.
PROTICdb: a web-based application to store, track, query, and compare plant proteome data.
Ferry-Dumazet, Hélène; Houel, Gwenn; Montalent, Pierre; Moreau, Luc; Langella, Olivier; Negroni, Luc; Vincent, Delphine; Lalanne, Céline; de Daruvar, Antoine; Plomion, Christophe; Zivy, Michel; Joets, Johann
2005-05-01
PROTICdb is a web-based application, mainly designed to store and analyze plant proteome data obtained by two-dimensional polyacrylamide gel electrophoresis (2-D PAGE) and mass spectrometry (MS). The purposes of PROTICdb are (i) to store, track, and query information related to proteomic experiments, i.e., from tissue sampling to protein identification and quantitative measurements, and (ii) to integrate information from the user's own expertise and other sources into a knowledge base, used to support data interpretation (e.g., for the determination of allelic variants or products of post-translational modifications). Data insertion into the relational database of PROTICdb is achieved either by uploading outputs of image analysis and MS identification software, or by filling web forms. 2-D PAGE annotated maps can be displayed, queried, and compared through a graphical interface. Links to external databases are also available. Quantitative data can be easily exported in a tabulated format for statistical analyses. PROTICdb is based on the Oracle or the PostgreSQL Database Management System and is freely available upon request at the following URL: http://moulon.inra.fr/ bioinfo/PROTICdb.
Querying clinical data in HL7 RIM based relational model with morph-RDB.
Priyatna, Freddy; Alonso-Calvo, Raul; Paraiso-Medina, Sergio; Corcho, Oscar
2017-10-05
Semantic interoperability is essential when carrying out post-genomic clinical trials where several institutions collaborate, since researchers and developers need to have an integrated view and access to heterogeneous data sources. One possible approach to accommodate this need is to use RDB2RDF systems that provide RDF datasets as the unified view. These RDF datasets may be materialized and stored in a triple store, or transformed into RDF in real time, as virtual RDF data sources. Our previous efforts involved materialized RDF datasets, hence losing data freshness. In this paper we present a solution that uses an ontology based on the HL7 v3 Reference Information Model and a set of R2RML mappings that relate this ontology to an underlying relational database implementation, and where morph-RDB is used to expose a virtual, non-materialized SPARQL endpoint over the data. By applying a set of optimization techniques on the SPARQL-to-SQL query translation algorithm, we can now issue SPARQL queries to the underlying relational data with generally acceptable performance.
NASA Technical Reports Server (NTRS)
Steeman, Gerald; Connell, Christopher
2000-01-01
Many librarians may feel that dynamic Web pages are out of their reach, financially and technically. Yet we are reminded in library and Web design literature that static home pages are a thing of the past. This paper describes how librarians at the Institute for Defense Analyses (IDA) library developed a database-driven, dynamic intranet site using commercial off-the-shelf applications. Administrative issues include surveying a library users group for interest and needs evaluation; outlining metadata elements; and, committing resources from managing time to populate the database and training in Microsoft FrontPage and Web-to-database design. Technical issues covered include Microsoft Access database fundamentals, lessons learned in the Web-to-database process (including setting up Database Source Names (DSNs), redesigning queries to accommodate the Web interface, and understanding Access 97 query language vs. Standard Query Language (SQL)). This paper also offers tips on editing Active Server Pages (ASP) scripting to create desired results. A how-to annotated resource list closes out the paper.
A Fast Healthcare Interoperability Resources (FHIR) layer implemented over i2b2.
Boussadi, Abdelali; Zapletal, Eric
2017-08-14
Standards and technical specifications have been developed to define how the information contained in Electronic Health Records (EHRs) should be structured, semantically described, and communicated. Current trends rely on differentiating the representation of data instances from the definition of clinical information models. The dual model approach, which combines a reference model (RM) and a clinical information model (CIM), sets in practice this software design pattern. The most recent initiative, proposed by HL7, is called Fast Health Interoperability Resources (FHIR). The aim of our study was to investigate the feasibility of applying the FHIR standard to modeling and exposing EHR data of the Georges Pompidou European Hospital (HEGP) integrating biology and the bedside (i2b2) clinical data warehouse (CDW). We implemented a FHIR server over i2b2 to expose EHR data in relation with five FHIR resources: DiagnosisReport, MedicationOrder, Patient, Encounter, and Medication. The architecture of the server combines a Data Access Object design pattern and FHIR resource providers, implemented using the Java HAPI FHIR API. Two types of queries were tested: query type #1 requests the server to display DiagnosticReport resources, for which the diagnosis code is equal to a given ICD-10 code. A total of 80 DiagnosticReport resources, corresponding to 36 patients, were displayed. Query type #2, requests the server to display MedicationOrder, for which the FHIR Medication identification code is equal to a given code expressed in a French coding system. A total of 503 MedicationOrder resources, corresponding to 290 patients, were displayed. Results were validated by manually comparing the results of each request to the results displayed by an ad-hoc SQL query. We showed the feasibility of implementing a Java layer over the i2b2 database model to expose data of the CDW as a set of FHIR resources. An important part of this work was the structural and semantic mapping between the i2b2 model and the FHIR RM. To accomplish this, developers must manually browse the specifications of the FHIR standard. Our source code is freely available and can be adapted for use in other i2b2 sites.
Balaur, Irina; Saqi, Mansoor; Barat, Ana; Lysenko, Artem; Mazein, Alexander; Rawlings, Christopher J; Ruskin, Heather J; Auffray, Charles
2017-10-01
The development of colorectal cancer (CRC)-the third most common cancer type-has been associated with deregulations of cellular mechanisms stimulated by both genetic and epigenetic events. StatEpigen is a manually curated and annotated database, containing information on interdependencies between genetic and epigenetic signals, and specialized currently for CRC research. Although StatEpigen provides a well-developed graphical user interface for information retrieval, advanced queries involving associations between multiple concepts can benefit from more detailed graph representation of the integrated data. This can be achieved by using a graph database (NoSQL) approach. Data were extracted from StatEpigen and imported to our newly developed EpiGeNet, a graph database for storage and querying of conditional relationships between molecular (genetic and epigenetic) events observed at different stages of colorectal oncogenesis. We illustrate the enhanced capability of EpiGeNet for exploration of different queries related to colorectal tumor progression; specifically, we demonstrate the query process for (i) stage-specific molecular events, (ii) most frequently observed genetic and epigenetic interdependencies in colon adenoma, and (iii) paths connecting key genes reported in CRC and associated events. The EpiGeNet framework offers improved capability for management and visualization of data on molecular events specific to CRC initiation and progression.
Omicseq: a web-based search engine for exploring omics datasets
Sun, Xiaobo; Pittard, William S.; Xu, Tianlei; Chen, Li; Zwick, Michael E.; Jiang, Xiaoqian; Wang, Fusheng
2017-01-01
Abstract The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve ‘findability’ of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. PMID:28402462
Jefferson, Emily R.; Walsh, Thomas P.; Roberts, Timothy J.; Barton, Geoffrey J.
2007-01-01
SNAPPI-DB, a high performance database of Structures, iNterfaces and Alignments of Protein–Protein Interactions, and its associated Java Application Programming Interface (API) is described. SNAPPI-DB contains structural data, down to the level of atom co-ordinates, for each structure in the Protein Data Bank (PDB) together with associated data including SCOP, CATH, Pfam, SWISSPROT, InterPro, GO terms, Protein Quaternary Structures (PQS) and secondary structure information. Domain–domain interactions are stored for multiple domain definitions and are classified by their Superfamily/Family pair and interaction interface. Each set of classified domain–domain interactions has an associated multiple structure alignment for each partner. The API facilitates data access via PDB entries, domains and domain–domain interactions. Rapid development, fast database access and the ability to perform advanced queries without the requirement for complex SQL statements are provided via an object oriented database and the Java Data Objects (JDO) API. SNAPPI-DB contains many features which are not available in other databases of structural protein–protein interactions. It has been applied in three studies on the properties of protein–protein interactions and is currently being employed to train a protein–protein interaction predictor and a functional residue predictor. The database, API and manual are available for download at: . PMID:17202171
MODBASE, a database of annotated comparative protein structure models
Pieper, Ursula; Eswar, Narayanan; Stuart, Ashley C.; Ilyin, Valentin A.; Sali, Andrej
2002-01-01
MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on PSI-BLAST, IMPALA and MODELLER. MODBASE uses the MySQL relational database management system for flexible and efficient querying, and the MODVIEW Netscape plugin for viewing and manipulating multiple sequences and structures. It is updated regularly to reflect the growth of the protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different datasets. The largest dataset contains models for domains in 304 517 out of 539 171 unique protein sequences in the complete TrEMBL database (23 March 2001); only models based on significant alignments (PSI-BLAST E-value < 10–4) and models assessed to have the correct fold are included. Other datasets include models for target selection and structure-based annotation by the New York Structural Genomics Research Consortium, models for prediction of genes in the Drosophila melanogaster genome, models for structure determination of several ribosomal particles and models calculated by the MODWEB comparative modeling web server. PMID:11752309
A JEE RESTful service to access Conditions Data in ATLAS
NASA Astrophysics Data System (ADS)
Formica, Andrea; Gallas, E. J.
2015-12-01
Usage of condition data in ATLAS is extensive for offline reconstruction and analysis (e.g. alignment, calibration, data quality). The system is based on the LCG Conditions Database infrastructure, with read and write access via an ad hoc C++ API (COOL), a system which was developed before Run 1 data taking began. The infrastructure dictates that the data is organized into separate schemas (assigned to subsystems/groups storing distinct and independent sets of conditions), making it difficult to access information from several schemas at the same time. We have thus created PL/SQL functions containing queries to provide content extraction at multi-schema level. The PL/SQL API has been exposed to external clients by means of a Java application providing DB access via REST services, deployed inside an application server (JBoss WildFly). The services allow navigation over multiple schemas via simple URLs. The data can be retrieved either in XML or JSON formats, via simple clients (like curl or Web browsers).
StreptomycesInforSys: A web-enabled information repository
Jain, Chakresh Kumar; Gupta, Vidhi; Gupta, Ashvarya; Gupta, Sanjay; Wadhwa, Gulshan; Sharma, Sanjeev Kumar; Sarethy, Indira P
2012-01-01
Members of Streptomyces produce 70% of natural bioactive products. There is considerable amount of information available based on polyphasic approach for classification of Streptomyces. However, this information based on phenotypic, genotypic and bioactive component production profiles is crucial for pharmacological screening programmes. This is scattered across various journals, books and other resources, many of which are not freely accessible. The designed database incorporates polyphasic typing information using combinations of search options to aid in efficient screening of new isolates. This will help in the preliminary categorization of appropriate groups. It is a free relational database compatible with existing operating systems. A cross platform technology with XAMPP Web server has been used to develop, manage, and facilitate the user query effectively with database support. Employment of PHP, a platform-independent scripting language, embedded in HTML and the database management software MySQL will facilitate dynamic information storage and retrieval. The user-friendly, open and flexible freeware (PHP, MySQL and Apache) is foreseen to reduce running and maintenance cost. Availability www.sis.biowaves.org PMID:23275736
A Web-based Tool for SDSS and 2MASS Database Searches
NASA Astrophysics Data System (ADS)
Hendrickson, M. A.; Uomoto, A.; Golimowski, D. A.
We have developed a web site using HTML, Php, Python, and MySQL that extracts, processes, and displays data from the Sloan Digital Sky Survey (SDSS) and the Two-Micron All-Sky Survey (2MASS). The goal is to locate brown dwarf candidates in the SDSS database by looking at color cuts; however, this site could also be useful for targeted searches of other databases as well. MySQL databases are created from broad searches of SDSS and 2MASS data. Broad queries on the SDSS and 2MASS database servers are run weekly so that observers have the most up-to-date information from which to select candidates for observation. Observers can look at detailed information about specific objects including finding charts, images, and available spectra. In addition, updates from previous observations can be added by any collaborators; this format makes observational collaboration simple. Observers can also restrict the database search, just before or during an observing run, to select objects of special interest.
StreptomycesInforSys: A web-enabled information repository.
Jain, Chakresh Kumar; Gupta, Vidhi; Gupta, Ashvarya; Gupta, Sanjay; Wadhwa, Gulshan; Sharma, Sanjeev Kumar; Sarethy, Indira P
2012-01-01
Members of Streptomyces produce 70% of natural bioactive products. There is considerable amount of information available based on polyphasic approach for classification of Streptomyces. However, this information based on phenotypic, genotypic and bioactive component production profiles is crucial for pharmacological screening programmes. This is scattered across various journals, books and other resources, many of which are not freely accessible. The designed database incorporates polyphasic typing information using combinations of search options to aid in efficient screening of new isolates. This will help in the preliminary categorization of appropriate groups. It is a free relational database compatible with existing operating systems. A cross platform technology with XAMPP Web server has been used to develop, manage, and facilitate the user query effectively with database support. Employment of PHP, a platform-independent scripting language, embedded in HTML and the database management software MySQL will facilitate dynamic information storage and retrieval. The user-friendly, open and flexible freeware (PHP, MySQL and Apache) is foreseen to reduce running and maintenance cost. www.sis.biowaves.org.
Profile-IQ: Web-based data query system for local health department infrastructure and activities.
Shah, Gulzar H; Leep, Carolyn J; Alexander, Dayna
2014-01-01
To demonstrate the use of National Association of County & City Health Officials' Profile-IQ, a Web-based data query system, and how policy makers, researchers, the general public, and public health professionals can use the system to generate descriptive statistics on local health departments. This article is a descriptive account of an important health informatics tool based on information from the project charter for Profile-IQ and the authors' experience and knowledge in design and use of this query system. Profile-IQ is a Web-based data query system that is based on open-source software: MySQL 5.5, Google Web Toolkit 2.2.0, Apache Commons Math library, Google Chart API, and Tomcat 6.0 Web server deployed on an Amazon EC2 server. It supports dynamic queries of National Profile of Local Health Departments data on local health department finances, workforce, and activities. Profile-IQ's customizable queries provide a variety of statistics not available in published reports and support the growing information needs of users who do not wish to work directly with data files for lack of staff skills or time, or to avoid a data use agreement. Profile-IQ also meets the growing demand of public health practitioners and policy makers for data to support quality improvement, community health assessment, and other processes associated with voluntary public health accreditation. It represents a step forward in the recent health informatics movement of data liberation and use of open source information technology solutions to promote public health.
Performance analysis of different database in new internet mapping system
NASA Astrophysics Data System (ADS)
Yao, Xing; Su, Wei; Gao, Shuai
2017-03-01
In the Mapping System of New Internet, Massive mapping entries between AID and RID need to be stored, added, updated, and deleted. In order to better deal with the problem when facing a large number of mapping entries update and query request, the Mapping System of New Internet must use high-performance database. In this paper, we focus on the performance of Redis, SQLite, and MySQL these three typical databases, and the results show that the Mapping System based on different databases can adapt to different needs according to the actual situation.
Zhang, Qingzhou; Yang, Bo; Chen, Xujiao; Xu, Jing; Mei, Changlin; Mao, Zhiguo
2014-01-01
We present a bioinformatics database named Renal Gene Expression Database (RGED), which contains comprehensive gene expression data sets from renal disease research. The web-based interface of RGED allows users to query the gene expression profiles in various kidney-related samples, including renal cell lines, human kidney tissues and murine model kidneys. Researchers can explore certain gene profiles, the relationships between genes of interests and identify biomarkers or even drug targets in kidney diseases. The aim of this work is to provide a user-friendly utility for the renal disease research community to query expression profiles of genes of their own interest without the requirement of advanced computational skills. Availability and implementation: Website is implemented in PHP, R, MySQL and Nginx and freely available from http://rged.wall-eva.net. Database URL: http://rged.wall-eva.net PMID:25252782
Zhang, Qingzhou; Yang, Bo; Chen, Xujiao; Xu, Jing; Mei, Changlin; Mao, Zhiguo
2014-01-01
We present a bioinformatics database named Renal Gene Expression Database (RGED), which contains comprehensive gene expression data sets from renal disease research. The web-based interface of RGED allows users to query the gene expression profiles in various kidney-related samples, including renal cell lines, human kidney tissues and murine model kidneys. Researchers can explore certain gene profiles, the relationships between genes of interests and identify biomarkers or even drug targets in kidney diseases. The aim of this work is to provide a user-friendly utility for the renal disease research community to query expression profiles of genes of their own interest without the requirement of advanced computational skills. Website is implemented in PHP, R, MySQL and Nginx and freely available from http://rged.wall-eva.net. http://rged.wall-eva.net. © The Author(s) 2014. Published by Oxford University Press.
PropBase Query Layer: a single portal to UK subsurface physical property databases
NASA Astrophysics Data System (ADS)
Kingdon, Andrew; Nayembil, Martin L.; Richardson, Anne E.; Smith, A. Graham
2013-04-01
Until recently, the delivery of geological information for industry and public was achieved by geological mapping. Now pervasively available computers mean that 3D geological models can deliver realistic representations of the geometric location of geological units, represented as shells or volumes. The next phase of this process is to populate these with physical properties data that describe subsurface heterogeneity and its associated uncertainty. Achieving this requires capture and serving of physical, hydrological and other property information from diverse sources to populate these models. The British Geological Survey (BGS) holds large volumes of subsurface property data, derived both from their own research data collection and also other, often commercially derived data sources. This can be voxelated to incorporate this data into the models to demonstrate property variation within the subsurface geometry. All property data held by BGS has for many years been stored in relational databases to ensure their long-term continuity. However these have, by necessity, complex structures; each database contains positional reference data and model information, and also metadata such as sample identification information and attributes that define the source and processing. Whilst this is critical to assessing these analyses, it also hugely complicates the understanding of variability of the property under assessment and requires multiple queries to study related datasets making extracting physical properties from these databases difficult. Therefore the PropBase Query Layer has been created to allow simplified aggregation and extraction of all related data and its presentation of complex data in simple, mostly denormalized, tables which combine information from multiple databases into a single system. The structure from each relational database is denormalized in a generalised structure, so that each dataset can be viewed together in a common format using a simple interface. Data are re-engineered to facilitate easy loading. The query layer structure comprises tables, procedures, functions, triggers, views and materialised views. The structure contains a main table PRB_DATA which contains all of the data with the following attribution: • a unique identifier • the data source • the unique identifier from the parent database for traceability • the 3D location • the property type • the property value • the units • necessary qualifiers • precision information and an audit trail Data sources, property type and units are constrained by dictionaries, a key component of the structure which defines what properties and inheritance hierarchies are to be coded and also guides the process as to what and how these are extracted from the structure. Data types served by the Query Layer include site investigation derived geotechnical data, hydrogeology datasets, regional geochemistry, geophysical logs as well as lithological and borehole metadata. The size and complexity of the data sets with multiple parent structures requires a technically robust approach to keep the layer synchronised. This is achieved through Oracle procedures written in PL/SQL containing the logic required to carry out the data manipulation (inserts, updates, deletes) to keep the layer synchronised with the underlying databases either as regular scheduled jobs (weekly, monthly etc) or invoked on demand. The PropBase Query Layer's implementation has enabled rapid data discovery, visualisation and interpretation of geological data with greater ease, simplifying the parametrisation of 3D model volumes and facilitating the study of intra-unit heterogeneity.
National Irrigation Water Quality Program data-synthesis data base
Seiler, Ralph L.; Skorupa, Joseph P.
2001-01-01
Under the National Irrigation Water Quality Program (NIWQP) of the U.S. Department of the Interior, researchers investigated contamination caused by irrigation drainage in 26 areas in the Western United States from 1986 to 1993. From 1992 to 1995, a comprehensive relational data base was built to organize data collected during the 26-area investigations. The data base provided the basis for analysis and synthesis of these data to identify common features of contaminated areas and hence dominant biologic, geologic, climatic, chemical, and physiographic factors that have resulted in contamination of water and biota in irrigated areas in the Western United States. Included in the data base are geologic, hydrologic, climatological, chemical, and cultural data that describe the 26 study areas in 14 Western States. The data base contains information on 1,264 sites from which water and bottom sediment were collected. It also contains chemical data from 6,903 analyses of surface water, 914 analyses of ground water, 707 analyses of inorganic constituents in bottom sediments, 223 analyses of organochlorine pesticides in bottom sediments, 8,217 analyses of inorganic constituents in biota, and 1,088 analyses for organic constituents in biota. The data base is available to the public and can be obtained at the NIWQP homepage http://www.usbr.gov/niwqp as dBase III tables for personal-computer systems or as American Standard Code for Information Exchange structured query language (SQL) command and data files for SQL data bases.
searchSCF: Using MongoDB to Enable Richer Searches of Locally Hosted Science Data Repositories
NASA Astrophysics Data System (ADS)
Knosp, B.
2016-12-01
Science teams today are in the unusual position of almost having too much data available to them. Modern sensors and models are capable of outputting terabytes of data per day, which can make it difficult to find specific subsets of data. The sheer size of files can also make it time consuming to retrieve this big data from national data archive centers. Thus, many science teams choose to store what data they can on their local systems, but they are not always equipped with tools to help them intelligently organize and search their data. In its local data repository, the Aura Microwave Limb Sounder (MLS) science team at NASA's Jet Propulsion Laboratory has collected over 300TB of atmospheric science data from 71 missions/models that aid in validation, algorithm development, and research activities. When the project began, the team developed a MySQL database to aid in data queries, but this database was only designed to keep track of MLS and a few ancillary data sets, leving much of the data uncatalogued. The team has also seen database query time rise over the life of the mission. Even though the MLS science team's data holdings are not the size of a national data center's, team members still need tools to help them discover and utilize the data that they have on-hand. Over the past year, members of the science team have been looking for solutions to (1) store information on all the data sets they have collected in a single database, (2) store more metadata about each data file, (3) develop queries that can find relationships among these disparate data types, and (4) plug any new functions developed around this database into existing analysis, visualization, and web tools, transparently to users. In this presentation, I will discuss the searchSCF package that is currently under development. This package includes a NoSQL database management system (MongoDB) and a set of Python tools that both ingests data into the database and supports user queries. I will also highlight case studies of how this system could be used by the MLS science team, and how it could be implemented by other science teams with local data repositories.
NASA Astrophysics Data System (ADS)
Miles, B.; Chepudira, K.; LaBar, W.
2017-12-01
The Open Geospatial Consortium (OGC) SensorThings API (STA) specification, ratified in 2016, is a next-generation open standard for enabling real-time communication of sensor data. Building on over a decade of OGC Sensor Web Enablement (SWE) Standards, STA offers a rich data model that can represent a range of sensor and phenomena types (e.g. fixed sensors sensing fixed phenomena, fixed sensors sensing moving phenomena, mobile sensors sensing fixed phenomena, and mobile sensors sensing moving phenomena) and is data agnostic. Additionally, and in contrast to previous SWE standards, STA is developer-friendly, as is evident from its convenient JSON serialization, and expressive OData-based query language (with support for geospatial queries); with its Message Queue Telemetry Transport (MQTT), STA is also well-suited to efficient real-time data publishing and discovery. All these attributes make STA potentially useful for use in environmental monitoring sensor networks. Here we present Kinota(TM), an Open-Source NoSQL implementation of OGC SensorThings for large-scale high-resolution real-time environmental monitoring. Kinota, which roughly stands for Knowledge from Internet of Things Analyses, relies on Cassandra its underlying data store, which is a horizontally scalable, fault-tolerant open-source database that is often used to store time-series data for Big Data applications (though integration with other NoSQL or rational databases is possible). With this foundation, Kinota can scale to store data from an arbitrary number of sensors collecting data every 500 milliseconds. Additionally, Kinota architecture is very modular allowing for customization by adopters who can choose to replace parts of the existing implementation when desirable. The architecture is also highly portable providing the flexibility to choose between cloud providers like azure, amazon, google etc. The scalable, flexible and cloud friendly architecture of Kinota makes it ideal for use in next-generation large-scale and high-resolution real-time environmental monitoring networks used in domains such as hydrology, geomorphology, and geophysics, as well as management applications such as flood early warning, and regulatory enforcement.
SeqWare Query Engine: storing and searching sequence data in the cloud.
O'Connor, Brian D; Merriman, Barry; Nelson, Stanley F
2010-12-21
Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.
SeqWare Query Engine: storing and searching sequence data in the cloud
2010-01-01
Background Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. Results In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). Conclusions The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets. PMID:21210981
Abhyankar, Swapna; Demner-Fushman, Dina; Callaghan, Fiona M; McDonald, Clement J
2014-01-01
Objective To develop a generalizable method for identifying patient cohorts from electronic health record (EHR) data—in this case, patients having dialysis—that uses simple information retrieval (IR) tools. Methods We used the coded data and clinical notes from the 24 506 adult patients in the Multiparameter Intelligent Monitoring in Intensive Care database to identify patients who had dialysis. We used SQL queries to search the procedure, diagnosis, and coded nursing observations tables based on ICD-9 and local codes. We used a domain-specific search engine to find clinical notes containing terms related to dialysis. We manually validated the available records for a 10% random sample of patients who potentially had dialysis and a random sample of 200 patients who were not identified as having dialysis based on any of the sources. Results We identified 1844 patients that potentially had dialysis: 1481 from the three coded sources and 1624 from the clinical notes. Precision for identifying dialysis patients based on available data was estimated to be 78.4% (95% CI 71.9% to 84.2%) and recall was 100% (95% CI 86% to 100%). Conclusions Combining structured EHR data with information from clinical notes using simple queries increases the utility of both types of data for cohort identification. Patients identified by more than one source are more likely to meet the inclusion criteria; however, including patients found in any of the sources increases recall. This method is attractive because it is available to researchers with access to EHR data and off-the-shelf IR tools. PMID:24384230
PhyloExplorer: a web server to validate, explore and query phylogenetic trees
Ranwez, Vincent; Clairon, Nicolas; Delsuc, Frédéric; Pourali, Saeed; Auberval, Nicolas; Diser, Sorel; Berry, Vincent
2009-01-01
Background Many important problems in evolutionary biology require molecular phylogenies to be reconstructed. Phylogenetic trees must then be manipulated for subsequent inclusion in publications or analyses such as supertree inference and tree comparisons. However, no tool is currently available to facilitate the management of tree collections providing, for instance: standardisation of taxon names among trees with respect to a reference taxonomy; selection of relevant subsets of trees or sub-trees according to a taxonomic query; or simply computation of descriptive statistics on the collection. Moreover, although several databases of phylogenetic trees exist, there is currently no easy way to find trees that are both relevant and complementary to a given collection of trees. Results We propose a tool to facilitate assessment and management of phylogenetic tree collections. Given an input collection of rooted trees, PhyloExplorer provides facilities for obtaining statistics describing the collection, correcting invalid taxon names, extracting taxonomically relevant parts of the collection using a dedicated query language, and identifying related trees in the TreeBASE database. Conclusion PhyloExplorer is a simple and interactive website implemented through underlying Python libraries and MySQL databases. It is available at: and the source code can be downloaded from: . PMID:19450253
Magnetic Fields for All: The GPIPS Community Web-Access Portal
NASA Astrophysics Data System (ADS)
Carveth, Carol; Clemens, D. P.; Pinnick, A.; Pavel, M.; Jameson, K.; Taylor, B.
2007-12-01
The new GPIPS website portal provides community users with an intuitive and powerful interface to query the data products of the Galactic Plane Infrared Polarization Survey. The website, which was built using PHP for the front end and MySQL for the database back end, allows users to issue queries based on galactic or equatorial coordinates, GPIPS-specific identifiers, polarization information, magnitude information, and several other attributes. The returns are presented in HTML tables, with the added option of either downloading or being emailed an ASCII file including the same or more information from the database. Other functionalities of the website include providing details of the status of the Survey (which fields have been observed or are planned to be observed), techniques involved in data collection and analysis, and descriptions of the database contents and names. For this initial launch of the website, users may access the GPIPS polarization point source catalog and the deep coadd photometric point source catalog. Future planned developments include a graphics-based method for querying the database, as well as tools to combine neighboring GPIPS images into larger image files for both polarimetry and photometry. This work is partially supported by NSF grant AST-0607500.
Omicseq: a web-based search engine for exploring omics datasets.
Sun, Xiaobo; Pittard, William S; Xu, Tianlei; Chen, Li; Zwick, Michael E; Jiang, Xiaoqian; Wang, Fusheng; Qin, Zhaohui S
2017-07-03
The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Portal to the GALEX Data Archive
NASA Astrophysics Data System (ADS)
Smith, M. A.; Conti, A.; Shiao, B.; Volpicelli, C. A.
2004-05-01
In early February MAST began its hosting of the GALEX public "Early Release Observations" images (40,000 objects) and spectra (1000 objects). MAST will host a much larger "first release," the GALEX DR1, in October, 2004. In this poster we describe features of our on-line website at http://galex.stsci.edu for researchers interested in downloading and browsing GALEX UV image and spectral data. The site, is based on MS .NET technology and user queries are entered for classes of objects or sky regions on a "MAST-like" query forms or with detailed queries written in SQL. In the latter case examples are provided to tailor a query to a user's specifications. The site provides novel features, such as tooltips that return keyword definitions, "active images" that return object classification and coordinate information in a 2.5 arcmin radius around the selected object, self-documentation of terms and tables, and of course a tutorial for new navigators. The GALEX database employs a Hierarchial Triangular Mesh system for rapid data discovery, neighbor searches, and cross correlations with other catalogs. Our "GMAX" tool allows a coplotting of object positions for objects observed by GALEX and other US-NVO compliant mission websites such as Sloan, 2MASS, FIRST.... As a member of the new Skynode network, GALEX has reported its web services to the US-NVO registry. This permits users to generate queries from other sites to cross-correlate, compare, and plot GALEX data using US-NVO protocols. Future plans for limited on-line data analysis and footprint services are described.
Silva, Sara; Gouveia-Oliveira, Rodrigo; Maretzek, António; Carriço, João; Gudnason, Thorolfur; Kristinsson, Karl G; Ekdahl, Karl; Brito-Avô, António; Tomasz, Alexander; Sanches, Ilda Santos; Lencastre, Hermínia de; Almeida, Jonas
2003-01-01
Background EURIS (European Resistance Intervention Study) was launched as a multinational study in September of 2000 to identify the multitude of complex risk factors that contribute to the high carriage rate of drug resistant Streptococcus pneumoniae strains in children attending Day Care Centers in several European countries. Access to the very large number of data required the development of a web-based infrastructure – EURISWEB – that includes a relational online database, coupled with a query system for data retrieval, and allows integrative storage of demographic, clinical and molecular biology data generated in EURIS. Methods All components of the system were developed using open source programming tools: data storage management was supported by PostgreSQL, and the hypertext preprocessor to generate the web pages was implemented using PHP. The query system is based on a software agent running in the background specifically developed for EURIS. Results The website currently contains data related to 13,500 nasopharyngeal samples and over one million measures taken from 5,250 individual children, as well as over one thousand pre-made and user-made queries aggregated into several reports, approximately. It is presently in use by participating researchers from three countries (Iceland, Portugal and Sweden). Conclusion An operational model centered on a PHP engine builds the interface between the user and the database automatically, allowing an easy maintenance of the system. The query system is also sufficiently adaptable to allow the integration of several advanced data analysis procedures far more demanding than simple queries, eventually including artificial intelligence predictive models. PMID:12846930
Advances in Data Management in Remote Sensing and Climate Modeling
NASA Astrophysics Data System (ADS)
Brown, P. G.
2014-12-01
Recent commercial interest in "Big Data" information systems has yielded little more than a sense of deja vu among scientists whose work has always required getting their arms around extremely large databases, and writing programs to explore and analyze it. On the flip side, there are some commercial DBMS startups building "Big Data" platform using techniques taken from earth science, astronomy, high energy physics and high performance computing. In this talk, we will introduce one such platform; Paradigm4's SciDB, the first DBMS designed from the ground up to combine the kinds of quality-of-service guarantees made by SQL DBMS platforms—high level data model, query languages, extensibility, transactions—with the kinds of functionality familiar to scientific users—arrays as structural building blocks, integrated linear algebra, and client language interfaces that minimize the learning curve. We will review how SciDB is used to manage and analyze earth science data by several teams of scientific users.
Development of noSQL data storage for the ATLAS PanDA Monitoring System
NASA Astrophysics Data System (ADS)
Ito, H.; Potekhin, M.; Wenaus, T.
2012-12-01
For several years the PanDA Workload Management System has been the basis for distributed production and analysis for the ATLAS experiment at the LHC. Since the start of data taking PanDA usage has ramped up steadily, typically exceeding 500k completed jobs/day by June 2011. The associated monitoring data volume has been rising as well, to levels that present a new set of challenges in the areas of database scalability and monitoring system performance and efficiency. These challenges are being met with an R&D effort aimed at implementing a scalable and efficient monitoring data storage based on a noSQL solution (Cassandra). We present our motivations for using this technology, as well as data design and the techniques used for efficient indexing of the data. We also discuss the hardware requirements as they were determined by testing with actual data and realistic rate of queries. In conclusion, we present our experience with operating a Cassandra cluster over an extended period of time and with data load adequate for planned application.
EarthServer - an FP7 project to enable the web delivery and analysis of 3D/4D models
NASA Astrophysics Data System (ADS)
Laxton, John; Sen, Marcus; Passmore, James
2013-04-01
EarthServer aims at open access and ad-hoc analytics on big Earth Science data, based on the OGC geoservice standards Web Coverage Service (WCS) and Web Coverage Processing Service (WCPS). The WCS model defines "coverages" as a unifying paradigm for multi-dimensional raster data, point clouds, meshes, etc., thereby addressing a wide range of Earth Science data including 3D/4D models. WCPS allows declarative SQL-style queries on coverages. The project is developing a pilot implementing these standards, and will also investigate the use of GeoSciML to describe coverages. Integration of WCPS with XQuery will in turn allow coverages to be queried in combination with their metadata and GeoSciML description. The unified service will support navigation, extraction, aggregation, and ad-hoc analysis on coverage data from SQL. Clients will range from mobile devices to high-end immersive virtual reality, and will enable 3D model visualisation using web browser technology coupled with developing web standards. EarthServer is establishing open-source client and server technology intended to be scalable to Petabyte/Exabyte volumes, based on distributed processing, supercomputing, and cloud virtualization. Implementation will be based on the existing rasdaman server technology developed. Services using rasdaman technology are being installed serving the atmospheric, oceanographic, geological, cryospheric, planetary and general earth observation communities. The geology service (http://earthserver.bgs.ac.uk/) is being provided by BGS and at present includes satellite imagery, superficial thickness data, onshore DTMs and 3D models for the Glasgow area. It is intended to extend the data sets available to include 3D voxel models. Use of the WCPS standard allows queries to be constructed against single or multiple coverages. For example on a single coverage data for a particular area can be selected or data with a particular range of pixel values. Queries on multiple surfaces can be constructed to calculate, for example, the thickness between two surfaces in a 3D model or the depth from ground surface to the top of a particular geologic unit. In the first version of the service a simple interface showing some example queries has been implemented in order to show the potential of the technologies. The project aims to develop the services available in light of user feedback, both in terms of the data available, the functionality and the interface. User feedback on the services guides the software and standards development aspects of the project, leading to enhanced versions of the software which will be implemented in upgraded versions of the services during the lifetime of the project.
Monitoring of IaaS and scientific applications on the Cloud using the Elasticsearch ecosystem
NASA Astrophysics Data System (ADS)
Bagnasco, S.; Berzano, D.; Guarise, A.; Lusso, S.; Masera, M.; Vallero, S.
2015-05-01
The private Cloud at the Torino INFN computing centre offers IaaS services to different scientific computing applications. The infrastructure is managed with the OpenNebula cloud controller. The main stakeholders of the facility are a grid Tier-2 site for the ALICE collaboration at LHC, an interactive analysis facility for the same experiment and a grid Tier-2 site for the BES-III collaboration, plus an increasing number of other small tenants. Besides keeping track of the usage, the automation of dynamic allocation of resources to tenants requires detailed monitoring and accounting of the resource usage. As a first investigation towards this, we set up a monitoring system to inspect the site activities both in terms of IaaS and applications running on the hosted virtual instances. For this purpose we used the Elasticsearch, Logstash and Kibana stack. In the current implementation, the heterogeneous accounting information is fed to different MySQL databases and sent to Elasticsearch via a custom Logstash plugin. For the IaaS metering, we developed sensors for the OpenNebula API. The IaaS level information gathered through the API is sent to the MySQL database through an ad-hoc developed RESTful web service, which is also used for other accounting purposes. Concerning the application level, we used the Root plugin TProofMonSenderSQL to collect accounting data from the interactive analysis facility. The BES-III virtual instances used to be monitored with Zabbix, as a proof of concept we also retrieve the information contained in the Zabbix database. Each of these three cases is indexed separately in Elasticsearch. We are now starting to consider dismissing the intermediate level provided by the SQL database and evaluating a NoSQL option as a unique central database for all the monitoring information. We setup a set of Kibana dashboards with pre-defined queries in order to monitor the relevant information in each case. In this way we have achieved a uniform monitoring interface for both the IaaS and the scientific applications, mostly leveraging off-the-shelf tools.
Ferreira Junior, José Raniery; Oliveira, Marcelo Costa; de Azevedo-Marques, Paulo Mazzoncini
2016-12-01
Lung cancer is the leading cause of cancer-related deaths in the world, and its main manifestation is pulmonary nodules. Detection and classification of pulmonary nodules are challenging tasks that must be done by qualified specialists, but image interpretation errors make those tasks difficult. In order to aid radiologists on those hard tasks, it is important to integrate the computer-based tools with the lesion detection, pathology diagnosis, and image interpretation processes. However, computer-aided diagnosis research faces the problem of not having enough shared medical reference data for the development, testing, and evaluation of computational methods for diagnosis. In order to minimize this problem, this paper presents a public nonrelational document-oriented cloud-based database of pulmonary nodules characterized by 3D texture attributes, identified by experienced radiologists and classified in nine different subjective characteristics by the same specialists. Our goal with the development of this database is to improve computer-aided lung cancer diagnosis and pulmonary nodule detection and classification research through the deployment of this database in a cloud Database as a Service framework. Pulmonary nodule data was provided by the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI), image descriptors were acquired by a volumetric texture analysis, and database schema was developed using a document-oriented Not only Structured Query Language (NoSQL) approach. The proposed database is now with 379 exams, 838 nodules, and 8237 images, 4029 of them are CT scans and 4208 manually segmented nodules, and it is allocated in a MongoDB instance on a cloud infrastructure.
FPGA-based prototype storage system with phase change memory
NASA Astrophysics Data System (ADS)
Li, Gezi; Chen, Xiaogang; Chen, Bomy; Li, Shunfen; Zhou, Mi; Han, Wenbing; Song, Zhitang
2016-10-01
With the ever-increasing amount of data being stored via social media, mobile telephony base stations, and network devices etc. the database systems face severe bandwidth bottlenecks when moving vast amounts of data from storage to the processing nodes. At the same time, Storage Class Memory (SCM) technologies such as Phase Change Memory (PCM) with unique features like fast read access, high density, non-volatility, byte-addressability, positive response to increasing temperature, superior scalability, and zero standby leakage have changed the landscape of modern computing and storage systems. In such a scenario, we present a storage system called FLEET which can off-load partial or whole SQL queries to the storage engine from CPU. FLEET uses an FPGA rather than conventional CPUs to implement the off-load engine due to its highly parallel nature. We have implemented an initial prototype of FLEET with PCM-based storage. The results demonstrate that significant performance and CPU utilization gains can be achieved by pushing selected query processing components inside in PCM-based storage.
Rosenbaum, Benjamin P; Silkin, Nikolay; Miller, Randolph A
2014-01-01
Real-time alerting systems typically warn providers about abnormal laboratory results or medication interactions. For more complex tasks, institutions create site-wide 'data warehouses' to support quality audits and longitudinal research. Sophisticated systems like i2b2 or Stanford's STRIDE utilize data warehouses to identify cohorts for research and quality monitoring. However, substantial resources are required to install and maintain such systems. For more modest goals, an organization desiring merely to identify patients with 'isolation' orders, or to determine patients' eligibility for clinical trials, may adopt a simpler, limited approach based on processing the output of one clinical system, and not a data warehouse. We describe a limited, order-entry-based, real-time 'pick off' tool, utilizing public domain software (PHP, MySQL). Through a web interface the tool assists users in constructing complex order-related queries and auto-generates corresponding database queries that can be executed at recurring intervals. We describe successful application of the tool for research and quality monitoring.
DAS: A Data Management System for Instrument Tests and Operations
NASA Astrophysics Data System (ADS)
Frailis, M.; Sartor, S.; Zacchei, A.; Lodi, M.; Cirami, R.; Pasian, F.; Trifoglio, M.; Bulgarelli, A.; Gianotti, F.; Franceschi, E.; Nicastro, L.; Conforti, V.; Zoli, A.; Smart, R.; Morbidelli, R.; Dadina, M.
2014-05-01
The Data Access System (DAS) is a and data management software system, providing a reusable solution for the storage of data acquired both from telescopes and auxiliary data sources during the instrument development phases and operations. It is part of the Customizable Instrument WorkStation system (CIWS-FW), a framework for the storage, processing and quick-look at the data acquired from scientific instruments. The DAS provides a data access layer mainly targeted to software applications: quick-look displays, pre-processing pipelines and scientific workflows. It is logically organized in three main components: an intuitive and compact Data Definition Language (DAS DDL) in XML format, aimed for user-defined data types; an Application Programming Interface (DAS API), automatically adding classes and methods supporting the DDL data types, and providing an object-oriented query language; a data management component, which maps the metadata of the DDL data types in a relational Data Base Management System (DBMS), and stores the data in a shared (network) file system. With the DAS DDL, developers define the data model for a particular project, specifying for each data type the metadata attributes, the data format and layout (if applicable), and named references to related or aggregated data types. Together with the DDL user-defined data types, the DAS API acts as the only interface to store, query and retrieve the metadata and data in the DAS system, providing both an abstract interface and a data model specific one in C, C++ and Python. The mapping of metadata in the back-end database is automatic and supports several relational DBMSs, including MySQL, Oracle and PostgreSQL.
F. O. Kern, Elizabeth; Beischel, Scott; Stalnaker, Randal; Aron, David C.; Kirsh, Susan R.; Watts, Sharon A.
2008-01-01
Background Little information is available describing how to implement a disease registry from an electronic patient record system. The aim of this report is to describe the technology, methods, and utility of a diabetes registry populated by the Veterans Health Information Systems Architecture (VistA), which underlies the computerized patient record system of the Veterans Health Administration (VHA) in Veteran Affairs Integrated Service Network 10 (VISN 10). Methods VISN 10 data from VistA were mapped to a relational SQL-based data system using KB_SQL software. Operational definitions for diabetes, active clinical management, and responsible providers were used to create views of patient-level data in the diabetes registry. Query Analyzer was used to access the data views directly. Semicustomizable reports were created by linking the diabetes registry to a Web page using Microsoft asp.net2. A retrospective observational study design was used to analyze trends in the process of care and outcomes. Results Since October 2001, 81,227 patients with diabetes have enrolled in VISN 10: approximately 42,000 are currently under active management by VISN 10 providers. By tracking primary care visits, we assigned 91% to a clinic group responsible for diabetes care. In the Cleveland Veterans Affairs Medical Center (VAMC), the frequency of mean annual hemoglobin A1c levels ≥9% has declined significantly over 5 years. Almost 4000 patients have been seen in diabetes intervention programs in the Cleveland VAMC over the past 4 years. Conclusions A diabetes registry can be populated from the database underlying the VHA electronic patient record database system and linked to Web-based and ad hoc queries useful for quality improvement. PMID:19885172
Space environment data storage and access: lessons learned and recommendations for the future
NASA Astrophysics Data System (ADS)
Evans, Hugh; Heynderickx, Daniel
2012-07-01
With the ever increasing volume of space environment data available at present and planned for the near future, the demands on data storage and access methods are increasing as well. In addition, continued access to historical, archived data remains crucial. On the basis of many years of experience, the authors identify the following issues as important for continued and efficient handling of datasets now and in the future: The huge data volumes currently or very soon avaiable from a number of space missions will limi direct Internet download access to even relatively short epoch ranges of data. Therefore, data providers should establish or extend standardised data (post-) processing services so that only data query results should be downloaded. Although a single standardised data format will in all likelihood remain utopia, data providers should at least include extensive metadata with their data products, according to established standards and practices (e.g. ISTP, SPASE). Standardisation of (sets of) metadata greatly facilitates data mining and querying. The use of SQL database storage should be considered instead of, or in parallel with, classic storage of data files. The use of SQL does away with having to handle file parsing and processing, while at the same time standard access protocols can be used to (remotely) connect to such data repositories. Many data holdings are still lacking in extensive descriptions of data provenance (e.g. instrument description), content and format. Unfortunately, detailed data information is usually rejected by scientific and technical journals. Re-processing of historical archived datasets into modern formats, making them easily available and usable, is urgently required, as knowledge is being lost. A global data directory has still not been achieved; policy makers should enforce stricter rules for "broadcasting" dataset information.
2013-01-01
commercial NoSQL database system. The results show that In-dexedHBase provides a data loading speed that is 6 times faster than Riak, and is...compare it with Riak, a widely adopted commercial NoSQL database system. The results show that In- dexedHBase provides a data loading speed that is 6...events. This chapter describes our research towards building an efficient and scalable storage platform for Truthy. Many existing NoSQL databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Angers, Crystal Plume; Bottema, Ryan; Buckley, Les
Purpose: Treatment unit uptime statistics are typically used to monitor radiation equipment performance. The Ottawa Hospital Cancer Centre has introduced the use of Quality Control (QC) test success as a quality indicator for equipment performance and overall health of the equipment QC program. Methods: Implemented in 2012, QATrack+ is used to record and monitor over 1100 routine machine QC tests each month for 20 treatment and imaging units ( http://qatrackplus.com/ ). Using an SQL (structured query language) script, automated queries of the QATrack+ database are used to generate program metrics such as the number of QC tests executed and themore » percentage of tests passing, at tolerance or at action. These metrics are compared against machine uptime statistics already reported within the program. Results: Program metrics for 2015 show good correlation between pass rate of QC tests and uptime for a given machine. For the nine conventional linacs, the QC test success rate was consistently greater than 97%. The corresponding uptimes for these units are better than 98%. Machines that consistently show higher failure or tolerance rates in the QC tests have lower uptimes. This points to either poor machine performance requiring corrective action or to problems with the QC program. Conclusions: QATrack+ significantly improves the organization of QC data but can also aid in overall equipment management. Complimenting machine uptime statistics with QC test metrics provides a more complete picture of overall machine performance and can be used to identify areas of improvement in the machine service and QC programs.« less
Knowledge portal for Six Sigma DMAIC process
NASA Astrophysics Data System (ADS)
ThanhDat, N.; Claudiu, K. V.; Zobia, R.; Lobont, Lucian
2016-08-01
Knowledge plays a crucial role in success of DMAIC (Define, Measure, Analysis, Improve, and Control) execution. It is therefore necessary to share and renew the knowledge. Yet, one problem arising is how to create a place where knowledge are collected and shared effectively. We believe that Knowledge Portal (KP) is an important solution for the problem. In this article, the works concerning with requirements and functionalities for KP are first reviewed. Afterwards, a procedure with necessary tools to develop and implement a KP for DMAIC (KPD) is proposed. Particularly, KPD is built on the basis of free and open-source content and learning management systems, and Ontology Engineering. In order to structure and store knowledge, tools such as Protégé, OWL, as well as OWL-RDF Parsers are used. A Knowledge Reasoner module is developed in PHP language, ARC2, MySQL and SPARQL endpoint for the purpose of querying and inferring knowledge available from Ontologies. In order to validate the availability of the procedure, a KPD is built with the proposed functionalities and tools. The authors find that the KPD benefits an organization in constructing Web sites by itself with simple steps of implementation and low initial costs. It creates a space of knowledge exchange and supports effectively collecting DMAIC reports as well as sharing knowledge created. The authors’ evaluation result shows that DMAIC knowledge is found exactly with a high success rate and a good level of response time of queries.
Extending Climate Analytics-As to the Earth System Grid Federation
NASA Astrophysics Data System (ADS)
Tamkin, G.; Schnase, J. L.; Duffy, D.; McInerney, M.; Nadeau, D.; Li, J.; Strong, S.; Thompson, J. H.
2015-12-01
We are building three extensions to prior-funded work on climate analytics-as-a-service that will benefit the Earth System Grid Federation (ESGF) as it addresses the Big Data challenges of future climate research: (1) We are creating a cloud-based, high-performance Virtual Real-Time Analytics Testbed supporting a select set of climate variables from six major reanalysis data sets. This near real-time capability will enable advanced technologies like the Cloudera Impala-based Structured Query Language (SQL) query capabilities and Hadoop-based MapReduce analytics over native NetCDF files while providing a platform for community experimentation with emerging analytic technologies. (2) We are building a full-featured Reanalysis Ensemble Service comprising monthly means data from six reanalysis data sets. The service will provide a basic set of commonly used operations over the reanalysis collections. The operations will be made accessible through NASA's climate data analytics Web services and our client-side Climate Data Services (CDS) API. (3) We are establishing an Open Geospatial Consortium (OGC) WPS-compliant Web service interface to our climate data analytics service that will enable greater interoperability with next-generation ESGF capabilities. The CDS API will be extended to accommodate the new WPS Web service endpoints as well as ESGF's Web service endpoints. These activities address some of the most important technical challenges for server-side analytics and support the research community's requirements for improved interoperability and improved access to reanalysis data.
A journey to Semantic Web query federation in the life sciences.
Cheung, Kei-Hoi; Frost, H Robert; Marshall, M Scott; Prud'hommeaux, Eric; Samwald, Matthias; Zhao, Jun; Paschke, Adrian
2009-10-01
As interest in adopting the Semantic Web in the biomedical domain continues to grow, Semantic Web technology has been evolving and maturing. A variety of technological approaches including triplestore technologies, SPARQL endpoints, Linked Data, and Vocabulary of Interlinked Datasets have emerged in recent years. In addition to the data warehouse construction, these technological approaches can be used to support dynamic query federation. As a community effort, the BioRDF task force, within the Semantic Web for Health Care and Life Sciences Interest Group, is exploring how these emerging approaches can be utilized to execute distributed queries across different neuroscience data sources. We have created two health care and life science knowledge bases. We have explored a variety of Semantic Web approaches to describe, map, and dynamically query multiple datasets. We have demonstrated several federation approaches that integrate diverse types of information about neurons and receptors that play an important role in basic, clinical, and translational neuroscience research. Particularly, we have created a prototype receptor explorer which uses OWL mappings to provide an integrated list of receptors and executes individual queries against different SPARQL endpoints. We have also employed the AIDA Toolkit, which is directed at groups of knowledge workers who cooperatively search, annotate, interpret, and enrich large collections of heterogeneous documents from diverse locations. We have explored a tool called "FeDeRate", which enables a global SPARQL query to be decomposed into subqueries against the remote databases offering either SPARQL or SQL query interfaces. Finally, we have explored how to use the vocabulary of interlinked Datasets (voiD) to create metadata for describing datasets exposed as Linked Data URIs or SPARQL endpoints. We have demonstrated the use of a set of novel and state-of-the-art Semantic Web technologies in support of a neuroscience query federation scenario. We have identified both the strengths and weaknesses of these technologies. While Semantic Web offers a global data model including the use of Uniform Resource Identifiers (URI's), the proliferation of semantically-equivalent URI's hinders large scale data integration. Our work helps direct research and tool development, which will be of benefit to this community.
A journey to Semantic Web query federation in the life sciences
Cheung, Kei-Hoi; Frost, H Robert; Marshall, M Scott; Prud'hommeaux, Eric; Samwald, Matthias; Zhao, Jun; Paschke, Adrian
2009-01-01
Background As interest in adopting the Semantic Web in the biomedical domain continues to grow, Semantic Web technology has been evolving and maturing. A variety of technological approaches including triplestore technologies, SPARQL endpoints, Linked Data, and Vocabulary of Interlinked Datasets have emerged in recent years. In addition to the data warehouse construction, these technological approaches can be used to support dynamic query federation. As a community effort, the BioRDF task force, within the Semantic Web for Health Care and Life Sciences Interest Group, is exploring how these emerging approaches can be utilized to execute distributed queries across different neuroscience data sources. Methods and results We have created two health care and life science knowledge bases. We have explored a variety of Semantic Web approaches to describe, map, and dynamically query multiple datasets. We have demonstrated several federation approaches that integrate diverse types of information about neurons and receptors that play an important role in basic, clinical, and translational neuroscience research. Particularly, we have created a prototype receptor explorer which uses OWL mappings to provide an integrated list of receptors and executes individual queries against different SPARQL endpoints. We have also employed the AIDA Toolkit, which is directed at groups of knowledge workers who cooperatively search, annotate, interpret, and enrich large collections of heterogeneous documents from diverse locations. We have explored a tool called "FeDeRate", which enables a global SPARQL query to be decomposed into subqueries against the remote databases offering either SPARQL or SQL query interfaces. Finally, we have explored how to use the vocabulary of interlinked Datasets (voiD) to create metadata for describing datasets exposed as Linked Data URIs or SPARQL endpoints. Conclusion We have demonstrated the use of a set of novel and state-of-the-art Semantic Web technologies in support of a neuroscience query federation scenario. We have identified both the strengths and weaknesses of these technologies. While Semantic Web offers a global data model including the use of Uniform Resource Identifiers (URI's), the proliferation of semantically-equivalent URI's hinders large scale data integration. Our work helps direct research and tool development, which will be of benefit to this community. PMID:19796394
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bauer, Travis
2007-01-26
Toba is an extensible personal information retrieval system. It supports various plugins which the user uses to create and store bits of information. It comes configured to store meeting notes, task items, issue, and business development opportunities. Plugins could be written to support almost any kind of digital information. So with the right plugins, Toba could become a full fledged contact manager, project management application, programmer's toolkit, or almost any other type of data storage/search/retrieval application imaginable. Toba comes with a built in command line interface and via a plugin it has a fully scripting language (jython). The information storedmore » can be searched by keyword or through SQL queries.« less
Search extension transforms Wiki into a relational system: a case for flavonoid metabolite database.
Arita, Masanori; Suwa, Kazuhiro
2008-09-17
In computer science, database systems are based on the relational model founded by Edgar Codd in 1970. On the other hand, in the area of biology the word 'database' often refers to loosely formatted, very large text files. Although such bio-databases may describe conflicts or ambiguities (e.g. a protein pair do and do not interact, or unknown parameters) in a positive sense, the flexibility of the data format sacrifices a systematic query mechanism equivalent to the widely used SQL. To overcome this disadvantage, we propose embeddable string-search commands on a Wiki-based system and designed a half-formatted database. As proof of principle, a database of flavonoid with 6902 molecular structures from over 1687 plant species was implemented on MediaWiki, the background system of Wikipedia. Registered users can describe any information in an arbitrary format. Structured part is subject to text-string searches to realize relational operations. The system was written in PHP language as the extension of MediaWiki. All modifications are open-source and publicly available. This scheme benefits from both the free-formatted Wiki style and the concise and structured relational-database style. MediaWiki supports multi-user environments for document management, and the cost for database maintenance is alleviated.
Search extension transforms Wiki into a relational system: A case for flavonoid metabolite database
Arita, Masanori; Suwa, Kazuhiro
2008-01-01
Background In computer science, database systems are based on the relational model founded by Edgar Codd in 1970. On the other hand, in the area of biology the word 'database' often refers to loosely formatted, very large text files. Although such bio-databases may describe conflicts or ambiguities (e.g. a protein pair do and do not interact, or unknown parameters) in a positive sense, the flexibility of the data format sacrifices a systematic query mechanism equivalent to the widely used SQL. Results To overcome this disadvantage, we propose embeddable string-search commands on a Wiki-based system and designed a half-formatted database. As proof of principle, a database of flavonoid with 6902 molecular structures from over 1687 plant species was implemented on MediaWiki, the background system of Wikipedia. Registered users can describe any information in an arbitrary format. Structured part is subject to text-string searches to realize relational operations. The system was written in PHP language as the extension of MediaWiki. All modifications are open-source and publicly available. Conclusion This scheme benefits from both the free-formatted Wiki style and the concise and structured relational-database style. MediaWiki supports multi-user environments for document management, and the cost for database maintenance is alleviated. PMID:18822113
DIBS: a repository of disordered binding sites mediating interactions with ordered proteins.
Schad, Eva; Fichó, Erzsébet; Pancsa, Rita; Simon, István; Dosztányi, Zsuzsanna; Mészáros, Bálint
2018-02-01
Intrinsically Disordered Proteins (IDPs) mediate crucial protein-protein interactions, most notably in signaling and regulation. As their importance is increasingly recognized, the detailed analyses of specific IDP interactions opened up new opportunities for therapeutic targeting. Yet, large scale information about IDP-mediated interactions in structural and functional details are lacking, hindering the understanding of the mechanisms underlying this distinct binding mode. Here, we present DIBS, the first comprehensive, curated collection of complexes between IDPs and ordered proteins. DIBS not only describes by far the highest number of cases, it also provides the dissociation constants of their interactions, as well as the description of potential post-translational modifications modulating the binding strength and linear motifs involved in the binding. Together with the wide range of structural and functional annotations, DIBS will provide the cornerstone for structural and functional studies of IDP complexes. DIBS is freely accessible at http://dibs.enzim.ttk.mta.hu/. The DIBS application is hosted by Apache web server and was implemented in PHP. To enrich querying features and to enhance backend performance a MySQL database was also created. dosztanyi@caesar.elte.hu or bmeszaros@caesar.elte.hu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.
Visualization of Vgi Data Through the New NASA Web World Wind Virtual Globe
NASA Astrophysics Data System (ADS)
Brovelli, M. A.; Kilsedar, C. E.; Zamboni, G.
2016-06-01
GeoWeb 2.0, laying the foundations of Volunteered Geographic Information (VGI) systems, has led to platforms where users can contribute to the geographic knowledge that is open to access. Moreover, as a result of the advancements in 3D visualization, virtual globes able to visualize geographic data even on browsers emerged. However the integration of VGI systems and virtual globes has not been fully realized. The study presented aims to visualize volunteered data in 3D, considering also the ease of use aspects for general public, using Free and Open Source Software (FOSS). The new Application Programming Interface (API) of NASA, Web World Wind, written in JavaScript and based on Web Graphics Library (WebGL) is cross-platform and cross-browser, so that the virtual globe created using this API can be accessible through any WebGL supported browser on different operating systems and devices, as a result not requiring any installation or configuration on the client-side, making the collected data more usable to users, which is not the case with the World Wind for Java as installation and configuration of the Java Virtual Machine (JVM) is required. Furthermore, the data collected through various VGI platforms might be in different formats, stored in a traditional relational database or in a NoSQL database. The project developed aims to visualize and query data collected through Open Data Kit (ODK) platform and a cross-platform application, where data is stored in a relational PostgreSQL and NoSQL CouchDB databases respectively.
Exploring No-SQL alternatives for ALMA monitoring system
NASA Astrophysics Data System (ADS)
Shen, Tzu-Chiang; Soto, Ruben; Merino, Patricio; Peña, Leonel; Bartsch, Marcelo; Aguirre, Alvaro; Ibsen, Jorge
2014-07-01
The Atacama Large Millimeter /submillimeter Array (ALMA) will be a unique research instrument composed of at least 66 reconfigurable high-precision antennas, located at the Chajnantor plain in the Chilean Andes at an elevation of 5000 m. This paper describes the experience gained after several years working with the monitoring system, which has a strong requirement of collecting and storing up to 150K variables with a highest sampling rate of 20.8 kHz. The original design was built on top of a cluster of relational database server and network attached storage with fiber channel interface. As the number of monitoring points increases with the number of antennas included in the array, the current monitoring system demonstrated to be able to handle the increased data rate in the collection and storage area (only one month of data), but the data query interface showed serious performance degradation. A solution based on no-SQL platform was explored as an alternative to the current long-term storage system. Among several alternatives, mongoDB has been selected. In the data flow, intermediate cache servers based on Redis were introduced to allow faster streaming of the most recently acquired data to web based charts and applications for online data analysis.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vatsavai, Raju; Burk, Thomas E; Lime, Steve
2012-01-01
The components making up an Open Source GIS are explained in this chapter. A map server (Sect. 30.1) can broadly be defined as a software platform for dynamically generating spatially referenced digital map products. The University of Minnesota MapServer (UMN Map Server) is one such system. Its basic features are visualization, overlay, and query. Section 30.2 names and explains many of the geospatial open source libraries, such as GDAL and OGR. The other libraries are FDO, JTS, GEOS, JCS, MetaCRS, and GPSBabel. The application examples include derived GIS-software and data format conversions. Quantum GIS, its origin and its applications explainedmore » in detail in Sect. 30.3. The features include a rich GUI, attribute tables, vector symbols, labeling, editing functions, projections, georeferencing, GPS support, analysis, and Web Map Server functionality. Future developments will address mobile applications, 3-D, and multithreading. The origins of PostgreSQL are outlined and PostGIS discussed in detail in Sect. 30.4. It extends PostgreSQL by implementing the Simple Feature standard. Section 30.5 details the most important open source licenses such as the GPL, the LGPL, the MIT License, and the BSD License, as well as the role of the Creative Commons.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Poliakov, Alexander; Couronne, Olivier
2002-11-04
Aligning large vertebrate genomes that are structurally complex poses a variety of problems not encountered on smaller scales. Such genomes are rich in repetitive elements and contain multiple segmental duplications, which increases the difficulty of identifying true orthologous SNA segments in alignments. The sizes of the sequences make many alignment algorithms designed for comparing single proteins extremely inefficient when processing large genomic intervals. We integrated both local and global alignment tools and developed a suite of programs for automatically aligning large vertebrate genomes and identifying conserved non-coding regions in the alignments. Our method uses the BLAT local alignment program tomore » find anchors on the base genome to identify regions of possible homology for a query sequence. These regions are postprocessed to find the best candidates which are then globally aligned using the AVID global alignment program. In the last step conserved non-coding segments are identified using VISTA. Our methods are fast and the resulting alignments exhibit a high degree of sensitivity, covering more than 90% of known coding exons in the human genome. The GenomeVISTA software is a suite of Perl programs that is built on a MySQL database platform. The scheduler gets control data from the database, builds a queve of jobs, and dispatches them to a PC cluster for execution. The main program, running on each node of the cluster, processes individual sequences. A Perl library acts as an interface between the database and the above programs. The use of a separate library allows the programs to function independently of the database schema. The library also improves on the standard Perl MySQL database interfere package by providing auto-reconnect functionality and improved error handling.« less
Preventing SQL Code Injection by Combining Static and Runtime Analysis
2008-05-01
attacker changes the developer’s intended structure of an SQ L com- mand by inserting new SQ L keywords or operators. (Su and Wasser - mann provide a...FROM b o o k s WHERE a u t h o r = ’ ’ GROUP BY r a t i n g We use symbol as a placeholder for the indeterminate part of the command (in this...dialects of SQL.) In our model, we mark transitions that correspond to externally defined strings with the symbol . To illustrate, Figure 2 shows the SQL
NASA Astrophysics Data System (ADS)
Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George
2017-10-01
In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.
Schulz, Erich; Barrett, James W.; Price, Colin
1998-01-01
As controlled clinical vocabularies assume an increasing role in modern clinical information systems, so the issue of their quality demands greater attention. In order to meet the resulting stringent criteria for completeness and correctness, a quality assurance system comprising a database of more than 500 rules is being developed and applied to the Read Thesaurus. The authors discuss the requirement to apply quality assurance processes to their dynamic editing database in order to ensure the quality of exported products. Sources of errors include human, hardware, and software factors as well as new rules and transactions. The overall quality strategy includes prevention, detection, and correction of errors. The quality assurance process encompasses simple data specification, internal consistency, inspection procedures and, eventually, field testing. The quality assurance system is driven by a small number of tables and UNIX scripts, with “business rules” declared explicitly as Structured Query Language (SQL) statements. Concurrent authorship, client-server technology, and an initial failure to implement robust transaction control have all provided valuable lessons. The feedback loop for error management needs to be short. PMID:9670131
[Establishement for regional pelvic trauma database in Hunan Province].
Cheng, Liang; Zhu, Yong; Long, Haitao; Yang, Junxiao; Sun, Buhua; Li, Kanghua
2017-04-28
To establish a database for pelvic trauma in Hunan Province, and to start the work of multicenter pelvic trauma registry. Methods: To establish the database, literatures relevant to pelvic trauma were screened, the experiences from the established trauma database in China and abroad were learned, and the actual situations for pelvic trauma rescue in Hunan Province were considered. The database for pelvic trauma was established based on the PostgreSQL and the advanced programming language Java 1.6. Results: The complex procedure for pelvic trauma rescue was described structurally. The contents for the database included general patient information, injurious condition, prehospital rescue, conditions in admission, treatment in hospital, status on discharge, diagnosis, classification, complication, trauma scoring and therapeutic effect. The database can be accessed through the internet by browser/servicer. The functions for the database include patient information management, data export, history query, progress report, video-image management and personal information management. Conclusion: The database with whole life cycle pelvic trauma is successfully established for the first time in China. It is scientific, functional, practical, and user-friendly.
Zerbino, Daniel R.; Johnson, Nathan; Juetteman, Thomas; Sheppard, Dan; Wilder, Steven P.; Lavidas, Ilias; Nuhn, Michael; Perry, Emily; Raffaillac-Desfosses, Quentin; Sobral, Daniel; Keefe, Damian; Gräf, Stefan; Ahmed, Ikhlak; Kinsella, Rhoda; Pritchard, Bethan; Brent, Simon; Amode, Ridwan; Parker, Anne; Trevanion, Steven; Birney, Ewan; Dunham, Ian; Flicek, Paul
2016-01-01
New experimental techniques in epigenomics allow researchers to assay a diversity of highly dynamic features such as histone marks, DNA modifications or chromatin structure. The study of their fluctuations should provide insights into gene expression regulation, cell differentiation and disease. The Ensembl project collects and maintains the Ensembl regulation data resources on epigenetic marks, transcription factor binding and DNA methylation for human and mouse, as well as microarray probe mappings and annotations for a variety of chordate genomes. From this data, we produce a functional annotation of the regulatory elements along the human and mouse genomes with plans to expand to other species as data becomes available. Starting from well-studied cell lines, we will progressively expand our library of measurements to a greater variety of samples. Ensembl’s regulation resources provide a central and easy-to-query repository for reference epigenomes. As with all Ensembl data, it is freely available at http://www.ensembl.org, from the Perl and REST APIs and from the public Ensembl MySQL database server at ensembldb.ensembl.org. Database URL: http://www.ensembl.org PMID:26888907
Applications of GIS and database technologies to manage a Karst Feature Database
Gao, Y.; Tipping, R.G.; Alexander, E.C.
2006-01-01
This paper describes the management of a Karst Feature Database (KFD) in Minnesota. Two sets of applications in both GIS and Database Management System (DBMS) have been developed for the KFD of Minnesota. These applications were used to manage and to enhance the usability of the KFD. Structured Query Language (SQL) was used to manipulate transactions of the database and to facilitate the functionality of the user interfaces. The Database Administrator (DBA) authorized users with different access permissions to enhance the security of the database. Database consistency and recovery are accomplished by creating data logs and maintaining backups on a regular basis. The working database provides guidelines and management tools for future studies of karst features in Minnesota. The methodology of designing this DBMS is applicable to develop GIS-based databases to analyze and manage geomorphic and hydrologic datasets at both regional and local scales. The short-term goal of this research is to develop a regional KFD for the Upper Mississippi Valley Karst and the long-term goal is to expand this database to manage and study karst features at national and global scales.
Read Code quality assurance: from simple syntax to semantic stability.
Schulz, E B; Barrett, J W; Price, C
1998-01-01
As controlled clinical vocabularies assume an increasing role in modern clinical information systems, so the issue of their quality demands greater attention. In order to meet the resulting stringent criteria for completeness and correctness, a quality assurance system comprising a database of more than 500 rules is being developed and applied to the Read Thesaurus. The authors discuss the requirement to apply quality assurance processes to their dynamic editing database in order to ensure the quality of exported products. Sources of errors include human, hardware, and software factors as well as new rules and transactions. The overall quality strategy includes prevention, detection, and correction of errors. The quality assurance process encompasses simple data specification, internal consistency, inspection procedures and, eventually, field testing. The quality assurance system is driven by a small number of tables and UNIX scripts, with "business rules" declared explicitly as Structured Query Language (SQL) statements. Concurrent authorship, client-server technology, and an initial failure to implement robust transaction control have all provided valuable lessons. The feedback loop for error management needs to be short.
GOSSS-DR1: The First Data Release of the Galactic O-star Spectroscopic Survey
NASA Astrophysics Data System (ADS)
Sota, Alfredo; Maíz Apellániz, Jesús; Barbá, Rodolfo H.; Walborn, Nolan R.; Alfaro, Emilio J.; Gamen, Roberto C.; Morrell, Nidia I.; Arias, Julia I.; Gallego Calvente, A. T.
2013-06-01
Coinciding with this meeting, we are publishing the first data release of GOSSS. This release contains [a] revised spectral classifications and [b] blue-violet R~2500 spectra in FITS format for ~400 Galactic O stars, including all brighter than B=8. DR1 (and future releases) will take place through GOSC, the Galactic O-Star Catalog (http://gosc.iaa.es), which will be updated for the occasion. Since 2011 GOSC runs on a MySQL database and allows for queries based on coordinates, spectral class, photometry, and other parameters. Future data releases will include the rest of the stars observed in GOSSS (currently 1521 with ~1000 more planned in the next two years).
Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes.
Hiscock, D; Upton, C
2000-05-01
The Viral Genome DataBase (VGDB) contains detailed information of the genes and predicted protein sequences from 15 completely sequenced genomes of large (&100 kb) viruses (2847 genes). The data that is stored includes DNA sequence, protein sequence, GenBank and user-entered notes, molecular weight (MW), isoelectric point (pI), amino acid content, A + T%, nucleotide frequency, dinucleotide frequency and codon use. The VGDB is a mySQL database with a user-friendly JAVA GUI. Results of queries can be easily sorted by any of the individual parameters. The software and additional figures and information are available at http://athena.bioc.uvic.ca/genomes/index.html .
Using CLIPS in a distributed system: The Network Control Center (NCC) expert system
NASA Technical Reports Server (NTRS)
Wannemacher, Tom
1990-01-01
This paper describes an intelligent troubleshooting system for the Help Desk domain. It was developed on an IBM-compatible 80286 PC using Microsoft C and CLIPS and an AT&T 3B2 minicomputer using the UNIFY database and a combination of shell script, C programs and SQL queries. The two computers are linked by a lan. The functions of this system are to help non-technical NCC personnel handle trouble calls, to keep a log of problem calls with complete, concise information, and to keep a historical database of problems. The database helps identify hardware and software problem areas and provides a source of new rules for the troubleshooting knowledge base.
ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining.
Huan, Tianxiao; Sivachenko, Andrey Y; Harrison, Scott H; Chen, Jake Y
2008-08-12
New systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed. We developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network. The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.
2012-11-27
with powerful analysis tools and an informatics approach leveraging best-of-breed NoSQL databases, in order to store, search and retrieve relevant...dictionaries, and JavaScript also has good support. The MongoDB project[15] was chosen as a scalable NoSQL data store for the cheminfor- matics components
ESTree db: a Tool for Peach Functional Genomics
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Stella, Alessandra; Milanesi, Luciano; Pozzi, Carlo
2005-01-01
Background The ESTree db represents a collection of Prunus persica expressed sequenced tags (ESTs) and is intended as a resource for peach functional genomics. A total of 6,155 successful EST sequences were obtained from four in-house prepared cDNA libraries from Prunus persica mesocarps at different developmental stages. Another 12,475 peach EST sequences were downloaded from public databases and added to the ESTree db. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts and data were collected in a MySQL database. A php-based web interface was developed to query the database. Results The ESTree db version as of April 2005 encompasses 18,630 sequences representing eight libraries. Contig assembly was performed with CAP3. Putative single nucleotide polymorphism (SNP) detection was performed with the AutoSNP program and a search engine was implemented to retrieve results. All the sequences and all the contig consensus sequences were annotated both with blastx against the GenBank nr db and with GOblet against the viridiplantae section of the Gene Ontology db. Links to NiceZyme (Expasy) and to the KEGG metabolic pathways were provided. A local BLAST utility is available. A text search utility allows querying and browsing the database. Statistics were provided on Gene Ontology occurrences to assign sequences to Gene Ontology categories. Conclusion The resulting database is a comprehensive resource of data and links related to peach EST sequences. The Sequence Report and Contig Report pages work as the web interface core structures, giving quick access to data related to each sequence/contig. PMID:16351742
ESTree db: a tool for peach functional genomics.
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Stella, Alessandra; Milanesi, Luciano; Pozzi, Carlo
2005-12-01
The ESTree db http://www.itb.cnr.it/estree/ represents a collection of Prunus persica expressed sequenced tags (ESTs) and is intended as a resource for peach functional genomics. A total of 6,155 successful EST sequences were obtained from four in-house prepared cDNA libraries from Prunus persica mesocarps at different developmental stages. Another 12,475 peach EST sequences were downloaded from public databases and added to the ESTree db. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts and data were collected in a MySQL database. A php-based web interface was developed to query the database. The ESTree db version as of April 2005 encompasses 18,630 sequences representing eight libraries. Contig assembly was performed with CAP3. Putative single nucleotide polymorphism (SNP) detection was performed with the AutoSNP program and a search engine was implemented to retrieve results. All the sequences and all the contig consensus sequences were annotated both with blastx against the GenBank nr db and with GOblet against the viridiplantae section of the Gene Ontology db. Links to NiceZyme (Expasy) and to the KEGG metabolic pathways were provided. A local BLAST utility is available. A text search utility allows querying and browsing the database. Statistics were provided on Gene Ontology occurrences to assign sequences to Gene Ontology categories. The resulting database is a comprehensive resource of data and links related to peach EST sequences. The Sequence Report and Contig Report pages work as the web interface core structures, giving quick access to data related to each sequence/contig.
DOMe: A deduplication optimization method for the NewSQL database backups
Wang, Longxiang; Zhu, Zhengdong; Zhang, Xingjun; Wang, Yinfeng
2017-01-01
Reducing duplicated data of database backups is an important application scenario for data deduplication technology. NewSQL is an emerging database system and is now being used more and more widely. NewSQL systems need to improve data reliability by periodically backing up in-memory data, resulting in a lot of duplicated data. The traditional deduplication method is not optimized for the NewSQL server system and cannot take full advantage of hardware resources to optimize deduplication performance. A recent research pointed out that the future NewSQL server will have thousands of CPU cores, large DRAM and huge NVRAM. Therefore, how to utilize these hardware resources to optimize the performance of data deduplication is an important issue. To solve this problem, we propose a deduplication optimization method (DOMe) for NewSQL system backup. To take advantage of the large number of CPU cores in the NewSQL server to optimize deduplication performance, DOMe parallelizes the deduplication method based on the fork-join framework. The fingerprint index, which is the key data structure in the deduplication process, is implemented as pure in-memory hash table, which makes full use of the large DRAM in NewSQL system, eliminating the performance bottleneck problem of fingerprint index existing in traditional deduplication method. The H-store is used as a typical NewSQL database system to implement DOMe method. DOMe is experimentally analyzed by two representative backup data. The experimental results show that: 1) DOMe can reduce the duplicated NewSQL backup data. 2) DOMe significantly improves deduplication performance by parallelizing CDC algorithms. In the case of the theoretical speedup ratio of the server is 20.8, the speedup ratio of DOMe can achieve up to 18; 3) DOMe improved the deduplication throughput by 1.5 times through the pure in-memory index optimization method. PMID:29049307
Recent improvements to Binding MOAD: a resource for protein–ligand binding affinities and structures
Ahmed, Aqeel; Smith, Richard D.; Clark, Jordan J.; Dunbar, James B.; Carlson, Heather A.
2015-01-01
For over 10 years, Binding MOAD (Mother of All Databases; http://www.BindingMOAD.org) has been one of the largest resources for high-quality protein–ligand complexes and associated binding affinity data. Binding MOAD has grown at the rate of 1994 complexes per year, on average. Currently, it contains 23 269 complexes and 8156 binding affinities. Our annual updates curate the data using a semi-automated literature search of the references cited within the PDB file, and we have recently upgraded our website and added new features and functionalities to better serve Binding MOAD users. In order to eliminate the legacy application server of the old platform and to accommodate new changes, the website has been completely rewritten in the LAMP (Linux, Apache, MySQL and PHP) environment. The improved user interface incorporates current third-party plugins for better visualization of protein and ligand molecules, and it provides features like sorting, filtering and filtered downloads. In addition to the field-based searching, Binding MOAD now can be searched by structural queries based on the ligand. In order to remove redundancy, Binding MOAD records are clustered in different families based on 90% sequence identity. The new Binding MOAD, with the upgraded platform, features and functionalities, is now equipped to better serve its users. PMID:25378330
WebDB Component Builder - Lessons Learned
DOE Office of Scientific and Technical Information (OSTI.GOV)
Macedo, C.
2000-02-15
Oracle WebDB is the easiest way to produce web enabled lightweight and enterprise-centric applications. This concept from Oracle has tantalized our taste for simplistic web development by using a purely web based tool that lives nowhere else but in the database. The use of online wizards, templates, and query builders, which produces PL/SQL behind the curtains, can be used straight ''out of the box'' by both novice and seasoned developers. The topic of this presentation will introduce lessons learned by developing and deploying applications built using the WebDB Component Builder in conjunction with custom PL/SQL code to empower a hybridmore » application. There are two kinds of WebDB components: those that display data to end users via reporting, and those that let end users update data in the database via entry forms. The presentation will also discuss various methods within the Component Builder to enhance the applications pushed to the desktop. The demonstrated example is an application entitled HOME (Helping Other's More Effectively) that was built to manage a yearly United Way Campaign effort. Our task was to build an end to end application which could manage approximately 900 non-profit agencies, an average of 4,100 individual contributions, and $1.2 million dollars. Using WebDB, the shell of the application was put together in a matter of a few weeks. However, we did encounter some hurdles that WebDB, in it's stage of infancy (v2.0), could not solve for us directly. Together with custom PL/SQL, WebDB's Component Builder became a powerful tool that enabled us to produce a very flexible hybrid application.« less
Van Neste, Christophe; Vandewoestyne, Mado; Van Criekinge, Wim; Deforce, Dieter; Van Nieuwerburgh, Filip
2014-03-01
Forensic scientists are currently investigating how to transition from capillary electrophoresis (CE) to massive parallel sequencing (MPS) for analysis of forensic DNA profiles. MPS offers several advantages over CE such as virtually unlimited multiplexy of loci, combining both short tandem repeat (STR) and single nucleotide polymorphism (SNP) loci, small amplicons without constraints of size separation, more discrimination power, deep mixture resolution and sample multiplexing. We present our bioinformatic framework My-Forensic-Loci-queries (MyFLq) for analysis of MPS forensic data. For allele calling, the framework uses a MySQL reference allele database with automatically determined regions of interest (ROIs) by a generic maximal flanking algorithm which makes it possible to use any STR or SNP forensic locus. Python scripts were designed to automatically make allele calls starting from raw MPS data. We also present a method to assess the usefulness and overall performance of a forensic locus with respect to MPS, as well as methods to estimate whether an unknown allele, which sequence is not present in the MySQL database, is in fact a new allele or a sequencing error. The MyFLq framework was applied to an Illumina MiSeq dataset of a forensic Illumina amplicon library, generated from multilocus STR polymerase chain reaction (PCR) on both single contributor samples and multiple person DNA mixtures. Although the multilocus PCR was not yet optimized for MPS in terms of amplicon length or locus selection, the results show excellent results for most loci. The results show a high signal-to-noise ratio, correct allele calls, and a low limit of detection for minor DNA contributors in mixed DNA samples. Technically, forensic MPS affords great promise for routine implementation in forensic genomics. The method is also applicable to adjacent disciplines such as molecular autopsy in legal medicine and in mitochondrial DNA research. Copyright © 2013 The Authors. Published by Elsevier Ireland Ltd.. All rights reserved.
Techniques for Efficiently Managing Large Geosciences Data Sets
NASA Astrophysics Data System (ADS)
Kruger, A.; Krajewski, W. F.; Bradley, A. A.; Smith, J. A.; Baeck, M. L.; Steiner, M.; Lawrence, R. E.; Ramamurthy, M. K.; Weber, J.; Delgreco, S. A.; Domaszczynski, P.; Seo, B.; Gunyon, C. A.
2007-12-01
We have developed techniques and software tools for efficiently managing large geosciences data sets. While the techniques were developed as part of an NSF-Funded ITR project that focuses on making NEXRAD weather data and rainfall products available to hydrologists and other scientists, they are relevant to other geosciences disciplines that deal with large data sets. Metadata, relational databases, data compression, and networking are central to our methodology. Data and derived products are stored on file servers in a compressed format. URLs to, and metadata about the data and derived products are managed in a PostgreSQL database. Virtually all access to the data and products is through this database. Geosciences data normally require a number of processing steps to transform the raw data into useful products: data quality assurance, coordinate transformations and georeferencing, applying calibration information, and many more. We have developed the concept of crawlers that manage this scientific workflow. Crawlers are unattended processes that run indefinitely, and at set intervals query the database for their next assignment. A database table functions as a roster for the crawlers. Crawlers perform well-defined tasks that are, except for perhaps sequencing, largely independent from other crawlers. Once a crawler is done with its current assignment, it updates the database roster table, and gets its next assignment by querying the database. We have developed a library that enables one to quickly add crawlers. The library provides hooks to external (i.e., C-language) compiled codes, so that developers can work and contribute independently. Processes called ingesters inject data into the system. The bulk of the data are from a real-time feed using UCAR/Unidata's IDD/LDM software. An exciting recent development is the establishment of a Unidata HYDRO feed that feeds value-added metadata over the IDD/LDM. Ingesters grab the metadata and populate the PostgreSQL tables. These and other concepts we have developed have enabled us to efficiently manage a 70 Tb (and growing) data weather radar data set.
Applying Query Structuring in Cross-language Retrieval.
ERIC Educational Resources Information Center
Pirkola, Ari; Puolamaki, Deniz; Jarvelin, Kalervo
2003-01-01
Explores ways to apply query structuring in cross-language information retrieval. Tested were: English queries translated into Finnish using an electronic dictionary, and run in a Finnish newspaper databases; effects of compound-based structuring using a proximity operator for translation equivalents of query language compound components; and a…
Querying and Ranking XML Documents.
ERIC Educational Resources Information Center
Schlieder, Torsten; Meuss, Holger
2002-01-01
Discussion of XML, information retrieval, precision, and recall focuses on a retrieval technique that adopts the similarity measure of the vector space model, incorporates the document structure, and supports structured queries. Topics include a query model based on tree matching; structured queries and term-based ranking; and term frequency and…
Spatiotemporal conceptual platform for querying archaeological information systems
NASA Astrophysics Data System (ADS)
Partsinevelos, Panagiotis; Sartzetaki, Mary; Sarris, Apostolos
2015-04-01
Spatial and temporal distribution of archaeological sites has been shown to associate with several attributes including marine, water, mineral and food resources, climate conditions, geomorphological features, etc. In this study, archeological settlement attributes are evaluated under various associations in order to provide a specialized query platform in a geographic information system (GIS). Towards this end, a spatial database is designed to include a series of archaeological findings for a secluded geographic area of Crete in Greece. The key categories of the geodatabase include the archaeological type (palace, burial site, village, etc.), temporal information of the habitation/usage period (pre Minoan, Minoan, Byzantine, etc.), and the extracted geographical attributes of the sites (distance to sea, altitude, resources, etc.). Most of the related spatial attributes are extracted with readily available GIS tools. Additionally, a series of conceptual data attributes are estimated, including: Temporal relation of an era to a future one in terms of alteration of the archaeological type, topologic relations of various types and attributes, spatial proximity relations between various types. These complex spatiotemporal relational measures reveal new attributes towards better understanding of site selection for prehistoric and/or historic cultures, yet their potential combinations can become numerous. Therefore, after the quantification of the above mentioned attributes, they are classified as of their importance for archaeological site location modeling. Under this new classification scheme, the user may select a geographic area of interest and extract only the important attributes for a specific archaeological type. These extracted attributes may then be queried against the entire spatial database and provide a location map of possible new archaeological sites. This novel type of querying is robust since the user does not have to type a standard SQL query but graphically select an area of interest. In addition, according to the application at hand, novel spatiotemporal attributes and relations can be supported, towards the understanding of historical settlement patterns.
Open Clients for Distributed Databases
NASA Astrophysics Data System (ADS)
Chayes, D. N.; Arko, R. A.
2001-12-01
We are actively developing a collection of open source example clients that demonstrate use of our "back end" data management infrastructure. The data management system is reported elsewhere at this meeting (Arko and Chayes: A Scaleable Database Infrastructure). In addition to their primary goal of being examples for others to build upon, some of these clients may have limited utility in them selves. More information about the clients and the data infrastructure is available on line at http://data.ldeo.columbia.edu. The available examples to be demonstrated include several web-based clients including those developed for the Community Review System of the Digital Library for Earth System Education, a real-time watch standers log book, an offline interface to use log book entries, a simple client to search on multibeam metadata and others are Internet enabled and generally web-based front ends that support searches against one or more relational databases using industry standard SQL queries. In addition to the web based clients, simple SQL searches from within Excel and similar applications will be demonstrated. By defining, documenting and publishing a clear interface to the fully searchable databases, it becomes relatively easy to construct client interfaces that are optimized for specific applications in comparison to building a monolithic data and user interface system.
SQL is Dead; Long-live SQL: Relational Database Technology in Science Contexts
NASA Astrophysics Data System (ADS)
Howe, B.; Halperin, D.
2014-12-01
Relational databases are often perceived as a poor fit in science contexts: Rigid schemas, poor support for complex analytics, unpredictable performance, significant maintenance and tuning requirements --- these idiosyncrasies often make databases unattractive in science contexts characterized by heterogeneous data sources, complex analysis tasks, rapidly changing requirements, and limited IT budgets. In this talk, I'll argue that although the value proposition of typical relational database systems are weak in science, the core ideas that power relational databases have become incredibly prolific in open source science software, and are emerging as a universal abstraction for both big data and small data. In addition, I'll talk about two open source systems we are building to "jailbreak" the core technology of relational databases and adapt them for use in science. The first is SQLShare, a Database-as-a-Service system supporting collaborative data analysis and exchange by reducing database use to an Upload-Query-Share workflow with no installation, schema design, or configuration required. The second is Myria, a service that supports much larger scale data, complex analytics, and supports multiple back end systems. Finally, I'll describe some of the ways our collaborators in oceanography, astronomy, biology, fisheries science, and more are using these systems to replace script-based workflows for reasons of performance, flexibility, and convenience.
A high-performance spatial database based approach for pathology imaging algorithm evaluation
Wang, Fusheng; Kong, Jun; Gao, Jingjing; Cooper, Lee A.D.; Kurc, Tahsin; Zhou, Zhengwen; Adler, David; Vergara-Niedermayr, Cristobal; Katigbak, Bryan; Brat, Daniel J.; Saltz, Joel H.
2013-01-01
Background: Algorithm evaluation provides a means to characterize variability across image analysis algorithms, validate algorithms by comparison with human annotations, combine results from multiple algorithms for performance improvement, and facilitate algorithm sensitivity studies. The sizes of images and image analysis results in pathology image analysis pose significant challenges in algorithm evaluation. We present an efficient parallel spatial database approach to model, normalize, manage, and query large volumes of analytical image result data. This provides an efficient platform for algorithm evaluation. Our experiments with a set of brain tumor images demonstrate the application, scalability, and effectiveness of the platform. Context: The paper describes an approach and platform for evaluation of pathology image analysis algorithms. The platform facilitates algorithm evaluation through a high-performance database built on the Pathology Analytic Imaging Standards (PAIS) data model. Aims: (1) Develop a framework to support algorithm evaluation by modeling and managing analytical results and human annotations from pathology images; (2) Create a robust data normalization tool for converting, validating, and fixing spatial data from algorithm or human annotations; (3) Develop a set of queries to support data sampling and result comparisons; (4) Achieve high performance computation capacity via a parallel data management infrastructure, parallel data loading and spatial indexing optimizations in this infrastructure. Materials and Methods: We have considered two scenarios for algorithm evaluation: (1) algorithm comparison where multiple result sets from different methods are compared and consolidated; and (2) algorithm validation where algorithm results are compared with human annotations. We have developed a spatial normalization toolkit to validate and normalize spatial boundaries produced by image analysis algorithms or human annotations. The validated data were formatted based on the PAIS data model and loaded into a spatial database. To support efficient data loading, we have implemented a parallel data loading tool that takes advantage of multi-core CPUs to accelerate data injection. The spatial database manages both geometric shapes and image features or classifications, and enables spatial sampling, result comparison, and result aggregation through expressive structured query language (SQL) queries with spatial extensions. To provide scalable and efficient query support, we have employed a shared nothing parallel database architecture, which distributes data homogenously across multiple database partitions to take advantage of parallel computation power and implements spatial indexing to achieve high I/O throughput. Results: Our work proposes a high performance, parallel spatial database platform for algorithm validation and comparison. This platform was evaluated by storing, managing, and comparing analysis results from a set of brain tumor whole slide images. The tools we develop are open source and available to download. Conclusions: Pathology image algorithm validation and comparison are essential to iterative algorithm development and refinement. One critical component is the support for queries involving spatial predicates and comparisons. In our work, we develop an efficient data model and parallel database approach to model, normalize, manage and query large volumes of analytical image result data. Our experiments demonstrate that the data partitioning strategy and the grid-based indexing result in good data distribution across database nodes and reduce I/O overhead in spatial join queries through parallel retrieval of relevant data and quick subsetting of datasets. The set of tools in the framework provide a full pipeline to normalize, load, manage and query analytical results for algorithm evaluation. PMID:23599905
The HITRAN2016 molecular spectroscopic database
NASA Astrophysics Data System (ADS)
Gordon, I. E.; Rothman, L. S.; Hill, C.; Kochanov, R. V.; Tan, Y.; Bernath, P. F.; Birk, M.; Boudon, V.; Campargue, A.; Chance, K. V.; Drouin, B. J.; Flaud, J.-M.; Gamache, R. R.; Hodges, J. T.; Jacquemart, D.; Perevalov, V. I.; Perrin, A.; Shine, K. P.; Smith, M.-A. H.; Tennyson, J.; Toon, G. C.; Tran, H.; Tyuterev, V. G.; Barbe, A.; Császár, A. G.; Devi, V. M.; Furtenbacher, T.; Harrison, J. J.; Hartmann, J.-M.; Jolly, A.; Johnson, T. J.; Karman, T.; Kleiner, I.; Kyuberis, A. A.; Loos, J.; Lyulin, O. M.; Massie, S. T.; Mikhailenko, S. N.; Moazzen-Ahmadi, N.; Müller, H. S. P.; Naumenko, O. V.; Nikitin, A. V.; Polyansky, O. L.; Rey, M.; Rotger, M.; Sharpe, S. W.; Sung, K.; Starikova, E.; Tashkun, S. A.; Auwera, J. Vander; Wagner, G.; Wilzewski, J.; Wcisło, P.; Yu, S.; Zak, E. J.
2017-12-01
This paper describes the contents of the 2016 edition of the HITRAN molecular spectroscopic compilation. The new edition replaces the previous HITRAN edition of 2012 and its updates during the intervening years. The HITRAN molecular absorption compilation is composed of five major components: the traditional line-by-line spectroscopic parameters required for high-resolution radiative-transfer codes, infrared absorption cross-sections for molecules not yet amenable to representation in a line-by-line form, collision-induced absorption data, aerosol indices of refraction, and general tables such as partition sums that apply globally to the data. The new HITRAN is greatly extended in terms of accuracy, spectral coverage, additional absorption phenomena, added line-shape formalisms, and validity. Moreover, molecules, isotopologues, and perturbing gases have been added that address the issues of atmospheres beyond the Earth. Of considerable note, experimental IR cross-sections for almost 300 additional molecules important in different areas of atmospheric science have been added to the database. The compilation can be accessed through www.hitran.org. Most of the HITRAN data have now been cast into an underlying relational database structure that offers many advantages over the long-standing sequential text-based structure. The new structure empowers the user in many ways. It enables the incorporation of an extended set of fundamental parameters per transition, sophisticated line-shape formalisms, easy user-defined output formats, and very convenient searching, filtering, and plotting of data. A powerful application programming interface making use of structured query language (SQL) features for higher-level applications of HITRAN is also provided.
The BDNYC database of low-mass stars, brown dwarfs, and planetary mass companions
NASA Astrophysics Data System (ADS)
Cruz, Kelle; Rodriguez, David; Filippazzo, Joseph; Gonzales, Eileen; Faherty, Jacqueline K.; Rice, Emily; BDNYC
2018-01-01
We present a web-interface to a database of low-mass stars, brown dwarfs, and planetary mass companions. Users can send SELECT SQL queries to the database, perform searches by coordinates or name, check the database inventory on specified objects, and even plot spectra interactively. The initial version of this database contains information for 198 objects and version 2 will contain over 1000 objects. The database currently includes photometric data from 2MASS, WISE, and Spitzer and version 2 will include a significant portion of the publicly available optical and NIR spectra for brown dwarfs. The database is maintained and curated by the BDNYC research group and we welcome contributions from other researchers via GitHub.
The Data Acquisition System of the Stockholm Educational Air Shower Array
NASA Astrophysics Data System (ADS)
Hofverberg, P.; Johansson, H.; Pearce, M.; Rydstrom, S.; Wikstrom, C.
2005-12-01
The Stockholm Educational Air Shower Array (SEASA) project is deploying an array of plastic scintillator detector stations on school roofs in the Stockholm area. Signals from GPS satellites are used to time synchronise signals from the widely separated detector stations, allowing cosmic ray air showers to be identified and studied. A low-cost and highly scalable data acquisition system has been produced using embedded Linux processors which communicate station data to a central server running a MySQL database. Air shower data can be visualised in real-time using a Java-applet client. It is also possible to query the database and manage detector stations from the client. In this paper, the design and performance of the system are described
Analysis of web-related threats in ten years of logs from a scientific portal
NASA Astrophysics Data System (ADS)
Santos, Rafael D. C.; Grégio, André R. A.; Raddick, Jordan; Vattki, Vamsi; Szalay, Alex
2012-06-01
SkyServer is an Internet portal to data from the Sloan Digital Sky Survey, the largest online archive of astronomy data in the world. provides free access to hundreds of millions of celestial objects for science, education and outreach purposes. Logs of accesses to SkyServer comprise around 930 million hits, 140 million web services accesses and 170 million SQL submitted queries, collected over the past 10 years. These logs also contain indications of compromise attempts on the servers. In this paper, we show some threats that were detected in ten years of stored logs, and compare them with known threats in those years. Also, we present an analysis of the evolution of those threats over these years.
Interactive DataBase of Cosmic Ray Anisotropy (DB A10)
NASA Astrophysics Data System (ADS)
Asipenka, A.S.; Belov, A.V.; Eroshenko, E.F.; Klepach, E.G.; Oleneva, V.A.; Yake, V.G.
Data on the hourly means of cosmic ray density and anisotropy derived by the GSM method over the 1957-2006 are introduced in to MySQL database. This format allowed an access to data both in local and in the Internet. Using the realized combination of script-language Php and My SQL database the Internet project was created on the access for users data on the CR anisotropy in different formats (http://cr20.izmiran.ru/AnisotropyCR/main.htm/). Usage the sheaf Php and MySQL provides fast receiving data even in the Internet since a request and following process of data are accomplished on the project server. Usage of MySQL basis for the storing data on cosmic ray variations give a possibility to construct requests of different structures, extends the variety of data reflection, makes it possible the conformity data to other systems and usage them in other projects.
NASA Technical Reports Server (NTRS)
Aspinall, David; Denney, Ewen; Lueth, Christoph
2012-01-01
We motivate and introduce a query language PrQL designed for inspecting machine representations of proofs. PrQL natively supports hiproofs which express proof structure using hierarchical nested labelled trees. The core language presented in this paper is locally structured (first-order), with queries built using recursion and patterns over proof structure and rule names. We define the syntax and semantics of locally structured queries, demonstrate their power, and sketch some implementation experiments.
Querying temporal clinical databases on granular trends.
Combi, Carlo; Pozzi, Giuseppe; Rossato, Rosalba
2012-04-01
This paper focuses on the identification of temporal trends involving different granularities in clinical databases, where data are temporal in nature: for example, while follow-up visit data are usually stored at the granularity of working days, queries on these data could require to consider trends either at the granularity of months ("find patients who had an increase of systolic blood pressure within a single month") or at the granularity of weeks ("find patients who had steady states of diastolic blood pressure for more than 3 weeks"). Representing and reasoning properly on temporal clinical data at different granularities are important both to guarantee the efficacy and the quality of care processes and to detect emergency situations. Temporal sequences of data acquired during a care process provide a significant source of information not only to search for a particular value or an event at a specific time, but also to detect some clinically-relevant patterns for temporal data. We propose a general framework for the description and management of temporal trends by considering specific temporal features with respect to the chosen time granularity. Temporal aspects of data are considered within temporal relational databases, first formally by using a temporal extension of the relational calculus, and then by showing how to map these relational expressions to plain SQL queries. Throughout the paper we consider the clinical domain of hemodialysis, where several parameters are periodically sampled during every session. Copyright © 2011 Elsevier Inc. All rights reserved.
MP3C - the Minor Planet Physical Properties Catalogue: a New VO Service For Multi-database Query
NASA Astrophysics Data System (ADS)
Tanga, Paolo; Delbo, M.; Gerakis, J.
2013-10-01
In the last few years we witnessed a large growth in the number of asteroids for which we have physical properties. However, these data are dispersed in a multiplicity of catalogs. Extracting data and combining them for further analysis requires custom tools, a situation further complicated by the variety of data sources, some of them standardized (Planetary Data System) others not. With these problems in mind, we created a new Virtual Observatory service named “Minor Planet Physical Properties Catalogue” (abbreviated as MP3C - http://mp3c.oca.eu/). MP3C is not a new database, but rather a portal allowing the user to access selected properties of objects by easy SQL query, even from different sources. At present, such diverse data as orbital parameters, photometric and light curve parameters, sizes and albedos derived by IRAS, AKARI and WISE, SDSS colors, SMASS taxonomy, family membership, satellite data, stellar occultation results, are included. Other data sources will be added in the near future. The physical properties output of the MP3C can be tuned by the users by query criteria based upon ranges of values of the ingested quantities. The resulting list of object can be used for interactive plots through standard VO tools such as TOPCAT. Also, their ephemerids and visibilities from given sites can be computed. We are targeting full VO compliance for providing a new standardized service to the community.
Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data
Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge
2016-01-01
This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers’ perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper. PMID:27376300
Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data.
Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge
2016-07-01
This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers' perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper.
New capabilities in the HENP grand challenge storage access systemand its application at RHIC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bernardo, L.; Gibbard, B.; Malon, D.
2000-04-25
The High Energy and Nuclear Physics Data Access GrandChallenge project has developed an optimizing storage access softwaresystem that was prototyped at RHIC. It is currently undergoingintegration with the STAR experiment in preparation for data taking thatstarts in mid-2000. The behavior and lessons learned in the RHIC MockData Challenge exercises are described as well as the observedperformance under conditions designed to characterize scalability. Up to250 simultaneous queries were tested and up to 10 million events across 7event components were involved in these queries. The system coordinatesthe staging of "bundles" of files from the HPSS tape system, so that allthe needed componentsmore » of each event are in disk cache when accessed bythe application software. The caching policy algorithm for thecoordinated bundle staging is described in the paper. The initialprototype implementation interfaced to the Objectivity/DB. In this latestversion, it evolved to work with arbitrary files and use CORBA interfacesto the tag database and file catalog services. The interface to the tagdatabase and the MySQL-based file catalog services used by STAR aredescribed along with the planned usage scenarios.« less
rEHR: An R package for manipulating and analysing Electronic Health Record data.
Springate, David A; Parisi, Rosa; Olier, Ivan; Reeves, David; Kontopantelis, Evangelos
2017-01-01
Research with structured Electronic Health Records (EHRs) is expanding as data becomes more accessible; analytic methods advance; and the scientific validity of such studies is increasingly accepted. However, data science methodology to enable the rapid searching/extraction, cleaning and analysis of these large, often complex, datasets is less well developed. In addition, commonly used software is inadequate, resulting in bottlenecks in research workflows and in obstacles to increased transparency and reproducibility of the research. Preparing a research-ready dataset from EHRs is a complex and time consuming task requiring substantial data science skills, even for simple designs. In addition, certain aspects of the workflow are computationally intensive, for example extraction of longitudinal data and matching controls to a large cohort, which may take days or even weeks to run using standard software. The rEHR package simplifies and accelerates the process of extracting ready-for-analysis datasets from EHR databases. It has a simple import function to a database backend that greatly accelerates data access times. A set of generic query functions allow users to extract data efficiently without needing detailed knowledge of SQL queries. Longitudinal data extractions can also be made in a single command, making use of parallel processing. The package also contains functions for cutting data by time-varying covariates, matching controls to cases, unit conversion and construction of clinical code lists. There are also functions to synthesise dummy EHR. The package has been tested with one for the largest primary care EHRs, the Clinical Practice Research Datalink (CPRD), but allows for a common interface to other EHRs. This simplified and accelerated work flow for EHR data extraction results in simpler, cleaner scripts that are more easily debugged, shared and reproduced.
Working with HITRAN Database Using Hapi: HITRAN Application Programming Interface
NASA Astrophysics Data System (ADS)
Kochanov, Roman V.; Hill, Christian; Wcislo, Piotr; Gordon, Iouli E.; Rothman, Laurence S.; Wilzewski, Jonas
2015-06-01
A HITRAN Application Programing Interface (HAPI) has been developed to allow users on their local machines much more flexibility and power. HAPI is a programming interface for the main data-searching capabilities of the new "HITRANonline" web service (http://www.hitran.org). It provides the possibility to query spectroscopic data from the HITRAN database in a flexible manner using either functions or query language. Some of the prominent current features of HAPI are: a) Downloading line-by-line data from the HITRANonline site to a local machine b) Filtering and processing the data in SQL-like fashion c) Conventional Python structures (lists, tuples, and dictionaries) for representing spectroscopic data d) Possibility to use a large set of third-party Python libraries to work with the data e) Python implementation of the HT lineshape which can be reduced to a number of conventional line profiles f) Python implementation of total internal partition sums (TIPS-2011) for spectra simulations g) High-resolution spectra calculation accounting for pressure, temperature and optical path length h) Providing instrumental functions to simulate experimental spectra i) Possibility to extend HAPI's functionality by custom line profiles, partitions sums and instrumental functions Currently the API is a module written in Python and uses Numpy library providing fast array operations. The API is designed to deal with data in multiple formats such as ASCII, CSV, HDF5 and XSAMS. This work has been supported by NASA Aura Science Team Grant NNX14AI55G and NASA Planetary Atmospheres Grant NNX13AI59G. L.S. Rothman et al. JQSRT, Volume 130, 2013, Pages 4-50 N.H. Ngo et al. JQSRT, Volume 129, November 2013, Pages 89-100 A. L. Laraia at al. Icarus, Volume 215, Issue 1, September 2011, Pages 391-400
Haverkamp, Christian; Ganslandt, Thomas; Horki, Petar; Boeker, Martin; Dörfler, Arnd; Schwab, Stefan; Berkefeld, Joachim; Pfeilschifter, Waltraud; Niesen, Wolf-Dirk; Egger, Karl; Kaps, Manfred; Brockmann, Marc A; Neumaier-Probst, Eva; Szabo, Kristina; Skalej, Martin; Bien, Siegfried; Best, Christoph; Prokosch, Hans-Ulrich; Urbach, Horst
2018-01-08
Mechanical thrombectomy, in addition to intravenous (i.v.) thrombolysis is recommended for treatment of acute stroke in patients with large vessel occlusions (LVO) in the anterior circulation up to 6 h after symptom onset. We compared thrombectomy rates of eight university hospitals of the MIRACUM consortium to analyze the implementation of this guideline in clinical routine. Anonymized billing data in a standardized format were loaded into a local i2b2 data warehouse by applying already existing extract, transform and load (ETL) routines. A locally executed uniform SQL (structured query language) query delivered aggregated site data for all inpatients with a discharge diagnosis of ischemic stroke (ICD-10 I63) containing counts for type of acute treatment, type of admission and age groups, which were centrally analyzed with R. From 2014 to 2016, the thrombectomy rate almost doubled from a mean of 4.7% to 9.6%, although significant differences between centers exist (range in 2016: 5.8-17%). The number of drip-and-ship procedures increased in 3 out of 8 centers. There was no evidence for a decrease in thrombectomy rates during weekends/holiday or among patients older than 80 years, but this age group is more likely to receive i.v. recombinant tissue plasminogen activator (rtPA). The observed increase of thrombectomy rates and drip-and-ship procedures without a significant difference between weekdays and weekends or patients of different ages is substantiating a rapid implementation of stroke guidelines within the analyzed neurovascular centers. The prototype of the MIRACUM Data Integration Center already contributes to health services research in Germany.
Fragger: a protein fragment picker for structural queries.
Berenger, Francois; Simoncini, David; Voet, Arnout; Shrestha, Rojan; Zhang, Kam Y J
2017-01-01
Protein modeling and design activities often require querying the Protein Data Bank (PDB) with a structural fragment, possibly containing gaps. For some applications, it is preferable to work on a specific subset of the PDB or with unpublished structures. These requirements, along with specific user needs, motivated the creation of a new software to manage and query 3D protein fragments. Fragger is a protein fragment picker that allows protein fragment databases to be created and queried. All fragment lengths are supported and any set of PDB files can be used to create a database. Fragger can efficiently search a fragment database with a query fragment and a distance threshold. Matching fragments are ranked by distance to the query. The query fragment can have structural gaps and the allowed amino acid sequences matching a query can be constrained via a regular expression of one-letter amino acid codes. Fragger also incorporates a tool to compute the backbone RMSD of one versus many fragments in high throughput. Fragger should be useful for protein design, loop grafting and related structural bioinformatics tasks.
Naturally Occurring Human Urinary Peptides for Use in Diagnosis of Chronic Kidney Disease*
Good, David M.; Zürbig, Petra; Argilés, Àngel; Bauer, Hartwig W.; Behrens, Georg; Coon, Joshua J.; Dakna, Mohammed; Decramer, Stéphane; Delles, Christian; Dominiczak, Anna F.; Ehrich, Jochen H. H.; Eitner, Frank; Fliser, Danilo; Frommberger, Moritz; Ganser, Arnold; Girolami, Mark A.; Golovko, Igor; Gwinner, Wilfried; Haubitz, Marion; Herget-Rosenthal, Stefan; Jankowski, Joachim; Jahn, Holger; Jerums, George; Julian, Bruce A.; Kellmann, Markus; Kliem, Volker; Kolch, Walter; Krolewski, Andrzej S.; Luppi, Mario; Massy, Ziad; Melter, Michael; Neusüss, Christian; Novak, Jan; Peter, Karlheinz; Rossing, Kasper; Rupprecht, Harald; Schanstra, Joost P.; Schiffer, Eric; Stolzenburg, Jens-Uwe; Tarnow, Lise; Theodorescu, Dan; Thongboonkerd, Visith; Vanholder, Raymond; Weissinger, Eva M.; Mischak, Harald; Schmitt-Kopplin, Philippe
2010-01-01
Because of its availability, ease of collection, and correlation with physiology and pathology, urine is an attractive source for clinical proteomics/peptidomics. However, the lack of comparable data sets from large cohorts has greatly hindered the development of clinical proteomics. Here, we report the establishment of a reproducible, high resolution method for peptidome analysis of naturally occurring human urinary peptides and proteins, ranging from 800 to 17,000 Da, using samples from 3,600 individuals analyzed by capillary electrophoresis coupled to MS. All processed data were deposited in an Structured Query Language (SQL) database. This database currently contains 5,010 relevant unique urinary peptides that serve as a pool of potential classifiers for diagnosis and monitoring of various diseases. As an example, by using this source of information, we were able to define urinary peptide biomarkers for chronic kidney diseases, allowing diagnosis of these diseases with high accuracy. Application of the chronic kidney disease-specific biomarker set to an independent test cohort in the subsequent replication phase resulted in 85.5% sensitivity and 100% specificity. These results indicate the potential usefulness of capillary electrophoresis coupled to MS for clinical applications in the analysis of naturally occurring urinary peptides. PMID:20616184
Development of management information system for land in mine area based on MapInfo
NASA Astrophysics Data System (ADS)
Wang, Shi-Dong; Liu, Chuang-Hua; Wang, Xin-Chuang; Pan, Yan-Yu
2008-10-01
MapInfo is current a popular GIS software. This paper introduces characters of MapInfo and GIS second development methods offered by MapInfo, which include three ones based on MapBasic, OLE automation, and MapX control usage respectively. Taking development of land management information system in mine area for example, in the paper, the method of developing GIS applications based on MapX has been discussed, as well as development of land management information system in mine area has been introduced in detail, including development environment, overall design, design and realization of every function module, and simple application of system, etc. The system uses MapX 5.0 and Visual Basic 6.0 as development platform, takes SQL Server 2005 as back-end database, and adopts Matlab 6.5 to calculate number in back-end. On the basis of integrated design, the system develops eight modules including start-up, layer control, spatial query, spatial analysis, data editing, application model, document management, results output. The system can be used in mine area for cadastral management, land use structure optimization, land reclamation, land evaluation, analysis and forecasting for land in mine area and environmental disruption, thematic mapping, and so on.
Domain fusion analysis by applying relational algebra to protein sequence and domain databases
Truong, Kevin; Ikura, Mitsuhiko
2003-01-01
Background Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful. Results This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at . Conclusion As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time. PMID:12734020
[Establishment of a comprehensive database for laryngeal cancer related genes and the miRNAs].
Li, Mengjiao; E, Qimin; Liu, Jialin; Huang, Tingting; Liang, Chuanyu
2015-09-01
By collecting and analyzing the laryngeal cancer related genes and the miRNAs, to build a comprehensive laryngeal cancer-related gene database, which differs from the current biological information database with complex and clumsy structure and focuses on the theme of gene and miRNA, and it could make the research and teaching more convenient and efficient. Based on the B/S architecture, using Apache as a Web server, MySQL as coding language of database design and PHP as coding language of web design, a comprehensive database for laryngeal cancer-related genes was established, providing with the gene tables, protein tables, miRNA tables and clinical information tables of the patients with laryngeal cancer. The established database containsed 207 laryngeal cancer related genes, 243 proteins, 26 miRNAs, and their particular information such as mutations, methylations, diversified expressions, and the empirical references of laryngeal cancer relevant molecules. The database could be accessed and operated via the Internet, by which browsing and retrieval of the information were performed. The database were maintained and updated regularly. The database for laryngeal cancer related genes is resource-integrated and user-friendly, providing a genetic information query tool for the study of laryngeal cancer.
Version VI of the ESTree db: an improved tool for peach transcriptome analysis
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Merelli, Ivan; Barale, Francesca; Milanesi, Luciano; Stella, Alessandra; Pozzi, Carlo
2008-01-01
Background The ESTree database (db) is a collection of Prunus persica and Prunus dulcis EST sequences that in its current version encompasses 75,404 sequences from 3 almond and 19 peach libraries. Nine peach genotypes and four peach tissues are represented, from four fruit developmental stages. The aim of this work was to implement the already existing ESTree db by adding new sequences and analysis programs. Particular care was given to the implementation of the web interface, that allows querying each of the database features. Results A Perl modular pipeline is the backbone of sequence analysis in the ESTree db project. Outputs obtained during the pipeline steps are automatically arrayed into the fields of a MySQL database. Apart from standard clustering and annotation analyses, version VI of the ESTree db encompasses new tools for tandem repeat identification, annotation against genomic Rosaceae sequences, and positioning on the database of oligomer sequences that were used in a peach microarray study. Furthermore, known protein patterns and motifs were identified by comparison to PROSITE. Based on data retrieved from sequence annotation against the UniProtKB database, a script was prepared to track positions of homologous hits on the GO tree and build statistics on the ontologies distribution in GO functional categories. EST mapping data were also integrated in the database. The PHP-based web interface was upgraded and extended. The aim of the authors was to enable querying the database according to all the biological aspects that can be investigated from the analysis of data available in the ESTree db. This is achieved by allowing multiple searches on logical subsets of sequences that represent different biological situations or features. Conclusions The version VI of ESTree db offers a broad overview on peach gene expression. Sequence analyses results contained in the database, extensively linked to external related resources, represent a large amount of information that can be queried via the tools offered in the web interface. Flexibility and modularity of the ESTree analysis pipeline and of the web interface allowed the authors to set up similar structures for different datasets, with limited manual intervention. PMID:18387211
Employing computers for the recruitment into clinical trials: a comprehensive systematic review.
Köpcke, Felix; Prokosch, Hans-Ulrich
2014-07-01
Medical progress depends on the evaluation of new diagnostic and therapeutic interventions within clinical trials. Clinical trial recruitment support systems (CTRSS) aim to improve the recruitment process in terms of effectiveness and efficiency. The goals were to (1) create an overview of all CTRSS reported until the end of 2013, (2) find and describe similarities in design, (3) theorize on the reasons for different approaches, and (4) examine whether projects were able to illustrate the impact of CTRSS. We searched PubMed titles, abstracts, and keywords for terms related to CTRSS research. Query results were classified according to clinical context, workflow integration, knowledge and data sources, reasoning algorithm, and outcome. A total of 101 papers on 79 different systems were found. Most lacked details in one or more categories. There were 3 different CTRSS that dominated: (1) systems for the retrospective identification of trial participants based on existing clinical data, typically through Structured Query Language (SQL) queries on relational databases, (2) systems that monitored the appearance of a key event of an existing health information technology component in which the occurrence of the event caused a comprehensive eligibility test for a patient or was directly communicated to the researcher, and (3) independent systems that required a user to enter patient data into an interface to trigger an eligibility assessment. Although the treating physician was required to act for the patient in older systems, it is now becoming increasingly popular to offer this possibility directly to the patient. Many CTRSS are designed to fit the existing infrastructure of a clinical care provider or the particularities of a trial. We conclude that the success of a CTRSS depends more on its successful workflow integration than on sophisticated reasoning and data processing algorithms. Furthermore, some of the most recent literature suggest that an increase in recruited patients and improvements in recruitment efficiency can be expected, although the former will depend on the error rate of the recruitment process being replaced. Finally, to increase the quality of future CTRSS reports, we propose a checklist of items that should be included.
Ahmed, Aqeel; Smith, Richard D; Clark, Jordan J; Dunbar, James B; Carlson, Heather A
2015-01-01
For over 10 years, Binding MOAD (Mother of All Databases; http://www.BindingMOAD.org) has been one of the largest resources for high-quality protein-ligand complexes and associated binding affinity data. Binding MOAD has grown at the rate of 1994 complexes per year, on average. Currently, it contains 23,269 complexes and 8156 binding affinities. Our annual updates curate the data using a semi-automated literature search of the references cited within the PDB file, and we have recently upgraded our website and added new features and functionalities to better serve Binding MOAD users. In order to eliminate the legacy application server of the old platform and to accommodate new changes, the website has been completely rewritten in the LAMP (Linux, Apache, MySQL and PHP) environment. The improved user interface incorporates current third-party plugins for better visualization of protein and ligand molecules, and it provides features like sorting, filtering and filtered downloads. In addition to the field-based searching, Binding MOAD now can be searched by structural queries based on the ligand. In order to remove redundancy, Binding MOAD records are clustered in different families based on 90% sequence identity. The new Binding MOAD, with the upgraded platform, features and functionalities, is now equipped to better serve its users. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Chapter 51: How to Build a Simple Cone Search Service Using a Local Database
NASA Astrophysics Data System (ADS)
Kent, B. R.; Greene, G. R.
The cone search service protocol will be examined from the server side in this chapter. A simple cone search service will be setup and configured locally using MySQL. Data will be read into a table, and the Java JDBC will be used to connect to the database. Readers will understand the VO cone search specification and how to use it to query a database on their local systems and return an XML/VOTable file based on an input of RA/DEC coordinates and a search radius. The cone search in this example will be deployed as a Java servlet. The resulting cone search can be tested with a verification service. This basic setup can be used with other languages and relational databases.
Integration of gel-based proteome data with pProRep.
Laukens, Kris; Matthiesen, Rune; Lemière, Filip; Esmans, Eddy; Onckelen, Harry Van; Jensen, Ole Nørregaard; Witters, Erwin
2006-11-15
pProRep is a web application integrating electrophoretic and mass spectral data from proteome analyses into a relational database. The graphical web-interface allows users to upload, analyse and share experimental proteome data. It offers researchers the possibility to query all previously analysed datasets and can visualize selected features, such as the presence of a certain set of ions in a peptide mass spectrum, on the level of the two-dimensional gel. The pProRep package and instructions for its use can be downloaded from http://www.ptools.ua.ac.be/pProRep. The application requires a web server that runs PHP 5 (http://www.php.net) and MySQL. Some (non-essential) extensions need additional freely available libraries: details are described in the installation instructions.
Development of Human Face Literature Database Using Text Mining Approach: Phase I.
Kaur, Paramjit; Krishan, Kewal; Sharma, Suresh K
2018-06-01
The face is an important part of the human body by which an individual communicates in the society. Its importance can be highlighted by the fact that a person deprived of face cannot sustain in the living world. The amount of experiments being performed and the number of research papers being published under the domain of human face have surged in the past few decades. Several scientific disciplines, which are conducting research on human face include: Medical Science, Anthropology, Information Technology (Biometrics, Robotics, and Artificial Intelligence, etc.), Psychology, Forensic Science, Neuroscience, etc. This alarms the need of collecting and managing the data concerning human face so that the public and free access of it can be provided to the scientific community. This can be attained by developing databases and tools on human face using bioinformatics approach. The current research emphasizes on creating a database concerning literature data of human face. The database can be accessed on the basis of specific keywords, journal name, date of publication, author's name, etc. The collected research papers will be stored in the form of a database. Hence, the database will be beneficial to the research community as the comprehensive information dedicated to the human face could be found at one place. The information related to facial morphologic features, facial disorders, facial asymmetry, facial abnormalities, and many other parameters can be extracted from this database. The front end has been developed using Hyper Text Mark-up Language and Cascading Style Sheets. The back end has been developed using hypertext preprocessor (PHP). The JAVA Script has used as scripting language. MySQL (Structured Query Language) is used for database development as it is most widely used Relational Database Management System. XAMPP (X (cross platform), Apache, MySQL, PHP, Perl) open source web application software has been used as the server.The database is still under the developmental phase and discusses the initial steps of its creation. The current paper throws light on the work done till date.
NASA Astrophysics Data System (ADS)
Dabiru, L.; O'Hara, C. G.; Shaw, D.; Katragadda, S.; Anderson, D.; Kim, S.; Shrestha, B.; Aanstoos, J.; Frisbie, T.; Policelli, F.; Keblawi, N.
2006-12-01
The Research Project Knowledge Base (RPKB) is currently being designed and will be implemented in a manner that is fully compatible and interoperable with enterprise architecture tools developed to support NASA's Applied Sciences Program. Through user needs assessment, collaboration with Stennis Space Center, Goddard Space Flight Center, and NASA's DEVELOP Staff personnel insight to information needs for the RPKB were gathered from across NASA scientific communities of practice. To enable efficient, consistent, standard, structured, and managed data entry and research results compilation a prototype RPKB has been designed and fully integrated with the existing NASA Earth Science Systems Components database. The RPKB will compile research project and keyword information of relevance to the six major science focus areas, 12 national applications, and the Global Change Master Directory (GCMD). The RPKB will include information about projects awarded from NASA research solicitations, project investigator information, research publications, NASA data products employed, and model or decision support tools used or developed as well as new data product information. The RPKB will be developed in a multi-tier architecture that will include a SQL Server relational database backend, middleware, and front end client interfaces for data entry. The purpose of this project is to intelligently harvest the results of research sponsored by the NASA Applied Sciences Program and related research program results. We present various approaches for a wide spectrum of knowledge discovery of research results, publications, projects, etc. from the NASA Systems Components database and global information systems and show how this is implemented in SQL Server database. The application of knowledge discovery is useful for intelligent query answering and multiple-layered database construction. Using advanced EA tools such as the Earth Science Architecture Tool (ESAT), RPKB will enable NASA and partner agencies to efficiently identify the significant results for new experiment directions and principle investigators to formulate experiment directions for new proposals.
Concept-based query language approach to enterprise information systems
NASA Astrophysics Data System (ADS)
Niemi, Timo; Junkkari, Marko; Järvelin, Kalervo
2014-01-01
In enterprise information systems (EISs) it is necessary to model, integrate and compute very diverse data. In advanced EISs the stored data often are based both on structured (e.g. relational) and semi-structured (e.g. XML) data models. In addition, the ad hoc information needs of end-users may require the manipulation of data-oriented (structural), behavioural and deductive aspects of data. Contemporary languages capable of treating this kind of diversity suit only persons with good programming skills. In this paper we present a concept-oriented query language approach to manipulate this diversity so that the programming skill requirements are considerably reduced. In our query language, the features which need technical knowledge are hidden in application-specific concepts and structures. Therefore, users need not be aware of the underlying technology. Application-specific concepts and structures are represented by the modelling primitives of the extended RDOOM (relational deductive object-oriented modelling) which contains primitives for all crucial real world relationships (is-a relationship, part-of relationship, association), XML documents and views. Our query language also supports intensional and extensional-intensional queries, in addition to conventional extensional queries. In its query formulation, the end-user combines available application-specific concepts and structures through shared variables.
The HITRAN2016 Molecular Spectroscopic Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gordon, I. E.; Rothman, L. S.; Hill, C.
This article describes the contents of the 2016 edition of the HITRAN molecular spectroscopic compilation. The new edition replaces the previous HITRAN edition of 2012 and its updates during the intervening years. The HITRAN molecular absorption compilation is composed of five major components: the traditional line-by-line spectroscopic parameters required for high-resolution radiative-transfer codes, infrared absorption cross-sections for molecules not yet amenable to representation in a line-by-line form, collision-induced absorption data, aerosol indices of refraction, and general tables such as partition sums that apply globally to the data. The new HITRAN is greatly extended in terms of accuracy, spectral coverage, additionalmore » absorption phenomena, added line-shape formalisms, and validity. Moreover, molecules, isotopologues, and perturbing gases have been added that address the issues of atmospheres beyond the Earth. Of considerable note, experimental IR cross-sections for almost 300 additional molecules important in different areas of atmospheric science have been added to the database. The compilation can be accessed through www.hitran.org. Most of the HITRAN data have now been cast into an underlying relational database structure that offers many advantages over the long-standing sequential text-based structure. The new structure empowers the user in many ways. It enables the incorporation of an extended set of fundamental parameters per transition, sophisticated line-shape formalisms, easy user-defined output formats, and very convenient searching, filtering, and plotting of data. Finally, a powerful application programming interface making use of structured query language (SQL) features for higher-level applications of HITRAN is also provided.« less
The HITRAN2016 Molecular Spectroscopic Database
Gordon, I. E.; Rothman, L. S.; Hill, C.; ...
2017-07-05
This article describes the contents of the 2016 edition of the HITRAN molecular spectroscopic compilation. The new edition replaces the previous HITRAN edition of 2012 and its updates during the intervening years. The HITRAN molecular absorption compilation is composed of five major components: the traditional line-by-line spectroscopic parameters required for high-resolution radiative-transfer codes, infrared absorption cross-sections for molecules not yet amenable to representation in a line-by-line form, collision-induced absorption data, aerosol indices of refraction, and general tables such as partition sums that apply globally to the data. The new HITRAN is greatly extended in terms of accuracy, spectral coverage, additionalmore » absorption phenomena, added line-shape formalisms, and validity. Moreover, molecules, isotopologues, and perturbing gases have been added that address the issues of atmospheres beyond the Earth. Of considerable note, experimental IR cross-sections for almost 300 additional molecules important in different areas of atmospheric science have been added to the database. The compilation can be accessed through www.hitran.org. Most of the HITRAN data have now been cast into an underlying relational database structure that offers many advantages over the long-standing sequential text-based structure. The new structure empowers the user in many ways. It enables the incorporation of an extended set of fundamental parameters per transition, sophisticated line-shape formalisms, easy user-defined output formats, and very convenient searching, filtering, and plotting of data. Finally, a powerful application programming interface making use of structured query language (SQL) features for higher-level applications of HITRAN is also provided.« less
Pathogen metadata platform: software for accessing and analyzing pathogen strain information.
Chang, Wenling E; Peterson, Matthew W; Garay, Christopher D; Korves, Tonia
2016-09-15
Pathogen metadata includes information about where and when a pathogen was collected and the type of environment it came from. Along with genomic nucleotide sequence data, this metadata is growing rapidly and becoming a valuable resource not only for research but for biosurveillance and public health. However, current freely available tools for analyzing this data are geared towards bioinformaticians and/or do not provide summaries and visualizations needed to readily interpret results. We designed a platform to easily access and summarize data about pathogen samples. The software includes a PostgreSQL database that captures metadata useful for disease outbreak investigations, and scripts for downloading and parsing data from NCBI BioSample and BioProject into the database. The software provides a user interface to query metadata and obtain standardized results in an exportable, tab-delimited format. To visually summarize results, the user interface provides a 2D histogram for user-selected metadata types and mapping of geolocated entries. The software is built on the LabKey data platform, an open-source data management platform, which enables developers to add functionalities. We demonstrate the use of the software in querying for a pathogen serovar and for genome sequence identifiers. This software enables users to create a local database for pathogen metadata, populate it with data from NCBI, easily query the data, and obtain visual summaries. Some of the components, such as the database, are modular and can be incorporated into other data platforms. The source code is freely available for download at https://github.com/wchangmitre/bioattribution .
BigQ: a NoSQL based framework to handle genomic variants in i2b2.
Gabetta, Matteo; Limongelli, Ivan; Rizzo, Ettore; Riva, Alberto; Segagni, Daniele; Bellazzi, Riccardo
2015-12-29
Precision medicine requires the tight integration of clinical and molecular data. To this end, it is mandatory to define proper technological solutions able to manage the overwhelming amount of high throughput genomic data needed to test associations between genomic signatures and human phenotypes. The i2b2 Center (Informatics for Integrating Biology and the Bedside) has developed a widely internationally adopted framework to use existing clinical data for discovery research that can help the definition of precision medicine interventions when coupled with genetic data. i2b2 can be significantly advanced by designing efficient management solutions of Next Generation Sequencing data. We developed BigQ, an extension of the i2b2 framework, which integrates patient clinical phenotypes with genomic variant profiles generated by Next Generation Sequencing. A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations. We report an evaluation of the query performance of our system on more than 11 million variants, showing that the implemented solution scales linearly in terms of query time and disk space with the number of variants. In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data. The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.
Comparative study on the customization of natural language interfaces to databases.
Pazos R, Rodolfo A; Aguirre L, Marco A; González B, Juan J; Martínez F, José A; Pérez O, Joaquín; Verástegui O, Andrés A
2016-01-01
In the last decades the popularity of natural language interfaces to databases (NLIDBs) has increased, because in many cases information obtained from them is used for making important business decisions. Unfortunately, the complexity of their customization by database administrators make them difficult to use. In order for a NLIDB to obtain a high percentage of correctly translated queries, it is necessary that it is correctly customized for the database to be queried. In most cases the performance reported in NLIDB literature is the highest possible; i.e., the performance obtained when the interfaces were customized by the implementers. However, for end users it is more important the performance that the interface can yield when the NLIDB is customized by someone different from the implementers. Unfortunately, there exist very few articles that report NLIDB performance when the NLIDBs are not customized by the implementers. This article presents a semantically-enriched data dictionary (which permits solving many of the problems that occur when translating from natural language to SQL) and an experiment in which two groups of undergraduate students customized our NLIDB and English language frontend (ELF), considered one of the best available commercial NLIDBs. The experimental results show that, when customized by the first group, our NLIDB obtained a 44.69 % of correctly answered queries and ELF 11.83 % for the ATIS database, and when customized by the second group, our NLIDB attained 77.05 % and ELF 13.48 %. The performance attained by our NLIDB, when customized by ourselves was 90 %.
NASA Astrophysics Data System (ADS)
Clements, O.; Siemen, S.; Wagemann, J.
2017-12-01
The EU-funded Earthserver-2 project aims to offer on-demand access to large volumes of environmental data (Earth Observation, Marine, Climate data and Planetary data) via the interface standard Web Coverage Service defined by the Open Geospatial Consortium. Providing access to data via OGC web services (e.g. WCS and WMS) has the potential to open up services to a wider audience, especially to users outside the respective communities. Especially WCS 2.0 with its processing extension Web Coverage Processing Service (WCPS) is highly beneficial to make large volumes accessible to non-expert communities. Users do not have to deal with custom community data formats, such as GRIB for the meteorological community, but can directly access the data in a format they are more familiar with, such as NetCDF, JSON or CSV. Data requests can further directly be integrated into custom processing routines and users are not required to download Gigabytes of data anymore. WCS supports trim (reduction of data extent) and slice (reduction of data dimension) operations on multi-dimensional data, providing users a very flexible on-demand access to the data. WCPS allows the user to craft queries to run on the data using a text-based query language, similar to SQL. These queries can be very powerful, e.g. condensing a three-dimensional data cube into its two-dimensional mean. However, the more processing-intensive the more complex the query. As part of the EarthServer-2 project, we developed a python library that helps users to generate complex WCPS queries with Python, a programming language they are more familiar with. The interactive presentation aims to give practical examples how users can benefit from two specific WCS services from the Marine and Climate community. Use-cases from the two communities will show different approaches to take advantage of a Web Coverage (Processing) Service. The entire content is available with Jupyter Notebooks, as they prove to be a highly beneficial tool to generate reproducible workflows for environmental data analysis.
Multidimensional indexing structure for use with linear optimization queries
NASA Technical Reports Server (NTRS)
Bergman, Lawrence David (Inventor); Castelli, Vittorio (Inventor); Chang, Yuan-Chi (Inventor); Li, Chung-Sheng (Inventor); Smith, John Richard (Inventor)
2002-01-01
Linear optimization queries, which usually arise in various decision support and resource planning applications, are queries that retrieve top N data records (where N is an integer greater than zero) which satisfy a specific optimization criterion. The optimization criterion is to either maximize or minimize a linear equation. The coefficients of the linear equation are given at query time. Methods and apparatus are disclosed for constructing, maintaining and utilizing a multidimensional indexing structure of database records to improve the execution speed of linear optimization queries. Database records with numerical attributes are organized into a number of layers and each layer represents a geometric structure called convex hull. Such linear optimization queries are processed by searching from the outer-most layer of this multi-layer indexing structure inwards. At least one record per layer will satisfy the query criterion and the number of layers needed to be searched depends on the spatial distribution of records, the query-issued linear coefficients, and N, the number of records to be returned. When N is small compared to the total size of the database, answering the query typically requires searching only a small fraction of all relevant records, resulting in a tremendous speedup as compared to linearly scanning the entire dataset.
A future Outlook: Web based Simulation of Hydrodynamic models
NASA Astrophysics Data System (ADS)
Islam, A. S.; Piasecki, M.
2003-12-01
Despite recent advances to present simulation results as 3D graphs or animation contours, the modeling user community still faces some shortcomings when trying to move around and analyze data. Typical problems include the lack of common platforms with standard vocabulary to exchange simulation results from different numerical models, insufficient descriptions about data (metadata), lack of robust search and retrieval tools for data, and difficulties to reuse simulation domain knowledge. This research demonstrates how to create a shared simulation domain in the WWW and run a number of models through multi-user interfaces. Firstly, meta-datasets have been developed to describe hydrodynamic model data based on geographic metadata standard (ISO 19115) that has been extended to satisfy the need of the hydrodynamic modeling community. The Extended Markup Language (XML) is used to publish this metadata by the Resource Description Framework (RDF). Specific domain ontology for Web Based Simulation (WBS) has been developed to explicitly define vocabulary for the knowledge based simulation system. Subsequently, this knowledge based system is converted into an object model using Meta Object Family (MOF). The knowledge based system acts as a Meta model for the object oriented system, which aids in reusing the domain knowledge. Specific simulation software has been developed based on the object oriented model. Finally, all model data is stored in an object relational database. Database back-ends help store, retrieve and query information efficiently. This research uses open source software and technology such as Java Servlet and JSP, Apache web server, Tomcat Servlet Engine, PostgresSQL databases, Protégé ontology editor, RDQL and RQL for querying RDF in semantic level, Jena Java API for RDF. Also, we use international standards such as the ISO 19115 metadata standard, and specifications such as XML, RDF, OWL, XMI, and UML. The final web based simulation product is deployed as Web Archive (WAR) files which is platform and OS independent and can be used by Windows, UNIX, or Linux. Keywords: Apache, ISO 19115, Java Servlet, Jena, JSP, Metadata, MOF, Linux, Ontology, OWL, PostgresSQL, Protégé, RDF, RDQL, RQL, Tomcat, UML, UNIX, Windows, WAR, XML
An exponentiation method for XML element retrieval.
Wichaiwong, Tanakorn
2014-01-01
XML document is now widely used for modelling and storing structured documents. The structure is very rich and carries important information about contents and their relationships, for example, e-Commerce. XML data-centric collections require query terms allowing users to specify constraints on the document structure; mapping structure queries and assigning the weight are significant for the set of possibly relevant documents with respect to structural conditions. In this paper, we present an extension to the MEXIR search system that supports the combination of structural and content queries in the form of content-and-structure queries, which we call the Exponentiation function. It has been shown the structural information improve the effectiveness of the search system up to 52.60% over the baseline BM25 at MAP.
Yeung, Daniel; Boes, Peter; Ho, Meng Wei; Li, Zuofeng
2015-05-08
Image-guided radiotherapy (IGRT), based on radiopaque markers placed in the prostate gland, was used for proton therapy of prostate patients. Orthogonal X-rays and the IBA Digital Image Positioning System (DIPS) were used for setup correction prior to treatment and were repeated after treatment delivery. Following a rationale for margin estimates similar to that of van Herk,(1) the daily post-treatment DIPS data were analyzed to determine if an adaptive radiotherapy plan was necessary. A Web application using ASP.NET MVC5, Entity Framework, and an SQL database was designed to automate this process. The designed features included state-of-the-art Web technologies, a domain model closely matching the workflow, a database-supporting concurrency and data mining, access to the DIPS database, secured user access and roles management, and graphing and analysis tools. The Model-View-Controller (MVC) paradigm allowed clean domain logic, unit testing, and extensibility. Client-side technologies, such as jQuery, jQuery Plug-ins, and Ajax, were adopted to achieve a rich user environment and fast response. Data models included patients, staff, treatment fields and records, correction vectors, DIPS images, and association logics. Data entry, analysis, workflow logics, and notifications were implemented. The system effectively modeled the clinical workflow and IGRT process.
A case Study of Applying Object-Relational Persistence in Astronomy Data Archiving
NASA Astrophysics Data System (ADS)
Yao, S. S.; Hiriart, R.; Barg, I.; Warner, P.; Gasson, D.
2005-12-01
The NOAO Science Archive (NSA) team is developing a comprehensive domain model to capture the science data in the archive. Java and an object model derived from the domain model weil address the application layer of the archive system. However, since RDBMS is the best proven technology for data management, the challenge is the paradigm mismatch between the object and the relational models. Transparent object-relational mapping (ORM) persistence is a successful solution to this challenge. In the data modeling and persistence implementation of NSA, we are using Hibernate, a well-accepted ORM tool, to bridge the object model in the business tier and the relational model in the database tier. Thus, the database is isolated from the Java application. The application queries directly on objects using a DBMS-independent object-oriented query API, which frees the application developers from the low level JDBC and SQL so that they can focus on the domain logic. We present the detailed design of the NSA R3 (Release 3) data model and object-relational persistence, including mapping, retrieving and caching. Persistence layer optimization and performance tuning will be analyzed. The system is being built on J2EE, so the integration of Hibernate into the EJB container and the transaction management are also explored.
An Exponentiation Method for XML Element Retrieval
2014-01-01
XML document is now widely used for modelling and storing structured documents. The structure is very rich and carries important information about contents and their relationships, for example, e-Commerce. XML data-centric collections require query terms allowing users to specify constraints on the document structure; mapping structure queries and assigning the weight are significant for the set of possibly relevant documents with respect to structural conditions. In this paper, we present an extension to the MEXIR search system that supports the combination of structural and content queries in the form of content-and-structure queries, which we call the Exponentiation function. It has been shown the structural information improve the effectiveness of the search system up to 52.60% over the baseline BM25 at MAP. PMID:24696643
An Analysis Platform for Mobile Ad Hoc Network (MANET) Scenario Execution Log Data
2016-01-01
these technologies. 4.1 Backend Technologies • Java 1.8 • my-sql-connector- java -5.0.8.jar • Tomcat • VirtualBox • Kali MANET Virtual Machine 4.2...Frontend Technologies • LAMPP 4.3 Database • MySQL Server 5. Database The SEDAP database settings and structure are described in this section...contains all the backend java functionality including the web services, should be placed in the webapps directory inside the Tomcat installation
Automatic management system for dose parameters in interventional radiology and cardiology.
Ten, J I; Fernandez, J M; Vaño, E
2011-09-01
The purpose of this work was to develop an automatic management system to archive and analyse the major study parameters and patient doses for fluoroscopy guided procedures performed in cardiology and interventional radiology systems. The X-ray systems used for this trial have the capability to export at the end of the procedure and via e-mail the technical parameters of the study and the patient dose values. An application was developed to query and retrieve from a mail server, all study reports sent by the imaging modality and store them on a Microsoft SQL Server data base. The results from 3538 interventional study reports generated by 7 interventional systems were processed. In the case of some technical parameters and patient doses, alarms were added to receive malfunction alerts so as to immediately take appropriate corrective actions.
Integrating a local database into the StarView distributed user interface
NASA Technical Reports Server (NTRS)
Silberberg, D. P.
1992-01-01
A distributed user interface to the Space Telescope Data Archive and Distribution Service (DADS) known as StarView is being developed. The DADS architecture consists of the data archive as well as a relational database catalog describing the archive. StarView is a client/server system in which the user interface is the front-end client to the DADS catalog and archive servers. Users query the DADS catalog from the StarView interface. Query commands are transmitted via a network and evaluated by the database. The results are returned via the network and are displayed on StarView forms. Based on the results, users decide which data sets to retrieve from the DADS archive. Archive requests are packaged by StarView and sent to DADS, which returns the requested data sets to the users. The advantages of distributed client/server user interfaces over traditional one-machine systems are well known. Since users run software on machines separate from the database, the overall client response time is much faster. Also, since the server is free to process only database requests, the database response time is much faster. Disadvantages inherent in this architecture are slow overall database access time due to the network delays, lack of a 'get previous row' command, and that refinements of a previously issued query must be submitted to the database server, even though the domain of values have already been returned by the previous query. This architecture also does not allow users to cross correlate DADS catalog data with other catalogs. Clearly, a distributed user interface would be more powerful if it overcame these disadvantages. A local database is being integrated into StarView to overcome these disadvantages. When a query is made through a StarView form, which is often composed of fields from multiple tables, it is translated to an SQL query and issued to the DADS catalog. At the same time, a local database table is created to contain the resulting rows of the query. The returned rows are displayed on the form as well as inserted into the local database table. Identical results are produced by reissuing the query to either the DADS catalog or to the local table. Relational databases do not provide a 'get previous row' function because of the inherent complexity of retrieving previous rows of multiple-table joins. However, since this function is easily implemented on a single table, StarView uses the local table to retrieve the previous row. Also, StarView issues subsequent query refinements to the local table instead of the DADS catalog, eliminating the network transmission overhead. Finally, other catalogs can be imported into the local database for cross correlation with local tables. Overall, it is believe that this is a more powerful architecture for distributed, database user interfaces.
Mining Longitudinal Web Queries: Trends and Patterns.
ERIC Educational Resources Information Center
Wang, Peiling; Berry, Michael W.; Yang, Yiheng
2003-01-01
Analyzed user queries submitted to an academic Web site during a four-year period, using a relational database, to examine users' query behavior, to identify problems they encounter, and to develop techniques for optimizing query analysis and mining. Linguistic analyses focus on query structures, lexicon, and word associations using statistical…
Research on high availability architecture of SQL and NoSQL
NASA Astrophysics Data System (ADS)
Wang, Zhiguo; Wei, Zhiqiang; Liu, Hao
2017-03-01
With the advent of the era of big data, amount and importance of data have increased dramatically. SQL database develops in performance and scalability, but more and more companies tend to use NoSQL database as their databases, because NoSQL database has simpler data model and stronger extension capacity than SQL database. Almost all database designers including SQL database and NoSQL database aim to improve performance and ensure availability by reasonable architecture which can reduce the effects of software failures and hardware failures, so that they can provide better experiences for their customers. In this paper, I mainly discuss the architectures of MySQL, MongoDB, and Redis, which are high available and have been deployed in practical application environment, and design a hybrid architecture.
NaKnowBaseTM: The EPA Nanomaterials Research ...
The ability to predict the environmental and health implications of engineered nanomaterials is an important research priority due to the exponential rate at which nanotechnology is being incorporated into consumer, industrial and biomedical applications. To address this need and develop predictive capability, we have created the NaKnowbaseTM, which provides a platform for the curation and dissemination of EPA nanomaterials data to support functional assay development, hazard risk models and informatic analyses. To date, we have combined relevant physicochemical parameters from other organizations (e.g., OECD, NIST), with those requested for nanomaterial data submitted to EPA under the Toxic Substances Control Act (TSCA). Physiochemical characterization data were collated from >400 unique nanomaterials including metals, metal oxides, carbon-based and hybrid materials evaluated or synthesized by EPA researchers. We constructed parameter requirements and table structures for encoding research metadata, including experimental factors and measured response variables. As a proof of concept, we illustrate how SQL-based queries facilitate a range of interrogations including, for example, relationships between nanoparticle characteristics and environmental or toxicological endpoints. The views expressed in this poster are those of the authors and may not reflect U.S. EPA policy. The purpose of this submission for clearance is an abstract for submission to a scientific
Liu, Yan-Lin; Shih, Cheng-Ting; Chang, Yuan-Jen; Chang, Shu-Jun; Wu, Jay
2014-01-01
The rapid development of picture archiving and communication systems (PACSs) thoroughly changes the way of medical informatics communication and management. However, as the scale of a hospital's operations increases, the large amount of digital images transferred in the network inevitably decreases system efficiency. In this study, a server cluster consisting of two server nodes was constructed. Network load balancing (NLB), distributed file system (DFS), and structured query language (SQL) duplication services were installed. A total of 1 to 16 workstations were used to transfer computed radiography (CR), computed tomography (CT), and magnetic resonance (MR) images simultaneously to simulate the clinical situation. The average transmission rate (ATR) was analyzed between the cluster and noncluster servers. In the download scenario, the ATRs of CR, CT, and MR images increased by 44.3%, 56.6%, and 100.9%, respectively, when using the server cluster, whereas the ATRs increased by 23.0%, 39.2%, and 24.9% in the upload scenario. In the mix scenario, the transmission performance increased by 45.2% when using eight computer units. The fault tolerance mechanisms of the server cluster maintained the system availability and image integrity. The server cluster can improve the transmission efficiency while maintaining high reliability and continuous availability in a healthcare environment.
Domain fusion analysis by applying relational algebra to protein sequence and domain databases.
Truong, Kevin; Ikura, Mitsuhiko
2003-05-06
Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful. This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at http://calcium.uhnres.utoronto.ca/pi. As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time.
Implementation of remote monitoring and managing switches
NASA Astrophysics Data System (ADS)
Leng, Junmin; Fu, Guo
2010-12-01
In order to strengthen the safety performance of the network and provide the big convenience and efficiency for the operator and the manager, the system of remote monitoring and managing switches has been designed and achieved using the advanced network technology and present network resources. The fast speed Internet Protocol Cameras (FS IP Camera) is selected, which has 32-bit RSIC embedded processor and can support a number of protocols. An Optimal image compress algorithm Motion-JPEG is adopted so that high resolution images can be transmitted by narrow network bandwidth. The architecture of the whole monitoring and managing system is designed and implemented according to the current infrastructure of the network and switches. The control and administrative software is projected. The dynamical webpage Java Server Pages (JSP) development platform is utilized in the system. SQL (Structured Query Language) Server database is applied to save and access images information, network messages and users' data. The reliability and security of the system is further strengthened by the access control. The software in the system is made to be cross-platform so that multiple operating systems (UNIX, Linux and Windows operating systems) are supported. The application of the system can greatly reduce manpower cost, and can quickly find and solve problems.
Zhou, Jindan; Rudd, Kenneth E.
2013-01-01
EcoGene (http://ecogene.org) is a database and website devoted to continuously improving the structural and functional annotation of Escherichia coli K-12, one of the most well understood model organisms, represented by the MG1655(Seq) genome sequence and annotations. Major improvements to EcoGene in the past decade include (i) graphic presentations of genome map features; (ii) ability to design Boolean queries and Venn diagrams from EcoArray, EcoTopics or user-provided GeneSets; (iii) the genome-wide clone and deletion primer design tool, PrimerPairs; (iv) sequence searches using a customized EcoBLAST; (v) a Cross Reference table of synonymous gene and protein identifiers; (vi) proteome-wide indexing with GO terms; (vii) EcoTools access to >2000 complete bacterial genomes in EcoGene-RefSeq; (viii) establishment of a MySql relational database; and (ix) use of web content management systems. The biomedical literature is surveyed daily to provide citation and gene function updates. As of September 2012, the review of 37 397 abstracts and articles led to creation of 98 425 PubMed-Gene links and 5415 PubMed-Topic links. Annotation updates to Genbank U00096 are transmitted from EcoGene to NCBI. Experimental verifications include confirmation of a CTG start codon, pseudogene restoration and quality assurance of the Keio strain collection. PMID:23197660
Chang, Shu-Jun; Wu, Jay
2014-01-01
The rapid development of picture archiving and communication systems (PACSs) thoroughly changes the way of medical informatics communication and management. However, as the scale of a hospital's operations increases, the large amount of digital images transferred in the network inevitably decreases system efficiency. In this study, a server cluster consisting of two server nodes was constructed. Network load balancing (NLB), distributed file system (DFS), and structured query language (SQL) duplication services were installed. A total of 1 to 16 workstations were used to transfer computed radiography (CR), computed tomography (CT), and magnetic resonance (MR) images simultaneously to simulate the clinical situation. The average transmission rate (ATR) was analyzed between the cluster and noncluster servers. In the download scenario, the ATRs of CR, CT, and MR images increased by 44.3%, 56.6%, and 100.9%, respectively, when using the server cluster, whereas the ATRs increased by 23.0%, 39.2%, and 24.9% in the upload scenario. In the mix scenario, the transmission performance increased by 45.2% when using eight computer units. The fault tolerance mechanisms of the server cluster maintained the system availability and image integrity. The server cluster can improve the transmission efficiency while maintaining high reliability and continuous availability in a healthcare environment. PMID:24701580
Web application for detailed real-time database transaction monitoring for CMS condition data
NASA Astrophysics Data System (ADS)
de Gruttola, Michele; Di Guida, Salvatore; Innocente, Vincenzo; Pierro, Antonio
2012-12-01
In the upcoming LHC era, database have become an essential part for the experiments collecting data from LHC, in order to safely store, and consistently retrieve, a wide amount of data, which are produced by different sources. In the CMS experiment at CERN, all this information is stored in ORACLE databases, allocated in several servers, both inside and outside the CERN network. In this scenario, the task of monitoring different databases is a crucial database administration issue, since different information may be required depending on different users' tasks such as data transfer, inspection, planning and security issues. We present here a web application based on Python web framework and Python modules for data mining purposes. To customize the GUI we record traces of user interactions that are used to build use case models. In addition the application detects errors in database transactions (for example identify any mistake made by user, application failure, unexpected network shutdown or Structured Query Language (SQL) statement error) and provides warning messages from the different users' perspectives. Finally, in order to fullfill the requirements of the CMS experiment community, and to meet the new development in many Web client tools, our application was further developed, and new features were deployed.
JASSA: a comprehensive tool for prediction of SUMOylation sites and SIMs.
Beauclair, Guillaume; Bridier-Nahmias, Antoine; Zagury, Jean-François; Saïb, Ali; Zamborlini, Alessia
2015-11-01
Post-translational modification by the Small Ubiquitin-like Modifier (SUMO) proteins, a process termed SUMOylation, is involved in many fundamental cellular processes. SUMO proteins are conjugated to a protein substrate, creating an interface for the recruitment of cofactors harboring SUMO-interacting motifs (SIMs). Mapping both SUMO-conjugation sites and SIMs is required to study the functional consequence of SUMOylation. To define the best candidate sites for experimental validation we designed JASSA, a Joint Analyzer of SUMOylation site and SIMs. JASSA is a predictor that uses a scoring system based on a Position Frequency Matrix derived from the alignment of experimental SUMOylation sites or SIMs. Compared with existing web-tools, JASSA displays on par or better performances. Novel features were implemented towards a better evaluation of the prediction, including identification of database hits matching the query sequence and representation of candidate sites within the secondary structural elements and/or the 3D fold of the protein of interest, retrievable from deposited PDB files. JASSA is freely accessible at http://www.jassa.fr/. Website is implemented in PHP and MySQL, with all major browsers supported. guillaume.beauclair@inserm.fr Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Zhou, Jindan; Rudd, Kenneth E
2013-01-01
EcoGene (http://ecogene.org) is a database and website devoted to continuously improving the structural and functional annotation of Escherichia coli K-12, one of the most well understood model organisms, represented by the MG1655(Seq) genome sequence and annotations. Major improvements to EcoGene in the past decade include (i) graphic presentations of genome map features; (ii) ability to design Boolean queries and Venn diagrams from EcoArray, EcoTopics or user-provided GeneSets; (iii) the genome-wide clone and deletion primer design tool, PrimerPairs; (iv) sequence searches using a customized EcoBLAST; (v) a Cross Reference table of synonymous gene and protein identifiers; (vi) proteome-wide indexing with GO terms; (vii) EcoTools access to >2000 complete bacterial genomes in EcoGene-RefSeq; (viii) establishment of a MySql relational database; and (ix) use of web content management systems. The biomedical literature is surveyed daily to provide citation and gene function updates. As of September 2012, the review of 37 397 abstracts and articles led to creation of 98 425 PubMed-Gene links and 5415 PubMed-Topic links. Annotation updates to Genbank U00096 are transmitted from EcoGene to NCBI. Experimental verifications include confirmation of a CTG start codon, pseudogene restoration and quality assurance of the Keio strain collection.
Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services
Zamani, Hamid
2017-01-01
Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services. PMID:29375652
Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services.
Chrimes, Dillon; Zamani, Hamid
2017-01-01
Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.
Shen, Boxuan; Linko, Veikko; Dietz, Hendrik; Toppari, J Jussi
2015-01-01
DNA origami is a widely used method for fabrication of custom-shaped nanostructures. However, to utilize such structures, one needs to controllably position them on nanoscale. Here we demonstrate how different types of 3D scaffolded multilayer origamis can be accurately anchored to lithographically fabricated nanoelectrodes on a silicon dioxide substrate by DEP. Straight brick-like origami structures, constructed both in square (SQL) and honeycomb lattices, as well as curved "C"-shaped and angular "L"-shaped origamis were trapped with nanoscale precision and single-structure accuracy. We show that the positioning and immobilization of all these structures can be realized with or without thiol-linkers. In general, structural deformations of the origami during the DEP trapping are highly dependent on the shape and the construction of the structure. The SQL brick turned out to be the most robust structure under the high DEP forces, and accordingly, its single-structure trapping yield was also highest. In addition, the electrical conductivity of single immobilized plain brick-like structures was characterized. The electrical measurements revealed that the conductivity is negligible (insulating behavior). However, we observed that the trapping process of the SQL brick equipped with thiol-linkers tended to induce an etched "nanocanyon" in the silicon dioxide substrate. The nanocanyon was formed exactly between the electrodes, that is, at the location of the DEP-trapped origami. The results show that the demonstrated DEP-trapping technique can be readily exploited in assembling and arranging complex multilayered origami geometries. In addition, DNA origamis could be utilized in DEP-assisted deformation of the substrates onto which they are attached. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Development of a platform-independent receiver control system for SISIFOS
NASA Astrophysics Data System (ADS)
Lemke, Roland; Olberg, Michael
1998-05-01
Up to now receiver control software was a time consuming development usually written by receiver engineers who had mainly the hardware in mind. We are presenting a low-cost and very flexible system which uses a minimal interface to the real hardware, and which makes it easy to adapt to new receivers. Our system uses Tcl/Tk as a graphical user interface (GUI), SpecTcl as a GUI builder, Pgplot as plotting software, a simple query language (SQL) database for information storage and retrieval, Ethernet socket to socket communication and SCPI as a command control language. The complete system is in principal platform independent but for cost saving reasons we are using it actually on a PC486 running Linux 2.0.30, which is a copylefted Unix. The only hardware dependent part are the digital input/output boards, analog to digital and digital to analog convertors. In the case of the Linux PC we are using a device driver development kit to integrate the boards fully into the kernel of the operating system, which indeed makes them look like an ordinary device. The advantage of this system is firstly the low price and secondly the clear separation between the different software components which are available for many operating systems. If it is not possible, due to CPU performance limitations, to run all the software in a single machine,the SQL-database or the graphical user interface could be installed on separate computers.
Monitoring performance of a highly distributed and complex computing infrastructure in LHCb
NASA Astrophysics Data System (ADS)
Mathe, Z.; Haen, C.; Stagni, F.
2017-10-01
In order to ensure an optimal performance of the LHCb Distributed Computing, based on LHCbDIRAC, it is necessary to be able to inspect the behavior over time of many components: firstly the agents and services on which the infrastructure is built, but also all the computing tasks and data transfers that are managed by this infrastructure. This consists of recording and then analyzing time series of a large number of observables, for which the usage of SQL relational databases is far from optimal. Therefore within DIRAC we have been studying novel possibilities based on NoSQL databases (ElasticSearch, OpenTSDB and InfluxDB) as a result of this study we developed a new monitoring system based on ElasticSearch. It has been deployed on the LHCb Distributed Computing infrastructure for which it collects data from all the components (agents, services, jobs) and allows creating reports through Kibana and a web user interface, which is based on the DIRAC web framework. In this paper we describe this new implementation of the DIRAC monitoring system. We give details on the ElasticSearch implementation within the DIRAC general framework, as well as an overview of the advantages of the pipeline aggregation used for creating a dynamic bucketing of the time series. We present the advantages of using the ElasticSearch DSL high-level library for creating and running queries. Finally we shall present the performances of that system.
Langer, Steve G
2016-06-01
In 2010, the DICOM Data Warehouse (DDW) was launched as a data warehouse for DICOM meta-data. Its chief design goals were to have a flexible database schema that enabled it to index standard patient and study information, modality specific tags (public and private), and create a framework to derive computable information (derived tags) from the former items. Furthermore, it was to map the above information to an internally standard lexicon that enables a non-DICOM savvy programmer to write standard SQL queries and retrieve the equivalent data from a cohort of scanners, regardless of what tag that data element was found in over the changing epochs of DICOM and ensuing migration of elements from private to public tags. After 5 years, the original design has scaled astonishingly well. Very little has changed in the database schema. The knowledge base is now fluent in over 90 device types. Also, additional stored procedures have been written to compute data that is derivable from standard or mapped tags. Finally, an early concern is that the system would not be able to address the variability DICOM-SR objects has been addressed. As of this writing the system is indexing 300 MR, 600 CT, and 2000 other (XA, DR, CR, MG) imaging studies per day. The only remaining issue to be solved is the case for tags that were not prospectively indexed-and indeed, this final challenge may lead to a noSQL, big data, approach in a subsequent version.
A spatial database for landslides in northern Bavaria: A methodological approach
NASA Astrophysics Data System (ADS)
Jäger, Daniel; Kreuzer, Thomas; Wilde, Martina; Bemm, Stefan; Terhorst, Birgit
2018-04-01
Landslide databases provide essential information for hazard modeling, damages on buildings and infrastructure, mitigation, and research needs. This study presents the development of a landslide database system named WISL (Würzburg Information System on Landslides), currently storing detailed landslide data for northern Bavaria, Germany, in order to enable scientific queries as well as comparisons with other regional landslide inventories. WISL is based on free open source software solutions (PostgreSQL, PostGIS) assuring good correspondence of the various softwares and to enable further extensions with specific adaptions of self-developed software. Apart from that, WISL was designed to be particularly compatible for easy communication with other databases. As a central pre-requisite for standardized, homogeneous data acquisition in the field, a customized data sheet for landslide description was compiled. This sheet also serves as an input mask for all data registration procedures in WISL. A variety of "in-database" solutions for landslide analysis provides the necessary scalability for the database, enabling operations at the local server. In its current state, WISL already enables extensive analysis and queries. This paper presents an example analysis of landslides in Oxfordian Limestones in the northeastern Franconian Alb, northern Bavaria. The results reveal widely differing landslides in terms of geometry and size. Further queries related to landslide activity classifies the majority of the landslides as currently inactive, however, they clearly possess a certain potential for remobilization. Along with some active mass movements, a significant percentage of landslides potentially endangers residential areas or infrastructure. The main aspect of future enhancements of the WISL database is related to data extensions in order to increase research possibilities, as well as to transfer the system to other regions and countries.
MFIB: a repository of protein complexes with mutual folding induced by binding.
Fichó, Erzsébet; Reményi, István; Simon, István; Mészáros, Bálint
2017-11-15
It is commonplace that intrinsically disordered proteins (IDPs) are involved in crucial interactions in the living cell. However, the study of protein complexes formed exclusively by IDPs is hindered by the lack of data and such analyses remain sporadic. Systematic studies benefited other types of protein-protein interactions paving a way from basic science to therapeutics; yet these efforts require reliable datasets that are currently lacking for synergistically folding complexes of IDPs. Here we present the Mutual Folding Induced by Binding (MFIB) database, the first systematic collection of complexes formed exclusively by IDPs. MFIB contains an order of magnitude more data than any dataset used in corresponding studies and offers a wide coverage of known IDP complexes in terms of flexibility, oligomeric composition and protein function from all domains of life. The included complexes are grouped using a hierarchical classification and are complemented with structural and functional annotations. MFIB is backed by a firm development team and infrastructure, and together with possible future community collaboration it will provide the cornerstone for structural and functional studies of IDP complexes. MFIB is freely accessible at http://mfib.enzim.ttk.mta.hu/. The MFIB application is hosted by Apache web server and was implemented in PHP. To enrich querying features and to enhance backend performance a MySQL database was also created. simon.istvan@ttk.mta.hu, meszaros.balint@ttk.mta.hu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.
Query Language for Location-Based Services: A Model Checking Approach
NASA Astrophysics Data System (ADS)
Hoareau, Christian; Satoh, Ichiro
We present a model checking approach to the rationale, implementation, and applications of a query language for location-based services. Such query mechanisms are necessary so that users, objects, and/or services can effectively benefit from the location-awareness of their surrounding environment. The underlying data model is founded on a symbolic model of space organized in a tree structure. Once extended to a semantic model for modal logic, we regard location query processing as a model checking problem, and thus define location queries as hybrid logicbased formulas. Our approach is unique to existing research because it explores the connection between location models and query processing in ubiquitous computing systems, relies on a sound theoretical basis, and provides modal logic-based query mechanisms for expressive searches over a decentralized data structure. A prototype implementation is also presented and will be discussed.
Content-Aware DataGuide with Incremental Index Update using Frequently Used Paths
NASA Astrophysics Data System (ADS)
Sharma, A. K.; Duhan, Neelam; Khattar, Priyanka
2010-11-01
Size of the WWW is increasing day by day. Due to the absence of structured data on the Web, it becomes very difficult for information retrieval tools to fully utilize the Web information. As a solution to this problem, XML pages come into play, which provide structural information to the users to some extent. Without efficient indexes, query processing can be quite inefficient due to an exhaustive traversal on XML data. In this paper an improved content-centric approach of Content-Aware DataGuide, which is an indexing technique for XML databases, is being proposed that uses frequently used paths from historical query logs to improve query performance. The index can be updated incrementally according to the changes in query workload and thus, the overhead of reconstruction can be minimized. Frequently used paths are extracted using any Sequential Pattern mining algorithm on subsequent queries in the query workload. After this, the data structures are incrementally updated. This indexing technique proves to be efficient as partial matching queries can be executed efficiently and users can now get the more relevant documents in results.
Virtual file system on NoSQL for processing high volumes of HL7 messages.
Kimura, Eizen; Ishihara, Ken
2015-01-01
The Standardized Structured Medical Information Exchange (SS-MIX) is intended to be the standard repository for HL7 messages that depend on a local file system. However, its scalability is limited. We implemented a virtual file system using NoSQL to incorporate modern computing technology into SS-MIX and allow the system to integrate local patient IDs from different healthcare systems into a universal system. We discuss its implementation using the database MongoDB and describe its performance in a case study.
NASA Astrophysics Data System (ADS)
Hu, Haibin
2017-05-01
Among numerous WEB security issues, SQL injection is the most notable and dangerous. In this study, characteristics and procedures of SQL injection are analyzed, and the method for detecting the SQL injection attack is illustrated. The defense resistance and remedy model of SQL injection attack is established from the perspective of non-intrusive SQL injection attack and defense. Moreover, the ability of resisting the SQL injection attack of the server has been comprehensively improved through the security strategies on operation system, IIS and database, etc.. Corresponding codes are realized. The method is well applied in the actual projects.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Robertson, SP; Quon, H; Kiess, AP
Purpose: To develop a framework for automatic extraction of clinically meaningful dosimetric-outcome relationships from an in-house, analytic oncology database. Methods: Dose-volume histograms (DVH) and clinical outcome-related structured data elements have been routinely stored to our database for 513 HN cancer patients treated from 2007 to 2014. SQL queries were developed to extract outcomes that had been assessed for at least 100 patients, as well as DVH curves for organs-at-risk (OAR) that were contoured for at least 100 patients. DVH curves for paired OAR (e.g., left and right parotids) were automatically combined and included as additional structures for analysis. For eachmore » OAR-outcome combination, DVH dose points, D(V{sub t}), at a series of normalized volume thresholds, V{sub t}=[0.01,0.99], were stratified into two groups based on outcomes after treatment completion. The probability, P[D(V{sub t})], of an outcome was modeled at each V{sub t} by logistic regression. Notable combinations, defined as having P[D(V{sub t})] increase by at least 5% per Gy (p<0.05), were further evaluated for clinical relevance using a custom graphical interface. Results: A total of 57 individual and combined structures and 115 outcomes were queried, resulting in over 6,500 combinations for analysis. Of these, 528 combinations met the 5%/Gy requirement, with further manual inspection revealing a number of reasonable models based on either reported literature or proximity between neighboring OAR. The data mining algorithm confirmed the following well-known toxicity/outcome relationships: dysphagia/larynx, voice changes/larynx, esophagitis/esophagus, xerostomia/combined parotids, and mucositis/oral mucosa. Other notable relationships included dysphagia/pharyngeal constrictors, nausea/brainstem, nausea/spinal cord, weight-loss/mandible, and weight-loss/combined parotids. Conclusion: Our database platform has enabled large-scale analysis of dose-outcome relationships. The current data-mining framework revealed both known and novel dosimetric and clinical relationships, underscoring the potential utility of this analytic approach. Multivariate models may be necessary to further evaluate the complex relationship between neighboring OARs and observed outcomes. This research was supported through collaborations with Elekta, Philips, and Toshiba.« less
NASA Technical Reports Server (NTRS)
Denney, Ewen W.; Naylor, Dwight; Pai, Ganesh
2014-01-01
Querying a safety case to show how the various stakeholders' concerns about system safety are addressed has been put forth as one of the benefits of argument-based assurance (in a recent study by the Health Foundation, UK, which reviewed the use of safety cases in safety-critical industries). However, neither the literature nor current practice offer much guidance on querying mechanisms appropriate for, or available within, a safety case paradigm. This paper presents a preliminary approach that uses a formal basis for querying safety cases, specifically Goal Structuring Notation (GSN) argument structures. Our approach semantically enriches GSN arguments with domain-specific metadata that the query language leverages, along with its inherent structure, to produce views. We have implemented the approach in our toolset AdvoCATE, and illustrate it by application to a fragment of the safety argument for an Unmanned Aircraft System (UAS) being developed at NASA Ames. We also discuss the potential practical utility of our query mechanism within the context of the existing framework for UAS safety assurance.
Schreiweis, Björn; Trinczek, Benjamin; Köpcke, Felix; Leusch, Thomas; Majeed, Raphael W; Wenk, Joachim; Bergh, Björn; Ohmann, Christian; Röhrig, Rainer; Dugas, Martin; Prokosch, Hans-Ulrich
2014-11-01
Reusing data from electronic health records for clinical and translational research and especially for patient recruitment has been tackled in a broader manner since about a decade. Most projects found in the literature however focus on standalone systems and proprietary implementations at one particular institution often for only one singular trial and no generic evaluation of EHR systems for their applicability to support the patient recruitment process does yet exist. Thus we sought to assess whether the current generation of EHR systems in Germany provides modules/tools, which can readily be applied for IT-supported patient recruitment scenarios. We first analysed the EHR portfolio implemented at German University Hospitals and then selected 5 sites with five different EHR implementations covering all major commercial systems applied in German University Hospitals. Further, major functionalities required for patient recruitment support have been defined and the five sample EHRs and their standard tools have been compared to the major functionalities. In our analysis of the site's hospital information system environments (with four commercial EHR systems and one self-developed system) we found that - even though no dedicated module for patient recruitment has been provided - most EHR products comprise generic tools such as workflow engines, querying capabilities, report generators and direct SQL-based database access which can be applied as query modules, screening lists and notification components for patient recruitment support. A major limitation of all current EHR products however is that they provide no dedicated data structures and functionalities for implementing and maintaining a local trial registry. At the five sites with standard EHR tools the typical functionalities of the patient recruitment process could be mostly implemented. However, no EHR component is yet directly dedicated to support research requirements such as patient recruitment. We recommend for future developments that EHR customers and vendors focus much more on the provision of dedicated patient recruitment modules. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Russ, Thomas A; Ramakrishnan, Cartic; Hovy, Eduard H; Bota, Mihail; Burns, Gully A P C
2011-08-22
We address the goal of curating observations from published experiments in a generalizable form; reasoning over these observations to generate interpretations and then querying this interpreted knowledge to supply the supporting evidence. We present web-application software as part of the 'BioScholar' project (R01-GM083871) that fully instantiates this process for a well-defined domain: using tract-tracing experiments to study the neural connectivity of the rat brain. The main contribution of this work is to provide the first instantiation of a knowledge representation for experimental observations called 'Knowledge Engineering from Experimental Design' (KEfED) based on experimental variables and their interdependencies. The software has three parts: (a) the KEfED model editor - a design editor for creating KEfED models by drawing a flow diagram of an experimental protocol; (b) the KEfED data interface - a spreadsheet-like tool that permits users to enter experimental data pertaining to a specific model; (c) a 'neural connection matrix' interface that presents neural connectivity as a table of ordinal connection strengths representing the interpretations of tract-tracing data. This tool also allows the user to view experimental evidence pertaining to a specific connection. BioScholar is built in Flex 3.5. It uses Persevere (a noSQL database) as a flexible data store and PowerLoom® (a mature First Order Logic reasoning system) to execute queries using spatial reasoning over the BAMS neuroanatomical ontology. We first introduce the KEfED approach as a general approach and describe its possible role as a way of introducing structured reasoning into models of argumentation within new models of scientific publication. We then describe the design and implementation of our example application: the BioScholar software. This is presented as a possible biocuration interface and supplementary reasoning toolkit for a larger, more specialized bioinformatics system: the Brain Architecture Management System (BAMS).
2011-01-01
Background We address the goal of curating observations from published experiments in a generalizable form; reasoning over these observations to generate interpretations and then querying this interpreted knowledge to supply the supporting evidence. We present web-application software as part of the 'BioScholar' project (R01-GM083871) that fully instantiates this process for a well-defined domain: using tract-tracing experiments to study the neural connectivity of the rat brain. Results The main contribution of this work is to provide the first instantiation of a knowledge representation for experimental observations called 'Knowledge Engineering from Experimental Design' (KEfED) based on experimental variables and their interdependencies. The software has three parts: (a) the KEfED model editor - a design editor for creating KEfED models by drawing a flow diagram of an experimental protocol; (b) the KEfED data interface - a spreadsheet-like tool that permits users to enter experimental data pertaining to a specific model; (c) a 'neural connection matrix' interface that presents neural connectivity as a table of ordinal connection strengths representing the interpretations of tract-tracing data. This tool also allows the user to view experimental evidence pertaining to a specific connection. BioScholar is built in Flex 3.5. It uses Persevere (a noSQL database) as a flexible data store and PowerLoom® (a mature First Order Logic reasoning system) to execute queries using spatial reasoning over the BAMS neuroanatomical ontology. Conclusions We first introduce the KEfED approach as a general approach and describe its possible role as a way of introducing structured reasoning into models of argumentation within new models of scientific publication. We then describe the design and implementation of our example application: the BioScholar software. This is presented as a possible biocuration interface and supplementary reasoning toolkit for a larger, more specialized bioinformatics system: the Brain Architecture Management System (BAMS). PMID:21859449
EarthServer: Cross-Disciplinary Earth Science Through Data Cube Analytics
NASA Astrophysics Data System (ADS)
Baumann, P.; Rossi, A. P.
2016-12-01
The unprecedented increase of imagery, in-situ measurements, and simulation data produced by Earth (and Planetary) Science observations missions bears a rich, yet not leveraged potential for getting insights from integrating such diverse datasets and transform scientific questions into actual queries to data, formulated in a standardized way.The intercontinental EarthServer [1] initiative is demonstrating new directions for flexible, scalable Earth Science services based on innovative NoSQL technology. Researchers from Europe, the US and Australia have teamed up to rigorously implement the concept of the datacube. Such a datacube may have spatial and temporal dimensions (such as a satellite image time series) and may unite an unlimited number of scenes. Independently from whatever efficient data structuring a server network may perform internally, users (scientist, planners, decision makers) will always see just a few datacubes they can slice and dice.EarthServer has established client [2] and server technology for such spatio-temporal datacubes. The underlying scalable array engine, rasdaman [3,4], enables direct interaction, including 3-D visualization, common EO data processing, and general analytics. Services exclusively rely on the open OGC "Big Geo Data" standards suite, the Web Coverage Service (WCS). Conversely, EarthServer has shaped and advanced WCS based on the experience gained. The first phase of EarthServer has advanced scalable array database technology into 150+ TB services. Currently, Petabyte datacubes are being built for ad-hoc and cross-disciplinary querying, e.g. using climate, Earth observation and ocean data.We will present the EarthServer approach, its impact on OGC / ISO / INSPIRE standardization, and its platform technology, rasdaman.References: [1] Baumann, et al. (2015) DOI: 10.1080/17538947.2014.1003106 [2] Hogan, P., (2011) NASA World Wind, Proceedings of the 2nd International Conference on Computing for Geospatial Research & Applications ACM. [3] Baumann, Peter, et al. (2014) In Proc. 10th ICDM, 194-201. [4] Dumitru, A. et al. (2014) In Proc ACM SIGMOD Workshop on Data Analytics in the Cloud (DanaC'2014), 1-4.
Enhanced DIII-D Data Management Through a Relational Database
NASA Astrophysics Data System (ADS)
Burruss, J. R.; Peng, Q.; Schachter, J.; Schissel, D. P.; Terpstra, T. B.
2000-10-01
A relational database is being used to serve data about DIII-D experiments. The database is optimized for queries across multiple shots, allowing for rapid data mining by SQL-literate researchers. The relational database relates different experiments and datasets, thus providing a big picture of DIII-D operations. Users are encouraged to add their own tables to the database. Summary physics quantities about DIII-D discharges are collected and stored in the database automatically. Meta-data about code runs, MDSplus usage, and visualization tool usage are collected, stored in the database, and later analyzed to improve computing. Documentation on the database may be accessed through programming languages such as C, Java, and IDL, or through ODBC compliant applications such as Excel and Access. A database-driven web page also provides a convenient means for viewing database quantities through the World Wide Web. Demonstrations will be given at the poster.
New Software for Ensemble Creation in the Spitzer-Space-Telescope Operations Database
NASA Technical Reports Server (NTRS)
Laher, Russ; Rector, John
2004-01-01
Some of the computer pipelines used to process digital astronomical images from NASA's Spitzer Space Telescope require multiple input images, in order to generate high-level science and calibration products. The images are grouped into ensembles according to well documented ensemble-creation rules by making explicit associations in the operations Informix database at the Spitzer Science Center (SSC). The advantage of this approach is that a simple database query can retrieve the required ensemble of pipeline input images. New and improved software for ensemble creation has been developed. The new software is much faster than the existing software because it uses pre-compiled database stored-procedures written in Informix SPL (SQL programming language). The new software is also more flexible because the ensemble creation rules are now stored in and read from newly defined database tables. This table-driven approach was implemented so that ensemble rules can be inserted, updated, or deleted without modifying software.
EXP-PAC: providing comparative analysis and storage of next generation gene expression data.
Church, Philip C; Goscinski, Andrzej; Lefèvre, Christophe
2012-07-01
Microarrays and more recently RNA sequencing has led to an increase in available gene expression data. How to manage and store this data is becoming a key issue. In response we have developed EXP-PAC, a web based software package for storage, management and analysis of gene expression and sequence data. Unique to this package is SQL based querying of gene expression data sets, distributed normalization of raw gene expression data and analysis of gene expression data across experiments and species. This package has been populated with lactation data in the international milk genomic consortium web portal (http://milkgenomics.org/). Source code is also available which can be hosted on a Windows, Linux or Mac APACHE server connected to a private or public network (http://mamsap.it.deakin.edu.au/~pcc/Release/EXP_PAC.html). Copyright © 2012 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Xu, Mingzhu; Gao, Zhiqiang; Ning, Jicai
2014-10-01
To improve the access efficiency of geoscience data, efficient data model and storage solutions should be used. Geoscience data is usually classified by format or coordinate system in existing storage solutions. When data is large, it is not conducive to search the geographic features. In this study, a geographical information integration system of Shandong province, China was developed based on the technology of ArcGIS Engine, .NET, and SQL Server. It uses Geodatabase spatial data model and ArcSDE to organize and store spatial and attribute data and establishes geoscience database of Shangdong. Seven function modules were designed: map browse, database and subject management, layer control, map query, spatial analysis and map symbolization. The system's characteristics of can be browsed and managed by geoscience subjects make the system convenient for geographic researchers and decision-making departments to use the data.
UCbase 2.0: ultraconserved sequences database (2014 update)
Lomonaco, Vincenzo; Martoglia, Riccardo; Mandreoli, Federica; Anderlucci, Laura; Emmett, Warren; Bicciato, Silvio; Taccioli, Cristian
2014-01-01
UCbase 2.0 (http://ucbase.unimore.it) is an update, extension and evolution of UCbase, a Web tool dedicated to the analysis of ultraconserved sequences (UCRs). UCRs are 481 sequences >200 bases sharing 100% identity among human, mouse and rat genomes. They are frequently located in genomic regions known to be involved in cancer or differentially expressed in human leukemias and carcinomas. UCbase 2.0 is a platform-independent Web resource that includes the updated version of the human genome annotation (hg19), information linking disorders to chromosomal coordinates based on the Systematized Nomenclature of Medicine classification, a query tool to search for Single Nucleotide Polymorphisms (SNPs) and a new text box to directly interrogate the database using a MySQL interface. To facilitate the interactive visual interpretation of UCR chromosomal positioning, UCbase 2.0 now includes a graph visualization interface directly linked to UCSC genome browser. Database URL: http://ucbase.unimore.it PMID:24951797
The Development and Preliminary Application Ofplant Quarantine Remote Teaching System Inchina
NASA Astrophysics Data System (ADS)
Wu, Zhigang; Li, Zhihong; Yang, Ding; Zhang, Guozhen
With the development of modern information technology, the traditional teaching mode becomes more deficient for the requirement of modern education. Plant Quarantine has been accepted as the common course for the universities of agriculture in China after the entry of WTO. But the teaching resources of this course are not enough especially for most universities with lack base. The characteristic of e-learning is regarded as one way to solve the problem of short teaching resource. PQRTS (Plant Quarantine Remote Teaching System) was designed and developed with JSP (Java Sever Pages), MySQL and Tomcat in this study. The system included many kinds of plant quarantine teaching resources, such as international glossary, regulations and standards, multimedia information of quarantine process and pests, ppt files of teaching, and training exercise. The system prototype implemented the functions of remote learning, querying, management, examination and remote discussion. It could be a tool for teaching, teaching assistance and learning online.
LSST communications middleware implementation
NASA Astrophysics Data System (ADS)
Mills, Dave; Schumacher, German; Lotz, Paul
2016-07-01
The LSST communications middleware is based on a set of software abstractions; which provide standard interfaces for common communications services. The observatory requires communication between diverse subsystems, implemented by different contractors, and comprehensive archiving of subsystem status data. The Service Abstraction Layer (SAL) is implemented using open source packages that implement open standards of DDS (Data Distribution Service1) for data communication, and SQL (Standard Query Language) for database access. For every subsystem, abstractions for each of the Telemetry datastreams, along with Command/Response and Events, have been agreed with the appropriate component vendor (such as Dome, TMA, Hexapod), and captured in ICD's (Interface Control Documents).The OpenSplice (Prismtech) Community Edition of DDS provides an LGPL licensed distribution which may be freely redistributed. The availability of the full source code provides assurances that the project will be able to maintain it over the full 10 year survey, independent of the fortunes of the original providers.
Model for Semantically Rich Point Cloud Data
NASA Astrophysics Data System (ADS)
Poux, F.; Neuville, R.; Hallot, P.; Billen, R.
2017-10-01
This paper proposes an interoperable model for managing high dimensional point clouds while integrating semantics. Point clouds from sensors are a direct source of information physically describing a 3D state of the recorded environment. As such, they are an exhaustive representation of the real world at every scale: 3D reality-based spatial data. Their generation is increasingly fast but processing routines and data models lack of knowledge to reason from information extraction rather than interpretation. The enhanced smart point cloud developed model allows to bring intelligence to point clouds via 3 connected meta-models while linking available knowledge and classification procedures that permits semantic injection. Interoperability drives the model adaptation to potentially many applications through specialized domain ontologies. A first prototype is implemented in Python and PostgreSQL database and allows to combine semantic and spatial concepts for basic hybrid queries on different point clouds.
Development of yarn breakage detection software system based on machine vision
NASA Astrophysics Data System (ADS)
Wang, Wenyuan; Zhou, Ping; Lin, Xiangyu
2017-10-01
For questions spinning mills and yarn breakage cannot be detected in a timely manner, and save the cost of textile enterprises. This paper presents a software system based on computer vision for real-time detection of yarn breakage. The system and Windows8.1 system Tablet PC, cloud server to complete the yarn breakage detection and management. Running on the Tablet PC software system is designed to collect yarn and location information for analysis and processing. And will be processed after the information through the Wi-Fi and http protocol sent to the cloud server to store in the Microsoft SQL2008 database. In order to follow up on the yarn break information query and management. Finally sent to the local display on time display, and remind the operator to deal with broken yarn. The experimental results show that the system of missed test rate not more than 5%o, and no error detection.
A Semantic Graph Query Language
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kaplan, I L
2006-10-16
Semantic graphs can be used to organize large amounts of information from a number of sources into one unified structure. A semantic query language provides a foundation for extracting information from the semantic graph. The graph query language described here provides a simple, powerful method for querying semantic graphs.
EarthServer: Use of Rasdaman as a data store for use in visualisation of complex EO data
NASA Astrophysics Data System (ADS)
Clements, Oliver; Walker, Peter; Grant, Mike
2013-04-01
The European Commission FP7 project EarthServer is establishing open access and ad-hoc analytics on extreme-size Earth Science data, based on and extending cutting-edge Array Database technology. EarthServer is built around the Rasdaman Raster Data Manager which extends standard relational database systems with the ability to store and retrieve multi-dimensional raster data of unlimited size through an SQL style query language. Rasdaman facilitates visualisation of data by providing several Open Geospatial Consortium (OGC) standard interfaces through its web services wrapper, Petascope. These include the well established standards, Web Coverage Service (WCS) and Web Map Service (WMS) as well as the emerging standard, Web Coverage Processing Service (WCPS). The WCPS standard allows the running of ad-hoc queries on the data stored within Rasdaman, creating an infrastructure where users are not restricted by bandwidth when manipulating or querying huge datasets. Here we will show that the use of EarthServer technologies and infrastructure allows access and visualisation of massive scale data through a web client with only marginal bandwidth use as opposed to the current mechanism of copying huge amounts of data to create visualisations locally. For example if a user wanted to generate a plot of global average chlorophyll for a complete decade time series they would only have to download the result instead of Terabytes of data. Firstly we will present a brief overview of the capabilities of Rasdaman and the WCPS query language to introduce the ways in which it is used in a visualisation tool chain. We will show that there are several ways in which WCPS can be utilised to create both standard and novel web based visualisations. An example of a standard visualisation is the production of traditional 2d plots, allowing users the ability to plot data products easily. However, the query language allows the creation of novel/custom products, which can then immediately be plotted with the same system. For more complex multi-spectral data, WCPS allows the user to explore novel combinations of bands in standard band-ratio algorithms through a web browser with dynamic updating of the resultant image. To visualise very large datasets Rasdaman has the capability to dynamically scale a dataset or query result so that it can be appraised quickly for use in later unscaled queries. All of these techniques are accessible through a web based GIS interface increasing the number of potential users of the system. Lastly we will show the advances in dynamic web based 3D visualisations being explored within the EarthServer project. By utilising the emerging declarative 3D web standard X3DOM as a tool to visualise the results of WCPS queries we introduce several possible benefits, including quick appraisal of data for outliers or anomalous data points and visualisation of the uncertainty of data alongside the actual data values.
Maintaining Multimedia Data in a Geospatial Database
2012-09-01
at PostgreSQL and MySQL as spatial databases was offered. Given their results, as each database produced result sets from zero to 100,000, it was...excelled given multiple conditions. A different look at PostgreSQL and MySQL as spatial databases was offered. Given their results, as each database... MySQL ................................................................................................14 B. BENCHMARKING DATA RETRIEVED FROM TABLE
Boes, Peter; Ho, Meng Wei; Li, Zuofeng
2015-01-01
Image‐guided radiotherapy (IGRT), based on radiopaque markers placed in the prostate gland, was used for proton therapy of prostate patients. Orthogonal X‐rays and the IBA Digital Image Positioning System (DIPS) were used for setup correction prior to treatment and were repeated after treatment delivery. Following a rationale for margin estimates similar to that of van Herk,(1) the daily post‐treatment DIPS data were analyzed to determine if an adaptive radiotherapy plan was necessary. A Web application using ASP.NET MVC5, Entity Framework, and an SQL database was designed to automate this process. The designed features included state‐of‐the‐art Web technologies, a domain model closely matching the workflow, a database‐supporting concurrency and data mining, access to the DIPS database, secured user access and roles management, and graphing and analysis tools. The Model‐View‐Controller (MVC) paradigm allowed clean domain logic, unit testing, and extensibility. Client‐side technologies, such as jQuery, jQuery Plug‐ins, and Ajax, were adopted to achieve a rich user environment and fast response. Data models included patients, staff, treatment fields and records, correction vectors, DIPS images, and association logics. Data entry, analysis, workflow logics, and notifications were implemented. The system effectively modeled the clinical workflow and IGRT process. PACS number: 87 PMID:26103504
Information Retrieval Using UMLS-based Structured Queries
Fagan, Lawrence M.; Berrios, Daniel C.; Chan, Albert; Cucina, Russell; Datta, Anupam; Shah, Maulik; Surendran, Sujith
2001-01-01
During the last three years, we have developed and described components of ELBook, a semantically based information-retrieval system [1-4]. Using these components, domain experts can specify a query model, indexers can use the query model to index documents, and end-users can search these documents for instances of indexed queries.
Querying databases of trajectories of differential equations: Data structures for trajectories
NASA Technical Reports Server (NTRS)
Grossman, Robert
1989-01-01
One approach to qualitative reasoning about dynamical systems is to extract qualitative information by searching or making queries on databases containing very large numbers of trajectories. The efficiency of such queries depends crucially upon finding an appropriate data structure for trajectories of dynamical systems. Suppose that a large number of parameterized trajectories gamma of a dynamical system evolving in R sup N are stored in a database. Let Eta is contained in set R sup N denote a parameterized path in Euclidean Space, and let the Euclidean Norm denote a norm on the space of paths. A data structure is defined to represent trajectories of dynamical systems, and an algorithm is sketched which answers queries.
Lee, HoJoon; Palm, Jennifer; Grimes, Susan M; Ji, Hanlee P
2015-10-27
The Cancer Genome Atlas (TCGA) project has generated genomic data sets covering over 20 malignancies. These data provide valuable insights into the underlying genetic and genomic basis of cancer. However, exploring the relationship among TCGA genomic results and clinical phenotype remains a challenge, particularly for individuals lacking formal bioinformatics training. Overcoming this hurdle is an important step toward the wider clinical translation of cancer genomic/proteomic data and implementation of precision cancer medicine. Several websites such as the cBio portal or University of California Santa Cruz genome browser make TCGA data accessible but lack interactive features for querying clinically relevant phenotypic associations with cancer drivers. To enable exploration of the clinical-genomic driver associations from TCGA data, we developed the Cancer Genome Atlas Clinical Explorer. The Cancer Genome Atlas Clinical Explorer interface provides a straightforward platform to query TCGA data using one of the following methods: (1) searching for clinically relevant genes, micro RNAs, and proteins by name, cancer types, or clinical parameters; (2) searching for genomic/proteomic profile changes by clinical parameters in a cancer type; or (3) testing two-hit hypotheses. SQL queries run in the background and results are displayed on our portal in an easy-to-navigate interface according to user's input. To derive these associations, we relied on elastic-net estimates of optimal multiple linear regularized regression and clinical parameters in the space of multiple genomic/proteomic features provided by TCGA data. Moreover, we identified and ranked gene/micro RNA/protein predictors of each clinical parameter for each cancer. The robustness of the results was estimated by bootstrapping. Overall, we identify associations of potential clinical relevance among genes/micro RNAs/proteins using our statistical analysis from 25 cancer types and 18 clinical parameters that include clinical stage or smoking history. The Cancer Genome Atlas Clinical Explorer enables the cancer research community and others to explore clinically relevant associations inferred from TCGA data. With its accessible web and mobile interface, users can examine queries and test hypothesis regarding genomic/proteomic alterations across a broad spectrum of malignancies.
2009-01-01
Oracle 9i, 10g MySQL MS SQL Server MS SQL Server Operating System Supported Windows 2003 Server Windows 2000 Server (32 bit...WebStar (Mac OS X) SunOne Internet Information Services (IIS) Database Server Supported MS SQL Server MS SQL Server Oracle 9i, 10g...challenges of Web-based surveys are: 1) identifying the best Commercial Off the Shelf (COTS) Web-based survey packages to serve the particular
An Optimization of the Basic School Military Occupational Skill Assignment Process
2003-06-01
Corps Intranet (NMCI)23 supports it. We evaluated the use of Microsoft’s SQL Server, but dismissed this after learning that TBS did not possess a SQL ...Server license or a qualified SQL Server administrator.24 SQL Server would have provided for additional security measures not available in MS...administrator. Although not has powerful as SQL Server, MS Access can handle the multi-user environment necessary for this system.25 The training
Fall 2014 Data-Intensive Systems
2014-10-29
Oct 2014 © 2014 Carnegie Mellon University Big Data Systems NoSQL and horizontal scaling are changing architecture principles by creating...University Status LEAP4BD • Ready to pilot QuABase • Prototype is complete – covers 8 NoSQL /NewSQL implementations • Completing validation testing Big...machine learning to automate population of knowledge base • Initial focus on NoSQL /NewSQL technology domain • Extend to create knowledge bases in other
Similarity analysis of spectra obtained via reflectance spectrometry in legal medicine.
Belenki, Liudmila; Sterzik, Vera; Bohnert, Michael
2014-02-01
In the present study, a series of reflectance spectra of postmortem lividity, pallor, and putrefaction-affected skin for 195 investigated cases in the course of cooling down the corpse has been collected. The reflectance spectrometric measurements were stored together with their respective metadata in a MySQL database. The latter has been managed via a scientific information repository. We propose similarity measures and a criterion of similarity that capture similar spectra recorded at corpse skin. We systematically clustered reflectance spectra from the database as well as their metadata, such as case number, age, sex, skin temperature, duration of cooling, and postmortem time, with respect to the given criterion of similarity. Altogether, more than 500 reflectance spectra have been pairwisely compared. The measures that have been used to compare a pair of reflectance curve samples include the Euclidean distance between curves and the Euclidean distance between derivatives of the functions represented by the reflectance curves at the same wavelengths in the spectral range of visible light between 380 and 750 nm. For each case, using the recorded reflectance curves and the similarity criterion, the postmortem time interval during which a characteristic change in the shape of reflectance spectrum takes place is estimated. The latter is carried out via a software package composed of Java, Python, and MatLab scripts that query the MySQL database. We show that in legal medicine, matching and clustering of reflectance curves obtained by means of reflectance spectrometry with respect to a given criterion of similarity can be used to estimate the postmortem interval.
A web-based data visualization tool for the MIMIC-II database.
Lee, Joon; Ribey, Evan; Wallace, James R
2016-02-04
Although MIMIC-II, a public intensive care database, has been recognized as an invaluable resource for many medical researchers worldwide, becoming a proficient MIMIC-II researcher requires knowledge of SQL programming and an understanding of the MIMIC-II database schema. These are challenging requirements especially for health researchers and clinicians who may have limited computer proficiency. In order to overcome this challenge, our objective was to create an interactive, web-based MIMIC-II data visualization tool that first-time MIMIC-II users can easily use to explore the database. The tool offers two main features: Explore and Compare. The Explore feature enables the user to select a patient cohort within MIMIC-II and visualize the distributions of various administrative, demographic, and clinical variables within the selected cohort. The Compare feature enables the user to select two patient cohorts and visually compare them with respect to a variety of variables. The tool is also helpful to experienced MIMIC-II researchers who can use it to substantially accelerate the cumbersome and time-consuming steps of writing SQL queries and manually visualizing extracted data. Any interested researcher can use the MIMIC-II data visualization tool for free to quickly and conveniently conduct a preliminary investigation on MIMIC-II with a few mouse clicks. Researchers can also use the tool to learn the characteristics of the MIMIC-II patients. Since it is still impossible to conduct multivariable regression inside the tool, future work includes adding analytics capabilities. Also, the next version of the tool will aim to utilize MIMIC-III which contains more data.
Building a semi-automatic ontology learning and construction system for geosciences
NASA Astrophysics Data System (ADS)
Babaie, H. A.; Sunderraman, R.; Zhu, Y.
2013-12-01
We are developing an ontology learning and construction framework that allows continuous, semi-automatic knowledge extraction, verification, validation, and maintenance by potentially a very large group of collaborating domain experts in any geosciences field. The system brings geoscientists from the side-lines to the center stage of ontology building, allowing them to collaboratively construct and enrich new ontologies, and merge, align, and integrate existing ontologies and tools. These constantly evolving ontologies can more effectively address community's interests, purposes, tools, and change. The goal is to minimize the cost and time of building ontologies, and maximize the quality, usability, and adoption of ontologies by the community. Our system will be a domain-independent ontology learning framework that applies natural language processing, allowing users to enter their ontology in a semi-structured form, and a combined Semantic Web and Social Web approach that lets direct participation of geoscientists who have no skill in the design and development of their domain ontologies. A controlled natural language (CNL) interface and an integrated authoring and editing tool automatically convert syntactically correct CNL text into formal OWL constructs. The WebProtege-based system will allow a potentially large group of geoscientists, from multiple domains, to crowd source and participate in the structuring of their knowledge model by sharing their knowledge through critiquing, testing, verifying, adopting, and updating of the concept models (ontologies). We will use cloud storage for all data and knowledge base components of the system, such as users, domain ontologies, discussion forums, and semantic wikis that can be accessed and queried by geoscientists in each domain. We will use NoSQL databases such as MongoDB as a service in the cloud environment. MongoDB uses the lightweight JSON format, which makes it convenient and easy to build Web applications using just HTML5 and Javascript, thereby avoiding cumbersome server side coding present in the traditional approaches. The JSON format used in MongoDB is also suitable for storing and querying RDF data. We will store the domain ontologies and associated linked data in JSON/RDF formats. Our Web interface will be built upon the open source and configurable WebProtege ontology editor. We will develop a simplified mobile version of our user interface which will automatically detect the hosting device and adjust the user interface layout to accommodate different screen sizes. We will also use the Semantic Media Wiki that allows the user to store and query the data within the wiki pages. By using HTML 5, JavaScript, and WebGL, we aim to create an interactive, dynamic, and multi-dimensional user interface that presents various geosciences data sets in a natural and intuitive way.
Photo-z-SQL: Photometric redshift estimation framework
NASA Astrophysics Data System (ADS)
Beck, Róbert; Dobos, László; Budavári, Tamás; Szalay, Alexander S.; Csabai, István
2017-04-01
Photo-z-SQL is a flexible template-based photometric redshift estimation framework that can be seamlessly integrated into a SQL database (or DB) server and executed on demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and uses the computational capabilities of DB hardware. Photo-z-SQL performs both maximum likelihood and Bayesian estimation and handles inputs of variable photometric filter sets and corresponding broad-band magnitudes.
Quantifying Uncertainty in Expert Judgment: Initial Results
2013-03-01
lines of source code were added in . ---------- C++ = 32%; JavaScript = 29%; XML = 15%; C = 7%; CSS = 7%; Java = 5%; Oth- er = 5% LOC = 927,266...much total effort in person years has been spent on this project? CMU/SEI-2013-TR-001 | 33 5 MySQL , the most popular Open Source SQL...as MySQL , Oracle, PostgreSQL, MS SQL Server, ODBC, or Interbase. Features include email reminders, iCal/vCal import/export, re- mote subscriptions
A New Publicly Available Chemical Query Language, CSRML ...
A new XML-based query language, CSRML, has been developed for representing chemical substructures, molecules, reaction rules, and reactions. CSRML queries are capable of integrating additional forms of information beyond the simple substructure (e.g., SMARTS) or reaction transformation (e.g., SMIRKS, reaction SMILES) queries currently in use. Chemotypes, a term used to represent advanced CSRML queries for repeated application can be encoded not only with connectivity and topology, but also with properties of atoms, bonds, electronic systems, or molecules. The CSRML language has been developed in parallel with a public set of chemotypes, i.e., the ToxPrint chemotypes, which are designed to provide excellent coverage of environmental, regulatory and commercial use chemical space, as well as to represent features and frameworks believed to be especially relevant to toxicity concerns. A software application, ChemoTyper, has also been developed and made publicly available to enable chemotype searching and fingerprinting against a target structure set. The public ChemoTyper houses the ToxPrint chemotype CSRML dictionary, as well as reference implementation so that the query specifications may be adopted by other chemical structure knowledge systems. The full specifications of the XML standard used in CSRML-based chemotypes are publicly available to facilitate and encourage the exchange of structural knowledge. Paper details specifications for a new XML-based query lan
SPARQL Query Re-writing Using Partonomy Based Transformation Rules
NASA Astrophysics Data System (ADS)
Jain, Prateek; Yeh, Peter Z.; Verma, Kunal; Henson, Cory A.; Sheth, Amit P.
Often the information present in a spatial knowledge base is represented at a different level of granularity and abstraction than the query constraints. For querying ontology's containing spatial information, the precise relationships between spatial entities has to be specified in the basic graph pattern of SPARQL query which can result in long and complex queries. We present a novel approach to help users intuitively write SPARQL queries to query spatial data, rather than relying on knowledge of the ontology structure. Our framework re-writes queries, using transformation rules to exploit part-whole relations between geographical entities to address the mismatches between query constraints and knowledge base. Our experiments were performed on completely third party datasets and queries. Evaluations were performed on Geonames dataset using questions from National Geographic Bee serialized into SPARQL and British Administrative Geography Ontology using questions from a popular trivia website. These experiments demonstrate high precision in retrieval of results and ease in writing queries.
Towards computational improvement of DNA database indexing and short DNA query searching.
Stojanov, Done; Koceski, Sašo; Mileva, Aleksandra; Koceska, Nataša; Bande, Cveta Martinovska
2014-09-03
In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.
[A web-based integrated clinical database for laryngeal cancer].
E, Qimin; Liu, Jialin; Li, Yong; Liang, Chuanyu
2014-08-01
To establish an integrated database for laryngeal cancer, and to provide an information platform for laryngeal cancer in clinical and fundamental researches. This database also meet the needs of clinical and scientific use. Under the guidance of clinical expert, we have constructed a web-based integrated clinical database for laryngeal carcinoma on the basis of clinical data standards, Apache+PHP+MySQL technology, laryngeal cancer specialist characteristics and tumor genetic information. A Web-based integrated clinical database for laryngeal carcinoma had been developed. This database had a user-friendly interface and the data could be entered and queried conveniently. In addition, this system utilized the clinical data standards and exchanged information with existing electronic medical records system to avoid the Information Silo. Furthermore, the forms of database was integrated with laryngeal cancer specialist characteristics and tumor genetic information. The Web-based integrated clinical database for laryngeal carcinoma has comprehensive specialist information, strong expandability, high feasibility of technique and conforms to the clinical characteristics of laryngeal cancer specialties. Using the clinical data standards and structured handling clinical data, the database can be able to meet the needs of scientific research better and facilitate information exchange, and the information collected and input about the tumor sufferers are very informative. In addition, the user can utilize the Internet to realize the convenient, swift visit and manipulation on the database.
NASA Astrophysics Data System (ADS)
Karnatak, H.; Pandey, K.; Oberai, K.; Roy, A.; Joshi, D.; Singh, H.; Raju, P. L. N.; Krishna Murthy, Y. V. N.
2014-11-01
National Biodiversity Characterization at Landscape Level, a project jointly sponsored by Department of Biotechnology and Department of Space, was implemented to identify and map the potential biodiversity rich areas in India. This project has generated spatial information at three levels viz. Satellite based primary information (Vegetation Type map, spatial locations of road & village, Fire occurrence); geospatially derived or modelled information (Disturbance Index, Fragmentation, Biological Richness) and geospatially referenced field samples plots. The study provides information of high disturbance and high biological richness areas suggesting future management strategies and formulating action plans. The study has generated for the first time baseline database in India which will be a valuable input towards climate change study in the Indian Subcontinent. The spatial data generated during the study is organized as central data repository in Geo-RDBMS environment using PostgreSQL and POSTGIS. The raster and vector data is published as OGC WMS and WFS standard for development of web base geoinformation system using Service Oriented Architecture (SOA). The WMS and WFS based system allows geo-visualization, online query and map outputs generation based on user request and response. This is a typical mashup architecture based geo-information system which allows access to remote web services like ISRO Bhuvan, Openstreet map, Google map etc., with overlay on Biodiversity data for effective study on Bio-resources. The spatial queries and analysis with vector data is achieved through SQL queries on POSTGIS and WFS-T operations. But the most important challenge is to develop a system for online raster based geo-spatial analysis and processing based on user defined Area of Interest (AOI) for large raster data sets. The map data of this study contains approximately 20 GB of size for each data layer which are five in number. An attempt has been to develop system using python, PostGIS and PHP for raster data analysis over the web for Biodiversity conservation and prioritization. The developed system takes inputs from users as WKT, Openlayer based Polygon geometry and Shape file upload as AOI to perform raster based operation using Python and GDAL/OGR. The intermediate products are stored in temporary files and tables which generate XML outputs for web representation. The raster operations like clip-zip-ship, class wise area statistics, single to multi-layer operations, diagrammatic representation and other geo-statistical analysis are performed. This is indigenous geospatial data processing engine developed using Open system architecture for spatial analysis of Biodiversity data sets in Internet GIS environment. The performance of this applications in multi-user environment like Internet domain is another challenging task which is addressed by fine tuning the source code, server hardening, spatial indexing and running the process in load balance mode. The developed system is hosted in Internet domain (http://bis.iirs.gov.in) for user access.
Comprehensive Routing Security Development and Deployment for the Internet
2015-02-01
feature enhancement and bug fixes. • MySQL : MySQL is a widely used and popular open source database package. It was chosen for database support in the...RPSTIR depends on several other open source packages. • MySQL : MySQL is used for the the local RPKI database cache. • OpenSSL: OpenSSL is used for...cryptographic libraries for X.509 certificates. • ODBC mySql Connector: ODBC (Open Database Connectivity) is a standard programming interface (API) for
A Semantic Basis for Proof Queries and Transformations
NASA Technical Reports Server (NTRS)
Aspinall, David; Denney, Ewen W.; Luth, Christoph
2013-01-01
We extend the query language PrQL, designed for inspecting machine representations of proofs, to also allow transformation of proofs. PrQL natively supports hiproofs which express proof structure using hierarchically nested labelled trees, which we claim is a natural way of taming the complexity of huge proofs. Query-driven transformations enable manipulation of this structure, in particular, to transform proofs produced by interactive theorem provers into forms that assist their understanding, or that could be consumed by other tools. In this paper we motivate and define basic transformation operations, using an abstract denotational semantics of hiproofs and queries. This extends our previous semantics for queries based on syntactic tree representations.We define update operations that add and remove sub-proofs, and manipulate the hierarchy to group and ungroup nodes. We show that
Teaching Structured Design of Network Algorithms in Enhanced Versions of SQL
ERIC Educational Resources Information Center
de Brock, Bert
2004-01-01
From time to time developers of (database) applications will encounter, explicitly or implicitly, structures such as trees, graphs, and networks. Such applications can, for instance, relate to bills of material, organization charts, networks of (rail)roads, networks of conduit pipes (e.g., plumbing, electricity), telecom networks, and data…
Structuring Legacy Pathology Reports by openEHR Archetypes to Enable Semantic Querying.
Kropf, Stefan; Krücken, Peter; Mueller, Wolf; Denecke, Kerstin
2017-05-18
Clinical information is often stored as free text, e.g. in discharge summaries or pathology reports. These documents are semi-structured using section headers, numbered lists, items and classification strings. However, it is still challenging to retrieve relevant documents since keyword searches applied on complete unstructured documents result in many false positive retrieval results. We are concentrating on the processing of pathology reports as an example for unstructured clinical documents. The objective is to transform reports semi-automatically into an information structure that enables an improved access and retrieval of relevant data. The data is expected to be stored in a standardized, structured way to make it accessible for queries that are applied to specific sections of a document (section-sensitive queries) and for information reuse. Our processing pipeline comprises information modelling, section boundary detection and section-sensitive queries. For enabling a focused search in unstructured data, documents are automatically structured and transformed into a patient information model specified through openEHR archetypes. The resulting XML-based pathology electronic health records (PEHRs) are queried by XQuery and visualized by XSLT in HTML. Pathology reports (PRs) can be reliably structured into sections by a keyword-based approach. The information modelling using openEHR allows saving time in the modelling process since many archetypes can be reused. The resulting standardized, structured PEHRs allow accessing relevant data by retrieving data matching user queries. Mapping unstructured reports into a standardized information model is a practical solution for a better access to data. Archetype-based XML enables section-sensitive retrieval and visualisation by well-established XML techniques. Focussing the retrieval to particular sections has the potential of saving retrieval time and improving the accuracy of the retrieval.
Information Network Model Query Processing
NASA Astrophysics Data System (ADS)
Song, Xiaopu
Information Networking Model (INM) [31] is a novel database model for real world objects and relationships management. It naturally and directly supports various kinds of static and dynamic relationships between objects. In INM, objects are networked through various natural and complex relationships. INM Query Language (INM-QL) [30] is designed to explore such information network, retrieve information about schema, instance, their attributes, relationships, and context-dependent information, and process query results in the user specified form. INM database management system has been implemented using Berkeley DB, and it supports INM-QL. This thesis is mainly focused on the implementation of the subsystem that is able to effectively and efficiently process INM-QL. The subsystem provides a lexical and syntactical analyzer of INM-QL, and it is able to choose appropriate evaluation strategies and index mechanism to process queries in INM-QL without the user's intervention. It also uses intermediate result structure to hold intermediate query result and other helping structures to reduce complexity of query processing.
2014-09-01
NoSQL Data Store Technologies John Klein, Software Engineering Institute Patrick Donohoe, Software Engineering Institute Neil Ernst...REPORT TYPE N/A 3. DATES COVERED 4. TITLE AND SUBTITLE NoSQL Data Store Technologies 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT...distribute data 4. Data Replication – determines how a NoSQL database facilitates reliable, high performance data replication to build
jSPyDB, an open source database-independent tool for data management
NASA Astrophysics Data System (ADS)
Pierro, Giuseppe Antonio; Cavallari, Francesca; Di Guida, Salvatore; Innocente, Vincenzo
2011-12-01
Nowadays, the number of commercial tools available for accessing Databases, built on Java or .Net, is increasing. However, many of these applications have several drawbacks: usually they are not open-source, they provide interfaces only with a specific kind of database, they are platform-dependent and very CPU and memory consuming. jSPyDB is a free web-based tool written using Python and Javascript. It relies on jQuery and python libraries, and is intended to provide a simple handler to different database technologies inside a local web browser. Such a tool, exploiting fast access libraries such as SQLAlchemy, is easy to install, and to configure. The design of this tool envisages three layers. The front-end client side in the local web browser communicates with a backend server. Only the server is able to connect to the different databases for the purposes of performing data definition and manipulation. The server makes the data available to the client, so that the user can display and handle them safely. Moreover, thanks to jQuery libraries, this tool supports export of data in different formats, such as XML and JSON. Finally, by using a set of pre-defined functions, users are allowed to create their customized views for a better data visualization. In this way, we optimize the performance of database servers by avoiding short connections and concurrent sessions. In addition, security is enforced since we do not provide users the possibility to directly execute any SQL statement.
The VO-Dance web application at the IA2 data center
NASA Astrophysics Data System (ADS)
Molinaro, Marco; Knapic, Cristina; Smareglia, Riccardo
2012-09-01
Italian center for Astronomical Archives (IA2, http://ia2.oats.inaf.it) is a national infrastructure project of the Italian National Institute for Astrophysics (Istituto Nazionale di AstroFisica, INAF) that provides services for the astronomical community. Besides data hosting for the Large Binocular Telescope (LBT) Corporation, the Galileo National Telescope (Telescopio Nazionale Galileo, TNG) Consortium and other telescopes and instruments, IA2 offers proprietary and public data access through user portals (both developed and mirrored) and deploys resources complying the Virtual Observatory (VO) standards. Archiving systems and web interfaces are developed to be extremely flexible about adding new instruments from other telescopes. VO resources publishing, along with data access portals, implements the International Virtual Observatory Alliance (IVOA) protocols providing astronomers with new ways of analyzing data. Given the large variety of data flavours and IVOA standards, the need for tools to easily accomplish data ingestion and data publishing arises. This paper describes the VO-Dance tool, that IA2 started developing to address VO resources publishing in a dynamical way from already existent database tables or views. The tool consists in a Java web application, potentially DBMS and platform independent, that stores internally the services' metadata and information, exposes restful endpoints to accept VO queries for these services and dynamically translates calls to these endpoints to SQL queries coherent with the published table or view. In response to the call VO-Dance translates back the database answer in a VO compliant way.
Kaas, Quentin; Ruiz, Manuel; Lefranc, Marie-Paule
2004-01-01
IMGT/3Dstructure-DB and IMGT/Structural-Query are a novel 3D structure database and a new tool for immunological proteins. They are part of IMGT, the international ImMunoGenetics information system®, a high-quality integrated knowledge resource specializing in immunoglobulins (IG), T cell receptors (TR), major histocompatibility complex (MHC) and related proteins of the immune system (RPI) of human and other vertebrate species, which consists of databases, Web resources and interactive on-line tools. IMGT/3Dstructure-DB data are described according to the IMGT Scientific chart rules based on the IMGT-ONTOLOGY concepts. IMGT/3Dstructure-DB provides IMGT gene and allele identification of IG, TR and MHC proteins with known 3D structures, domain delimitations, amino acid positions according to the IMGT unique numbering and renumbered coordinate flat files. Moreover IMGT/3Dstructure-DB provides 2D graphical representations (or Collier de Perles) and results of contact analysis. The IMGT/StructuralQuery tool allows search of this database based on specific structural characteristics. IMGT/3Dstructure-DB and IMGT/StructuralQuery are freely available at http://imgt.cines.fr. PMID:14681396
EXTENSIBLE DATABASE FRAMEWORK FOR MANAGEMENT OF UNSTRUCTURED AND SEMI-STRUCTURED DOCUMENTS
NASA Technical Reports Server (NTRS)
Gawdiak, Yuri O. (Inventor); La, Tracy T. (Inventor); Lin, Shu-Chun Y. (Inventor); Malof, David A. (Inventor); Tran, Khai Peter B. (Inventor)
2005-01-01
Method and system for querying a collection of Unstructured or semi-structured documents to identify presence of, and provide context and/or content for, keywords and/or keyphrases. The documents are analyzed and assigned a node structure, including an ordered sequence of mutually exclusive node segments or strings. Each node has an associated set of at least four, five or six attributes with node information and can represent a format marker or text, with the last node in any node segment usually being a text node. A keyword (or keyphrase) is specified. and the last node in each node segment is searched for a match with the keyword. When a match is found at a query node, or at a node determined with reference to a query node, the system displays the context andor the content of the query node.
Connecting Provenance with Semantic Descriptions in the NASA Earth Exchange (NEX)
NASA Astrophysics Data System (ADS)
Votava, P.; Michaelis, A.; Nemani, R. R.
2012-12-01
NASA Earth Exchange (NEX) is a data, modeling and knowledge collaboratory that houses NASA satellite data, climate data and ancillary data where a focused community may come together to share modeling and analysis codes, scientific results, knowledge and expertise on a centralized platform. Some of the main goals of NEX are transparency and repeatability and to that extent we have been adding components that enable tracking of provenance of both scientific processes and datasets produced by these processes. As scientific processes become more complex, they are often developed collaboratively and it becomes increasingly important for the research team to be able to track the development of the process and the datasets that are produced along the way. Additionally, we want to be able to link the processes and the datasets developed on NEX to an existing information and knowledge, so that the users can query and compare the provenance of any dataset or process with regard to the component-specific attributes such as data quality, geographic location, related publications, user comments and annotations etc. We have developed several ontologies that describe datasets and workflow components available on NEX using the OWL ontology language as well as a simple ontology that provides linking mechanism to the collected provenance information. The provenance is captured in two ways - we utilize existing provenance infrastructure of VisTrails, which is used as a workflow engine on NEX, and we extend the captured provenance using the PROV data model expressed through the PROV-O ontology. We do this in order to link and query the provenance easier in the context of the existing NEX information and knowledge. The captured provenance graph is processed and stored using RDFlib with MySQL backend that can be queried using either RDFLib or SPARQL. As a concrete example, we show how this information is captured during anomaly detection process in large satellite datasets.
The Chandra Source Catalog: User Interface
NASA Astrophysics Data System (ADS)
Bonaventura, Nina; Evans, I. N.; Harbo, P. N.; Rots, A. H.; Tibbetts, M. S.; Van Stone, D. W.; Zografou, P.; Anderson, C. S.; Chen, J. C.; Davis, J. E.; Doe, S. M.; Evans, J. D.; Fabbiano, G.; Galle, E.; Gibbs, D. G.; Glotfelty, K. J.; Grier, J. D.; Hain, R.; Hall, D. M.; He, X.; Houck, J. C.; Karovska, M.; Lauer, J.; McCollough, M. L.; McDowell, J. C.; Miller, J. B.; Mitschang, A. W.; Morgan, D. L.; Nichols, J. S.; Nowak, M. A.; Plummer, D. A.; Primini, F. A.; Refsdal, B. L.; Siemiginowska, A. L.; Sundheim, B. A.; Winkelman, S. L.
2009-01-01
The Chandra Source Catalog (CSC) is the definitive catalog of all X-ray sources detected by Chandra. The CSC is presented to the user in two tables: the Master Chandra Source Table and the Table of Individual Source Observations. Each distinct X-ray source identified in the CSC is represented by a single master source entry and one or more individual source entries. If a source is unaffected by confusion and pile-up in multiple observations, the individual source observations are merged to produce a master source. In each table, a row represents a source, and each column a quantity that is officially part of the catalog. The CSC contains positions and multi-band fluxes for the sources, as well as derived spatial, spectral, and temporal source properties. The CSC also includes associated source region and full-field data products for each source, including images, photon event lists, light curves, and spectra. The master source properties represent the best estimates of the properties of a source, and are presented in the following categories: Position and Position Errors, Source Flags, Source Extent and Errors, Source Fluxes, Source Significance, Spectral Properties, and Source Variability. The CSC Data Access GUI provides direct access to the source properties and data products contained in the catalog. The user may query the catalog database via a web-style search or an SQL command-line query. Each query returns a table of source properties, along with the option to browse and download associated data products. The GUI is designed to run in a web browser with Java version 1.5 or higher, and may be accessed via a link on the CSC website homepage (http://cxc.harvard.edu/csc/). As an alternative to the GUI, the contents of the CSC may be accessed directly through a URL, using the command-line tool, cURL. Support: NASA contract NAS8-03060 (CXC).
Ingredients for an Integrated Dinner: Parsley, Sage, Rosemary and Thyme
NASA Astrophysics Data System (ADS)
Baumann, Peter
2013-04-01
In 1966, Simon and Garfunkel combined the English traditional "Scarborough Fair" with a counter melody. This is one of the manifold techniques of the Kontrapunktik described by Bach around 1745 in "The Art of the Fugue": combining completely different and seemingly independent melodies (or motifs) into a coherent piece of music, pleasant for the audience. This achievement, transposed into Computer Science, could be of great benefit for geo services as we look at the currently disparate situation: On the one hand, we have metadata - traditionally, they are understood as being small in volume, but rich in content and semantics, and flexibly queryable through the rich body of technologies established over several decades of database research, centering around query languages like SQL. On the other hand, we have data themselves, such as remote sensing and other measured and observed data sets - they are considered difficult to interpret, semantic-poor, and only for clumsy download, as they are the main constituent of what we today call Big Data. The traditional advantages of databases, such as information integration, query flexibility, and scalability seem to be unavailable. These are the melodies that require a kontrapunctic harmonization, leading to a Holy Grail where different information categories enjoy individually tailored support, while an overall integrating framework allows seamless and convenient access and processing by the user. Most of the data categories to be integrated are well known in fact: ontologies, geospatial meshes, spatiotemporal arrays, and free text constitute major ingredients in this orchestration. For many of them, isolated solutions have been presented, and for some of them (like ontologies and text) integration has been achieved already; a complete harmonic integration, though, is still lacking as of today. In our talk, we detail our vision on such integration through query models and languages which merge established concepts and novel paradigms in a harmonic way. We present the EarthServer initiative which has set out to demonstrate flexible ad-hoc processing and filtering on massive Earth data sets.
A Novel Database to Rank and Display Archeomagnetic Intensity Data
NASA Astrophysics Data System (ADS)
Donadini, F.; Korhonen, K.; Riisager, P.; Pesonen, L. J.; Kahma, K.
2005-12-01
To understand the content and the causes of the changes in the Earth's magnetic field beyond the observatory records one has to rely on archeomagnetic and lake sediment paleomagnetic data. The regional archeointensity curves are often of different quality and temporally variable which hampers the global analysis of the data in terms of dipole vs non-dipole field. We have developed a novel archeointensity database application utilizing MySQL, PHP (PHP Hypertext Preprocessor), and the Generic Mapping Tools (GMT) for ranking and displaying geomagnetic intensity data from the last 12000 years. Our application has the advantage that no specific software is required to query the database and view the results. Querying the database is performed using any Web browser; a fill-out form is used to enter the site location and a minimum ranking value to select the data points to be displayed. The form also features the possibility to select plotting of the data as an archeointensity curve with error bars, and a Virtual Axial Dipole Moment (VADM) or ancient field value (Ba) curve calculated using the CALS7K model (Continuous Archaeomagnetic and Lake Sediment geomagnetic model) of (Korte and Constable, 2005). The results of a query are displayed on a Web page containing a table summarizing the query parameters, a table showing the archeointensity values satisfying the query parameters, and a plot of VADM or Ba as a function of sample age. The database consists of eight related tables. The main one, INTENSITIES, stores the 3704 archeointensity measurements collected from 159 publications as VADM (and VDM when available) and Ba values, including their standard deviations and sampling locations. It also contains the number of samples and specimens measured from each site. The REFS table stores the references to a particular study. The names, latitudes, and longitudes of the regions where the samples were collected are stored in the SITES table. The MATERIALS, METHODS, SPECIMEN_TYPES and DATING_METHODS tables store information about the sample materials, intensity determination methods, specimen types and age determination methods. The SIGMA_COUNT table is used indirectly for ranking data according to the number of samples measured and their standard deviations. Each intensity measurement is assigned a score (0--2) depending on the number of specimens measured and their standard deviations, the intensity determination method, the type of specimens measured and materials. The ranking of each data point is calculated as the sum of the four scores and varies between 0 and 8. Additionally, users can select the parameters that will be included in the ranking.
Jung, HaRim; Song, MoonBae; Youn, Hee Yong; Kim, Ung Mo
2015-09-18
A content-matched (CM) rangemonitoring query overmoving objects continually retrieves the moving objects (i) whose non-spatial attribute values are matched to given non-spatial query values; and (ii) that are currently located within a given spatial query range. In this paper, we propose a new query indexing structure, called the group-aware query region tree (GQR-tree) for efficient evaluation of CMrange monitoring queries. The primary role of the GQR-tree is to help the server leverage the computational capabilities of moving objects in order to improve the system performance in terms of the wireless communication cost and server workload. Through a series of comprehensive simulations, we verify the superiority of the GQR-tree method over the existing methods.
Spatial and symbolic queries for 3D image data
NASA Astrophysics Data System (ADS)
Benson, Daniel C.; Zick, Gregory L.
1992-04-01
We present a query system for an object-oriented biomedical imaging database containing 3-D anatomical structures and their corresponding 2-D images. The graphical interface facilitates the formation of spatial queries, nonspatial or symbolic queries, and combined spatial/symbolic queries. A query editor is used for the creation and manipulation of 3-D query objects as volumes, surfaces, lines, and points. Symbolic predicates are formulated through a combination of text fields and multiple choice selections. Query results, which may include images, image contents, composite objects, graphics, and alphanumeric data, are displayed in multiple views. Objects returned by the query may be selected directly within the views for further inspection or modification, or for use as query objects in subsequent queries. Our image database query system provides visual feedback and manipulation of spatial query objects, multiple views of volume data, and the ability to combine spatial and symbolic queries. The system allows for incremental enhancement of existing objects and the addition of new objects and spatial relationships. The query system is designed for databases containing symbolic and spatial data. This paper discuses its application to data acquired in biomedical 3- D image reconstruction, but it is applicable to other areas such as CAD/CAM, geographical information systems, and computer vision.
Query-Time Optimization Techniques for Structured Queries in Information Retrieval
ERIC Educational Resources Information Center
Cartright, Marc-Allen
2013-01-01
The use of information retrieval (IR) systems is evolving towards larger, more complicated queries. Both the IR industrial and research communities have generated significant evidence indicating that in order to continue improving retrieval effectiveness, increases in retrieval model complexity may be unavoidable. From an operational perspective,…
VISAGE: Interactive Visual Graph Querying.
Pienta, Robert; Navathe, Shamkant; Tamersoy, Acar; Tong, Hanghang; Endert, Alex; Chau, Duen Horng
2016-06-01
Extracting useful patterns from large network datasets has become a fundamental challenge in many domains. We present VISAGE, an interactive visual graph querying approach that empowers users to construct expressive queries, without writing complex code (e.g., finding money laundering rings of bankers and business owners). Our contributions are as follows: (1) we introduce graph autocomplete , an interactive approach that guides users to construct and refine queries, preventing over-specification; (2) VISAGE guides the construction of graph queries using a data-driven approach, enabling users to specify queries with varying levels of specificity, from concrete and detailed (e.g., query by example), to abstract (e.g., with "wildcard" nodes of any types), to purely structural matching; (3) a twelve-participant, within-subject user study demonstrates VISAGE's ease of use and the ability to construct graph queries significantly faster than using a conventional query language; (4) VISAGE works on real graphs with over 468K edges, achieving sub-second response times for common queries.
VISAGE: Interactive Visual Graph Querying
Pienta, Robert; Navathe, Shamkant; Tamersoy, Acar; Tong, Hanghang; Endert, Alex; Chau, Duen Horng
2017-01-01
Extracting useful patterns from large network datasets has become a fundamental challenge in many domains. We present VISAGE, an interactive visual graph querying approach that empowers users to construct expressive queries, without writing complex code (e.g., finding money laundering rings of bankers and business owners). Our contributions are as follows: (1) we introduce graph autocomplete, an interactive approach that guides users to construct and refine queries, preventing over-specification; (2) VISAGE guides the construction of graph queries using a data-driven approach, enabling users to specify queries with varying levels of specificity, from concrete and detailed (e.g., query by example), to abstract (e.g., with “wildcard” nodes of any types), to purely structural matching; (3) a twelve-participant, within-subject user study demonstrates VISAGE’s ease of use and the ability to construct graph queries significantly faster than using a conventional query language; (4) VISAGE works on real graphs with over 468K edges, achieving sub-second response times for common queries. PMID:28553670
NASA Astrophysics Data System (ADS)
Baranowski, Z.; Canali, L.; Toebbicke, R.; Hrivnac, J.; Barberis, D.
2017-10-01
This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions of event records, each of which consists of ∼100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. We report also on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.
CGDSNPdb: a database resource for error-checked and imputed mouse SNPs.
Hutchins, Lucie N; Ding, Yueming; Szatkiewicz, Jin P; Von Smith, Randy; Yang, Hyuna; de Villena, Fernando Pardo-Manuel; Churchill, Gary A; Graber, Joel H
2010-07-06
The Center for Genome Dynamics Single Nucleotide Polymorphism Database (CGDSNPdb) is an open-source value-added database with more than nine million mouse single nucleotide polymorphisms (SNPs), drawn from multiple sources, with genotypes assigned to multiple inbred strains of laboratory mice. All SNPs are checked for accuracy and annotated for properties specific to the SNP as well as those implied by changes to overlapping protein-coding genes. CGDSNPdb serves as the primary interface to two unique data sets, the 'imputed genotype resource' in which a Hidden Markov Model was used to assess local haplotypes and the most probable base assignment at several million genomic loci in tens of strains of mice, and the Affymetrix Mouse Diversity Genotyping Array, a high density microarray with over 600,000 SNPs and over 900,000 invariant genomic probes. CGDSNPdb is accessible online through either a web-based query tool or a MySQL public login. Database URL: http://cgd.jax.org/cgdsnpdb/
Analyzing reflectance spectra of human skin in legal medicine
NASA Astrophysics Data System (ADS)
Belenki, Liudmila; Sterzik, Vera; Schulz, Katharina; Bohnert, Michael
2013-01-01
Our current research in the framework of an interdisciplinary project focuses on modelling the dynamics of the hemoglobin reoxygenation process in post-mortem human skin by reflectance spectrometry. The observations of reoxygenation of hemoglobin in livores after postmortem exposure to a cold environment relate the reoxygenation to the commonly known phenomenon that the color impression of livores changes from livid to pink under low ambient temperatures. We analyze the spectra with respect to a physical model describing the optical properties of human skin, discuss the dynamics of the reoxygenation, and propose a phenomenological model for reoxygenation. For additional characterization of the reflectance spectra, the curvature of the local minimum and maximum in the investigated spectral range is considered. There is a strong correlation between the curvature of specra at a wavelength of 560 nm and the concentration of O2-Hb. The analysis is carried out via C programs, as well as MySQL database queries in Java EE, JDBC, Matlab, and Python.
Analyzing reflectance spectra of human skin in legal medicine.
Belenki, Liudmila; Sterzik, Vera; Schulz, Katharina; Bohnert, Michael
2013-01-01
Our current research in the framework of an interdisciplinary project focuses on modelling the dynamics of the hemoglobin reoxygenation process in post-mortem human skin by reflectance spectrometry. The observations of reoxygenation of hemoglobin in livores after postmortem exposure to a cold environment relate the reoxygenation to the commonly known phenomenon that the color impression of livores changes from livid to pink under low ambient temperatures. We analyze the spectra with respect to a physical model describing the optical properties of human skin, discuss the dynamics of the reoxygenation, and propose a phenomenological model for reoxygenation. For additional characterization of the reflectance spectra, the curvature of the local minimum and maximum in the investigated spectral range is considered. There is a strong correlation between the curvature of specra at a wavelength of 560 nm and the concentration of O2-Hb. The analysis is carried out via C programs, as well as MySQL database queries in Java EE, JDBC, Matlab, and Python.
A Flexible Monitoring Infrastructure for the Simulation Requests
NASA Astrophysics Data System (ADS)
Spinoso, V.; Missiato, M.
2014-06-01
Running and monitoring simulations usually involves several different aspects of the entire workflow: the configuration of the job, the site issues, the software deployment at the site, the file catalogue, the transfers of the simulated data. In addition, the final product of the simulation is often the result of several sequential steps. This project tries a different approach to monitoring the simulation requests. All the necessary data are collected from the central services which lead the submission of the requests and the data management, and stored by a backend into a NoSQL-based data cache; those data can be queried through a Web Service interface, which returns JSON responses, and allows users, sites, physics groups to easily create their own web frontend, aggregating only the needed information. As an example, it will be shown how it is possible to monitor the CMS services (ReqMgr, DAS/DBS, PhEDEx) using a central backend and multiple customized cross-language frontends.
UCbase 2.0: ultraconserved sequences database (2014 update).
Lomonaco, Vincenzo; Martoglia, Riccardo; Mandreoli, Federica; Anderlucci, Laura; Emmett, Warren; Bicciato, Silvio; Taccioli, Cristian
2014-01-01
UCbase 2.0 (http://ucbase.unimore.it) is an update, extension and evolution of UCbase, a Web tool dedicated to the analysis of ultraconserved sequences (UCRs). UCRs are 481 sequences >200 bases sharing 100% identity among human, mouse and rat genomes. They are frequently located in genomic regions known to be involved in cancer or differentially expressed in human leukemias and carcinomas. UCbase 2.0 is a platform-independent Web resource that includes the updated version of the human genome annotation (hg19), information linking disorders to chromosomal coordinates based on the Systematized Nomenclature of Medicine classification, a query tool to search for Single Nucleotide Polymorphisms (SNPs) and a new text box to directly interrogate the database using a MySQL interface. To facilitate the interactive visual interpretation of UCR chromosomal positioning, UCbase 2.0 now includes a graph visualization interface directly linked to UCSC genome browser. Database URL: http://ucbase.unimore.it. © The Author(s) 2014. Published by Oxford University Press.
Shao, Wei; Shan, Jigui; Kearney, Mary F; Wu, Xiaolin; Maldarelli, Frank; Mellors, John W; Luke, Brian; Coffin, John M; Hughes, Stephen H
2016-07-04
The NCI Retrovirus Integration Database is a MySql-based relational database created for storing and retrieving comprehensive information about retroviral integration sites, primarily, but not exclusively, HIV-1. The database is accessible to the public for submission or extraction of data originating from experiments aimed at collecting information related to retroviral integration sites including: the site of integration into the host genome, the virus family and subtype, the origin of the sample, gene exons/introns associated with integration, and proviral orientation. Information about the references from which the data were collected is also stored in the database. Tools are built into the website that can be used to map the integration sites to UCSC genome browser, to plot the integration site patterns on a chromosome, and to display provirus LTRs in their inserted genome sequence. The website is robust, user friendly, and allows users to query the database and analyze the data dynamically. https://rid.ncifcrf.gov ; or http://home.ncifcrf.gov/hivdrp/resources.htm .
MyLabStocks: a web-application to manage molecular biology materials
Chuffart, Florent; Yvert, Gaël
2014-01-01
Laboratory stocks are the hardware of research. They must be stored and managed with mimimum loss of material and information. Plasmids, oligonucleotides and strains are regularly exchanged between collaborators within and between laboratories. Managing and sharing information about every item is crucial for retrieval of reagents, for planning experiments and for reproducing past experimental results. We have developed a web-based application to manage stocks commonly used in a molecular biology laboratory. Its functionalities include user-defined privileges, visualization of plasmid maps directly from their sequence and the capacity to search items from fields of annotation or directly from a query sequence using BLAST. It is designed to handle records of plasmids, oligonucleotides, yeast strains, antibodies, pipettes and notebooks. Based on PHP/MySQL, it can easily be extended to handle other types of stocks and it can be installed on any server architecture. MyLabStocks is freely available from: https://forge.cbp.ens-lyon.fr/redmine/projects/mylabstocks under an open source licence. PMID:24643870
BGDB: a database of bivalent genes.
Li, Qingyan; Lian, Shuabin; Dai, Zhiming; Xiang, Qian; Dai, Xianhua
2013-01-01
Bivalent gene is a gene marked with both H3K4me3 and H3K27me3 epigenetic modification in the same area, and is proposed to play a pivotal role related to pluripotency in embryonic stem (ES) cells. Identification of these bivalent genes and understanding their functions are important for further research of lineage specification and embryo development. So far, lots of genome-wide histone modification data were generated in mouse and human ES cells. These valuable data make it possible to identify bivalent genes, but no comprehensive data repositories or analysis tools are available for bivalent genes currently. In this work, we develop BGDB, the database of bivalent genes. The database contains 6897 bivalent genes in human and mouse ES cells, which are manually collected from scientific literature. Each entry contains curated information, including genomic context, sequences, gene ontology and other relevant information. The web services of BGDB database were implemented with PHP + MySQL + JavaScript, and provide diverse query functions. Database URL: http://dailab.sysu.edu.cn/bgdb/
Design of Instant Messaging System of Multi-language E-commerce Platform
NASA Astrophysics Data System (ADS)
Yang, Heng; Chen, Xinyi; Li, Jiajia; Cao, Yaru
2017-09-01
This paper aims at researching the message system in the instant messaging system based on the multi-language e-commerce platform in order to design the instant messaging system in multi-language environment and exhibit the national characteristics based information as well as applying national languages to e-commerce. In order to develop beautiful and friendly system interface for the front end of the message system and reduce the development cost, the mature jQuery framework is adopted in this paper. The high-performance server Tomcat is adopted at the back end to process user requests, and MySQL database is adopted for data storage to persistently store user data, and meanwhile Oracle database is adopted as the message buffer for system optimization. Moreover, AJAX technology is adopted for the client to actively pull the newest data from the server at the specified time. In practical application, the system has strong reliability, good expansibility, short response time, high system throughput capacity and high user concurrency.
2004-03-01
with MySQL . This choice was made because MySQL is open source. Any significant database engine such as Oracle or MS- SQL or even MS Access can be used...10 Figure 6. The DoD vs . Commercial Life Cycle...necessarily be interested in SCADA network security 13. MySQL (Database server) – This station represents a typical data server for a web page
Architecture for biomedical multimedia information delivery on the World Wide Web
NASA Astrophysics Data System (ADS)
Long, L. Rodney; Goh, Gin-Hua; Neve, Leif; Thoma, George R.
1997-10-01
Research engineers at the National Library of Medicine are building a prototype system for the delivery of multimedia biomedical information on the World Wide Web. This paper discuses the architecture and design considerations for the system, which will be used initially to make images and text from the third National Health and Nutrition Examination Survey (NHANES) publicly available. We categorized our analysis as follows: (1) fundamental software tools: we analyzed trade-offs among use of conventional HTML/CGI, X Window Broadway, and Java; (2) image delivery: we examined the use of unconventional TCP transmission methods; (3) database manager and database design: we discuss the capabilities and planned use of the Informix object-relational database manager and the planned schema for the HNANES database; (4) storage requirements for our Sun server; (5) user interface considerations; (6) the compatibility of the system with other standard research and analysis tools; (7) image display: we discuss considerations for consistent image display for end users. Finally, we discuss the scalability of the system in terms of incorporating larger or more databases of similar data, and the extendibility of the system for supporting content-based retrieval of biomedical images. The system prototype is called the Web-based Medical Information Retrieval System. An early version was built as a Java applet and tested on Unix, PC, and Macintosh platforms. This prototype used the MiniSQL database manager to do text queries on a small database of records of participants in the second NHANES survey. The full records and associated x-ray images were retrievable and displayable on a standard Web browser. A second version has now been built, also a Java applet, using the MySQL database manager.
Sun, Yongmei; Li, Xing; Wu, Di; Pan, Qi; Ji, Yuefeng; Ren, Hong; Ding, Keyue
2016-01-01
RNA editing is one of the post- or co-transcriptional processes that can lead to amino acid substitutions in protein sequences, alternative pre-mRNA splicing, and changes in gene expression levels. Although several methods have been suggested to identify RNA editing sites, there remains challenges to be addressed in distinguishing true RNA editing sites from its counterparts on genome and technical artifacts. In addition, there lacks a software framework to identify and visualize potential RNA editing sites. Here, we presented a software - 'RED' (RNA Editing sites Detector) - for the identification of RNA editing sites by integrating multiple rule-based and statistical filters. The potential RNA editing sites can be visualized at the genome and the site levels by graphical user interface (GUI). To improve performance, we used MySQL database management system (DBMS) for high-throughput data storage and query. We demonstrated the validity and utility of RED by identifying the presence and absence of C→U RNA-editing sites experimentally validated, in comparison with REDItools, a command line tool to perform high-throughput investigation of RNA editing. In an analysis of a sample data-set with 28 experimentally validated C→U RNA editing sites, RED had sensitivity and specificity of 0.64 and 0.5. In comparison, REDItools had a better sensitivity (0.75) but similar specificity (0.5). RED is an easy-to-use, platform-independent Java-based software, and can be applied to RNA-seq data without or with DNA sequencing data. The package is freely available under the GPLv3 license at http://github.com/REDetector/RED or https://sourceforge.net/projects/redetector.
Sun, Yongmei; Li, Xing; Wu, Di; Pan, Qi; Ji, Yuefeng; Ren, Hong; Ding, Keyue
2016-01-01
RNA editing is one of the post- or co-transcriptional processes that can lead to amino acid substitutions in protein sequences, alternative pre-mRNA splicing, and changes in gene expression levels. Although several methods have been suggested to identify RNA editing sites, there remains challenges to be addressed in distinguishing true RNA editing sites from its counterparts on genome and technical artifacts. In addition, there lacks a software framework to identify and visualize potential RNA editing sites. Here, we presented a software − ‘RED’ (RNA Editing sites Detector) − for the identification of RNA editing sites by integrating multiple rule-based and statistical filters. The potential RNA editing sites can be visualized at the genome and the site levels by graphical user interface (GUI). To improve performance, we used MySQL database management system (DBMS) for high-throughput data storage and query. We demonstrated the validity and utility of RED by identifying the presence and absence of C→U RNA-editing sites experimentally validated, in comparison with REDItools, a command line tool to perform high-throughput investigation of RNA editing. In an analysis of a sample data-set with 28 experimentally validated C→U RNA editing sites, RED had sensitivity and specificity of 0.64 and 0.5. In comparison, REDItools had a better sensitivity (0.75) but similar specificity (0.5). RED is an easy-to-use, platform-independent Java-based software, and can be applied to RNA-seq data without or with DNA sequencing data. The package is freely available under the GPLv3 license at http://github.com/REDetector/RED or https://sourceforge.net/projects/redetector. PMID:26930599
Efficient data management tools for the heterogeneous big data warehouse
NASA Astrophysics Data System (ADS)
Alekseev, A. A.; Osipova, V. V.; Ivanov, M. A.; Klimentov, A.; Grigorieva, N. V.; Nalamwar, H. S.
2016-09-01
The traditional RDBMS has been consistent for the normalized data structures. RDBMS served well for decades, but the technology is not optimal for data processing and analysis in data intensive fields like social networks, oil-gas industry, experiments at the Large Hadron Collider, etc. Several challenges have been raised recently on the scalability of data warehouse like workload against the transactional schema, in particular for the analysis of archived data or the aggregation of data for summary and accounting purposes. The paper evaluates new database technologies like HBase, Cassandra, and MongoDB commonly referred as NoSQL databases for handling messy, varied and large amount of data. The evaluation depends upon the performance, throughput and scalability of the above technologies for several scientific and industrial use-cases. This paper outlines the technologies and architectures needed for processing Big Data, as well as the description of the back-end application that implements data migration from RDBMS to NoSQL data warehouse, NoSQL database organization and how it could be useful for further data analytics.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Yong-Liang; Department of Chemistry and Chemical Engineering, Shaanxi Key Laboratory of Comprehensive Utilization of Tailings Resources, Shang Luo University, Shang Luo 726000; Wu, Ya-Pan
2015-03-15
Two new interpenetrating Cu{sup II}/Ni{sup II} coordination polymers, based on a unsymmetrical bifunctional N/O-tectonic 3-(pyrid-4′-yl)-5-(4″-carbonylphenyl)-1,2,4-triazolyl (H{sub 2}pycz), ([Cu-(Hpycz){sub 2}]·2H{sub 2}O){sub n} (1) and ([Ni(Hpycz){sub 2}]·H{sub 2}O){sub n} (2), have been solvothermally synthesized and structure characterization. Single crystal X-ray analysis indicates that compound 1 shows 2-fold parallel interpenetrated 4{sup 4}-sql layers with the same handedness. The overall structure of 1 is achiral—in each layer of doubly interpenetrating nets, the two individual nets have the opposite handedness to the corresponding nets in the adjoining layers—while 2 features a rare 8-fold interpenetrating 6{sup 6}-dia network that belongs to class IIIa interpenetration. In addition,more » compounds 1 and 2 both show similar paramagnetic characteristic properties. - Graphical abstract: Two new Cu(II)/Ni(II) coordination polymers present 2D parallel 2-fold interpenetrated 4{sup 4}-sql layers and a rare 3D 8-fold interpenetrating 6{sup 6}-dia network. In addition, magnetic susceptibility measurements show similar paramagnetic characteristic for two complexes. - Highlights: • A new unsymmetrical bifunctional N/O-tectonic as 4-connected spacer. • A 2-fold parallel interpenetrated sql layer with the same handedness. • A rare 8-fold interpenetrating dia network (class IIIa)« less
NASA Astrophysics Data System (ADS)
Ho, Chris M. W.; Marshall, Garland R.
1993-12-01
SPLICE is a program that processes partial query solutions retrieved from 3D, structural databases to generate novel, aggregate ligands. It is designed to interface with the database searching program FOUNDATION, which retrieves fragments containing any combination of a user-specified minimum number of matching query elements. SPLICE eliminates aspects of structures that are physically incapable of binding within the active site. Then, a systematic rule-based procedure is performed upon the remaining fragments to ensure receptor complementarity. All modifications are automated and remain transparent to the user. Ligands are then assembled by linking components into composite structures through overlapping bonds. As a control experiment, FOUNDATION and SPLICE were used to reconstruct a know HIV-1 protease inhibitor after it had been fragmented, reoriented, and added to a sham database of fifty different small molecules. To illustrate the capabilities of this program, a 3D search query containing the pharmacophoric elements of an aspartic proteinase-inhibitor crystal complex was searched using FOUNDATION against a subset of the Cambridge Structural Database. One hundred thirty-one compounds were retrieved, each containing any combination of at least four query elements. Compounds were automatically screened and edited for receptor complementarity. Numerous combinations of fragments were discovered that could be linked to form novel structures, containing a greater number of pharmacophoric elements than any single retrieved fragment.
A high performance, ad-hoc, fuzzy query processing system for relational databases
NASA Technical Reports Server (NTRS)
Mansfield, William H., Jr.; Fleischman, Robert M.
1992-01-01
Database queries involving imprecise or fuzzy predicates are currently an evolving area of academic and industrial research. Such queries place severe stress on the indexing and I/O subsystems of conventional database environments since they involve the search of large numbers of records. The Datacycle architecture and research prototype is a database environment that uses filtering technology to perform an efficient, exhaustive search of an entire database. It has recently been modified to include fuzzy predicates in its query processing. The approach obviates the need for complex index structures, provides unlimited query throughput, permits the use of ad-hoc fuzzy membership functions, and provides a deterministic response time largely independent of query complexity and load. This paper describes the Datacycle prototype implementation of fuzzy queries and some recent performance results.
Jung, HaRim; Song, MoonBae; Youn, Hee Yong; Kim, Ung Mo
2015-01-01
A content-matched (CM) range monitoring query over moving objects continually retrieves the moving objects (i) whose non-spatial attribute values are matched to given non-spatial query values; and (ii) that are currently located within a given spatial query range. In this paper, we propose a new query indexing structure, called the group-aware query region tree (GQR-tree) for efficient evaluation of CM range monitoring queries. The primary role of the GQR-tree is to help the server leverage the computational capabilities of moving objects in order to improve the system performance in terms of the wireless communication cost and server workload. Through a series of comprehensive simulations, we verify the superiority of the GQR-tree method over the existing methods. PMID:26393613
Systems and methods for an extensible business application framework
NASA Technical Reports Server (NTRS)
Bell, David G. (Inventor); Crawford, Michael (Inventor)
2012-01-01
Method and systems for editing data from a query result include requesting a query result using a unique collection identifier for a collection of individual files and a unique identifier for a configuration file that specifies a data structure for the query result. A query result is generated that contains a plurality of fields as specified by the configuration file, by combining each of the individual files associated with a unique identifier for a collection of individual files. The query result data is displayed with a plurality of labels as specified in the configuration file. Edits can be performed by querying a collection of individual files using the configuration file, editing a portion of the query result, and transmitting only the edited information for storage back into a data repository.
Knowledge Data Base for Amorphous Metals
2007-07-26
not programmatic, updates. Over 100 custom SQL statements that maintain the domain specific data are attached to the workflow entries in a generic...for the form by populating the SQL and run generation tables. Application data may be prepared in different ways for two steps that invoke the same form...run generation mode). There is a single table of SQL commands. Each record has a user-definable ID, the SQL code, and a comment. The run generation
NASA Astrophysics Data System (ADS)
Dziedzic, Adam; Mulawka, Jan
2014-11-01
NoSQL is a new approach to data storage and manipulation. The aim of this paper is to gain more insight into NoSQL databases, as we are still in the early stages of understanding when to use them and how to use them in an appropriate way. In this submission descriptions of selected NoSQL databases are presented. Each of the databases is analysed with primary focus on its data model, data access, architecture and practical usage in real applications. Furthemore, the NoSQL databases are compared in fields of data references. The relational databases offer foreign keys, whereas NoSQL databases provide us with limited references. An intermediate model between graph theory and relational algebra which can address the problem should be created. Finally, the proposal of a new approach to the problem of inconsistent references in Big Data storage systems is introduced.
Jadhav, Ashutosh; Sheth, Amit; Pathak, Jyotishman
2014-01-01
Since the early 2000’s, Internet usage for health information searching has increased significantly. Studying search queries can help us to understand users “information need” and how do they formulate search queries (“expression of information need”). Although cardiovascular diseases (CVD) affect a large percentage of the population, few studies have investigated how and what users search for CVD. We address this knowledge gap in the community by analyzing a large corpus of 10 million CVD related search queries from MayoClinic.com. Using UMLS MetaMap and UMLS semantic types/concepts, we developed a rule-based approach to categorize the queries into 14 health categories. We analyzed structural properties, types (keyword-based/Wh-questions/Yes-No questions) and linguistic structure of the queries. Our results show that the most searched health categories are ‘Diseases/Conditions’, ‘Vital-Sings’, ‘Symptoms’ and ‘Living-with’. CVD queries are longer and are predominantly keyword-based. This study extends our knowledge about online health information searching and provides useful insights for Web search engines and health websites. PMID:25954380
NASA Astrophysics Data System (ADS)
Zhang, Xiaowei; Xing, Peiqi; Geng, Xiujuan; Sun, Daofeng; Xiao, Zhenyu; Wang, Lei
2015-09-01
Eight new coordination polymers (CPs), namely, [Zn(1,2-mbix)(tbtpa)]n (1), [Co(1,2-mbix)(tbtpa)]n (2), [CdCl(1,2-mbix)(tbtpa)0.5]n (3), {[Cd(1,2-bix)(tbtpa)]·H2O}n (4), {[Cd0.5(1,2-bix)(tbtpa)0.5]·H2O}n (5), {[Co0.5(1,2-bix)(tbtpa)0.5]·2H2O}n (6), {[Co(1,2-bix)(tbtpa)]·H2O}n (7) and {[Co(1,2-bix)(tbtpa)]·Diox·2H2O}n (8), were synthesized under solvothermal conditions based on mix-ligand strategy (H2tbtpa=tetrabromoterephthalic acid and 1,2-mbix=1,2-bis((2-methyl-1H-imidazol-1-yl)methyl)benzene, 1,2-bix=1,2-bis(imidazol-1-ylmethyl)benzene). All of the CPs have been structurally characterized by single-crystal X-ray diffraction analyses and further characterized by elemental analyses, IR spectroscopy, powder X-ray diffraction (PXRD), and thermogravimetric analyses (TGA). X-ray diffraction analyses show that 1 and 2 are isotypics which have 2D highly undulated networks with (4,4)-sql topology with the existence of C-H ⋯Br interactions; for 3, it has a 2D planar network with (4,4)-sql topology with the occurrence of C-H ⋯Cl interactions other than C-H ⋯Br interactions; 4 shows a 3D 2-fold interpenetrated nets with rare 65·8-mok topology which has a self-catention property. As the same case as 1 and 2, 5 and 6 are also isostructural with planar layers with 44-sql topology which further assembled into 3D supramolecular structure through the interdigitated stacking fashion and the C-Br ⋯Cph interactions. As for 7, it has a 2D slightly undulated networks with (4,4)-sql topology which has one dimension channel. While 8 has a 2-fold interpenetrated networks with (3,4)-connect jeb topology with point symbol {63}{65·8}. And their structures can be tuned by conformations of bis(imidazol) ligands and solvent mixture. Besides, the TGA properties for all compounds and the luminescent properties for 1, 3, 4, 5 are discussed in detail.
Konc, Janez; Cesnik, Tomo; Konc, Joanna Trykowska; Penca, Matej; Janežič, Dušanka
2012-02-27
ProBiS-Database is a searchable repository of precalculated local structural alignments in proteins detected by the ProBiS algorithm in the Protein Data Bank. Identification of functionally important binding regions of the protein is facilitated by structural similarity scores mapped to the query protein structure. PDB structures that have been aligned with a query protein may be rapidly retrieved from the ProBiS-Database, which is thus able to generate hypotheses concerning the roles of uncharacterized proteins. Presented with uncharacterized protein structure, ProBiS-Database can discern relationships between such a query protein and other better known proteins in the PDB. Fast access and a user-friendly graphical interface promote easy exploration of this database of over 420 million local structural alignments. The ProBiS-Database is updated weekly and is freely available online at http://probis.cmm.ki.si/database.
CSRQ: Communication-Efficient Secure Range Queries in Two-Tiered Sensor Networks
Dai, Hua; Ye, Qingqun; Yang, Geng; Xu, Jia; He, Ruiliang
2016-01-01
In recent years, we have seen many applications of secure query in two-tiered wireless sensor networks. Storage nodes are responsible for storing data from nearby sensor nodes and answering queries from Sink. It is critical to protect data security from a compromised storage node. In this paper, the Communication-efficient Secure Range Query (CSRQ)—a privacy and integrity preserving range query protocol—is proposed to prevent attackers from gaining information of both data collected by sensor nodes and queries issued by Sink. To preserve privacy and integrity, in addition to employing the encoding mechanisms, a novel data structure called encrypted constraint chain is proposed, which embeds the information of integrity verification. Sink can use this encrypted constraint chain to verify the query result. The performance evaluation shows that CSRQ has lower communication cost than the current range query protocols. PMID:26907293
SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases.
Schweiger, Dominik; Trajanoski, Zlatko; Pabinger, Stephan
2014-08-15
Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. SPARQLGraph offers an intuitive drag & drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers. This new graphical way of creating queries for biological Semantic Web databases considerably facilitates usability as it removes the requirement of knowing specific query languages and database structures. The system is freely available at http://sparqlgraph.i-med.ac.at.
Repetski, Stephen; Venkataraman, Girish; Che, Anney; Luke, Brian T.; Girard, F. Pascal; Stephens, Robert M.
2013-01-01
As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework. PMID:24312478
A Layered Searchable Encryption Scheme with Functional Components Independent of Encryption Methods
Luo, Guangchun; Qin, Ke
2014-01-01
Searchable encryption technique enables the users to securely store and search their documents over the remote semitrusted server, which is especially suitable for protecting sensitive data in the cloud. However, various settings (based on symmetric or asymmetric encryption) and functionalities (ranked keyword query, range query, phrase query, etc.) are often realized by different methods with different searchable structures that are generally not compatible with each other, which limits the scope of application and hinders the functional extensions. We prove that asymmetric searchable structure could be converted to symmetric structure, and functions could be modeled separately apart from the core searchable structure. Based on this observation, we propose a layered searchable encryption (LSE) scheme, which provides compatibility, flexibility, and security for various settings and functionalities. In this scheme, the outputs of the core searchable component based on either symmetric or asymmetric setting are converted to some uniform mappings, which are then transmitted to loosely coupled functional components to further filter the results. In such a way, all functional components could directly support both symmetric and asymmetric settings. Based on LSE, we propose two representative and novel constructions for ranked keyword query (previously only available in symmetric scheme) and range query (previously only available in asymmetric scheme). PMID:24719565
Mudunuri, Uma S; Khouja, Mohamad; Repetski, Stephen; Venkataraman, Girish; Che, Anney; Luke, Brian T; Girard, F Pascal; Stephens, Robert M
2013-01-01
As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.
A natural language interface plug-in for cooperative query answering in biological databases.
Jamil, Hasan M
2012-06-11
One of the many unique features of biological databases is that the mere existence of a ground data item is not always a precondition for a query response. It may be argued that from a biologist's standpoint, queries are not always best posed using a structured language. By this we mean that approximate and flexible responses to natural language like queries are well suited for this domain. This is partly due to biologists' tendency to seek simpler interfaces and partly due to the fact that questions in biology involve high level concepts that are open to interpretations computed using sophisticated tools. In such highly interpretive environments, rigidly structured databases do not always perform well. In this paper, our goal is to propose a semantic correspondence plug-in to aid natural language query processing over arbitrary biological database schema with an aim to providing cooperative responses to queries tailored to users' interpretations. Natural language interfaces for databases are generally effective when they are tuned to the underlying database schema and its semantics. Therefore, changes in database schema become impossible to support, or a substantial reorganization cost must be absorbed to reflect any change. We leverage developments in natural language parsing, rule languages and ontologies, and data integration technologies to assemble a prototype query processor that is able to transform a natural language query into a semantically equivalent structured query over the database. We allow knowledge rules and their frequent modifications as part of the underlying database schema. The approach we adopt in our plug-in overcomes some of the serious limitations of many contemporary natural language interfaces, including support for schema modifications and independence from underlying database schema. The plug-in introduced in this paper is generic and facilitates connecting user selected natural language interfaces to arbitrary databases using a semantic description of the intended application. We demonstrate the feasibility of our approach with a practical example.
Jadhav, Ashutosh; Andrews, Donna; Fiksdal, Alexander; Kumbamu, Ashok; McCormick, Jennifer B; Misitano, Andrew; Nelsen, Laurie; Ryu, Euijung; Sheth, Amit; Wu, Stephen
2014-01-01
Background The number of people using the Internet and mobile/smart devices for health information seeking is increasing rapidly. Although the user experience for online health information seeking varies with the device used, for example, smart devices (SDs) like smartphones/tablets versus personal computers (PCs) like desktops/laptops, very few studies have investigated how online health information seeking behavior (OHISB) may differ by device. Objective The objective of this study is to examine differences in OHISB between PCs and SDs through a comparative analysis of large-scale health search queries submitted through Web search engines from both types of devices. Methods Using the Web analytics tool, IBM NetInsight OnDemand, and based on the type of devices used (PCs or SDs), we obtained the most frequent health search queries between June 2011 and May 2013 that were submitted on Web search engines and directed users to the Mayo Clinic’s consumer health information website. We performed analyses on “Queries with considering repetition counts (QwR)” and “Queries without considering repetition counts (QwoR)”. The dataset contains (1) 2.74 million and 3.94 million QwoR, respectively for PCs and SDs, and (2) more than 100 million QwR for both PCs and SDs. We analyzed structural properties of the queries (length of the search queries, usage of query operators and special characters in health queries), types of search queries (keyword-based, wh-questions, yes/no questions), categorization of the queries based on health categories and information mentioned in the queries (gender, age-groups, temporal references), misspellings in the health queries, and the linguistic structure of the health queries. Results Query strings used for health information searching via PCs and SDs differ by almost 50%. The most searched health categories are “Symptoms” (1 in 3 search queries), “Causes”, and “Treatments & Drugs”. The distribution of search queries for different health categories differs with the device used for the search. Health queries tend to be longer and more specific than general search queries. Health queries from SDs are longer and have slightly fewer spelling mistakes than those from PCs. Users specify words related to women and children more often than that of men and any other age group. Most of the health queries are formulated using keywords; the second-most common are wh- and yes/no questions. Users ask more health questions using SDs than PCs. Almost all health queries have at least one noun and health queries from SDs are more descriptive than those from PCs. Conclusions This study is a large-scale comparative analysis of health search queries to understand the effects of device type (PCs vs SDs) used on OHISB. The study indicates that the device used for online health information search plays an important role in shaping how health information searches by consumers and patients are executed. PMID:25000537
Jadhav, Ashutosh; Andrews, Donna; Fiksdal, Alexander; Kumbamu, Ashok; McCormick, Jennifer B; Misitano, Andrew; Nelsen, Laurie; Ryu, Euijung; Sheth, Amit; Wu, Stephen; Pathak, Jyotishman
2014-07-04
The number of people using the Internet and mobile/smart devices for health information seeking is increasing rapidly. Although the user experience for online health information seeking varies with the device used, for example, smart devices (SDs) like smartphones/tablets versus personal computers (PCs) like desktops/laptops, very few studies have investigated how online health information seeking behavior (OHISB) may differ by device. The objective of this study is to examine differences in OHISB between PCs and SDs through a comparative analysis of large-scale health search queries submitted through Web search engines from both types of devices. Using the Web analytics tool, IBM NetInsight OnDemand, and based on the type of devices used (PCs or SDs), we obtained the most frequent health search queries between June 2011 and May 2013 that were submitted on Web search engines and directed users to the Mayo Clinic's consumer health information website. We performed analyses on "Queries with considering repetition counts (QwR)" and "Queries without considering repetition counts (QwoR)". The dataset contains (1) 2.74 million and 3.94 million QwoR, respectively for PCs and SDs, and (2) more than 100 million QwR for both PCs and SDs. We analyzed structural properties of the queries (length of the search queries, usage of query operators and special characters in health queries), types of search queries (keyword-based, wh-questions, yes/no questions), categorization of the queries based on health categories and information mentioned in the queries (gender, age-groups, temporal references), misspellings in the health queries, and the linguistic structure of the health queries. Query strings used for health information searching via PCs and SDs differ by almost 50%. The most searched health categories are "Symptoms" (1 in 3 search queries), "Causes", and "Treatments & Drugs". The distribution of search queries for different health categories differs with the device used for the search. Health queries tend to be longer and more specific than general search queries. Health queries from SDs are longer and have slightly fewer spelling mistakes than those from PCs. Users specify words related to women and children more often than that of men and any other age group. Most of the health queries are formulated using keywords; the second-most common are wh- and yes/no questions. Users ask more health questions using SDs than PCs. Almost all health queries have at least one noun and health queries from SDs are more descriptive than those from PCs. This study is a large-scale comparative analysis of health search queries to understand the effects of device type (PCs vs. SDs) used on OHISB. The study indicates that the device used for online health information search plays an important role in shaping how health information searches by consumers and patients are executed.
Standley, Daron M; Toh, Hiroyuki; Nakamura, Haruki
2008-09-01
A method to functionally annotate structural genomics targets, based on a novel structural alignment scoring function, is proposed. In the proposed score, position-specific scoring matrices are used to weight structurally aligned residue pairs to highlight evolutionarily conserved motifs. The functional form of the score is first optimized for discriminating domains belonging to the same Pfam family from domains belonging to different families but the same CATH or SCOP superfamily. In the optimization stage, we consider four standard weighting functions as well as our own, the "maximum substitution probability," and combinations of these functions. The optimized score achieves an area of 0.87 under the receiver-operating characteristic curve with respect to identifying Pfam families within a sequence-unique benchmark set of domain pairs. Confidence measures are then derived from the benchmark distribution of true-positive scores. The alignment method is next applied to the task of functionally annotating 230 query proteins released to the public as part of the Protein 3000 structural genomics project in Japan. Of these queries, 78 were found to align to templates with the same Pfam family as the query or had sequence identities > or = 30%. Another 49 queries were found to match more distantly related templates. Within this group, the template predicted by our method to be the closest functional relative was often not the most structurally similar. Several nontrivial cases are discussed in detail. Finally, 103 queries matched templates at the fold level, but not the family or superfamily level, and remain functionally uncharacterized. 2008 Wiley-Liss, Inc.
NASA Astrophysics Data System (ADS)
Knosp, B.; Gangl, M. E.; Hristova-Veleva, S. M.; Kim, R. M.; Lambrigtsen, B.; Li, P.; Niamsuwan, N.; Shen, T. P. J.; Turk, F. J.; Vu, Q. A.
2014-12-01
The JPL Tropical Cyclone Information System (TCIS) brings together satellite, aircraft, and model forecast data from several NASA, NOAA, and other data centers to assist researchers in comparing and analyzing data related to tropical cyclones. The TCIS has been supporting specific science field campaigns, such as the Genesis and Rapid Intensification Processes (GRIP) campaign and the Hurricane and Severe Storm Sentinel (HS3) campaign, by creating near real-time (NRT) data visualization portals. These portals are intended to assist in mission planning, enhance the understanding of current physical processes, and improve model data by comparing it to satellite and aircraft observations. The TCIS NRT portals allow the user to view plots on a Google Earth interface. To compliment these visualizations, the team has been working on developing data analysis tools to let the user actively interrogate areas of Level 2 swath and two-dimensional plots they see on their screen. As expected, these observation and model data are quite voluminous and bottlenecks in the system architecture can occur when the databases try to run geospatial searches for data files that need to be read by the tools. To improve the responsiveness of the data analysis tools, the TCIS team has been conducting studies on how to best store Level 2 swath footprints and run sub-second geospatial searches to discover data. The first objective was to improve the sampling accuracy of the footprints being stored in the TCIS database by comparing the Java-based NASA PO.DAAC Level 2 Swath Generator with a TCIS Python swath generator. The second objective was to compare the performance of four database implementations - MySQL, MySQL+Solr, MongoDB, and PostgreSQL - to see which database management system would yield the best geospatial query and storage performance. The final objective was to integrate our chosen technologies with our Joint Probability Density Function (Joint PDF), Wave Number Analysis, and Automated Rotational Center Hurricane Eye Retrieval (ARCHER) tools. In this presentation, we will compare the enabling technologies we tested and discuss which ones we selected for integration into the TCIS' data analysis tool architecture. We will also show how these techniques have been automated to provide access to NRT data through our analysis tools.
Risk Assessment of the Naval Postgraduate School Gigabit Network
2004-09-01
Management Server (1) • Ras Server (1) • Remedy Server (1) • Samba Server(2) • SQL Servers (3) • Web Servers (3) • WINS Server (1) • Library...Server Bob Sharp INCA Windows 2000 Advanced Server NPGS Landesk SQL 2000 Alan Pires eagle Microsoft Windows 2000 Advanced Server EWS NPGS Landesk...Advanced Server Special Projects NPGS SQL Alan Pires MC01BDB Microsoft Windows 2000 Advanced Server Special Projects NPGS SQL 2000 Alan Pires
Performance Evaluation of NoSQL Databases: A Case Study
2015-02-01
a centralized relational database. The customer decided to consider NoSQL technologies for two specific uses, namely: the primary data store for...17 custom specific 6. FU NoSQL availab data mo arking of data g a specific wo sin benchmark f hmark for tran le workload de o publish meas their...The choice of a particular NoSQL database imposes a specific distributed software architecture and data model, and is a major determinant of the
Gai, Xiaowu; Perin, Juan C; Murphy, Kevin; O'Hara, Ryan; D'arcy, Monica; Wenocur, Adam; Xie, Hongbo M; Rappaport, Eric F; Shaikh, Tamim H; White, Peter S
2010-02-04
Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing availability of high-resolution genome surveillance platforms provides opportunity for rapidly assessing research and clinical samples for CNV content, as well as for determining the potential pathogenicity of identified variants. However, few informatics tools for accurate and efficient CNV detection and assessment currently exist. We developed a suite of software tools and resources (CNV Workshop) for automated, genome-wide CNV detection from a variety of SNP array platforms. CNV Workshop includes three major components: detection, annotation, and presentation of structural variants from genome array data. CNV detection utilizes a robust and genotype-specific extension of the Circular Binary Segmentation algorithm, and the use of additional detection algorithms is supported. Predicted CNVs are captured in a MySQL database that supports cohort-based projects and incorporates a secure user authentication layer and user/admin roles. To assist with determination of pathogenicity, detected CNVs are also annotated automatically for gene content, known disease loci, and gene-based literature references. Results are easily queried, sorted, filtered, and visualized via a web-based presentation layer that includes a GBrowse-based graphical representation of CNV content and relevant public data, integration with the UCSC Genome Browser, and tabular displays of genomic attributes for each CNV. To our knowledge, CNV Workshop represents the first cohesive and convenient platform for detection, annotation, and assessment of the biological and clinical significance of structural variants. CNV Workshop has been successfully utilized for assessment of genomic variation in healthy individuals and disease cohorts and is an ideal platform for coordinating multiple associated projects. Available on the web at: http://sourceforge.net/projects/cnv.
In-database processing of a large collection of remote sensing data: applications and implementation
NASA Astrophysics Data System (ADS)
Kikhtenko, Vladimir; Mamash, Elena; Chubarov, Dmitri; Voronina, Polina
2016-04-01
Large archives of remote sensing data are now available to scientists, yet the need to work with individual satellite scenes or product files constrains studies that span a wide temporal range or spatial extent. The resources (storage capacity, computing power and network bandwidth) required for such studies are often beyond the capabilities of individual geoscientists. This problem has been tackled before in remote sensing research and inspired several information systems. Some of them such as NASA Giovanni [1] and Google Earth Engine have already proved their utility for science. Analysis tasks involving large volumes of numerical data are not unique to Earth Sciences. Recent advances in data science are enabled by the development of in-database processing engines that bring processing closer to storage, use declarative query languages to facilitate parallel scalability and provide high-level abstraction of the whole dataset. We build on the idea of bridging the gap between file archives containing remote sensing data and databases by integrating files into relational database as foreign data sources and performing analytical processing inside the database engine. Thereby higher level query language can efficiently address problems of arbitrary size: from accessing the data associated with a specific pixel or a grid cell to complex aggregation over spatial or temporal extents over a large number of individual data files. This approach was implemented using PostgreSQL for a Siberian regional archive of satellite data products holding hundreds of terabytes of measurements from multiple sensors and missions taken over a decade-long span. While preserving the original storage layout and therefore compatibility with existing applications the in-database processing engine provides a toolkit for provisioning remote sensing data in scientific workflows and applications. The use of SQL - a widely used higher level declarative query language - simplifies interoperability between desktop GIS, web applications and geographic web services and interactive scientific applications (MATLAB, IPython). The system is also automatically ingesting direct readout data from meteorological and research satellites in near-real time with distributed acquisition workflows managed by Taverna workflow engine [2]. The system has demonstrated its utility in performing non-trivial analytic processing such as the computation of the Robust Satellite Technique (RST) indices [3]. It had been useful in different tasks such as studying urban heat islands, analyzing patterns in the distribution of wildfire occurrences, detecting phenomena related to seismic and earthquake activity. Initial experience has highlighted several limitations of the proposed approach yet it has demonstrated ability to facilitate the use of large archives of remote sensing data by geoscientists. 1. J.G. Acker, G. Leptoukh, Online analysis enhances use of NASA Earth science data. EOS Trans. AGU, 2007, 88(2), P. 14-17. 2. D. Hull, K. Wolsfencroft, R. Stevens, C. Goble, M.R. Pocock, P. Li and T. Oinn, Taverna: a tool for building and running workflows of services. Nucleic Acids Research. 2006. V. 34. P. W729-W732. 3. V. Tramutoli, G. Di Bello, N. Pergola, S. Piscitelli, Robust satellite techniques for remote sensing of seismically active areas // Annals of Geophysics. 2001. no. 44(2). P. 295-312.
Ibmdbpy-spatial : An Open-source implementation of in-database geospatial analytics in Python
NASA Astrophysics Data System (ADS)
Roy, Avipsa; Fouché, Edouard; Rodriguez Morales, Rafael; Moehler, Gregor
2017-04-01
As the amount of spatial data acquired from several geodetic sources has grown over the years and as data infrastructure has become more powerful, the need for adoption of in-database analytic technology within geosciences has grown rapidly. In-database analytics on spatial data stored in a traditional enterprise data warehouse enables much faster retrieval and analysis for making better predictions about risks and opportunities, identifying trends and spot anomalies. Although there are a number of open-source spatial analysis libraries like geopandas and shapely available today, most of them have been restricted to manipulation and analysis of geometric objects with a dependency on GEOS and similar libraries. We present an open-source software package, written in Python, to fill the gap between spatial analysis and in-database analytics. Ibmdbpy-spatial provides a geospatial extension to the ibmdbpy package, implemented in 2015. It provides an interface for spatial data manipulation and access to in-database algorithms in IBM dashDB, a data warehouse platform with a spatial extender that runs as a service on IBM's cloud platform called Bluemix. Working in-database reduces the network overload, as the complete data need not be replicated into the user's local system altogether and only a subset of the entire dataset can be fetched into memory in a single instance. Ibmdbpy-spatial accelerates Python analytics by seamlessly pushing operations written in Python into the underlying database for execution using the dashDB spatial extender, thereby benefiting from in-database performance-enhancing features, such as columnar storage and parallel processing. The package is currently supported on Python versions from 2.7 up to 3.4. The basic architecture of the package consists of three main components - 1) a connection to the dashDB represented by the instance IdaDataBase, which uses a middleware API namely - pypyodbc or jaydebeapi to establish the database connection via ODBC or JDBC respectively, 2) an instance to represent the spatial data stored in the database as a dataframe in Python, called the IdaGeoDataFrame, with a specific geometry attribute which recognises a planar geometry column in dashDB and 3) Python wrappers for spatial functions like within, distance, area, buffer} and more which dashDB currently supports to make the querying process from Python much simpler for the users. The spatial functions translate well-known geopandas-like syntax into SQL queries utilising the database connection to perform spatial operations in-database and can operate on single geometries as well two different geometries from different IdaGeoDataFrames. The in-database queries strictly follow the standards of OpenGIS Implementation Specification for Geographic information - Simple feature access for SQL. The results of the operations obtained can thereby be accessed dynamically via interactive Jupyter notebooks from any system which supports Python, without any additional dependencies and can also be combined with other open source libraries such as matplotlib and folium in-built within Jupyter notebooks for visualization purposes. We built a use case to analyse crime hotspots in New York city to validate our implementation and visualized the results as a choropleth map for each borough.
Wald, Lisa A.; Wald, David J.; Schwarz, Stan; Presgrave, Bruce; Earle, Paul S.; Martinez, Eric; Oppenheimer, David
2008-01-01
At the beginning of 2006, the U.S. Geological Survey (USGS) Earthquake Hazards Program (EHP) introduced a new automated Earthquake Notification Service (ENS) to take the place of the National Earthquake Information Center (NEIC) "Bigquake" system and the various other individual EHP e-mail list-servers for separate regions in the United States. These included northern California, southern California, and the central and eastern United States. ENS is a "one-stop shopping" system that allows Internet users to subscribe to flexible and customizable notifications for earthquakes anywhere in the world. The customization capability allows users to define the what (magnitude threshold), the when (day and night thresholds), and the where (specific regions) for their notifications. Customization is achieved by employing a per-user based request profile, allowing the notifications to be tailored for each individual's requirements. Such earthquake-parameter-specific custom delivery was not possible with simple e-mail list-servers. Now that event and user profiles are in a structured query language (SQL) database, additional flexibility is possible. At the time of this writing, ENS had more than 114,000 subscribers, with more than 200,000 separate user profiles. On a typical day, more than 188,000 messages get sent to a variety of widely distributed users for a wide range of earthquake locations and magnitudes. The purpose of this article is to describe how ENS works, highlight the features it offers, and summarize plans for future developments.
Netgram: Visualizing Communities in Evolving Networks
Mall, Raghvendra; Langone, Rocco; Suykens, Johan A. K.
2015-01-01
Real-world complex networks are dynamic in nature and change over time. The change is usually observed in the interactions within the network over time. Complex networks exhibit community like structures. A key feature of the dynamics of complex networks is the evolution of communities over time. Several methods have been proposed to detect and track the evolution of these groups over time. However, there is no generic tool which visualizes all the aspects of group evolution in dynamic networks including birth, death, splitting, merging, expansion, shrinkage and continuation of groups. In this paper, we propose Netgram: a tool for visualizing evolution of communities in time-evolving graphs. Netgram maintains evolution of communities over 2 consecutive time-stamps in tables which are used to create a query database using the sql outer-join operation. It uses a line-based visualization technique which adheres to certain design principles and aesthetic guidelines. Netgram uses a greedy solution to order the initial community information provided by the evolutionary clustering technique such that we have fewer line cross-overs in the visualization. This makes it easier to track the progress of individual communities in time evolving graphs. Netgram is a generic toolkit which can be used with any evolutionary community detection algorithm as illustrated in our experiments. We use Netgram for visualization of topic evolution in the NIPS conference over a period of 11 years and observe the emergence and merging of several disciplines in the field of information processing systems. PMID:26356538
Kühbeck, Felizian; Engelhardt, Stefan; Sarikas, Antonio
2014-01-01
Audience response (AR) systems are increasingly used in undergraduate medical education. However, high costs and complexity of conventional AR systems often limit their use. Here we present a novel AR system that is platform independent and does not require hardware clickers or additional software to be installed. "OnlineTED" was developed at Technische Universität München (TUM) based on Hypertext Preprocessor (PHP) with a My Structured Query Language (MySQL)-database as server- and Javascript as client-side programming languages. "OnlineTED" enables lecturers to create and manage question sets online and start polls in-class via a web-browser. Students can participate in the polls with any internet-enabled device (smartphones, tablet-PCs or laptops). A paper-based survey was conducted with undergraduate medical students and lecturers at TUM to compare "OnlineTED" with conventional AR systems using clickers. "OnlineTED" received above-average evaluation results by both students and lecturers at TUM and was seen on par or superior to conventional AR systems. The survey results indicated that up to 80% of students at TUM own an internet-enabled device (smartphone or tablet-PC) for participation in web-based AR technologies. "OnlineTED" is a novel web-based and platform-independent AR system for higher education that was well received by students and lecturers. As a non-commercial alternative to conventional AR systems it may foster interactive teaching in undergraduate education, in particular with large audiences.
Correcting ligands, metabolites, and pathways
Ott, Martin A; Vriend, Gert
2006-01-01
Background A wide range of research areas in bioinformatics, molecular biology and medicinal chemistry require precise chemical structure information about molecules and reactions, e.g. drug design, ligand docking, metabolic network reconstruction, and systems biology. Most available databases, however, treat chemical structures more as illustrations than as a datafield in its own right. Lack of chemical accuracy impedes progress in the areas mentioned above. We present a database of metabolites called BioMeta that augments the existing pathway databases by explicitly assessing the validity, correctness, and completeness of chemical structure and reaction information. Description The main bulk of the data in BioMeta were obtained from the KEGG Ligand database. We developed a tool for chemical structure validation which assesses the chemical validity and stereochemical completeness of a molecule description. The validation tool was used to examine the compounds in BioMeta, showing that a relatively small number of compounds had an incorrect constitution (connectivity only, not considering stereochemistry) and that a considerable number (about one third) had incomplete or even incorrect stereochemistry. We made a large effort to correct the errors and to complete the structural descriptions. A total of 1468 structures were corrected and/or completed. We also established the reaction balance of the reactions in BioMeta and corrected 55% of the unbalanced (stoichiometrically incorrect) reactions in an automatic procedure. The BioMeta database was implemented in PostgreSQL and provided with a web-based interface. Conclusion We demonstrate that the validation of metabolite structures and reactions is a feasible and worthwhile undertaking, and that the validation results can be used to trigger corrections and improvements to BioMeta, our metabolite database. BioMeta provides some tools for rational drug design, reaction searches, and visualization. It is freely available at provided that the copyright notice of all original data is cited. The database will be useful for querying and browsing biochemical pathways, and to obtain reference information for identifying compounds. However, these applications require that the underlying data be correct, and that is the focus of BioMeta. PMID:17132165
Queries over Unstructured Data: Probabilistic Methods to the Rescue
NASA Astrophysics Data System (ADS)
Sarawagi, Sunita
Unstructured data like emails, addresses, invoices, call transcripts, reviews, and press releases are now an integral part of any large enterprise. A challenge of modern business intelligence applications is analyzing and querying data seamlessly across structured and unstructured sources. This requires the development of automated techniques for extracting structured records from text sources and resolving entity mentions in data from various sources. The success of any automated method for extraction and integration depends on how effectively it unifies diverse clues in the unstructured source and in existing structured databases. We argue that statistical learning techniques like Conditional Random Fields (CRFs) provide a accurate, elegant and principled framework for tackling these tasks. Given the inherent noise in real-world sources, it is important to capture the uncertainty of the above operations via imprecise data models. CRFs provide a sound probability distribution over extractions but are not easy to represent and query in a relational framework. We present methods of approximating this distribution to query-friendly row and column uncertainty models. Finally, we present models for representing the uncertainty of de-duplication and algorithms for various Top-K count queries on imprecise duplicates.
2012-09-01
relative performance of several conventional SQL and NoSQL databases with a set of one billion file block hashes. Digital Forensics, Sector Hashing, Full... NoSQL databases with a set of one billion file block hashes. v THIS PAGE INTENTIONALLY LEFT BLANK vi Table of Contents List of Acronyms and...Operating System NOOP No Operation assembly instruction NoSQL “Not only SQL” model for non-relational database management NSRL National Software