Active Learning by Querying Informative and Representative Examples.
Huang, Sheng-Jun; Jin, Rong; Zhou, Zhi-Hua
2014-10-01
Active learning reduces the labeling cost by iteratively selecting the most valuable data to query their labels. It has attracted a lot of interests given the abundance of unlabeled data and the high cost of labeling. Most active learning approaches select either informative or representative unlabeled instances to query their labels, which could significantly limit their performance. Although several active learning algorithms were proposed to combine the two query selection criteria, they are usually ad hoc in finding unlabeled instances that are both informative and representative. We address this limitation by developing a principled approach, termed QUIRE, based on the min-max view of active learning. The proposed approach provides a systematic way for measuring and combining the informativeness and representativeness of an unlabeled instance. Further, by incorporating the correlation among labels, we extend the QUIRE approach to multi-label learning by actively querying instance-label pairs. Extensive experimental results show that the proposed QUIRE approach outperforms several state-of-the-art active learning approaches in both single-label and multi-label learning.
Query construction, entropy, and generalization in neural-network models
NASA Astrophysics Data System (ADS)
Sollich, Peter
1994-05-01
We study query construction algorithms, which aim at improving the generalization ability of systems that learn from examples by choosing optimal, nonredundant training sets. We set up a general probabilistic framework for deriving such algorithms from the requirement of optimizing a suitable objective function; specifically, we consider the objective functions entropy (or information gain) and generalization error. For two learning scenarios, the high-low game and the linear perceptron, we evaluate the generalization performance obtained by applying the corresponding query construction algorithms and compare it to training on random examples. We find qualitative differences between the two scenarios due to the different structure of the underlying rules (nonlinear and ``noninvertible'' versus linear); in particular, for the linear perceptron, random examples lead to the same generalization ability as a sequence of queries in the limit of an infinite number of examples. We also investigate learning algorithms which are ill matched to the learning environment and find that, in this case, minimum entropy queries can in fact yield a lower generalization ability than random examples. Finally, we study the efficiency of single queries and its dependence on the learning history, i.e., on whether the previous training examples were generated randomly or by querying, and the difference between globally and locally optimal query construction.
Collaborative Supervised Learning for Sensor Networks
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri L.; Rebbapragada, Umaa; Lane, Terran
2011-01-01
Collaboration methods for distributed machine-learning algorithms involve the specification of communication protocols for the learners, which can query other learners and/or broadcast their findings preemptively. Each learner incorporates information from its neighbors into its own training set, and they are thereby able to bootstrap each other to higher performance. Each learner resides at a different node in the sensor network and makes observations (collects data) independently of the other learners. After being seeded with an initial labeled training set, each learner proceeds to learn in an iterative fashion. New data is collected and classified. The learner can then either broadcast its most confident classifications for use by other learners, or can query neighbors for their classifications of its least confident items. As such, collaborative learning combines elements of both passive (broadcast) and active (query) learning. It also uses ideas from ensemble learning to combine the multiple responses to a given query into a single useful label. This approach has been evaluated against current non-collaborative alternatives, including training a single classifier and deploying it at all nodes with no further learning possible, and permitting learners to learn from their own most confident judgments, absent interaction with their neighbors. On several data sets, it has been consistently found that active collaboration is the best strategy for a distributed learner network. The main advantages include the ability for learning to take place autonomously by collaboration rather than by requiring intervention from an oracle (usually human), and also the ability to learn in a distributed environment, permitting decisions to be made in situ and to yield faster response time.
Fast Nonparametric Machine Learning Algorithms for High-Dimensional Massive Data and Applications
2006-03-01
know the probability of that from Lemma 2. Using the union bound, we know that for any query q, the probability that i-am-feeling-lucky search algorithm...and each point in a d-dimensional space, a naive k-NN search needs to do a linear scan of T for every single query q, and thus the computational time...algorithm based on partition trees with priority search , and give an expected query time O((1/)d log n). But the constant in the O((1/)d log n
The role of economics in the QUERI program: QUERI Series
Smith, Mark W; Barnett, Paul G
2008-01-01
Background The United States (U.S.) Department of Veterans Affairs (VA) Quality Enhancement Research Initiative (QUERI) has implemented economic analyses in single-site and multi-site clinical trials. To date, no one has reviewed whether the QUERI Centers are taking an optimal approach to doing so. Consistent with the continuous learning culture of the QUERI Program, this paper provides such a reflection. Methods We present a case study of QUERI as an example of how economic considerations can and should be integrated into implementation research within both single and multi-site studies. We review theoretical and applied cost research in implementation studies outside and within VA. We also present a critique of the use of economic research within the QUERI program. Results Economic evaluation is a key element of implementation research. QUERI has contributed many developments in the field of implementation but has only recently begun multi-site implementation trials across multiple regions within the national VA healthcare system. These trials are unusual in their emphasis on developing detailed costs of implementation, as well as in the use of business case analyses (budget impact analyses). Conclusion Economics appears to play an important role in QUERI implementation studies, only after implementation has reached the stage of multi-site trials. Economic analysis could better inform the choice of which clinical best practices to implement and the choice of implementation interventions to employ. QUERI economics also would benefit from research on costing methods and development of widely accepted international standards for implementation economics. PMID:18430199
The role of economics in the QUERI program: QUERI Series.
Smith, Mark W; Barnett, Paul G
2008-04-22
The United States (U.S.) Department of Veterans Affairs (VA) Quality Enhancement Research Initiative (QUERI) has implemented economic analyses in single-site and multi-site clinical trials. To date, no one has reviewed whether the QUERI Centers are taking an optimal approach to doing so. Consistent with the continuous learning culture of the QUERI Program, this paper provides such a reflection. We present a case study of QUERI as an example of how economic considerations can and should be integrated into implementation research within both single and multi-site studies. We review theoretical and applied cost research in implementation studies outside and within VA. We also present a critique of the use of economic research within the QUERI program. Economic evaluation is a key element of implementation research. QUERI has contributed many developments in the field of implementation but has only recently begun multi-site implementation trials across multiple regions within the national VA healthcare system. These trials are unusual in their emphasis on developing detailed costs of implementation, as well as in the use of business case analyses (budget impact analyses). Economics appears to play an important role in QUERI implementation studies, only after implementation has reached the stage of multi-site trials. Economic analysis could better inform the choice of which clinical best practices to implement and the choice of implementation interventions to employ. QUERI economics also would benefit from research on costing methods and development of widely accepted international standards for implementation economics.
Discriminative structural approaches for enzyme active-site prediction.
Kato, Tsuyoshi; Nagano, Nozomi
2011-02-15
Predicting enzyme active-sites in proteins is an important issue not only for protein sciences but also for a variety of practical applications such as drug design. Because enzyme reaction mechanisms are based on the local structures of enzyme active-sites, various template-based methods that compare local structures in proteins have been developed to date. In comparing such local sites, a simple measurement, RMSD, has been used so far. This paper introduces new machine learning algorithms that refine the similarity/deviation for comparison of local structures. The similarity/deviation is applied to two types of applications, single template analysis and multiple template analysis. In the single template analysis, a single template is used as a query to search proteins for active sites, whereas a protein structure is examined as a query to discover the possible active-sites using a set of templates in the multiple template analysis. This paper experimentally illustrates that the machine learning algorithms effectively improve the similarity/deviation measurements for both the analyses.
A "Simple Query Interface" Adapter for the Discovery and Exchange of Learning Resources
ERIC Educational Resources Information Center
Massart, David
2006-01-01
Developed as part of CEN/ISSS Workshop on Learning Technology efforts to improve interoperability between learning resource repositories, the Simple Query Interface (SQI) is an Application Program Interface (API) for querying heterogeneous repositories of learning resource metadata. In the context of the ProLearn Network of Excellence, SQI is used…
Performing private database queries in a real-world environment using a quantum protocol.
Chan, Philip; Lucio-Martinez, Itzel; Mo, Xiaofan; Simon, Christoph; Tittel, Wolfgang
2014-06-10
In the well-studied cryptographic primitive 1-out-of-N oblivious transfer, a user retrieves a single element from a database of size N without the database learning which element was retrieved. While it has previously been shown that a secure implementation of 1-out-of-N oblivious transfer is impossible against arbitrarily powerful adversaries, recent research has revealed an interesting class of private query protocols based on quantum mechanics in a cheat sensitive model. Specifically, a practical protocol does not need to guarantee that the database provider cannot learn what element was retrieved if doing so carries the risk of detection. The latter is sufficient motivation to keep a database provider honest. However, none of the previously proposed protocols could cope with noisy channels. Here we present a fault-tolerant private query protocol, in which the novel error correction procedure is integral to the security of the protocol. Furthermore, we present a proof-of-concept demonstration of the protocol over a deployed fibre.
Performing private database queries in a real-world environment using a quantum protocol
Chan, Philip; Lucio-Martinez, Itzel; Mo, Xiaofan; Simon, Christoph; Tittel, Wolfgang
2014-01-01
In the well-studied cryptographic primitive 1-out-of-N oblivious transfer, a user retrieves a single element from a database of size N without the database learning which element was retrieved. While it has previously been shown that a secure implementation of 1-out-of-N oblivious transfer is impossible against arbitrarily powerful adversaries, recent research has revealed an interesting class of private query protocols based on quantum mechanics in a cheat sensitive model. Specifically, a practical protocol does not need to guarantee that the database provider cannot learn what element was retrieved if doing so carries the risk of detection. The latter is sufficient motivation to keep a database provider honest. However, none of the previously proposed protocols could cope with noisy channels. Here we present a fault-tolerant private query protocol, in which the novel error correction procedure is integral to the security of the protocol. Furthermore, we present a proof-of-concept demonstration of the protocol over a deployed fibre. PMID:24913129
Generalized query-based active learning to identify differentially methylated regions in DNA.
Haque, Md Muksitul; Holder, Lawrence B; Skinner, Michael K; Cook, Diane J
2013-01-01
Active learning is a supervised learning technique that reduces the number of examples required for building a successful classifier, because it can choose the data it learns from. This technique holds promise for many biological domains in which classified examples are expensive and time-consuming to obtain. Most traditional active learning methods ask very specific queries to the Oracle (e.g., a human expert) to label an unlabeled example. The example may consist of numerous features, many of which are irrelevant. Removing such features will create a shorter query with only relevant features, and it will be easier for the Oracle to answer. We propose a generalized query-based active learning (GQAL) approach that constructs generalized queries based on multiple instances. By constructing appropriately generalized queries, we can achieve higher accuracy compared to traditional active learning methods. We apply our active learning method to find differentially DNA methylated regions (DMRs). DMRs are DNA locations in the genome that are known to be involved in tissue differentiation, epigenetic regulation, and disease. We also apply our method on 13 other data sets and show that our method is better than another popular active learning technique.
Learning Extended Finite State Machines
NASA Technical Reports Server (NTRS)
Cassel, Sofia; Howar, Falk; Jonsson, Bengt; Steffen, Bernhard
2014-01-01
We present an active learning algorithm for inferring extended finite state machines (EFSM)s, combining data flow and control behavior. Key to our learning technique is a novel learning model based on so-called tree queries. The learning algorithm uses the tree queries to infer symbolic data constraints on parameters, e.g., sequence numbers, time stamps, identifiers, or even simple arithmetic. We describe sufficient conditions for the properties that the symbolic constraints provided by a tree query in general must have to be usable in our learning model. We have evaluated our algorithm in a black-box scenario, where tree queries are realized through (black-box) testing. Our case studies include connection establishment in TCP and a priority queue from the Java Class Library.
Learning and retention through predictive inference and classification.
Sakamoto, Yasuaki; Love, Bradley C
2010-12-01
Work in category learning addresses how humans acquire knowledge and, thus, should inform classroom practices. In two experiments, we apply and evaluate intuitions garnered from laboratory-based research in category learning to learning tasks situated in an educational context. In Experiment 1, learning through predictive inference and classification were compared for fifth-grade students using class-related materials. Making inferences about properties of category members and receiving feedback led to the acquisition of both queried (i.e., tested) properties and nonqueried properties that were correlated with a queried property (e.g., even if not queried, students learned about a species' habitat because it correlated with a queried property, like the species' size). In contrast, classifying items according to their species and receiving feedback led to knowledge of only the property most diagnostic of category membership. After multiple-day delay, the fifth-graders who learned through inference selectively retained information about the queried properties, and the fifth-graders who learned through classification retained information about the diagnostic property, indicating a role for explicit evaluation in establishing memories. Overall, inference learning resulted in fewer errors, better retention, and more liking of the categories than did classification learning. Experiment 2 revealed that querying a property only a few times was enough to manifest the full benefits of inference learning in undergraduate students. These results suggest that classroom teaching should emphasize reasoning from the category to multiple properties rather than from a set of properties to the category. (PsycINFO Database Record (c) 2010 APA, all rights reserved).
Marenco, Luis; Li, Yuli; Martone, Maryann E; Sternberg, Paul W; Shepherd, Gordon M; Miller, Perry L
2008-09-01
This paper describes a pilot query interface that has been constructed to help us explore a "concept-based" approach for searching the Neuroscience Information Framework (NIF). The query interface is concept-based in the sense that the search terms submitted through the interface are selected from a standardized vocabulary of terms (concepts) that are structured in the form of an ontology. The NIF contains three primary resources: the NIF Resource Registry, the NIF Document Archive, and the NIF Database Mediator. These NIF resources are very different in their nature and therefore pose challenges when designing a single interface from which searches can be automatically launched against all three resources simultaneously. The paper first discusses briefly several background issues involving the use of standardized biomedical vocabularies in biomedical information retrieval, and then presents a detailed example that illustrates how the pilot concept-based query interface operates. The paper concludes by discussing certain lessons learned in the development of the current version of the interface.
Li, Yuli; Martone, Maryann E.; Sternberg, Paul W.; Shepherd, Gordon M.; Miller, Perry L.
2009-01-01
This paper describes a pilot query interface that has been constructed to help us explore a “concept-based” approach for searching the Neuroscience Information Framework (NIF). The query interface is concept-based in the sense that the search terms submitted through the interface are selected from a standardized vocabulary of terms (concepts) that are structured in the form of an ontology. The NIF contains three primary resources: the NIF Resource Registry, the NIF Document Archive, and the NIF Database Mediator. These NIF resources are very different in their nature and therefore pose challenges when designing a single interface from which searches can be automatically launched against all three resources simultaneously. The paper first discusses briefly several background issues involving the use of standardized biomedical vocabularies in biomedical information retrieval, and then presents a detailed example that illustrates how the pilot concept-based query interface operates. The paper concludes by discussing certain lessons learned in the development of the current version of the interface. PMID:18953674
Enhanced Approximate Nearest Neighbor via Local Area Focused Search.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gonzales, Antonio; Blazier, Nicholas Paul
Approximate Nearest Neighbor (ANN) algorithms are increasingly important in machine learning, data mining, and image processing applications. There is a large family of space- partitioning ANN algorithms, such as randomized KD-Trees, that work well in practice but are limited by an exponential increase in similarity comparisons required to optimize recall. Additionally, they only support a small set of similarity metrics. We present Local Area Fo- cused Search (LAFS), a method that enhances the way queries are performed using an existing ANN index. Instead of a single query, LAFS performs a number of smaller (fewer similarity comparisons) queries and focuses onmore » a local neighborhood which is refined as candidates are identified. We show that our technique improves performance on several well known datasets and is easily extended to general similarity metrics using kernel projection techniques.« less
Design notes for the next generation persistent object manager for CAP
DOE Office of Scientific and Technical Information (OSTI.GOV)
Isely, M.; Fischler, M.; Galli, M.
1995-05-01
The CAP query system software at Fermilab has several major components, including SQS (for managing the query), the retrieval system (for fetching auxiliary data), and the query software itself. The central query software in particular is essentially a modified version of the `ptool` product created at UIC (University of Illinois at Chicago) as part of the PASS project under Bob Grossman. The original UIC version was designed for use in a single-user non-distributed Unix environment. The Fermi modifications were an attempt to permit multi-user access to a data set distributed over a set of storage nodes. (The hardware is anmore » IBM SP-x system - a cluster of AIX POWER2 nodes with an IBM-proprietary high speed switch interconnect). Since the implementation work of the Fermi-ized ptool, the CAP members have learned quite a bit about the nature of queries and where the current performance bottlenecks exist. This has lead them to design a persistent object manager that will overcome these problems. For backwards compatibility with ptool, the ptool persistent object API will largely be retained, but the implementation will be entirely different.« less
Multi-Bit Quantum Private Query
NASA Astrophysics Data System (ADS)
Shi, Wei-Xu; Liu, Xing-Tong; Wang, Jian; Tang, Chao-Jing
2015-09-01
Most of the existing Quantum Private Queries (QPQ) protocols provide only single-bit queries service, thus have to be repeated several times when more bits are retrieved. Wei et al.'s scheme for block queries requires a high-dimension quantum key distribution system to sustain, which is still restricted in the laboratory. Here, based on Markus Jakobi et al.'s single-bit QPQ protocol, we propose a multi-bit quantum private query protocol, in which the user can get access to several bits within one single query. We also extend the proposed protocol to block queries, using a binary matrix to guard database security. Analysis in this paper shows that our protocol has better communication complexity, implementability and can achieve a considerable level of security.
ERIC Educational Resources Information Center
Piyayodilokchai, Hongsiri; Panjaburee, Patcharin; Laosinchai, Parames; Ketpichainarong, Watcharee; Ruenwongsa, Pintip
2013-01-01
With the benefit of multimedia and the learning cycle approach in promoting effective active learning, this paper proposed a learning cycle approach-based, multimedia-supplemented instructional unit for Structured Query Language (SQL) for second-year undergraduate students with the aim of enhancing their basic knowledge of SQL and ability to apply…
Del Fiol, Guilherme; Michelson, Matthew; Iorio, Alfonso; Cotoi, Chris; Haynes, R Brian
2018-06-25
A major barrier to the practice of evidence-based medicine is efficiently finding scientifically sound studies on a given clinical topic. To investigate a deep learning approach to retrieve scientifically sound treatment studies from the biomedical literature. We trained a Convolutional Neural Network using a noisy dataset of 403,216 PubMed citations with title and abstract as features. The deep learning model was compared with state-of-the-art search filters, such as PubMed's Clinical Query Broad treatment filter, McMaster's textword search strategy (no Medical Subject Heading, MeSH, terms), and Clinical Query Balanced treatment filter. A previously annotated dataset (Clinical Hedges) was used as the gold standard. The deep learning model obtained significantly lower recall than the Clinical Queries Broad treatment filter (96.9% vs 98.4%; P<.001); and equivalent recall to McMaster's textword search (96.9% vs 97.1%; P=.57) and Clinical Queries Balanced filter (96.9% vs 97.0%; P=.63). Deep learning obtained significantly higher precision than the Clinical Queries Broad filter (34.6% vs 22.4%; P<.001) and McMaster's textword search (34.6% vs 11.8%; P<.001), but was significantly lower than the Clinical Queries Balanced filter (34.6% vs 40.9%; P<.001). Deep learning performed well compared to state-of-the-art search filters, especially when citations were not indexed. Unlike previous machine learning approaches, the proposed deep learning model does not require feature engineering, or time-sensitive or proprietary features, such as MeSH terms and bibliometrics. Deep learning is a promising approach to identifying reports of scientifically rigorous clinical research. Further work is needed to optimize the deep learning model and to assess generalizability to other areas, such as diagnosis, etiology, and prognosis. ©Guilherme Del Fiol, Matthew Michelson, Alfonso Iorio, Chris Cotoi, R Brian Haynes. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 25.06.2018.
Learning of Multimodal Representations With Random Walks on the Click Graph.
Wu, Fei; Lu, Xinyan; Song, Jun; Yan, Shuicheng; Zhang, Zhongfei Mark; Rui, Yong; Zhuang, Yueting
2016-02-01
In multimedia information retrieval, most classic approaches tend to represent different modalities of media in the same feature space. With the click data collected from the users' searching behavior, existing approaches take either one-to-one paired data (text-image pairs) or ranking examples (text-query-image and/or image-query-text ranking lists) as training examples, which do not make full use of the click data, particularly the implicit connections among the data objects. In this paper, we treat the click data as a large click graph, in which vertices are images/text queries and edges indicate the clicks between an image and a query. We consider learning a multimodal representation from the perspective of encoding the explicit/implicit relevance relationship between the vertices in the click graph. By minimizing both the truncated random walk loss as well as the distance between the learned representation of vertices and their corresponding deep neural network output, the proposed model which is named multimodal random walk neural network (MRW-NN) can be applied to not only learn robust representation of the existing multimodal data in the click graph, but also deal with the unseen queries and images to support cross-modal retrieval. We evaluate the latent representation learned by MRW-NN on a public large-scale click log data set Clickture and further show that MRW-NN achieves much better cross-modal retrieval performance on the unseen queries/images than the other state-of-the-art methods.
Query-Biased Preview over Outsourced and Encrypted Data
Luo, Guangchun; Qin, Ke; Chen, Aiguo
2013-01-01
For both convenience and security, more and more users encrypt their sensitive data before outsourcing it to a third party such as cloud storage service. However, searching for the desired documents becomes problematic since it is costly to download and decrypt each possibly needed document to check if it contains the desired content. An informative query-biased preview feature, as applied in modern search engine, could help the users to learn about the content without downloading the entire document. However, when the data are encrypted, securely extracting a keyword-in-context snippet from the data as a preview becomes a challenge. Based on private information retrieval protocol and the core concept of searchable encryption, we propose a single-server and two-round solution to securely obtain a query-biased snippet over the encrypted data from the server. We achieve this novel result by making a document (plaintext) previewable under any cryptosystem and constructing a secure index to support dynamic computation for a best matched snippet when queried by some keywords. For each document, the scheme has O(d) storage complexity and O(log(d/s) + s + d/s) communication complexity, where d is the document size and s is the snippet length. PMID:24078798
Query-biased preview over outsourced and encrypted data.
Peng, Ningduo; Luo, Guangchun; Qin, Ke; Chen, Aiguo
2013-01-01
For both convenience and security, more and more users encrypt their sensitive data before outsourcing it to a third party such as cloud storage service. However, searching for the desired documents becomes problematic since it is costly to download and decrypt each possibly needed document to check if it contains the desired content. An informative query-biased preview feature, as applied in modern search engine, could help the users to learn about the content without downloading the entire document. However, when the data are encrypted, securely extracting a keyword-in-context snippet from the data as a preview becomes a challenge. Based on private information retrieval protocol and the core concept of searchable encryption, we propose a single-server and two-round solution to securely obtain a query-biased snippet over the encrypted data from the server. We achieve this novel result by making a document (plaintext) previewable under any cryptosystem and constructing a secure index to support dynamic computation for a best matched snippet when queried by some keywords. For each document, the scheme has O(d) storage complexity and O(log(d/s) + s + d/s) communication complexity, where d is the document size and s is the snippet length.
Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS.
Yu, Hwanjo; Kim, Taehoon; Oh, Jinoh; Ko, Ilhwan; Kim, Sungchul; Han, Wook-Shin
2010-04-16
Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed. RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed. RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user's feedback and efficiently processes the function to return relevant articles in real time.
Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS
2010-01-01
Background Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed. Results RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed. Conclusions RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user’s feedback and efficiently processes the function to return relevant articles in real time. PMID:20406504
Enhancing Learning Outcomes in Computer-Based Training via Self-Generated Elaboration
ERIC Educational Resources Information Center
Cuevas, Haydee M.; Fiore, Stephen M.
2014-01-01
The present study investigated the utility of an instructional strategy known as the "query method" for enhancing learning outcomes in computer-based training. The query method involves an embedded guided, sentence generation task requiring elaboration of key concepts in the training material that encourages learners to "stop and…
NASA Astrophysics Data System (ADS)
Warren, Z.; Shahriar, M. S.; Tripathi, R.; Pati, G. S.
2018-02-01
A repeated query technique has been demonstrated as a new interrogation method in pulsed coherent population trapping for producing single-peaked Ramsey interference with high contrast. This technique enhances the contrast of the central Ramsey fringe by nearly 1.5 times and significantly suppresses the side fringes by using more query pulses ( >10) in the pulse cycle. Theoretical models have been developed to simulate Ramsey interference and analyze the characteristics of the Ramsey spectrum produced by the repeated query technique. Experiments have also been carried out employing a repeated query technique in a prototype rubidium clock to study its frequency stability performance.
Multiple Query Evaluation Based on an Enhanced Genetic Algorithm.
ERIC Educational Resources Information Center
Tamine, Lynda; Chrisment, Claude; Boughanem, Mohand
2003-01-01
Explains the use of genetic algorithms to combine results from multiple query evaluations to improve relevance in information retrieval. Discusses niching techniques, relevance feedback techniques, and evolution heuristics, and compares retrieval results obtained by both genetic multiple query evaluation and classical single query evaluation…
Recommender System for Learning SQL Using Hints
ERIC Educational Resources Information Center
Lavbic, Dejan; Matek, Tadej; Zrnec, Aljaž
2017-01-01
Today's software industry requires individuals who are proficient in as many programming languages as possible. Structured query language (SQL), as an adopted standard, is no exception, as it is the most widely used query language to retrieve and manipulate data. However, the process of learning SQL turns out to be challenging. The need for a…
Querying Proofs (Work in Progress)
NASA Technical Reports Server (NTRS)
Aspinall, David; Denney, Ewen; Lueth, Christoph
2011-01-01
We motivate and introduce the basis for a query language designed for inspecting electronic representations of proofs. We argue that there is much to learn from large proofs beyond their validity, and that a dedicated query language can provide a principled way of implementing a family of useful operations.
2014-11-01
created to serve as idealized representations of actual medical records, and include information such as medical history , current symptoms, diagnosis...NLM Medical Text Indexer (MTI).3 MeSH, or Medical Subject Headings, are terminology used by the NLM to index articles, catalog books, and searching...MeSH- indexed databases such as PubMed. However, since many medical conditions may be expressed in varying terminology , a single representation of a
Graham, Ian D; Tetroe, Jacqueline
2009-01-01
As the recent collection of papers from the Quality Enhancement Research Initiative (QUERI) Series indicates, knowledge is leading to considerable action in the United States (U.S.) Department of Veterans Affairs (VA). The QUERI Series offers clinical researchers, implementation scientists, health systems, and health research funders from around the globe a unique window into the both the practice and science of implementation or knowledge translation (KT) in the VA. By describing successes and challenges as well as setbacks and disappointments, the QUERI Series is all the more useful. From the vantage point of Canadian KT researchers and officials at a national health research funding agency, we offer a number of observations and lessons that can be learned from QUERI. "Knowledge, if it does not determine action, is dead to us." Plotinus (Roman philosopher 205AD-270AD) PMID:19267920
Cross-modal learning to rank via latent joint representation.
Wu, Fei; Jiang, Xinyang; Li, Xi; Tang, Siliang; Lu, Weiming; Zhang, Zhongfei; Zhuang, Yueting
2015-05-01
Cross-modal ranking is a research topic that is imperative to many applications involving multimodal data. Discovering a joint representation for multimodal data and learning a ranking function are essential in order to boost the cross-media retrieval (i.e., image-query-text or text-query-image). In this paper, we propose an approach to discover the latent joint representation of pairs of multimodal data (e.g., pairs of an image query and a text document) via a conditional random field and structural learning in a listwise ranking manner. We call this approach cross-modal learning to rank via latent joint representation (CML²R). In CML²R, the correlations between multimodal data are captured in terms of their sharing hidden variables (e.g., topics), and a hidden-topic-driven discriminative ranking function is learned in a listwise ranking manner. The experiments show that the proposed approach achieves a good performance in cross-media retrieval and meanwhile has the capability to learn the discriminative representation of multimodal data.
Distributed query plan generation using multiobjective genetic algorithm.
Panicker, Shina; Kumar, T V Vijay
2014-01-01
A distributed query processing strategy, which is a key performance determinant in accessing distributed databases, aims to minimize the total query processing cost. One way to achieve this is by generating efficient distributed query plans that involve fewer sites for processing a query. In the case of distributed relational databases, the number of possible query plans increases exponentially with respect to the number of relations accessed by the query and the number of sites where these relations reside. Consequently, computing optimal distributed query plans becomes a complex problem. This distributed query plan generation (DQPG) problem has already been addressed using single objective genetic algorithm, where the objective is to minimize the total query processing cost comprising the local processing cost (LPC) and the site-to-site communication cost (CC). In this paper, this DQPG problem is formulated and solved as a biobjective optimization problem with the two objectives being minimize total LPC and minimize total CC. These objectives are simultaneously optimized using a multiobjective genetic algorithm NSGA-II. Experimental comparison of the proposed NSGA-II based DQPG algorithm with the single objective genetic algorithm shows that the former performs comparatively better and converges quickly towards optimal solutions for an observed crossover and mutation probability.
Distributed Query Plan Generation Using Multiobjective Genetic Algorithm
Panicker, Shina; Vijay Kumar, T. V.
2014-01-01
A distributed query processing strategy, which is a key performance determinant in accessing distributed databases, aims to minimize the total query processing cost. One way to achieve this is by generating efficient distributed query plans that involve fewer sites for processing a query. In the case of distributed relational databases, the number of possible query plans increases exponentially with respect to the number of relations accessed by the query and the number of sites where these relations reside. Consequently, computing optimal distributed query plans becomes a complex problem. This distributed query plan generation (DQPG) problem has already been addressed using single objective genetic algorithm, where the objective is to minimize the total query processing cost comprising the local processing cost (LPC) and the site-to-site communication cost (CC). In this paper, this DQPG problem is formulated and solved as a biobjective optimization problem with the two objectives being minimize total LPC and minimize total CC. These objectives are simultaneously optimized using a multiobjective genetic algorithm NSGA-II. Experimental comparison of the proposed NSGA-II based DQPG algorithm with the single objective genetic algorithm shows that the former performs comparatively better and converges quickly towards optimal solutions for an observed crossover and mutation probability. PMID:24963513
Flexible Querying of Lifelong Learner Metadata
ERIC Educational Resources Information Center
Poulovassilis, A.; Selmer, P.; Wood, P. T.
2012-01-01
This paper discusses the provision of flexible querying facilities over heterogeneous data arising from lifelong learners' educational and work experiences. A key aim of such querying facilities is to allow learners to identify possible choices for their future learning and professional development by seeing what others have done. We motivate and…
Evaluation methodology for query-based scene understanding systems
NASA Astrophysics Data System (ADS)
Huster, Todd P.; Ross, Timothy D.; Culbertson, Jared L.
2015-05-01
In this paper, we are proposing a method for the principled evaluation of scene understanding systems in a query-based framework. We can think of a query-based scene understanding system as a generalization of typical sensor exploitation systems where instead of performing a narrowly defined task (e.g., detect, track, classify, etc.), the system can perform general user-defined tasks specified in a query language. Examples of this type of system have been developed as part of DARPA's Mathematics of Sensing, Exploitation, and Execution (MSEE) program. There is a body of literature on the evaluation of typical sensor exploitation systems, but the open-ended nature of the query interface introduces new aspects to the evaluation problem that have not been widely considered before. In this paper, we state the evaluation problem and propose an approach to efficiently learn about the quality of the system under test. We consider the objective of the evaluation to be to build a performance model of the system under test, and we rely on the principles of Bayesian experiment design to help construct and select optimal queries for learning about the parameters of that model.
Accelerating Research Impact in a Learning Health Care System
Elwy, A. Rani; Sales, Anne E.; Atkins, David
2017-01-01
Background: Since 1998, the Veterans Health Administration (VHA) Quality Enhancement Research Initiative (QUERI) has supported more rapid implementation of research into clinical practice. Objectives: With the passage of the Veterans Access, Choice and Accountability Act of 2014 (Choice Act), QUERI further evolved to support VHA’s transformation into a Learning Health Care System by aligning science with clinical priority goals based on a strategic planning process and alignment of funding priorities with updated VHA priority goals in response to the Choice Act. Design: QUERI updated its strategic goals in response to independent assessments mandated by the Choice Act that recommended VHA reduce variation in care by providing a clear path to implement best practices. Specifically, QUERI updated its application process to ensure its centers (Programs) focus on cross-cutting VHA priorities and specify roadmaps for implementation of research-informed practices across different settings. QUERI also increased funding for scientific evaluations of the Choice Act and other policies in response to Commission on Care recommendations. Results: QUERI’s national network of Programs deploys effective practices using implementation strategies across different settings. QUERI Choice Act evaluations informed the law’s further implementation, setting the stage for additional rigorous national evaluations of other VHA programs and policies including community provider networks. Conclusions: Grounded in implementation science and evidence-based policy, QUERI serves as an example of how to operationalize core components of a Learning Health Care System, notably through rigorous evaluation and scientific testing of implementation strategies to ultimately reduce variation in quality and improve overall population health. PMID:27997456
StarView: The object oriented design of the ST DADS user interface
NASA Technical Reports Server (NTRS)
Williams, J. D.; Pollizzi, J. A.
1992-01-01
StarView is the user interface being developed for the Hubble Space Telescope Data Archive and Distribution Service (ST DADS). ST DADS is the data archive for HST observations and a relational database catalog describing the archived data. Users will use StarView to query the catalog and select appropriate datasets for study. StarView sends requests for archived datasets to ST DADS which processes the requests and returns the database to the user. StarView is designed to be a powerful and extensible user interface. Unique features include an internal relational database to navigate query results, a form definition language that will work with both CRT and X interfaces, a data definition language that will allow StarView to work with any relational database, and the ability to generate adhoc queries without requiring the user to understand the structure of the ST DADS catalog. Ultimately, StarView will allow the user to refine queries in the local database for improved performance and merge in data from external sources for correlation with other query results. The user will be able to create a query from single or multiple forms, merging the selected attributes into a single query. Arbitrary selection of attributes for querying is supported. The user will be able to select how query results are viewed. A standard form or table-row format may be used. Navigation capabilities are provided to aid the user in viewing query results. Object oriented analysis and design techniques were used in the design of StarView to support the mechanisms and concepts required to implement these features. One such mechanism is the Model-View-Controller (MVC) paradigm. The MVC allows the user to have multiple views of the underlying database, while providing a consistent mechanism for interaction regardless of the view. This approach supports both CRT and X interfaces while providing a common mode of user interaction. Another powerful abstraction is the concept of a Query Model. This concept allows a single query to be built form a single or multiple forms before it is submitted to ST DADS. Supporting this concept is the adhoc query generator which allows the user to select and qualify an indeterminate number attributes from the database. The user does not need any knowledge of how the joins across various tables are to be resolved. The adhoc generator calculates the joins automatically and generates the correct SQL query.
Query-based learning for aerospace applications.
Saad, E W; Choi, J J; Vian, J L; Wunsch, D C Ii
2003-01-01
Models of real-world applications often include a large number of parameters with a wide dynamic range, which contributes to the difficulties of neural network training. Creating the training data set for such applications becomes costly, if not impossible. In order to overcome the challenge, one can employ an active learning technique known as query-based learning (QBL) to add performance-critical data to the training set during the learning phase, thereby efficiently improving the overall learning/generalization. The performance-critical data can be obtained using an inverse mapping called network inversion (discrete network inversion and continuous network inversion) followed by oracle query. This paper investigates the use of both inversion techniques for QBL learning, and introduces an original heuristic to select the inversion target values for continuous network inversion method. Efficiency and generalization was further enhanced by employing node decoupled extended Kalman filter (NDEKF) training and a causality index (CI) as a means to reduce the input search dimensionality. The benefits of the overall QBL approach are experimentally demonstrated in two aerospace applications: a classification problem with large input space and a control distribution problem.
Cumulative query method for influenza surveillance using search engine data.
Seo, Dong-Woo; Jo, Min-Woo; Sohn, Chang Hwan; Shin, Soo-Yong; Lee, JaeHo; Yu, Maengsoo; Kim, Won Young; Lim, Kyoung Soo; Lee, Sang-Il
2014-12-16
Internet search queries have become an important data source in syndromic surveillance system. However, there is currently no syndromic surveillance system using Internet search query data in South Korea. The objective of this study was to examine correlations between our cumulative query method and national influenza surveillance data. Our study was based on the local search engine, Daum (approximately 25% market share), and influenza-like illness (ILI) data from the Korea Centers for Disease Control and Prevention. A quota sampling survey was conducted with 200 participants to obtain popular queries. We divided the study period into two sets: Set 1 (the 2009/10 epidemiological year for development set 1 and 2010/11 for validation set 1) and Set 2 (2010/11 for development Set 2 and 2011/12 for validation Set 2). Pearson's correlation coefficients were calculated between the Daum data and the ILI data for the development set. We selected the combined queries for which the correlation coefficients were .7 or higher and listed them in descending order. Then, we created a cumulative query method n representing the number of cumulative combined queries in descending order of the correlation coefficient. In validation set 1, 13 cumulative query methods were applied, and 8 had higher correlation coefficients (min=.916, max=.943) than that of the highest single combined query. Further, 11 of 13 cumulative query methods had an r value of ≥.7, but 4 of 13 combined queries had an r value of ≥.7. In validation set 2, 8 of 15 cumulative query methods showed higher correlation coefficients (min=.975, max=.987) than that of the highest single combined query. All 15 cumulative query methods had an r value of ≥.7, but 6 of 15 combined queries had an r value of ≥.7. Cumulative query method showed relatively higher correlation with national influenza surveillance data than combined queries in the development and validation set.
Sparks, Rachel; Madabhushi, Anant
2016-01-01
Content-based image retrieval (CBIR) retrieves database images most similar to the query image by (1) extracting quantitative image descriptors and (2) calculating similarity between database and query image descriptors. Recently, manifold learning (ML) has been used to perform CBIR in a low dimensional representation of the high dimensional image descriptor space to avoid the curse of dimensionality. ML schemes are computationally expensive, requiring an eigenvalue decomposition (EVD) for every new query image to learn its low dimensional representation. We present out-of-sample extrapolation utilizing semi-supervised ML (OSE-SSL) to learn the low dimensional representation without recomputing the EVD for each query image. OSE-SSL incorporates semantic information, partial class label, into a ML scheme such that the low dimensional representation co-localizes semantically similar images. In the context of prostate histopathology, gland morphology is an integral component of the Gleason score which enables discrimination between prostate cancer aggressiveness. Images are represented by shape features extracted from the prostate gland. CBIR with OSE-SSL for prostate histology obtained from 58 patient studies, yielded an area under the precision recall curve (AUPRC) of 0.53 ± 0.03 comparatively a CBIR with Principal Component Analysis (PCA) to learn a low dimensional space yielded an AUPRC of 0.44 ± 0.01. PMID:27264985
Improving accuracy for identifying related PubMed queries by an integrated approach.
Lu, Zhiyong; Wilbur, W John
2009-10-01
PubMed is the most widely used tool for searching biomedical literature online. As with many other online search tools, a user often types a series of multiple related queries before retrieving satisfactory results to fulfill a single information need. Meanwhile, it is also a common phenomenon to see a user type queries on unrelated topics in a single session. In order to study PubMed users' search strategies, it is necessary to be able to automatically separate unrelated queries and group together related queries. Here, we report a novel approach combining both lexical and contextual analyses for segmenting PubMed query sessions and identifying related queries and compare its performance with the previous approach based solely on concept mapping. We experimented with our integrated approach on sample data consisting of 1539 pairs of consecutive user queries in 351 user sessions. The prediction results of 1396 pairs agreed with the gold-standard annotations, achieving an overall accuracy of 90.7%. This demonstrates that our approach is significantly better than the previously published method. By applying this approach to a one day query log of PubMed, we found that a significant proportion of information needs involved more than one PubMed query, and that most of the consecutive queries for the same information need are lexically related. Finally, the proposed PubMed distance is shown to be an accurate and meaningful measure for determining the contextual similarity between biological terms. The integrated approach can play a critical role in handling real-world PubMed query log data as is demonstrated in our experiments.
Improving accuracy for identifying related PubMed queries by an integrated approach
Lu, Zhiyong; Wilbur, W. John
2009-01-01
PubMed is the most widely used tool for searching biomedical literature online. As with many other online search tools, a user often types a series of multiple related queries before retrieving satisfactory results to fulfill a single information need. Meanwhile, it is also a common phenomenon to see a user type queries on unrelated topics in a single session. In order to study PubMed users’ search strategies, it is necessary to be able to automatically separate unrelated queries and group together related queries. Here, we report a novel approach combining both lexical and contextual analyses for segmenting PubMed query sessions and identifying related queries and compare its performance with the previous approach based solely on concept mapping. We experimented with our integrated approach on sample data consisting of 1,539 pairs of consecutive user queries in 351 user sessions. The prediction results of 1,396 pairs agreed with the gold-standard annotations, achieving an overall accuracy of 90.7%. This demonstrates that our approach is significantly better than the previously published method. By applying this approach to a one day query log of PubMed, we found that a significant proportion of information needs involved more than one PubMed query, and that most of the consecutive queries for the same information need are lexically related. Finally, the proposed PubMed distance is shown to be an accurate and meaningful measure for determining the contextual similarity between biological terms. The integrated approach can play a critical role in handling real-world PubMed query log data as is demonstrated in our experiments. PMID:19162232
A Framework for WWW Query Processing
NASA Technical Reports Server (NTRS)
Wu, Binghui Helen; Wharton, Stephen (Technical Monitor)
2000-01-01
Query processing is the most common operation in a DBMS. Sophisticated query processing has been mainly targeted at a single enterprise environment providing centralized control over data and metadata. Submitting queries by anonymous users on the web is different in such a way that load balancing or DBMS' accessing control becomes the key issue. This paper provides a solution by introducing a framework for WWW query processing. The success of this framework lies in the utilization of query optimization techniques and the ontological approach. This methodology has proved to be cost effective at the NASA Goddard Space Flight Center Distributed Active Archive Center (GDAAC).
Cross-domain active learning for video concept detection
NASA Astrophysics Data System (ADS)
Li, Huan; Li, Chao; Shi, Yuan; Xiong, Zhang; Hauptmann, Alexander G.
2011-08-01
As video data from a variety of different domains (e.g., news, documentaries, entertainment) have distinctive data distributions, cross-domain video concept detection becomes an important task, in which one can reuse the labeled data of one domain to benefit the learning task in another domain with insufficient labeled data. In this paper, we approach this problem by proposing a cross-domain active learning method which iteratively queries labels of the most informative samples in the target domain. Traditional active learning assumes that the training (source domain) and test data (target domain) are from the same distribution. However, it may fail when the two domains have different distributions because querying informative samples according to a base learner that initially learned from source domain may no longer be helpful for the target domain. In our paper, we use the Gaussian random field model as the base learner which has the advantage of exploring the distributions in both domains, and adopt uncertainty sampling as the query strategy. Additionally, we present an instance weighting trick to accelerate the adaptability of the base learner, and develop an efficient model updating method which can significantly speed up the active learning process. Experimental results on TRECVID collections highlight the effectiveness.
muBLASTP: database-indexed protein sequence search on multicore CPUs.
Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun
2016-11-04
The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.
Agent-Based Framework for Discrete Entity Simulations
2006-11-01
Postgres database server for environment queries of neighbors and continuum data. As expected for raw database queries (no database optimizations in...form. Eventually the code was ported to GNU C++ on the same single Intel Pentium 4 CPU running RedHat Linux 9.0 and Postgres database server...Again Postgres was used for environmental queries, and the tool remained relatively slow because of the immense number of queries necessary to assess
Effective Multi-Query Expansions: Collaborative Deep Networks for Robust Landmark Retrieval.
Wang, Yang; Lin, Xuemin; Wu, Lin; Zhang, Wenjie
2017-03-01
Given a query photo issued by a user (q-user), the landmark retrieval is to return a set of photos with their landmarks similar to those of the query, while the existing studies on the landmark retrieval focus on exploiting geometries of landmarks for similarity matches between candidate photos and a query photo. We observe that the same landmarks provided by different users over social media community may convey different geometry information depending on the viewpoints and/or angles, and may, subsequently, yield very different results. In fact, dealing with the landmarks with low quality shapes caused by the photography of q-users is often nontrivial and has seldom been studied. In this paper, we propose a novel framework, namely, multi-query expansions, to retrieve semantically robust landmarks by two steps. First, we identify the top- k photos regarding the latent topics of a query landmark to construct multi-query set so as to remedy its possible low quality shape. For this purpose, we significantly extend the techniques of Latent Dirichlet Allocation. Then, motivated by the typical collaborative filtering methods, we propose to learn a collaborative deep networks-based semantically, nonlinear, and high-level features over the latent factor for landmark photo as the training set, which is formed by matrix factorization over collaborative user-photo matrix regarding the multi-query set. The learned deep network is further applied to generate the features for all the other photos, meanwhile resulting into a compact multi-query set within such space. Then, the final ranking scores are calculated over the high-level feature space between the multi-query set and all other photos, which are ranked to serve as the final ranking list of landmark retrieval. Extensive experiments are conducted on real-world social media data with both landmark photos together with their user information to show the superior performance over the existing methods, especially our recently proposed multi-query based mid-level pattern representation method [1].
Bidirectional Active Learning: A Two-Way Exploration Into Unlabeled and Labeled Data Set.
Zhang, Xiao-Yu; Wang, Shupeng; Yun, Xiaochun
2015-12-01
In practical machine learning applications, human instruction is indispensable for model construction. To utilize the precious labeling effort effectively, active learning queries the user with selective sampling in an interactive way. Traditional active learning techniques merely focus on the unlabeled data set under a unidirectional exploration framework and suffer from model deterioration in the presence of noise. To address this problem, this paper proposes a novel bidirectional active learning algorithm that explores into both unlabeled and labeled data sets simultaneously in a two-way process. For the acquisition of new knowledge, forward learning queries the most informative instances from unlabeled data set. For the introspection of learned knowledge, backward learning detects the most suspiciously unreliable instances within the labeled data set. Under the two-way exploration framework, the generalization ability of the learning model can be greatly improved, which is demonstrated by the encouraging experimental results.
A proposal of fuzzy connective with learning function and its application to fuzzy retrieval system
NASA Technical Reports Server (NTRS)
Hayashi, Isao; Naito, Eiichi; Ozawa, Jun; Wakami, Noboru
1993-01-01
A new fuzzy connective and a structure of network constructed by fuzzy connectives are proposed to overcome a drawback of conventional fuzzy retrieval systems. This network represents a retrieval query and the fuzzy connectives in networks have a learning function to adjust its parameters by data from a database and outputs of a user. The fuzzy retrieval systems employing this network are also constructed. Users can retrieve results even with a query whose attributes do not exist in a database schema and can get satisfactory results for variety of thinkings by learning function.
Autocorrelation and Regularization of Query-Based Information Retrieval Scores
2008-02-01
of the most general information retrieval models [ Salton , 1968]. By treating a query as a very short document, documents and queries can be rep... Salton , 1971]. In the context of single link hierarchical clustering, Jardine and van Rijsbergen showed that ranking all k clusters and retrieving a...a document about “dogs”, then the system will always miss this document when a user queries “dog”. Salton recognized that a document’s representation
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene
2018-01-01
Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie PMID:29688379
NASA Astrophysics Data System (ADS)
Liao, S.; Chen, L.; Li, J.; Xiong, W.; Wu, Q.
2015-07-01
Existing spatiotemporal database supports spatiotemporal aggregation query over massive moving objects datasets. Due to the large amounts of data and single-thread processing method, the query speed cannot meet the application requirements. On the other hand, the query efficiency is more sensitive to spatial variation then temporal variation. In this paper, we proposed a spatiotemporal aggregation query method using multi-thread parallel technique based on regional divison and implemented it on the server. Concretely, we divided the spatiotemporal domain into several spatiotemporal cubes, computed spatiotemporal aggregation on all cubes using the technique of multi-thread parallel processing, and then integrated the query results. By testing and analyzing on the real datasets, this method has improved the query speed significantly.
Quantum Private Query Based on Bell State and Single Photons
NASA Astrophysics Data System (ADS)
Gao, Xiang; Chang, Yan; Zhang, Shi-Bin; Yang, Fan; Zhang, Yan
2018-03-01
Quantum private query (QPQ) can protect both user's and database holder's privacy. In this paper, we propose a novel quantum private query protocol based on Bell state and single photons. As far as we know, no one has ever proposed the QPQ based on Bell state. By using the decoherence-free (DF) states, our protocol can resist the collective noise. Besides that, our protocol is a one-way quantum protocol, which can resist the Trojan horse attack and reduce the communication complexity. Our protocol can not only guarantee the participants' privacy but also stand against an external eavesdropper.
Titanbrowse: a new paradigm for access, visualization and analysis of hyperspectral imaging
NASA Astrophysics Data System (ADS)
Penteado, Paulo F.
2016-10-01
Currently there are archives and tools to explore remote sensing imaging, but these lack some functionality needed for hyperspectral imagers: 1) Querying and serving only whole datacubes is not enough, since in each cube there is typically a large variation in observation geometry over the spatial pixels. Thus, often the most useful unit for selecting observations of interest is not a whole cube but rather a single spectrum. 2) Pixel-specific geometric data included in the standard pipelines is calculated at only one point per pixel. Particularly for selections of pixels from many different cubes, or observations near the limb, it is necessary to know the actual extent of each pixel. 3) Database queries need not only metadata, but also by the spectral data. For instance, one query might look for atypical values of some band, or atypical relations between bands, denoting spectral features (such as ratios or differences between bands). 4) There is the need to evaluate arbitrary, dynamically-defined, complex functions of the data (beyond just simple arithmetic operations), both for selection in the queries, and for visualization, to interactively tune the queries to the observations of interest. 5) Making the most useful query for some analysis often requires interactive visualization integrated with data selection and processing, because the user needs to explore how different functions of the data vary over the observations without having to download data and import it into visualization software. 6) Complementary to interactive use, an API allowing programmatic access to the system is needed for systematic data analyses. 7) Direct access to calibrated and georeferenced data, without the need to download data and software and learn to process it.We present titanbrowse, a database, exploration and visualization system for Cassini VIMS observations of Titan, designed to fullfill the aforementioned needs. While it originallly ran on data in the user's computer, we are now developing an online version, so that users do not need to download software and data. The server, which we maintain, processes the queries and communicates the results to the client the user runs. http://ppenteado.net/titanbrowse.
Enhancing user privacy in SARG04-based private database query protocols
NASA Astrophysics Data System (ADS)
Yu, Fang; Qiu, Daowen; Situ, Haozhen; Wang, Xiaoming; Long, Shun
2015-11-01
The well-known SARG04 protocol can be used in a private query application to generate an oblivious key. By usage of the key, the user can retrieve one out of N items from a database without revealing which one he/she is interested in. However, the existing SARG04-based private query protocols are vulnerable to the attacks of faked data from the database since in its canonical form, the SARG04 protocol lacks means for one party to defend attacks from the other. While such attacks can cause significant loss of user privacy, a variant of the SARG04 protocol is proposed in this paper with new mechanisms designed to help the user protect its privacy in private query applications. In the protocol, it is the user who starts the session with the database, trying to learn from it bits of a raw key in an oblivious way. An honesty test is used to detect a cheating database who had transmitted faked data. The whole private query protocol has O( N) communication complexity for conveying at least N encrypted items. Compared with the existing SARG04-based protocols, it is efficient in communication for per-bit learning.
Active Learning with Irrelevant Examples
NASA Technical Reports Server (NTRS)
Mazzoni, Dominic; Wagstaff, Kiri L.; Burl, Michael
2006-01-01
Active learning algorithms attempt to accelerate the learning process by requesting labels for the most informative items first. In real-world problems, however, there may exist unlabeled items that are irrelevant to the user's classification goals. Queries about these points slow down learning because they provide no information about the problem of interest. We have observed that when irrelevant items are present, active learning can perform worse than random selection, requiring more time (queries) to achieve the same level of accuracy. Therefore, we propose a novel approach, Relevance Bias, in which the active learner combines its default selection heuristic with the output of a simultaneously trained relevance classifier to favor items that are likely to be both informative and relevant. In our experiments on a real-world problem and two benchmark datasets, the Relevance Bias approach significantly improved the learning rate of three different active learning approaches.
NASA Astrophysics Data System (ADS)
Yang, YuGuang; Liu, ZhiChao; Chen, XiuBo; Zhou, YiHua; Shi, WeiMin
2017-12-01
Quantum channel noise may cause the user to obtain a wrong answer and thus misunderstand the database holder for existing QKD-based quantum private query (QPQ) protocols. In addition, an outside attacker may conceal his attack by exploiting the channel noise. We propose a new, robust QPQ protocol based on four-qubit decoherence-free (DF) states. In contrast to existing QPQ protocols against channel noise, only an alternative fixed sequence of single-qubit measurements is needed by the user (Alice) to measure the received DF states. This property makes it easy to implement the proposed protocol by exploiting current technologies. Moreover, to retain the advantage of flexible database queries, we reconstruct Alice's measurement operators so that Alice needs only conditioned sequences of single-qubit measurements.
System for Performing Single Query Searches of Heterogeneous and Dispersed Databases
NASA Technical Reports Server (NTRS)
Maluf, David A. (Inventor); Okimura, Takeshi (Inventor); Gurram, Mohana M. (Inventor); Tran, Vu Hoang (Inventor); Knight, Christopher D. (Inventor); Trinh, Anh Ngoc (Inventor)
2017-01-01
The present invention is a distributed computer system of heterogeneous databases joined in an information grid and configured with an Application Programming Interface hardware which includes a search engine component for performing user-structured queries on multiple heterogeneous databases in real time. This invention reduces overhead associated with the impedance mismatch that commonly occurs in heterogeneous database queries.
Supporting ontology-based keyword search over medical databases.
Kementsietsidis, Anastasios; Lim, Lipyeow; Wang, Min
2008-11-06
The proliferation of medical terms poses a number of challenges in the sharing of medical information among different stakeholders. Ontologies are commonly used to establish relationships between different terms, yet their role in querying has not been investigated in detail. In this paper, we study the problem of supporting ontology-based keyword search queries on a database of electronic medical records. We present several approaches to support this type of queries, study the advantages and limitations of each approach, and summarize the lessons learned as best practices.
Mining the SDSS SkyServer SQL queries log
NASA Astrophysics Data System (ADS)
Hirota, Vitor M.; Santos, Rafael; Raddick, Jordan; Thakar, Ani
2016-05-01
SkyServer, the Internet portal for the Sloan Digital Sky Survey (SDSS) astronomic catalog, provides a set of tools that allows data access for astronomers and scientific education. One of SkyServer data access interfaces allows users to enter ad-hoc SQL statements to query the catalog. SkyServer also presents some template queries that can be used as basis for more complex queries. This interface has logged over 330 million queries submitted since 2001. It is expected that analysis of this data can be used to investigate usage patterns, identify potential new classes of queries, find similar queries, etc. and to shed some light on how users interact with the Sloan Digital Sky Survey data and how scientists have adopted the new paradigm of e-Science, which could in turn lead to enhancements on the user interfaces and experience in general. In this paper we review some approaches to SQL query mining, apply the traditional techniques used in the literature and present lessons learned, namely, that the general text mining approach for feature extraction and clustering does not seem to be adequate for this type of data, and, most importantly, we find that this type of analysis can result in very different queries being clustered together.
Coaching the exploration and exploitation in active learning for interactive video retrieval.
Wei, Xiao-Yong; Yang, Zhen-Qun
2013-03-01
Conventional active learning approaches for interactive video/image retrieval usually assume the query distribution is unknown, as it is difficult to estimate with only a limited number of labeled instances available. Thus, it is easy to put the system in a dilemma whether to explore the feature space in uncertain areas for a better understanding of the query distribution or to harvest in certain areas for more relevant instances. In this paper, we propose a novel approach called coached active learning that makes the query distribution predictable through training and, therefore, avoids the risk of searching on a completely unknown space. The estimated distribution, which provides a more global view of the feature space, can be used to schedule not only the timing but also the step sizes of the exploration and the exploitation in a principled way. The results of the experiments on a large-scale data set from TRECVID 2005-2009 validate the efficiency and effectiveness of our approach, which demonstrates an encouraging performance when facing domain-shift, outperforms eight conventional active learning methods, and shows superiority to six state-of-the-art interactive video retrieval systems.
Pentoney, Christopher; Harwell, Jeff; Leroy, Gondy
2014-01-01
Searching for medical information online is a common activity. While it has been shown that forming good queries is difficult, Google's query suggestion tool, a type of query expansion, aims to facilitate query formation. However, it is unknown how this expansion, which is based on what others searched for, affects the information gathering of the online community. To measure the impact of social-based query expansion, this study compared it with content-based expansion, i.e., what is really in the text. We used 138,906 medical queries from the AOL User Session Collection and expanded them using Google's Autocomplete method (social-based) and the content of the Google Web Corpus (content-based). We evaluated the specificity and ambiguity of the expansion terms for trigram queries. We also looked at the impact on the actual results using domain diversity and expansion edit distance. Results showed that the social-based method provided more precise expansion terms as well as terms that were less ambiguous. Expanded queries do not differ significantly in diversity when expanded using the social-based method (6.72 different domains returned in the first ten results, on average) vs. content-based method (6.73 different domains, on average).
SIMS: addressing the problem of heterogeneity in databases
NASA Astrophysics Data System (ADS)
Arens, Yigal
1997-02-01
The heterogeneity of remotely accessible databases -- with respect to contents, query language, semantics, organization, etc. -- presents serious obstacles to convenient querying. The SIMS (single interface to multiple sources) system addresses this global integration problem. It does so by defining a single language for describing the domain about which information is stored in the databases and using this language as the query language. Each database to which SIMS is to provide access is modeled using this language. The model describes a database's contents, organization, and other relevant features. SIMS uses these models, together with a planning system drawing on techniques from artificial intelligence, to decompose a given user's high-level query into a series of queries against the databases and other data manipulation steps. The retrieval plan is constructed so as to minimize data movement over the network and maximize parallelism to increase execution speed. SIMS can recover from network failures during plan execution by obtaining data from alternate sources, when possible. SIMS has been demonstrated in the domains of medical informatics and logistics, using real databases.
Manually Classifying User Search Queries on an Academic Library Web Site
ERIC Educational Resources Information Center
Chapman, Suzanne; Desai, Shevon; Hagedorn, Kat; Varnum, Ken; Mishra, Sonali; Piacentine, Julie
2013-01-01
The University of Michigan Library wanted to learn more about the kinds of searches its users were conducting through the "one search" search box on the Library Web site. Library staff conducted two investigations. A preliminary investigation in 2011 involved the manual review of the 100 most frequently occurring queries conducted…
An advanced web query interface for biological databases
Latendresse, Mario; Karp, Peter D.
2010-01-01
Although most web-based biological databases (DBs) offer some type of web-based form to allow users to author DB queries, these query forms are quite restricted in the complexity of DB queries that they can formulate. They can typically query only one DB, and can query only a single type of object at a time (e.g. genes) with no possible interaction between the objects—that is, in SQL parlance, no joins are allowed between DB objects. Writing precise queries against biological DBs is usually left to a programmer skillful enough in complex DB query languages like SQL. We present a web interface for building precise queries for biological DBs that can construct much more precise queries than most web-based query forms, yet that is user friendly enough to be used by biologists. It supports queries containing multiple conditions, and connecting multiple object types without using the join concept, which is unintuitive to biologists. This interactive web interface is called the Structured Advanced Query Page (SAQP). Users interactively build up a wide range of query constructs. Interactive documentation within the SAQP describes the schema of the queried DBs. The SAQP is based on BioVelo, a query language based on list comprehension. The SAQP is part of the Pathway Tools software and is available as part of several bioinformatics web sites powered by Pathway Tools, including the BioCyc.org site that contains more than 500 Pathway/Genome DBs. PMID:20624715
Collaborative E-Learning Using Semantic Course Blog
ERIC Educational Resources Information Center
Lu, Lai-Chen; Yeh, Ching-Long
2008-01-01
Collaborative e-learning delivers many enhancements to e-learning technology; it enables students to collaborate with each other and improves their learning efficiency. Semantic blog combines semantic Web and blog technology that users can import, export, view, navigate, and query the blog. We developed a semantic course blog for collaborative…
KBGIS-II: A knowledge-based geographic information system
NASA Technical Reports Server (NTRS)
Smith, Terence; Peuquet, Donna; Menon, Sudhakar; Agarwal, Pankaj
1986-01-01
The architecture and working of a recently implemented Knowledge-Based Geographic Information System (KBGIS-II), designed to satisfy several general criteria for the GIS, is described. The system has four major functions including query-answering, learning and editing. The main query finds constrained locations for spatial objects that are describable in a predicate-calculus based spatial object language. The main search procedures include a family of constraint-satisfaction procedures that use a spatial object knowledge base to search efficiently for complex spatial objects in large, multilayered spatial data bases. These data bases are represented in quadtree form. The search strategy is designed to reduce the computational cost of search in the average case. The learning capabilities of the system include the addition of new locations of complex spatial objects to the knowledge base as queries are answered, and the ability to learn inductively definitions of new spatial objects from examples. The new definitions are added to the knowledge base by the system. The system is performing all its designated tasks successfully. Future reports will relate performance characteristics of the system.
Yang, Liu; Jin, Rong; Mummert, Lily; Sukthankar, Rahul; Goode, Adam; Zheng, Bin; Hoi, Steven C H; Satyanarayanan, Mahadev
2010-01-01
Similarity measurement is a critical component in content-based image retrieval systems, and learning a good distance metric can significantly improve retrieval performance. However, despite extensive study, there are several major shortcomings with the existing approaches for distance metric learning that can significantly affect their application to medical image retrieval. In particular, "similarity" can mean very different things in image retrieval: resemblance in visual appearance (e.g., two images that look like one another) or similarity in semantic annotation (e.g., two images of tumors that look quite different yet are both malignant). Current approaches for distance metric learning typically address only one goal without consideration of the other. This is problematic for medical image retrieval where the goal is to assist doctors in decision making. In these applications, given a query image, the goal is to retrieve similar images from a reference library whose semantic annotations could provide the medical professional with greater insight into the possible interpretations of the query image. If the system were to retrieve images that did not look like the query, then users would be less likely to trust the system; on the other hand, retrieving images that appear superficially similar to the query but are semantically unrelated is undesirable because that could lead users toward an incorrect diagnosis. Hence, learning a distance metric that preserves both visual resemblance and semantic similarity is important. We emphasize that, although our study is focused on medical image retrieval, the problem addressed in this work is critical to many image retrieval systems. We present a boosting framework for distance metric learning that aims to preserve both visual and semantic similarities. The boosting framework first learns a binary representation using side information, in the form of labeled pairs, and then computes the distance as a weighted Hamming distance using the learned binary representation. A boosting algorithm is presented to efficiently learn the distance function. We evaluate the proposed algorithm on a mammographic image reference library with an Interactive Search-Assisted Decision Support (ISADS) system and on the medical image data set from ImageCLEF. Our results show that the boosting framework compares favorably to state-of-the-art approaches for distance metric learning in retrieval accuracy, with much lower computational cost. Additional evaluation with the COREL collection shows that our algorithm works well for regular image data sets.
ERIC Educational Resources Information Center
Khan, Abdul Azeez; Khader, Sheik Abdul
2014-01-01
E-learning or electronic learning platforms facilitate delivery of the knowledge spectrum to the learning community through information and communication technologies. The transfer of knowledge takes place from experts to learners, and externalization of the knowledge transfer is significant. In the e-learning environment, the learners seek…
NASA Astrophysics Data System (ADS)
Hornung, Thomas; Simon, Kai; Lausen, Georg
Combining information from different Web sources often results in a tedious and repetitive process, e.g. even simple information requests might require to iterate over a result list of one Web query and use each single result as input for a subsequent query. One approach for this chained queries are data-centric mashups, which allow to visually model the data flow as a graph, where the nodes represent the data source and the edges the data flow.
A Semantic Parsing Method for Mapping Clinical Questions to Logical Forms
Roberts, Kirk; Patra, Braja Gopal
2017-01-01
This paper presents a method for converting natural language questions about structured data in the electronic health record (EHR) into logical forms. The logical forms can then subsequently be converted to EHR-dependent structured queries. The natural language processing task, known as semantic parsing, has the potential to convert questions to logical forms with extremely high precision, resulting in a system that is usable and trusted by clinicians for real-time use in clinical settings. We propose a hybrid semantic parsing method, combining rule-based methods with a machine learning-based classifier. The overall semantic parsing precision on a set of 212 questions is 95.6%. The parser’s rules furthermore allow it to “know what it does not know”, enabling the system to indicate when unknown terms prevent it from understanding the question’s full logical structure. When combined with a module for converting a logical form into an EHR-dependent query, this high-precision approach allows for a question answering system to provide a user with a single, verifiably correct answer. PMID:29854217
Features: Real-Time Adaptive Feature and Document Learning for Web Search.
ERIC Educational Resources Information Center
Chen, Zhixiang; Meng, Xiannong; Fowler, Richard H.; Zhu, Binhai
2001-01-01
Describes Features, an intelligent Web search engine that is able to perform real-time adaptive feature (i.e., keyword) and document learning. Explains how Features learns from users' document relevance feedback and automatically extracts and suggests indexing keywords relevant to a search query, and learns from users' keyword relevance feedback…
Practical private database queries based on a quantum-key-distribution protocol
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jakobi, Markus; Humboldt-Universitaet zu Berlin, D-10117 Berlin; Simon, Christoph
2011-02-15
Private queries allow a user, Alice, to learn an element of a database held by a provider, Bob, without revealing which element she is interested in, while limiting her information about the other elements. We propose to implement private queries based on a quantum-key-distribution protocol, with changes only in the classical postprocessing of the key. This approach makes our scheme both easy to implement and loss tolerant. While unconditionally secure private queries are known to be impossible, we argue that an interesting degree of security can be achieved by relying on fundamental physical principles instead of unverifiable security assumptions inmore » order to protect both the user and the database. We think that the scope exists for such practical private queries to become another remarkable application of quantum information in the footsteps of quantum key distribution.« less
Building a Smart Portal for Astronomy
NASA Astrophysics Data System (ADS)
Derriere, S.; Boch, T.
2011-07-01
The development of a portal for accessing astronomical resources is not an easy task. The ever-increasing complexity of the data products can result in very complex user interfaces, requiring a lot of effort and learning from the user in order to perform searches. This is often a design choice, where the user must explicitly set many constraints, while the portal search logic remains simple. We investigated a different approach, where the query interface is kept as simple as possible (ideally, a simple text field, like for Google search), and the search logic is made much more complex to interpret the query in a relevant manner. We will present the implications of this approach in terms of interpretation and categorization of the query parameters (related to astronomical vocabularies), translation (mapping) of these concepts into the portal components metadata, identification of query schemes and use cases matching the input parameters, and delivery of query results to the user.
Active Learning with Irrelevant Examples
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri; Mazzoni, Dominic
2009-01-01
An improved active learning method has been devised for training data classifiers. One example of a data classifier is the algorithm used by the United States Postal Service since the 1960s to recognize scans of handwritten digits for processing zip codes. Active learning algorithms enable rapid training with minimal investment of time on the part of human experts to provide training examples consisting of correctly classified (labeled) input data. They function by identifying which examples would be most profitable for a human expert to label. The goal is to maximize classifier accuracy while minimizing the number of examples the expert must label. Although there are several well-established methods for active learning, they may not operate well when irrelevant examples are present in the data set. That is, they may select an item for labeling that the expert simply cannot assign to any of the valid classes. In the context of classifying handwritten digits, the irrelevant items may include stray marks, smudges, and mis-scans. Querying the expert about these items results in wasted time or erroneous labels, if the expert is forced to assign the item to one of the valid classes. In contrast, the new algorithm provides a specific mechanism for avoiding querying the irrelevant items. This algorithm has two components: an active learner (which could be a conventional active learning algorithm) and a relevance classifier. The combination of these components yields a method, denoted Relevance Bias, that enables the active learner to avoid querying irrelevant data so as to increase its learning rate and efficiency when irrelevant items are present. The algorithm collects irrelevant data in a set of rejected examples, then trains the relevance classifier to distinguish between labeled (relevant) training examples and the rejected ones. The active learner combines its ranking of the items with the probability that they are relevant to yield a final decision about which item to present to the expert for labeling. Experiments on several data sets have demonstrated that the Relevance Bias approach significantly decreases the number of irrelevant items queried and also accelerates learning speed.
Measuring the development of conceptual understanding in chemistry
NASA Astrophysics Data System (ADS)
Claesgens, Jennifer Marie
The purpose of this dissertation research is to investigate and characterize how students learn chemistry from pre-instruction to deeper understanding of the subject matter in their general chemistry coursework. Based on preliminary work, I believe that students have a general pathway of learning across the "big ideas," or concepts, in chemistry that can be characterized over the course of instruction. My hypothesis is that as students learn chemistry they build from experience and logical reasoning then relate chemistry specific ideas in a pair-wise fashion before making more complete multi-relational links for deeper understanding of the subject matter. This proposed progression of student learning, which starts at Notions, moves to Recognition, and then to Formulation, is described in the ChemQuery Perspectives framework. My research continues the development of ChemQuery, an NSF-funded assessment system that uses a framework of the key ideas in the discipline and criterion-referenced analysis using item response theory (IRT) to map student progress. Specifially, this research investigates the potential for using criterion-referenced analysis to describe and measure how students learn chemistry followed by more detailed task analysis of patterns in student responses found in the data. My research question asks: does IRT work to describe and measure how students learn chemistry and if so, what is discovered about how students learn? Although my findings seem to neither entirely support nor entirely refute the pathway of student understanding proposed in the ChemQuery Perspectives framework. My research does provide an indication of trouble spots. For example, it seems like the pathway from Notions to Recognition is holding but there are difficulties around the transition from Recognition to Formulation that cannot be resolved with this data. Nevertheless, this research has produced the following, which has contributed to the development of the ChemQuery assessment system, (a) 13 new change items with good fits, 3 new change items that need further study, (b) a refined scoring guide and (c) a set of item exemplars that can then be developed further into a computer-adapted model so that more data can be captured.
Query Health: standards-based, cross-platform population health surveillance
Klann, Jeffrey G; Buck, Michael D; Brown, Jeffrey; Hadley, Marc; Elmore, Richard; Weber, Griffin M; Murphy, Shawn N
2014-01-01
Objective Understanding population-level health trends is essential to effectively monitor and improve public health. The Office of the National Coordinator for Health Information Technology (ONC) Query Health initiative is a collaboration to develop a national architecture for distributed, population-level health queries across diverse clinical systems with disparate data models. Here we review Query Health activities, including a standards-based methodology, an open-source reference implementation, and three pilot projects. Materials and methods Query Health defined a standards-based approach for distributed population health queries, using an ontology based on the Quality Data Model and Consolidated Clinical Document Architecture, Health Quality Measures Format (HQMF) as the query language, the Query Envelope as the secure transport layer, and the Quality Reporting Document Architecture as the result language. Results We implemented this approach using Informatics for Integrating Biology and the Bedside (i2b2) and hQuery for data analytics and PopMedNet for access control, secure query distribution, and response. We deployed the reference implementation at three pilot sites: two public health departments (New York City and Massachusetts) and one pilot designed to support Food and Drug Administration post-market safety surveillance activities. The pilots were successful, although improved cross-platform data normalization is needed. Discussions This initiative resulted in a standards-based methodology for population health queries, a reference implementation, and revision of the HQMF standard. It also informed future directions regarding interoperability and data access for ONC's Data Access Framework initiative. Conclusions Query Health was a test of the learning health system that supplied a functional methodology and reference implementation for distributed population health queries that has been validated at three sites. PMID:24699371
Query Health: standards-based, cross-platform population health surveillance.
Klann, Jeffrey G; Buck, Michael D; Brown, Jeffrey; Hadley, Marc; Elmore, Richard; Weber, Griffin M; Murphy, Shawn N
2014-01-01
Understanding population-level health trends is essential to effectively monitor and improve public health. The Office of the National Coordinator for Health Information Technology (ONC) Query Health initiative is a collaboration to develop a national architecture for distributed, population-level health queries across diverse clinical systems with disparate data models. Here we review Query Health activities, including a standards-based methodology, an open-source reference implementation, and three pilot projects. Query Health defined a standards-based approach for distributed population health queries, using an ontology based on the Quality Data Model and Consolidated Clinical Document Architecture, Health Quality Measures Format (HQMF) as the query language, the Query Envelope as the secure transport layer, and the Quality Reporting Document Architecture as the result language. We implemented this approach using Informatics for Integrating Biology and the Bedside (i2b2) and hQuery for data analytics and PopMedNet for access control, secure query distribution, and response. We deployed the reference implementation at three pilot sites: two public health departments (New York City and Massachusetts) and one pilot designed to support Food and Drug Administration post-market safety surveillance activities. The pilots were successful, although improved cross-platform data normalization is needed. This initiative resulted in a standards-based methodology for population health queries, a reference implementation, and revision of the HQMF standard. It also informed future directions regarding interoperability and data access for ONC's Data Access Framework initiative. Query Health was a test of the learning health system that supplied a functional methodology and reference implementation for distributed population health queries that has been validated at three sites. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
ERIC Educational Resources Information Center
Reid, Alan D.; Nikel, Jutta
2003-01-01
Explores the notion that a review communicates a research program and how it might extend and disrupt readings of Rickinson's (2001) review of the evidence base for environmental education learning. Investigates, through a series of notes and queries using Lakatos's ideas, the production and possibilities of the review rather than the findings.…
A unified framework for image retrieval using keyword and visual features.
Jing, Feng; Li, Mingling; Zhang, Hong-Jiang; Zhang, Bo
2005-07-01
In this paper, a unified image retrieval framework based on both keyword annotations and visual features is proposed. In this framework, a set of statistical models are built based on visual features of a small set of manually labeled images to represent semantic concepts and used to propagate keywords to other unlabeled images. These models are updated periodically when more images implicitly labeled by users become available through relevance feedback. In this sense, the keyword models serve the function of accumulation and memorization of knowledge learned from user-provided relevance feedback. Furthermore, two sets of effective and efficient similarity measures and relevance feedback schemes are proposed for query by keyword scenario and query by image example scenario, respectively. Keyword models are combined with visual features in these schemes. In particular, a new, entropy-based active learning strategy is introduced to improve the efficiency of relevance feedback for query by keyword. Furthermore, a new algorithm is proposed to estimate the keyword features of the search concept for query by image example. It is shown to be more appropriate than two existing relevance feedback algorithms. Experimental results demonstrate the effectiveness of the proposed framework.
Learning Object Retrieval and Aggregation Based on Learning Styles
ERIC Educational Resources Information Center
Ramirez-Arellano, Aldo; Bory-Reyes, Juan; Hernández-Simón, Luis Manuel
2017-01-01
The main goal of this article is to develop a Management System for Merging Learning Objects (msMLO), which offers an approach that retrieves learning objects (LOs) based on students' learning styles and term-based queries, which produces a new outcome with a better score. The msMLO faces the task of retrieving LOs via two steps: The first step…
Noise-tolerant parity learning with one quantum bit
NASA Astrophysics Data System (ADS)
Park, Daniel K.; Rhee, June-Koo K.; Lee, Soonchil
2018-03-01
Demonstrating quantum advantage with less powerful but more realistic devices is of great importance in modern quantum information science. Recently, a significant quantum speedup was achieved in the problem of learning a hidden parity function with noise. However, if all data qubits at the query output are completely depolarized, the algorithm fails. In this work, we present a quantum parity learning algorithm that exhibits quantum advantage as long as one qubit is provided with nonzero polarization in each query. In this scenario, the quantum parity learning naturally becomes deterministic quantum computation with one qubit. Then the hidden parity function can be revealed by performing a set of operations that can be interpreted as measuring nonlocal observables on the auxiliary result qubit having nonzero polarization and each data qubit. We also discuss the source of the quantum advantage in our algorithm from the resource-theoretic point of view.
Physical and Psychological Well-Being and University Student Satisfaction with E-Learning
ERIC Educational Resources Information Center
Johnson, Genevieve Marie
2015-01-01
Although research establishes that student characteristics exert considerable influence on learning outcomes, research concerned with e-learning satisfaction most typically focuses of factors associated with instructional design, curriculum and pedagogy. Fifty-eight first-year university e-students completed an online survey that queried their…
Active learning for semi-supervised clustering based on locally linear propagation reconstruction.
Chang, Chin-Chun; Lin, Po-Yi
2015-03-01
The success of semi-supervised clustering relies on the effectiveness of side information. To get effective side information, a new active learner learning pairwise constraints known as must-link and cannot-link constraints is proposed in this paper. Three novel techniques are developed for learning effective pairwise constraints. The first technique is used to identify samples less important to cluster structures. This technique makes use of a kernel version of locally linear embedding for manifold learning. Samples neither important to locally linear propagation reconstructions of other samples nor on flat patches in the learned manifold are regarded as unimportant samples. The second is a novel criterion for query selection. This criterion considers not only the importance of a sample to expanding the space coverage of the learned samples but also the expected number of queries needed to learn the sample. To facilitate semi-supervised clustering, the third technique yields inferred must-links for passing information about flat patches in the learned manifold to semi-supervised clustering algorithms. Experimental results have shown that the learned pairwise constraints can capture the underlying cluster structures and proven the feasibility of the proposed approach. Copyright © 2014 Elsevier Ltd. All rights reserved.
Learning multiple relative attributes with humans in the loop.
Qian, Buyue; Wang, Xiang; Cao, Nan; Jiang, Yu-Gang; Davidson, Ian
2014-12-01
Semantic attributes have been recognized as a more spontaneous manner to describe and annotate image content. It is widely accepted that image annotation using semantic attributes is a significant improvement to the traditional binary or multiclass annotation due to its naturally continuous and relative properties. Though useful, existing approaches rely on an abundant supervision and high-quality training data, which limit their applicability. Two standard methods to overcome small amounts of guidance and low-quality training data are transfer and active learning. In the context of relative attributes, this would entail learning multiple relative attributes simultaneously and actively querying a human for additional information. This paper addresses the two main limitations in existing work: 1) it actively adds humans to the learning loop so that minimal additional guidance can be given and 2) it learns multiple relative attributes simultaneously and thereby leverages dependence amongst them. In this paper, we formulate a joint active learning to rank framework with pairwise supervision to achieve these two aims, which also has other benefits such as the ability to be kernelized. The proposed framework optimizes over a set of ranking functions (measuring the strength of the presence of attributes) simultaneously and dependently on each other. The proposed pairwise queries take the form of which one of these two pictures is more natural? These queries can be easily answered by humans. Extensive empirical study on real image data sets shows that our proposed method, compared with several state-of-the-art methods, achieves superior retrieval performance while requires significantly less human inputs.
XGI: a graphical interface for XQuery creation.
Li, Xiang; Gennari, John H; Brinkley, James F
2007-10-11
XML has become the default standard for data exchange among heterogeneous data sources, and in January 2007 XQuery (XML Query language) was recommended by the World Wide Web Consortium as the query language for XML. However, XQuery is a complex language that is difficult for non-programmers to learn. We have therefore developed XGI (XQuery Graphical Interface), a visual interface for graphically generating XQuery. In this paper we demonstrate the functionality of XGI through its application to a biomedical XML dataset. We describe the system architecture and the features of XGI in relation to several existing querying systems, we demonstrate the system's usability through a sample query construction, and we discuss a preliminary evaluation of XGI. Finally, we describe some limitations of the system, and our plans for future improvements.
Active Learning with Statistical Models.
1995-01-01
Active Learning with Statistical Models ASC-9217041, NSF CDA-9309300 6. AUTHOR(S) David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan 7. PERFORMING...TERMS 15. NUMBER OF PAGES Al, MIT, Artificial Intelligence, active learning , queries, locally weighted 6 regression, LOESS, mixtures of gaussians...COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1522 January 9. 1995 C.B.C.L. Paper No. 110 Active Learning with
Active Learning Using Arbitrary Binary Valued Queries
1990-10-01
active learning in the sense that the learner has complete choice in the information received. Specifically, we allow the learner to ask arbitrary yes...no questions. We consider both active learning under a fixed distribution and distribution-free active learning . In the case of active learning , the...a concept class is actively learnable iff it is finite, so that active learning is in fact less powerful than the usual passive learning model. We
A Practical Ontology Query Expansion Algorithm for Semantic-Aware Learning Objects Retrieval
ERIC Educational Resources Information Center
Lee, Ming-Che; Tsai, Kun Hua; Wang, Tzone I.
2008-01-01
Following the rapid development of Internet, particularly web page interaction technology, distant e-learning has become increasingly realistic and popular. To solve the problems associated with sharing and reusing teaching materials in different e-learning systems, several standard formats, including SCORM, IMS, LOM, and AICC, etc., recently have…
Executing SPARQL Queries over the Web of Linked Data
NASA Astrophysics Data System (ADS)
Hartig, Olaf; Bizer, Christian; Freytag, Johann-Christoph
The Web of Linked Data forms a single, globally distributed dataspace. Due to the openness of this dataspace, it is not possible to know in advance all data sources that might be relevant for query answering. This openness poses a new challenge that is not addressed by traditional research on federated query processing. In this paper we present an approach to execute SPARQL queries over the Web of Linked Data. The main idea of our approach is to discover data that might be relevant for answering a query during the query execution itself. This discovery is driven by following RDF links between data sources based on URIs in the query and in partial results. The URIs are resolved over the HTTP protocol into RDF data which is continuously added to the queried dataset. This paper describes concepts and algorithms to implement our approach using an iterator-based pipeline. We introduce a formalization of the pipelining approach and show that classical iterators may cause blocking due to the latency of HTTP requests. To avoid blocking, we propose an extension of the iterator paradigm. The evaluation of our approach shows its strengths as well as the still existing challenges.
Saying What You're Looking For: Linguistics Meets Video Search.
Barrett, Daniel Paul; Barbu, Andrei; Siddharth, N; Siskind, Jeffrey Mark
2016-10-01
We present an approach to searching large video corpora for clips which depict a natural-language query in the form of a sentence. Compositional semantics is used to encode subtle meaning differences lost in other approaches, such as the difference between two sentences which have identical words but entirely different meaning: The person rode the horse versus The horse rode the person. Given a sentential query and a natural-language parser, we produce a score indicating how well a video clip depicts that sentence for each clip in a corpus and return a ranked list of clips. Two fundamental problems are addressed simultaneously: detecting and tracking objects, and recognizing whether those tracks depict the query. Because both tracking and object detection are unreliable, our approach uses the sentential query to focus the tracker on the relevant participants and ensures that the resulting tracks are described by the sentential query. While most earlier work was limited to single-word queries which correspond to either verbs or nouns, we search for complex queries which contain multiple phrases, such as prepositional phrases, and modifiers, such as adverbs. We demonstrate this approach by searching for 2,627 naturally elicited sentential queries in 10 Hollywood movies.
NASA Astrophysics Data System (ADS)
Giovannetti, Vittorio; Lloyd, Seth; Maccone, Lorenzo
2008-06-01
We propose a cheat sensitive quantum protocol to perform a private search on a classical database which is efficient in terms of communication complexity. It allows a user to retrieve an item from the database provider without revealing which item he or she retrieved: if the provider tries to obtain information on the query, the person querying the database can find it out. The protocol ensures also perfect data privacy of the database: the information that the user can retrieve in a single query is bounded and does not depend on the size of the database. With respect to the known (quantum and classical) strategies for private information retrieval, our protocol displays an exponential reduction in communication complexity and in running-time computational complexity.
Neural networks and logical reasoning systems: a translation table.
Martins, J; Mendes, R V
2001-04-01
A correspondence is established between the basic elements of logic reasoning systems (knowledge bases, rules, inference and queries) and the structure and dynamical evolution laws of neural networks. The correspondence is pictured as a translation dictionary which might allow to go back and forth between symbolic and network formulations, a desirable step in learning-oriented systems and multicomputer networks. In the framework of Horn clause logics, it is found that atomic propositions with n arguments correspond to nodes with nth order synapses, rules to synaptic intensity constraints, forward chaining to synaptic dynamics and queries either to simple node activation or to a query tensor dynamics.
Web document ranking via active learning and kernel principal component analysis
NASA Astrophysics Data System (ADS)
Cai, Fei; Chen, Honghui; Shu, Zhen
2015-09-01
Web document ranking arises in many information retrieval (IR) applications, such as the search engine, recommendation system and online advertising. A challenging issue is how to select the representative query-document pairs and informative features as well for better learning and exploring new ranking models to produce an acceptable ranking list of candidate documents of each query. In this study, we propose an active sampling (AS) plus kernel principal component analysis (KPCA) based ranking model, viz. AS-KPCA Regression, to study the document ranking for a retrieval system, i.e. how to choose the representative query-document pairs and features for learning. More precisely, we fill those documents gradually into the training set by AS such that each of which will incur the highest expected DCG loss if unselected. Then, the KPCA is performed via projecting the selected query-document pairs onto p-principal components in the feature space to complete the regression. Hence, we can cut down the computational overhead and depress the impact incurred by noise simultaneously. To the best of our knowledge, we are the first to perform the document ranking via dimension reductions in two dimensions, namely, the number of documents and features simultaneously. Our experiments demonstrate that the performance of our approach is better than that of the baseline methods on the public LETOR 4.0 datasets. Our approach brings an improvement against RankBoost as well as other baselines near 20% in terms of MAP metric and less improvements using P@K and NDCG@K, respectively. Moreover, our approach is particularly suitable for document ranking on the noisy dataset in practice.
A rank-based Prediction Algorithm of Learning User's Intention
NASA Astrophysics Data System (ADS)
Shen, Jie; Gao, Ying; Chen, Cang; Gong, HaiPing
Internet search has become an important part in people's daily life. People can find many types of information to meet different needs through search engines on the Internet. There are two issues for the current search engines: first, the users should predetermine the types of information they want and then change to the appropriate types of search engine interfaces. Second, most search engines can support multiple kinds of search functions, each function has its own separate search interface. While users need different types of information, they must switch between different interfaces. In practice, most queries are corresponding to various types of information results. These queries can search the relevant results in various search engines, such as query "Palace" contains the websites about the introduction of the National Palace Museum, blog, Wikipedia, some pictures and video information. This paper presents a new aggregative algorithm for all kinds of search results. It can filter and sort the search results by learning three aspects about the query words, search results and search history logs to achieve the purpose of detecting user's intention. Experiments demonstrate that this rank-based method for multi-types of search results is effective. It can meet the user's search needs well, enhance user's satisfaction, provide an effective and rational model for optimizing search engines and improve user's search experience.
Using Bitmap Indexing Technology for Combined Numerical and TextQueries
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stockinger, Kurt; Cieslewicz, John; Wu, Kesheng
2006-10-16
In this paper, we describe a strategy of using compressedbitmap indices to speed up queries on both numerical data and textdocuments. By using an efficient compression algorithm, these compressedbitmap indices are compact even for indices with millions of distinctterms. Moreover, bitmap indices can be used very efficiently to answerBoolean queries over text documents involving multiple query terms.Existing inverted indices for text searches are usually inefficient forcorpora with a very large number of terms as well as for queriesinvolving a large number of hits. We demonstrate that our compressedbitmap index technology overcomes both of those short-comings. In aperformance comparison against amore » commonly used database system, ourindices answer queries 30 times faster on average. To provide full SQLsupport, we integrated our indexing software, called FastBit, withMonetDB. The integrated system MonetDB/FastBit provides not onlyefficient searches on a single table as FastBit does, but also answersjoin queries efficiently. Furthermore, MonetDB/FastBit also provides avery efficient retrieval mechanism of result records.« less
A Novel Quantum Solution to Privacy-Preserving Nearest Neighbor Query in Location-Based Services
NASA Astrophysics Data System (ADS)
Luo, Zhen-yu; Shi, Run-hua; Xu, Min; Zhang, Shun
2018-04-01
We present a cheating-sensitive quantum protocol for Privacy-Preserving Nearest Neighbor Query based on Oblivious Quantum Key Distribution and Quantum Encryption. Compared with the classical related protocols, our proposed protocol has higher security, because the security of our protocol is based on basic physical principles of quantum mechanics, instead of difficulty assumptions. Especially, our protocol takes single photons as quantum resources and only needs to perform single-photon projective measurement. Therefore, it is feasible to implement this protocol with the present technologies.
Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick
2016-01-01
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest. PMID:26958859
Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick
2016-01-01
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.
Survey on Uses of Distance Learning in the U.S.
ERIC Educational Resources Information Center
Downing, Diane E.
A December 1983 survey queried the chief state school officers of the 50 states on the extent to which distance learning techniques are used in public education in their states. Respondents were asked to focus on interactive forms of distance learning, such as audio and video teleconferencing. A total of 28 states (56%) responded, with the…
ERIC Educational Resources Information Center
Dangel, Julie Rainer
2011-01-01
Whither Scholarship in the Work of Enhancing the Quality of Teaching and Learning? This is an important query because it acknowledges, embraces, questions, and challenges the role of scholarship in enhancing teaching and learning. Interestingly, these four verbs help the author categorize her perspectives on the use of scholarship. Drawing from…
Quantum private query based on single-photon interference
NASA Astrophysics Data System (ADS)
Xu, Sheng-Wei; Sun, Ying; Lin, Song
2016-08-01
Quantum private query (QPQ) has become a research hotspot recently. Specially, the quantum key distribution (QKD)-based QPQ attracts lots of attention because of its practicality. Various such kind of QPQ protocols have been proposed based on different technologies of quantum communications. Single-photon interference is one of such technologies, on which the famous QKD protocol GV95 is just based. In this paper, we propose two QPQ protocols based on single-photon interference. The first one is simpler and easier to realize, and the second one is loss tolerant and flexible, and more practical than the first one. Furthermore, we analyze both the user privacy and the database privacy in the proposed protocols.
Producing approximate answers to database queries
NASA Technical Reports Server (NTRS)
Vrbsky, Susan V.; Liu, Jane W. S.
1993-01-01
We have designed and implemented a query processor, called APPROXIMATE, that makes approximate answers available if part of the database is unavailable or if there is not enough time to produce an exact answer. The accuracy of the approximate answers produced improves monotonically with the amount of data retrieved to produce the result. The exact answer is produced if all of the needed data are available and query processing is allowed to continue until completion. The monotone query processing algorithm of APPROXIMATE works within the standard relational algebra framework and can be implemented on a relational database system with little change to the relational architecture. We describe here the approximation semantics of APPROXIMATE that serves as the basis for meaningful approximations of both set-valued and single-valued queries. We show how APPROXIMATE is implemented to make effective use of semantic information, provided by an object-oriented view of the database, and describe the additional overhead required by APPROXIMATE.
Characterizing Listener Engagement with Popular Songs Using Large-Scale Music Discovery Data
Kaneshiro, Blair; Ruan, Feng; Baker, Casey W.; Berger, Jonathan
2017-01-01
Music discovery in everyday situations has been facilitated in recent years by audio content recognition services such as Shazam. The widespread use of such services has produced a wealth of user data, specifying where and when a global audience takes action to learn more about music playing around them. Here, we analyze a large collection of Shazam queries of popular songs to study the relationship between the timing of queries and corresponding musical content. Our results reveal that the distribution of queries varies over the course of a song, and that salient musical events drive an increase in queries during a song. Furthermore, we find that the distribution of queries at the time of a song's release differs from the distribution following a song's peak and subsequent decline in popularity, possibly reflecting an evolution of user intent over the “life cycle” of a song. Finally, we derive insights into the data size needed to achieve consistent query distributions for individual songs. The combined findings of this study suggest that music discovery behavior, and other facets of the human experience of music, can be studied quantitatively using large-scale industrial data. PMID:28386241
Semantic-based surveillance video retrieval.
Hu, Weiming; Xie, Dan; Fu, Zhouyu; Zeng, Wenrong; Maybank, Steve
2007-04-01
Visual surveillance produces large amounts of video data. Effective indexing and retrieval from surveillance video databases are very important. Although there are many ways to represent the content of video clips in current video retrieval algorithms, there still exists a semantic gap between users and retrieval systems. Visual surveillance systems supply a platform for investigating semantic-based video retrieval. In this paper, a semantic-based video retrieval framework for visual surveillance is proposed. A cluster-based tracking algorithm is developed to acquire motion trajectories. The trajectories are then clustered hierarchically using the spatial and temporal information, to learn activity models. A hierarchical structure of semantic indexing and retrieval of object activities, where each individual activity automatically inherits all the semantic descriptions of the activity model to which it belongs, is proposed for accessing video clips and individual objects at the semantic level. The proposed retrieval framework supports various queries including queries by keywords, multiple object queries, and queries by sketch. For multiple object queries, succession and simultaneity restrictions, together with depth and breadth first orders, are considered. For sketch-based queries, a method for matching trajectories drawn by users to spatial trajectories is proposed. The effectiveness and efficiency of our framework are tested in a crowded traffic scene.
Characterizing Listener Engagement with Popular Songs Using Large-Scale Music Discovery Data.
Kaneshiro, Blair; Ruan, Feng; Baker, Casey W; Berger, Jonathan
2017-01-01
Music discovery in everyday situations has been facilitated in recent years by audio content recognition services such as Shazam. The widespread use of such services has produced a wealth of user data, specifying where and when a global audience takes action to learn more about music playing around them. Here, we analyze a large collection of Shazam queries of popular songs to study the relationship between the timing of queries and corresponding musical content. Our results reveal that the distribution of queries varies over the course of a song, and that salient musical events drive an increase in queries during a song. Furthermore, we find that the distribution of queries at the time of a song's release differs from the distribution following a song's peak and subsequent decline in popularity, possibly reflecting an evolution of user intent over the "life cycle" of a song. Finally, we derive insights into the data size needed to achieve consistent query distributions for individual songs. The combined findings of this study suggest that music discovery behavior, and other facets of the human experience of music, can be studied quantitatively using large-scale industrial data.
Active learning based segmentation of Crohns disease from abdominal MRI.
Mahapatra, Dwarikanath; Vos, Franciscus M; Buhmann, Joachim M
2016-05-01
This paper proposes a novel active learning (AL) framework, and combines it with semi supervised learning (SSL) for segmenting Crohns disease (CD) tissues from abdominal magnetic resonance (MR) images. Robust fully supervised learning (FSL) based classifiers require lots of labeled data of different disease severities. Obtaining such data is time consuming and requires considerable expertise. SSL methods use a few labeled samples, and leverage the information from many unlabeled samples to train an accurate classifier. AL queries labels of most informative samples and maximizes gain from the labeling effort. Our primary contribution is in designing a query strategy that combines novel context information with classification uncertainty and feature similarity. Combining SSL and AL gives a robust segmentation method that: (1) optimally uses few labeled samples and many unlabeled samples; and (2) requires lower training time. Experimental results show our method achieves higher segmentation accuracy than FSL methods with fewer samples and reduced training effort. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
KBGIS-2: A knowledge-based geographic information system
NASA Technical Reports Server (NTRS)
Smith, T.; Peuquet, D.; Menon, S.; Agarwal, P.
1986-01-01
The architecture and working of a recently implemented knowledge-based geographic information system (KBGIS-2) that was designed to satisfy several general criteria for the geographic information system are described. The system has four major functions that include query-answering, learning, and editing. The main query finds constrained locations for spatial objects that are describable in a predicate-calculus based spatial objects language. The main search procedures include a family of constraint-satisfaction procedures that use a spatial object knowledge base to search efficiently for complex spatial objects in large, multilayered spatial data bases. These data bases are represented in quadtree form. The search strategy is designed to reduce the computational cost of search in the average case. The learning capabilities of the system include the addition of new locations of complex spatial objects to the knowledge base as queries are answered, and the ability to learn inductively definitions of new spatial objects from examples. The new definitions are added to the knowledge base by the system. The system is currently performing all its designated tasks successfully, although currently implemented on inadequate hardware. Future reports will detail the performance characteristics of the system, and various new extensions are planned in order to enhance the power of KBGIS-2.
LAILAPS-QSM: A RESTful API and JAVA library for semantic query suggestions.
Chen, Jinbo; Scholz, Uwe; Zhou, Ruonan; Lange, Matthias
2018-03-01
In order to access and filter content of life-science databases, full text search is a widely applied query interface. But its high flexibility and intuitiveness is paid for with potentially imprecise and incomplete query results. To reduce this drawback, query assistance systems suggest those combinations of keywords with the highest potential to match most of the relevant data records. Widespread approaches are syntactic query corrections that avoid misspelling and support expansion of words by suffixes and prefixes. Synonym expansion approaches apply thesauri, ontologies, and query logs. All need laborious curation and maintenance. Furthermore, access to query logs is in general restricted. Approaches that infer related queries by their query profile like research field, geographic location, co-authorship, affiliation etc. require user's registration and its public accessibility that contradict privacy concerns. To overcome these drawbacks, we implemented LAILAPS-QSM, a machine learning approach that reconstruct possible linguistic contexts of a given keyword query. The context is referred from the text records that are stored in the databases that are going to be queried or extracted for a general purpose query suggestion from PubMed abstracts and UniProt data. The supplied tool suite enables the pre-processing of these text records and the further computation of customized distributed word vectors. The latter are used to suggest alternative keyword queries. An evaluated of the query suggestion quality was done for plant science use cases. Locally present experts enable a cost-efficient quality assessment in the categories trait, biological entity, taxonomy, affiliation, and metabolic function which has been performed using ontology term similarities. LAILAPS-QSM mean information content similarity for 15 representative queries is 0.70, whereas 34% have a score above 0.80. In comparison, the information content similarity for human expert made query suggestions is 0.90. The software is either available as tool set to build and train dedicated query suggestion services or as already trained general purpose RESTful web service. The service uses open interfaces to be seamless embeddable into database frontends. The JAVA implementation uses highly optimized data structures and streamlined code to provide fast and scalable response for web service calls. The source code of LAILAPS-QSM is available under GNU General Public License version 2 in Bitbucket GIT repository: https://bitbucket.org/ipk_bit_team/bioescorte-suggestion.
Persistent Identifiers for Improved Accessibility for Linked Data Querying
NASA Astrophysics Data System (ADS)
Shepherd, A.; Chandler, C. L.; Arko, R. A.; Fils, D.; Jones, M. B.; Krisnadhi, A.; Mecum, B.
2016-12-01
The adoption of linked open data principles within the geosciences has increased the amount of accessible information available on the Web. However, this data is difficult to consume for those who are unfamiliar with Semantic Web technologies such as Web Ontology Language (OWL), Resource Description Framework (RDF) and SPARQL - the RDF query language. Consumers would need to understand the structure of the data and how to efficiently query it. Furthermore, understanding how to query doesn't solve problems of poor precision and recall in search results. For consumers unfamiliar with the data, full-text searches are most accessible, but not ideal as they arrest the advantages of data disambiguation and co-reference resolution efforts. Conversely, URI searches across linked data can deliver improved search results, but knowledge of these exact URIs may remain difficult to obtain. The increased adoption of Persistent Identifiers (PIDs) can lead to improved linked data querying by a wide variety of consumers. Because PIDs resolve to a single entity, they are an excellent data point for disambiguating content. At the same time, PIDs are more accessible and prominent than a single data provider's linked data URI. When present in linked open datasets, PIDs provide balance between the technical and social hurdles of linked data querying as evidenced by the NSF EarthCube GeoLink project. The GeoLink project, funded by NSF's EarthCube initiative, have brought together data repositories include content from field expeditions, laboratory analyses, journal publications, conference presentations, theses/reports, and funding awards that span scientific studies from marine geology to marine ecosystems and biogeochemistry to paleoclimatology.
Classification of ECG beats using deep belief network and active learning.
G, Sayantan; T, Kien P; V, Kadambari K
2018-04-12
A new semi-supervised approach based on deep learning and active learning for classification of electrocardiogram signals (ECG) is proposed. The objective of the proposed work is to model a scientific method for classification of cardiac irregularities using electrocardiogram beats. The model follows the Association for the Advancement of medical instrumentation (AAMI) standards and consists of three phases. In phase I, feature representation of ECG is learnt using Gaussian-Bernoulli deep belief network followed by a linear support vector machine (SVM) training in the consecutive phase. It yields three deep models which are based on AAMI-defined classes, namely N, V, S, and F. In the last phase, a query generator is introduced to interact with the expert to label few beats to improve accuracy and sensitivity. The proposed approach depicts significant improvement in accuracy with minimal queries posed to the expert and fast online training as tested on the MIT-BIH Arrhythmia Database and the MIT-BIH Supra-ventricular Arrhythmia Database (SVDB). With 100 queries labeled by the expert in phase III, the method achieves an accuracy of 99.5% in "S" versus all classifications (SVEB) and 99.4% accuracy in "V " versus all classifications (VEB) on MIT-BIH Arrhythmia Database. In a similar manner, it is attributed that an accuracy of 97.5% for SVEB and 98.6% for VEB on SVDB database is achieved respectively. Graphical Abstract Reply- Deep belief network augmented by active learning for efficient prediction of arrhythmia.
Hu, Xiangen; Graesser, Arthur C
2004-05-01
The Human Use Regulatory Affairs Advisor (HURAA) is a Web-based facility that provides help and training on the ethical use of human subjects in research, based on documents and regulations in United States federal agencies. HURAA has a number of standard features of conventional Web facilities and computer-based training, such as hypertext, multimedia, help modules, glossaries, archives, links to other sites, and page-turning didactic instruction. HURAA also has these intelligent features: (1) an animated conversational agent that serves as a navigational guide for the Web facility, (2) lessons with case-based and explanation-based reasoning, (3) document retrieval through natural language queries, and (4) a context-sensitive Frequently Asked Questions segment, called Point & Query. This article describes the functional learning components of HURAA, specifies its computational architecture, and summarizes empirical tests of the facility on learners.
Semantic Services in e-Learning: An Argumentation Case Study
ERIC Educational Resources Information Center
Moreale, Emanuela; Vargas-Vera, Maria
2004-01-01
This paper outlines an e-Learning services architecture offering semantic-based services to students and tutors, in particular ways to browse and obtain information through web services. Services could include registration, authentication, tutoring systems, smart question answering for students' queries, automated marking systems and a student…
A Re-Unification of Two Competing Models for Document Retrieval.
ERIC Educational Resources Information Center
Bodoff, David
1999-01-01
Examines query-oriented versus document-oriented information retrieval and feedback learning. Highlights include a reunification of the two approaches for probabilistic document retrieval and for vector space model (VSM) retrieval; learning in VSM and in probabilistic models; multi-dimensional scaling; and ongoing field studies. (LRW)
Large-Scale Image Analytics Using Deep Learning
NASA Astrophysics Data System (ADS)
Ganguly, S.; Nemani, R. R.; Basu, S.; Mukhopadhyay, S.; Michaelis, A.; Votava, P.
2014-12-01
High resolution land cover classification maps are needed to increase the accuracy of current Land ecosystem and climate model outputs. Limited studies are in place that demonstrates the state-of-the-art in deriving very high resolution (VHR) land cover products. In addition, most methods heavily rely on commercial softwares that are difficult to scale given the region of study (e.g. continents to globe). Complexities in present approaches relate to (a) scalability of the algorithm, (b) large image data processing (compute and memory intensive), (c) computational cost, (d) massively parallel architecture, and (e) machine learning automation. In addition, VHR satellite datasets are of the order of terabytes and features extracted from these datasets are of the order of petabytes. In our present study, we have acquired the National Agricultural Imaging Program (NAIP) dataset for the Continental United States at a spatial resolution of 1-m. This data comes as image tiles (a total of quarter million image scenes with ~60 million pixels) and has a total size of ~100 terabytes for a single acquisition. Features extracted from the entire dataset would amount to ~8-10 petabytes. In our proposed approach, we have implemented a novel semi-automated machine learning algorithm rooted on the principles of "deep learning" to delineate the percentage of tree cover. In order to perform image analytics in such a granular system, it is mandatory to devise an intelligent archiving and query system for image retrieval, file structuring, metadata processing and filtering of all available image scenes. Using the Open NASA Earth Exchange (NEX) initiative, which is a partnership with Amazon Web Services (AWS), we have developed an end-to-end architecture for designing the database and the deep belief network (following the distbelief computing model) to solve a grand challenge of scaling this process across quarter million NAIP tiles that cover the entire Continental United States. The AWS core components that we use to solve this problem are DynamoDB along with S3 for database query and storage, ElastiCache shared memory architecture for image segmentation, Elastic Map Reduce (EMR) for image feature extraction, and the memory optimized Elastic Cloud Compute (EC2) for the learning algorithm.
A Research on E - learning Resources Construction Based on Semantic Web
NASA Astrophysics Data System (ADS)
Rui, Liu; Maode, Deng
Traditional e-learning platforms have the flaws that it's usually difficult to query or positioning, and realize the cross platform sharing and interoperability. In the paper, the semantic web and metadata standard is discussed, and a kind of e - learning system framework based on semantic web is put forward to try to solve the flaws of traditional elearning platforms.
ERIC Educational Resources Information Center
Johnson, Genevieve Marie
2016-01-01
First-year university students (n = 199) completed an online questionnaire that queried their purchase of paper books and eBooks for university study and personal interest. The questionnaire also required students to rate their learning characteristics including reading strategies, study self-regulation, learning control beliefs and achievement…
Diverse expected gradient active learning for relative attributes.
You, Xinge; Wang, Ruxin; Tao, Dacheng
2014-07-01
The use of relative attributes for semantic understanding of images and videos is a promising way to improve communication between humans and machines. However, it is extremely labor- and time-consuming to define multiple attributes for each instance in large amount of data. One option is to incorporate active learning, so that the informative samples can be actively discovered and then labeled. However, most existing active-learning methods select samples one at a time (serial mode), and may therefore lose efficiency when learning multiple attributes. In this paper, we propose a batch-mode active-learning method, called diverse expected gradient active learning. This method integrates an informativeness analysis and a diversity analysis to form a diverse batch of queries. Specifically, the informativeness analysis employs the expected pairwise gradient length as a measure of informativeness, while the diversity analysis forces a constraint on the proposed diverse gradient angle. Since simultaneous optimization of these two parts is intractable, we utilize a two-step procedure to obtain the diverse batch of queries. A heuristic method is also introduced to suppress imbalanced multiclass distributions. Empirical evaluations of three different databases demonstrate the effectiveness and efficiency of the proposed approach.
Diverse Expected Gradient Active Learning for Relative Attributes.
You, Xinge; Wang, Ruxin; Tao, Dacheng
2014-06-02
The use of relative attributes for semantic understanding of images and videos is a promising way to improve communication between humans and machines. However, it is extremely labor- and time-consuming to define multiple attributes for each instance in large amount of data. One option is to incorporate active learning, so that the informative samples can be actively discovered and then labeled. However, most existing active-learning methods select samples one at a time (serial mode), and may therefore lose efficiency when learning multiple attributes. In this paper, we propose a batch-mode active-learning method, called Diverse Expected Gradient Active Learning (DEGAL). This method integrates an informativeness analysis and a diversity analysis to form a diverse batch of queries. Specifically, the informativeness analysis employs the expected pairwise gradient length as a measure of informativeness, while the diversity analysis forces a constraint on the proposed diverse gradient angle. Since simultaneous optimization of these two parts is intractable, we utilize a two-step procedure to obtain the diverse batch of queries. A heuristic method is also introduced to suppress imbalanced multi-class distributions. Empirical evaluations of three different databases demonstrate the effectiveness and efficiency of the proposed approach.
Enabling Incremental Query Re-Optimization.
Liu, Mengmeng; Ives, Zachary G; Loo, Boon Thau
2016-01-01
As declarative query processing techniques expand to the Web, data streams, network routers, and cloud platforms, there is an increasing need to re-plan execution in the presence of unanticipated performance changes. New runtime information may affect which query plan we prefer to run. Adaptive techniques require innovation both in terms of the algorithms used to estimate costs , and in terms of the search algorithm that finds the best plan. We investigate how to build a cost-based optimizer that recomputes the optimal plan incrementally given new cost information, much as a stream engine constantly updates its outputs given new data. Our implementation especially shows benefits for stream processing workloads. It lays the foundations upon which a variety of novel adaptive optimization algorithms can be built. We start by leveraging the recently proposed approach of formulating query plan enumeration as a set of recursive datalog queries ; we develop a variety of novel optimization approaches to ensure effective pruning in both static and incremental cases. We further show that the lessons learned in the declarative implementation can be equally applied to more traditional optimizer implementations.
Enabling Incremental Query Re-Optimization
Liu, Mengmeng; Ives, Zachary G.; Loo, Boon Thau
2017-01-01
As declarative query processing techniques expand to the Web, data streams, network routers, and cloud platforms, there is an increasing need to re-plan execution in the presence of unanticipated performance changes. New runtime information may affect which query plan we prefer to run. Adaptive techniques require innovation both in terms of the algorithms used to estimate costs, and in terms of the search algorithm that finds the best plan. We investigate how to build a cost-based optimizer that recomputes the optimal plan incrementally given new cost information, much as a stream engine constantly updates its outputs given new data. Our implementation especially shows benefits for stream processing workloads. It lays the foundations upon which a variety of novel adaptive optimization algorithms can be built. We start by leveraging the recently proposed approach of formulating query plan enumeration as a set of recursive datalog queries; we develop a variety of novel optimization approaches to ensure effective pruning in both static and incremental cases. We further show that the lessons learned in the declarative implementation can be equally applied to more traditional optimizer implementations. PMID:28659658
Classification of Automated Search Traffic
NASA Astrophysics Data System (ADS)
Buehrer, Greg; Stokes, Jack W.; Chellapilla, Kumar; Platt, John C.
As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a web site’s rank, augmenting online games, or possibly to maliciously alter click-through rates. In this paper, we investigate automated traffic (sometimes referred to as bot traffic) in the query stream of a large search engine provider. We define automated traffic as any search query not generated by a human in real time. We first provide examples of different categories of query logs generated by automated means. We then develop many different features that distinguish between queries generated by people searching for information, and those generated by automated processes. We categorize these features into two classes, either an interpretation of the physical model of human interactions, or as behavioral patterns of automated interactions. Using the these detection features, we next classify the query stream using multiple binary classifiers. In addition, a multiclass classifier is then developed to identify subclasses of both normal and automated traffic. An active learning algorithm is used to suggest which user sessions to label to improve the accuracy of the multiclass classifier, while also seeking to discover new classes of automated traffic. Performance analysis are then provided. Finally, the multiclass classifier is used to predict the subclass distribution for the search query stream.
NASA Astrophysics Data System (ADS)
Wei, Chun-Yan; Gao, Fei; Wen, Qiao-Yan; Wang, Tian-Yin
2014-12-01
Until now, the only kind of practical quantum private query (QPQ), quantum-key-distribution (QKD)-based QPQ, focuses on the retrieval of a single bit. In fact, meaningful message is generally composed of multiple adjacent bits (i.e., a multi-bit block). To obtain a message from database, the user Alice has to query l times to get each ai. In this condition, the server Bob could gain Alice's privacy once he obtains the address she queried in any of the l queries, since each ai contributes to the message Alice retrieves. Apparently, the longer the retrieved message is, the worse the user privacy becomes. To solve this problem, via an unbalanced-state technique and based on a variant of multi-level BB84 protocol, we present a protocol for QPQ of blocks, which allows the user to retrieve a multi-bit block from database in one query. Our protocol is somewhat like the high-dimension version of the first QKD-based QPQ protocol proposed by Jacobi et al., but some nontrivial modifications are necessary.
Wei, Chun-Yan; Gao, Fei; Wen, Qiao-Yan; Wang, Tian-Yin
2014-01-01
Until now, the only kind of practical quantum private query (QPQ), quantum-key-distribution (QKD)-based QPQ, focuses on the retrieval of a single bit. In fact, meaningful message is generally composed of multiple adjacent bits (i.e., a multi-bit block). To obtain a message from database, the user Alice has to query l times to get each ai. In this condition, the server Bob could gain Alice's privacy once he obtains the address she queried in any of the l queries, since each ai contributes to the message Alice retrieves. Apparently, the longer the retrieved message is, the worse the user privacy becomes. To solve this problem, via an unbalanced-state technique and based on a variant of multi-level BB84 protocol, we present a protocol for QPQ of blocks, which allows the user to retrieve a multi-bit block from database in one query. Our protocol is somewhat like the high-dimension version of the first QKD-based QPQ protocol proposed by Jacobi et al., but some nontrivial modifications are necessary. PMID:25518810
2007-08-01
In this domain, queries typically show a deeply nested structure, which makes the semantic parsing task rather challenging , e.g.: What states border...only 80% of the GEOQUERY queries are semantically tractable, which shows that GEOQUERY is indeed a more challenging domain than ATIS. Note that none...a particularly challenging task, because of the inherent ambiguity of natural languages on both sides. It has inspired a large body of research. In
Chen, R S; Nadkarni, P; Marenco, L; Levin, F; Erdos, J; Miller, P L
2000-01-01
The entity-attribute-value representation with classes and relationships (EAV/CR) provides a flexible and simple database schema to store heterogeneous biomedical data. In certain circumstances, however, the EAV/CR model is known to retrieve data less efficiently than conventionally based database schemas. To perform a pilot study that systematically quantifies performance differences for database queries directed at real-world microbiology data modeled with EAV/CR and conventional representations, and to explore the relative merits of different EAV/CR query implementation strategies. Clinical microbiology data obtained over a ten-year period were stored using both database models. Query execution times were compared for four clinically oriented attribute-centered and entity-centered queries operating under varying conditions of database size and system memory. The performance characteristics of three different EAV/CR query strategies were also examined. Performance was similar for entity-centered queries in the two database models. Performance in the EAV/CR model was approximately three to five times less efficient than its conventional counterpart for attribute-centered queries. The differences in query efficiency became slightly greater as database size increased, although they were reduced with the addition of system memory. The authors found that EAV/CR queries formulated using multiple, simple SQL statements executed in batch were more efficient than single, large SQL statements. This paper describes a pilot project to explore issues in and compare query performance for EAV/CR and conventional database representations. Although attribute-centered queries were less efficient in the EAV/CR model, these inefficiencies may be addressable, at least in part, by the use of more powerful hardware or more memory, or both.
Automatic classification and detection of clinically relevant images for diabetic retinopathy
NASA Astrophysics Data System (ADS)
Xu, Xinyu; Li, Baoxin
2008-03-01
We proposed a novel approach to automatic classification of Diabetic Retinopathy (DR) images and retrieval of clinically-relevant DR images from a database. Given a query image, our approach first classifies the image into one of the three categories: microaneurysm (MA), neovascularization (NV) and normal, and then it retrieves DR images that are clinically-relevant to the query image from an archival image database. In the classification stage, the query DR images are classified by the Multi-class Multiple-Instance Learning (McMIL) approach, where images are viewed as bags, each of which contains a number of instances corresponding to non-overlapping blocks, and each block is characterized by low-level features including color, texture, histogram of edge directions, and shape. McMIL first learns a collection of instance prototypes for each class that maximizes the Diverse Density function using Expectation- Maximization algorithm. A nonlinear mapping is then defined using the instance prototypes and maps every bag to a point in a new multi-class bag feature space. Finally a multi-class Support Vector Machine is trained in the multi-class bag feature space. In the retrieval stage, we retrieve images from the archival database who bear the same label with the query image, and who are the top K nearest neighbors of the query image in terms of similarity in the multi-class bag feature space. The classification approach achieves high classification accuracy, and the retrieval of clinically-relevant images not only facilitates utilization of the vast amount of hidden diagnostic knowledge in the database, but also improves the efficiency and accuracy of DR lesion diagnosis and assessment.
NASA Technical Reports Server (NTRS)
Gordon, Diana F.
1992-01-01
Selecting a good bias prior to concept learning can be difficult. Therefore, dynamic bias adjustment is becoming increasingly popular. Current dynamic bias adjustment systems, however, are limited in their ability to identify erroneous assumptions about the relationship between the bias and the target concept. Without proper diagnosis, it is difficult to identify and then remedy faulty assumptions. We have developed an approach that makes these assumptions explicit, actively tests them with queries to an oracle, and adjusts the bias based on the test results.
miBLAST: scalable evaluation of a batch of nucleotide sequence queries with BLAST
Kim, You Jung; Boyd, Andrew; Athey, Brian D.; Patel, Jignesh M.
2005-01-01
A common task in many modern bioinformatics applications is to match a set of nucleotide query sequences against a large sequence dataset. Exis-ting tools, such as BLAST, are designed to evaluate a single query at a time and can be unacceptably slow when the number of sequences in the query set is large. In this paper, we present a new algorithm, called miBLAST, that evaluates such batch workloads efficiently. At the core, miBLAST employs a q-gram filtering and an index join for efficiently detecting similarity between the query sequences and database sequences. This set-oriented technique, which indexes both the query and the database sets, results in substantial performance improvements over existing methods. Our results show that miBLAST is significantly faster than BLAST in many cases. For example, miBLAST aligned 247 965 oligonucleotide sequences in the Affymetrix probe set against the Human UniGene in 1.26 days, compared with 27.27 days with BLAST (an improvement by a factor of 22). The relative performance of miBLAST increases for larger word sizes; however, it decreases for longer queries. miBLAST employs the familiar BLAST statistical model and output format, guaranteeing the same accuracy as BLAST and facilitating a seamless transition for existing BLAST users. PMID:16061938
TCW: Transcriptome Computational Workbench
Soderlund, Carol; Nelson, William; Willer, Mark; Gang, David R.
2013-01-01
Background The analysis of transcriptome data involves many steps and various programs, along with organization of large amounts of data and results. Without a methodical approach for storage, analysis and query, the resulting ad hoc analysis can lead to human error, loss of data and results, inefficient use of time, and lack of verifiability, repeatability, and extensibility. Methodology The Transcriptome Computational Workbench (TCW) provides Java graphical interfaces for methodical analysis for both single and comparative transcriptome data without the use of a reference genome (e.g. for non-model organisms). The singleTCW interface steps the user through importing transcript sequences (e.g. Illumina) or assembling long sequences (e.g. Sanger, 454, transcripts), annotating the sequences, and performing differential expression analysis using published statistical programs in R. The data, metadata, and results are stored in a MySQL database. The multiTCW interface builds a comparison database by importing sequence and annotation from one or more single TCW databases, executes the ESTscan program to translate the sequences into proteins, and then incorporates one or more clusterings, where the clustering options are to execute the orthoMCL program, compute transitive closure, or import clusters. Both singleTCW and multiTCW allow extensive query and display of the results, where singleTCW displays the alignment of annotation hits to transcript sequences, and multiTCW displays multiple transcript alignments with MUSCLE or pairwise alignments. The query programs can be executed on the desktop for fastest analysis, or from the web for sharing the results. Conclusion It is now affordable to buy a multi-processor machine, and easy to install Java and MySQL. By simply downloading the TCW, the user can interactively analyze, query and view their data. The TCW allows in-depth data mining of the results, which can lead to a better understanding of the transcriptome. TCW is freely available from www.agcol.arizona.edu/software/tcw. PMID:23874959
TCW: transcriptome computational workbench.
Soderlund, Carol; Nelson, William; Willer, Mark; Gang, David R
2013-01-01
The analysis of transcriptome data involves many steps and various programs, along with organization of large amounts of data and results. Without a methodical approach for storage, analysis and query, the resulting ad hoc analysis can lead to human error, loss of data and results, inefficient use of time, and lack of verifiability, repeatability, and extensibility. The Transcriptome Computational Workbench (TCW) provides Java graphical interfaces for methodical analysis for both single and comparative transcriptome data without the use of a reference genome (e.g. for non-model organisms). The singleTCW interface steps the user through importing transcript sequences (e.g. Illumina) or assembling long sequences (e.g. Sanger, 454, transcripts), annotating the sequences, and performing differential expression analysis using published statistical programs in R. The data, metadata, and results are stored in a MySQL database. The multiTCW interface builds a comparison database by importing sequence and annotation from one or more single TCW databases, executes the ESTscan program to translate the sequences into proteins, and then incorporates one or more clusterings, where the clustering options are to execute the orthoMCL program, compute transitive closure, or import clusters. Both singleTCW and multiTCW allow extensive query and display of the results, where singleTCW displays the alignment of annotation hits to transcript sequences, and multiTCW displays multiple transcript alignments with MUSCLE or pairwise alignments. The query programs can be executed on the desktop for fastest analysis, or from the web for sharing the results. It is now affordable to buy a multi-processor machine, and easy to install Java and MySQL. By simply downloading the TCW, the user can interactively analyze, query and view their data. The TCW allows in-depth data mining of the results, which can lead to a better understanding of the transcriptome. TCW is freely available from www.agcol.arizona.edu/software/tcw.
Embedding strategies for effective use of information from multiple sequence alignments.
Henikoff, S.; Henikoff, J. G.
1997-01-01
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain. PMID:9070452
Wei, Chun-Yan; Gao, Fei; Wen, Qiao-Yan; Wang, Tian-Yin
2014-12-18
Until now, the only kind of practical quantum private query (QPQ), quantum-key-distribution (QKD)-based QPQ, focuses on the retrieval of a single bit. In fact, meaningful message is generally composed of multiple adjacent bits (i.e., a multi-bit block). To obtain a message a1a2···al from database, the user Alice has to query l times to get each ai. In this condition, the server Bob could gain Alice's privacy once he obtains the address she queried in any of the l queries, since each a(i) contributes to the message Alice retrieves. Apparently, the longer the retrieved message is, the worse the user privacy becomes. To solve this problem, via an unbalanced-state technique and based on a variant of multi-level BB84 protocol, we present a protocol for QPQ of blocks, which allows the user to retrieve a multi-bit block from database in one query. Our protocol is somewhat like the high-dimension version of the first QKD-based QPQ protocol proposed by Jacobi et al., but some nontrivial modifications are necessary.
Support patient search on pathology reports with interactive online learning based data extraction.
Zheng, Shuai; Lu, James J; Appin, Christina; Brat, Daniel; Wang, Fusheng
2015-01-01
Structural reporting enables semantic understanding and prompt retrieval of clinical findings about patients. While synoptic pathology reporting provides templates for data entries, information in pathology reports remains primarily in narrative free text form. Extracting data of interest from narrative pathology reports could significantly improve the representation of the information and enable complex structured queries. However, manual extraction is tedious and error-prone, and automated tools are often constructed with a fixed training dataset and not easily adaptable. Our goal is to extract data from pathology reports to support advanced patient search with a highly adaptable semi-automated data extraction system, which can adjust and self-improve by learning from a user's interaction with minimal human effort. We have developed an online machine learning based information extraction system called IDEAL-X. With its graphical user interface, the system's data extraction engine automatically annotates values for users to review upon loading each report text. The system analyzes users' corrections regarding these annotations with online machine learning, and incrementally enhances and refines the learning model as reports are processed. The system also takes advantage of customized controlled vocabularies, which can be adaptively refined during the online learning process to further assist the data extraction. As the accuracy of automatic annotation improves overtime, the effort of human annotation is gradually reduced. After all reports are processed, a built-in query engine can be applied to conveniently define queries based on extracted structured data. We have evaluated the system with a dataset of anatomic pathology reports from 50 patients. Extracted data elements include demographical data, diagnosis, genetic marker, and procedure. The system achieves F-1 scores of around 95% for the majority of tests. Extracting data from pathology reports could enable more accurate knowledge to support biomedical research and clinical diagnosis. IDEAL-X provides a bridge that takes advantage of online machine learning based data extraction and the knowledge from human's feedback. By combining iterative online learning and adaptive controlled vocabularies, IDEAL-X can deliver highly adaptive and accurate data extraction to support patient search.
Optimization of Extended Relational Database Systems
1986-07-23
control functions are integrated into a single system in a homogeneoua way. As a first exam - ple, consider previous work in supporting various semantic...sizes are reduced and, wnk? quently, the number of materializations that will be needed is aba lower. For exam - pie, in the above query tuple...retrieve (EMP.name) where EMP hobbies instrument = ’ violin ’ When the various entries in the hobbies field are materialized, only those queries that
Tan, Chee-Heng; Teh, Ying-Wah
2013-08-01
The main obstacles in mass adoption of cloud computing for database operations in healthcare organization are the data security and privacy issues. In this paper, it is shown that IT services particularly in hardware performance evaluation in virtual machine can be accomplished effectively without IT personnel gaining access to actual data for diagnostic and remediation purposes. The proposed mechanisms utilized the hypothetical data from TPC-H benchmark, to achieve 2 objectives. First, the underlying hardware performance and consistency is monitored via a control system, which is constructed using TPC-H queries. Second, the mechanism to construct stress-testing scenario is envisaged in the host, using a single or combination of TPC-H queries, so that the resource threshold point can be verified, if the virtual machine is still capable of serving critical transactions at this constraining juncture. This threshold point uses server run queue size as input parameter, and it serves 2 purposes: It provides the boundary threshold to the control system, so that periodic learning of the synthetic data sets for performance evaluation does not reach the host's constraint level. Secondly, when the host undergoes hardware change, stress-testing scenarios are simulated in the host by loading up to this resource threshold level, for subsequent response time verification from real and critical transactions.
Queries over Unstructured Data: Probabilistic Methods to the Rescue
NASA Astrophysics Data System (ADS)
Sarawagi, Sunita
Unstructured data like emails, addresses, invoices, call transcripts, reviews, and press releases are now an integral part of any large enterprise. A challenge of modern business intelligence applications is analyzing and querying data seamlessly across structured and unstructured sources. This requires the development of automated techniques for extracting structured records from text sources and resolving entity mentions in data from various sources. The success of any automated method for extraction and integration depends on how effectively it unifies diverse clues in the unstructured source and in existing structured databases. We argue that statistical learning techniques like Conditional Random Fields (CRFs) provide a accurate, elegant and principled framework for tackling these tasks. Given the inherent noise in real-world sources, it is important to capture the uncertainty of the above operations via imprecise data models. CRFs provide a sound probability distribution over extractions but are not easy to represent and query in a relational framework. We present methods of approximating this distribution to query-friendly row and column uncertainty models. Finally, we present models for representing the uncertainty of de-duplication and algorithms for various Top-K count queries on imprecise duplicates.
ERIC Educational Resources Information Center
Banks, Joy
2017-01-01
This investigation employs Disability Critical Race Studies as a theoretical framework to determine the interdependence of racism and ableism in school settings. African American male students with learning disabilities are queried about their interpretations of special education placement and labeling while attempting to secure educational…
Teaching Tip: Active Learning via a Sample Database: The Case of Microsoft's Adventure Works
ERIC Educational Resources Information Center
Mitri, Michel
2015-01-01
This paper describes the use and benefits of Microsoft's Adventure Works (AW) database to teach advanced database skills in a hands-on, realistic environment. Database management and querying skills are a key element of a robust information systems curriculum, and active learning is an important way to develop these skills. To facilitate active…
Designing Adult Learning Strategies: The Case of South Eastern Europe
ERIC Educational Resources Information Center
Gunny, Madeleine; Viertel, Evelyn
2006-01-01
The importance of lifelong learning is generally well understood and few people today would query the need for adults to regularly update their skills in line with labour market needs, and for governments and social partners to provide an environment that supports skills acquisition and updating. However, it is clear when we look at data from the…
Designing a Syntax-Based Retrieval System for Supporting Language Learning
ERIC Educational Resources Information Center
Tsao, Nai-Lung; Kuo, Chin-Hwa; Wible, David; Hung, Tsung-Fu
2009-01-01
In this paper, we propose a syntax-based text retrieval system for on-line language learning and use a fast regular expression search engine as its main component. Regular expression searches provide more scalable querying and search results than keyword-based searches. However, without a well-designed index scheme, the execution time of regular…
Sumner, Walton; Xu, Jin Zhong; Roussel, Guy; Hagen, Michael D
2007-10-11
The American Board of Family Medicine deployed virtual patient simulations in 2004 to evaluate Diplomates' diagnostic and management skills. A previously reported dynamic process generates general symptom histories from time series data representing baseline values and reactions to medications. The simulator also must answer queries about details such as palliation and provocation. These responses often describe some recurring pattern, such as, "this medicine relieves my symptoms in a few minutes." The simulator can provide a detail stored as text, or it can evaluate a reference to a second query object. The second query object can generate details using a single Bayesian network to evaluate the effect of each drug in a virtual patient's medication list. A new medication option may not require redesign of the second query object if its implementation is consistent with related drugs. We expect this mechanism to maintain realistic responses to detail questions in complex simulations.
Andrenucci, Andrea
2016-01-01
Few studies have been performed within cross-language information retrieval (CLIR) in the field of psychology and psychotherapy. The aim of this paper is to to analyze and assess the quality of available query translation methods for CLIR on a health portal for psychology. A test base of 100 user queries, 50 Multi Word Units (WUs) and 50 Single WUs, was used. Swedish was the source language and English the target language. Query translation methods based on machine translation (MT) and dictionary look-up were utilized in order to submit query translations to two search engines: Google Site Search and Quick Ask. Standard IR evaluation measures and a qualitative analysis were utilized to assess the results. The lexicon extracted with word alignment of the portal's parallel corpus provided better statistical results among dictionary look-ups. Google Translate provided more linguistically correct translations overall and also delivered better retrieval results in MT.
Secure searching of biomarkers through hybrid homomorphic encryption scheme.
Kim, Miran; Song, Yongsoo; Cheon, Jung Hee
2017-07-26
As genome sequencing technology develops rapidly, there has lately been an increasing need to keep genomic data secure even when stored in the cloud and still used for research. We are interested in designing a protocol for the secure outsourcing matching problem on encrypted data. We propose an efficient method to securely search a matching position with the query data and extract some information at the position. After decryption, only a small amount of comparisons with the query information should be performed in plaintext state. We apply this method to find a set of biomarkers in encrypted genomes. The important feature of our method is to encode a genomic database as a single element of polynomial ring. Since our method requires a single homomorphic multiplication of hybrid scheme for query computation, it has the advantage over the previous methods in parameter size, computation complexity, and communication cost. In particular, the extraction procedure not only prevents leakage of database information that has not been queried by user but also reduces the communication cost by half. We evaluate the performance of our method and verify that the computation on large-scale personal data can be securely and practically outsourced to a cloud environment during data analysis. It takes about 3.9 s to search-and-extract the reference and alternate sequences at the queried position in a database of size 4M. Our solution for finding a set of biomarkers in DNA sequences shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment.
Active learning reduces annotation time for clinical concept extraction.
Kholghi, Mahnoosh; Sitbon, Laurianne; Zuccon, Guido; Nguyen, Anthony
2017-10-01
To investigate: (1) the annotation time savings by various active learning query strategies compared to supervised learning and a random sampling baseline, and (2) the benefits of active learning-assisted pre-annotations in accelerating the manual annotation process compared to de novo annotation. There are 73 and 120 discharge summary reports provided by Beth Israel institute in the train and test sets of the concept extraction task in the i2b2/VA 2010 challenge, respectively. The 73 reports were used in user study experiments for manual annotation. First, all sequences within the 73 reports were manually annotated from scratch. Next, active learning models were built to generate pre-annotations for the sequences selected by a query strategy. The annotation/reviewing time per sequence was recorded. The 120 test reports were used to measure the effectiveness of the active learning models. When annotating from scratch, active learning reduced the annotation time up to 35% and 28% compared to a fully supervised approach and a random sampling baseline, respectively. Reviewing active learning-assisted pre-annotations resulted in 20% further reduction of the annotation time when compared to de novo annotation. The number of concepts that require manual annotation is a good indicator of the annotation time for various active learning approaches as demonstrated by high correlation between time rate and concept annotation rate. Active learning has a key role in reducing the time required to manually annotate domain concepts from clinical free text, either when annotating from scratch or reviewing active learning-assisted pre-annotations. Copyright © 2017 Elsevier B.V. All rights reserved.
Minimizing Statistical Bias with Queries.
1995-09-14
method for optimally selecting these points would o er enormous savings in time and money. An active learning system will typically attempt to select data...research in active learning assumes that the sec- ond term of Equation 2 is approximately zero, that is, that the learner is unbiased. If this is the case...outperforms the variance- minimizing algorithm and random exploration. and e ective strategy for active learning . I have given empirical evidence that, with
A natural language query system for Hubble Space Telescope proposal selection
NASA Technical Reports Server (NTRS)
Hornick, Thomas; Cohen, William; Miller, Glenn
1987-01-01
The proposal selection process for the Hubble Space Telescope is assisted by a robust and easy to use query program (TACOS). The system parses an English subset language sentence regardless of the order of the keyword phases, allowing the user a greater flexibility than a standard command query language. Capabilities for macro and procedure definition are also integrated. The system was designed for flexibility in both use and maintenance. In addition, TACOS can be applied to any knowledge domain that can be expressed in terms of a single reaction. The system was implemented mostly in Common LISP. The TACOS design is described in detail, with particular attention given to the implementation methods of sentence processing.
A Ruby API to query the Ensembl database for genomic features.
Strozzi, Francesco; Aerts, Jan
2011-04-01
The Ensembl database makes genomic features available via its Genome Browser. It is also possible to access the underlying data through a Perl API for advanced querying. We have developed a full-featured Ruby API to the Ensembl databases, providing the same functionality as the Perl interface with additional features. A single Ruby API is used to access different releases of the Ensembl databases and is also able to query multi-species databases. Most functionality of the API is provided using the ActiveRecord pattern. The library depends on introspection to make it release independent. The API is available through the Rubygem system and can be installed with the command gem install ruby-ensembl-api.
Comprehensive Optimal Manpower and Personnel Analytic Simulation System (COMPASS)
2009-10-01
4 The EDB consists of 4 major components (some of which are re-usable): 1. Metadata Editor ( MDE ): Also considered a leaf node, the metadata...end-user queries via the QB. The EDB supports multiple instances of the MDE , although currently, only a single instance is recommended. 2 Query...the MSB is a central collection of web services, responsible for the authentication and authorization of users, maintenance of the EDB metadata
Web Image Search Re-ranking with Click-based Similarity and Typicality.
Yang, Xiaopeng; Mei, Tao; Zhang, Yong Dong; Liu, Jie; Satoh, Shin'ichi
2016-07-20
In image search re-ranking, besides the well known semantic gap, intent gap, which is the gap between the representation of users' query/demand and the real intent of the users, is becoming a major problem restricting the development of image retrieval. To reduce human effects, in this paper, we use image click-through data, which can be viewed as the "implicit feedback" from users, to help overcome the intention gap, and further improve the image search performance. Generally, the hypothesis visually similar images should be close in a ranking list and the strategy images with higher relevance should be ranked higher than others are widely accepted. To obtain satisfying search results, thus, image similarity and the level of relevance typicality are determinate factors correspondingly. However, when measuring image similarity and typicality, conventional re-ranking approaches only consider visual information and initial ranks of images, while overlooking the influence of click-through data. This paper presents a novel re-ranking approach, named spectral clustering re-ranking with click-based similarity and typicality (SCCST). First, to learn an appropriate similarity measurement, we propose click-based multi-feature similarity learning algorithm (CMSL), which conducts metric learning based on clickbased triplets selection, and integrates multiple features into a unified similarity space via multiple kernel learning. Then based on the learnt click-based image similarity measure, we conduct spectral clustering to group visually and semantically similar images into same clusters, and get the final re-rank list by calculating click-based clusters typicality and withinclusters click-based image typicality in descending order. Our experiments conducted on two real-world query-image datasets with diverse representative queries show that our proposed reranking approach can significantly improve initial search results, and outperform several existing re-ranking approaches.
Quantum algorithms on Walsh transform and Hamming distance for Boolean functions
NASA Astrophysics Data System (ADS)
Xie, Zhengwei; Qiu, Daowen; Cai, Guangya
2018-06-01
Walsh spectrum or Walsh transform is an alternative description of Boolean functions. In this paper, we explore quantum algorithms to approximate the absolute value of Walsh transform W_f at a single point z0 (i.e., |W_f(z0)|) for n-variable Boolean functions with probability at least 8/π 2 using the number of O(1/|W_f(z_{0)|ɛ }) queries, promised that the accuracy is ɛ , while the best known classical algorithm requires O(2n) queries. The Hamming distance between Boolean functions is used to study the linearity testing and other important problems. We take advantage of Walsh transform to calculate the Hamming distance between two n-variable Boolean functions f and g using O(1) queries in some cases. Then, we exploit another quantum algorithm which converts computing Hamming distance between two Boolean functions to quantum amplitude estimation (i.e., approximate counting). If Ham(f,g)=t≠0, we can approximately compute Ham( f, g) with probability at least 2/3 by combining our algorithm and {Approx-Count(f,ɛ ) algorithm} using the expected number of Θ( √{N/(\\lfloor ɛ t\\rfloor +1)}+√{t(N-t)}/\\lfloor ɛ t\\rfloor +1) queries, promised that the accuracy is ɛ . Moreover, our algorithm is optimal, while the exact query complexity for the above problem is Θ(N) and the query complexity with the accuracy ɛ is O(1/ɛ 2N/(t+1)) in classical algorithm, where N=2n. Finally, we present three exact quantum query algorithms for two promise problems on Hamming distance using O(1) queries, while any classical deterministic algorithm solving the problem uses Ω(2n) queries.
Query-based biclustering of gene expression data using Probabilistic Relational Models.
Zhao, Hui; Cloots, Lore; Van den Bulcke, Tim; Wu, Yan; De Smet, Riet; Storms, Valerie; Meysman, Pieter; Engelen, Kristof; Marchal, Kathleen
2011-02-15
With the availability of large scale expression compendia it is now possible to view own findings in the light of what is already available and retrieve genes with an expression profile similar to a set of genes of interest (i.e., a query or seed set) for a subset of conditions. To that end, a query-based strategy is needed that maximally exploits the coexpression behaviour of the seed genes to guide the biclustering, but that at the same time is robust against the presence of noisy genes in the seed set as seed genes are often assumed, but not guaranteed to be coexpressed in the queried compendium. Therefore, we developed ProBic, a query-based biclustering strategy based on Probabilistic Relational Models (PRMs) that exploits the use of prior distributions to extract the information contained within the seed set. We applied ProBic on a large scale Escherichia coli compendium to extend partially described regulons with potentially novel members. We compared ProBic's performance with previously published query-based biclustering algorithms, namely ISA and QDB, from the perspective of bicluster expression quality, robustness of the outcome against noisy seed sets and biological relevance.This comparison learns that ProBic is able to retrieve biologically relevant, high quality biclusters that retain their seed genes and that it is particularly strong in handling noisy seeds. ProBic is a query-based biclustering algorithm developed in a flexible framework, designed to detect biologically relevant, high quality biclusters that retain relevant seed genes even in the presence of noise or when dealing with low quality seed sets.
Supervised learning of tools for content-based search of image databases
NASA Astrophysics Data System (ADS)
Delanoy, Richard L.
1996-03-01
A computer environment, called the Toolkit for Image Mining (TIM), is being developed with the goal of enabling users with diverse interests and varied computer skills to create search tools for content-based image retrieval and other pattern matching tasks. Search tools are generated using a simple paradigm of supervised learning that is based on the user pointing at mistakes of classification made by the current search tool. As mistakes are identified, a learning algorithm uses the identified mistakes to build up a model of the user's intentions, construct a new search tool, apply the search tool to a test image, display the match results as feedback to the user, and accept new inputs from the user. Search tools are constructed in the form of functional templates, which are generalized matched filters capable of knowledge- based image processing. The ability of this system to learn the user's intentions from experience contrasts with other existing approaches to content-based image retrieval that base searches on the characteristics of a single input example or on a predefined and semantically- constrained textual query. Currently, TIM is capable of learning spectral and textural patterns, but should be adaptable to the learning of shapes, as well. Possible applications of TIM include not only content-based image retrieval, but also quantitative image analysis, the generation of metadata for annotating images, data prioritization or data reduction in bandwidth-limited situations, and the construction of components for larger, more complex computer vision algorithms.
Combinatorial Fusion Analysis for Meta Search Information Retrieval
NASA Astrophysics Data System (ADS)
Hsu, D. Frank; Taksa, Isak
Leading commercial search engines are built as single event systems. In response to a particular search query, the search engine returns a single list of ranked search results. To find more relevant results the user must frequently try several other search engines. A meta search engine was developed to enhance the process of multi-engine querying. The meta search engine queries several engines at the same time and fuses individual engine results into a single search results list. The fusion of multiple search results has been shown (mostly experimentally) to be highly effective. However, the question of why and how the fusion should be done still remains largely unanswered. In this chapter, we utilize the combinatorial fusion analysis proposed by Hsu et al. to analyze combination and fusion of multiple sources of information. A rank/score function is used in the design and analysis of our framework. The framework provides a better understanding of the fusion phenomenon in information retrieval. For example, to improve the performance of the combined multiple scoring systems, it is necessary that each of the individual scoring systems has relatively high performance and the individual scoring systems are diverse. Additionally, we illustrate various applications of the framework using two examples from the information retrieval domain.
Federated queries of clinical data repositories: the sum of the parts does not equal the whole
Weber, Griffin M
2013-01-01
Background and objective In 2008 we developed a shared health research information network (SHRINE), which for the first time enabled research queries across the full patient populations of four Boston hospitals. It uses a federated architecture, where each hospital returns only the aggregate count of the number of patients who match a query. This allows hospitals to retain control over their local databases and comply with federal and state privacy laws. However, because patients may receive care from multiple hospitals, the result of a federated query might differ from what the result would be if the query were run against a single central repository. This paper describes the situations when this happens and presents a technique for correcting these errors. Methods We use a one-time process of identifying which patients have data in multiple repositories by comparing one-way hash values of patient demographics. This enables us to partition the local databases such that all patients within a given partition have data at the same subset of hospitals. Federated queries are then run separately on each partition independently, and the combined results are presented to the user. Results Using theoretical bounds and simulated hospital networks, we demonstrate that once the partitions are made, SHRINE can produce more precise estimates of the number of patients matching a query. Conclusions Uncertainty in the overlap of patient populations across hospitals limits the effectiveness of SHRINE and other federated query tools. Our technique reduces this uncertainty while retaining an aggregate federated architecture. PMID:23349080
ERIC Educational Resources Information Center
Ainsa, Patricia
2015-01-01
E-texts have become a main venue but research has not provided much guidance for practical adaptation, yet. This research query started in the spring of 2014 when an e-text was adopted for an undergraduate distance learning class. The change created some unexpected influence in the students' experiences. It was necessary to assess their…
Learning context-sensitive shape similarity by graph transduction.
Bai, Xiang; Yang, Xingwei; Latecki, Longin Jan; Liu, Wenyu; Tu, Zhuowen
2010-05-01
Shape similarity and shape retrieval are very important topics in computer vision. The recent progress in this domain has been mostly driven by designing smart shape descriptors for providing better similarity measure between pairs of shapes. In this paper, we provide a new perspective to this problem by considering the existing shapes as a group, and study their similarity measures to the query shape in a graph structure. Our method is general and can be built on top of any existing shape similarity measure. For a given similarity measure, a new similarity is learned through graph transduction. The new similarity is learned iteratively so that the neighbors of a given shape influence its final similarity to the query. The basic idea here is related to PageRank ranking, which forms a foundation of Google Web search. The presented experimental results demonstrate that the proposed approach yields significant improvements over the state-of-art shape matching algorithms. We obtained a retrieval rate of 91.61 percent on the MPEG-7 data set, which is the highest ever reported in the literature. Moreover, the learned similarity by the proposed method also achieves promising improvements on both shape classification and shape clustering.
Shark: SQL and Rich Analytics at Scale
2012-11-26
learning programs up to 100 faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a...Shark to run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop. Unlike previous systems, Shark...so using a runtime that is optimized for such workloads and a programming model that is designed to express machine learn - ing algorithms. 4.1
Searching for cancer information on the internet: analyzing natural language search queries.
Bader, Judith L; Theofanos, Mary Frances
2003-12-11
Searching for health information is one of the most-common tasks performed by Internet users. Many users begin searching on popular search engines rather than on prominent health information sites. We know that many visitors to our (National Cancer Institute) Web site, cancer.gov, arrive via links in search engine result. To learn more about the specific needs of our general-public users, we wanted to understand what lay users really wanted to know about cancer, how they phrased their questions, and how much detail they used. The National Cancer Institute partnered with AskJeeves, Inc to develop a methodology to capture, sample, and analyze 3 months of cancer-related queries on the Ask.com Web site, a prominent United States consumer search engine, which receives over 35 million queries per week. Using a benchmark set of 500 terms and word roots supplied by the National Cancer Institute, AskJeeves identified a test sample of cancer queries for 1 week in August 2001. From these 500 terms only 37 appeared >or= 5 times/day over the trial test week in 17208 queries. Using these 37 terms, 204165 instances of cancer queries were found in the Ask.com query logs for the actual test period of June-August 2001. Of these, 7500 individual user questions were randomly selected for detailed analysis and assigned to appropriate categories. The exact language of sample queries is presented. Considering multiples of the same questions, the sample of 7500 individual user queries represented 76077 queries (37% of the total 3-month pool). Overall 78.37% of sampled Cancer queries asked about 14 specific cancer types. Within each cancer type, queries were sorted into appropriate subcategories including at least the following: General Information, Symptoms, Diagnosis and Testing, Treatment, Statistics, Definition, and Cause/Risk/Link. The most-common specific cancer types mentioned in queries were Digestive/Gastrointestinal/Bowel (15.0%), Breast (11.7%), Skin (11.3%), and Genitourinary (10.5%). Additional subcategories of queries about specific cancer types varied, depending on user input. Queries that were not specific to a cancer type were also tracked and categorized. Natural-language searching affords users the opportunity to fully express their information needs and can aid users naïve to the content and vocabulary. The specific queries analyzed for this study reflect news and research studies reported during the study dates and would surely change with different study dates. Analyzing queries from search engines represents one way of knowing what kinds of content to provide to users of a given Web site. Users ask questions using whole sentences and keywords, often misspelling words. Providing the option for natural-language searching does not obviate the need for good information architecture, usability engineering, and user testing in order to optimize user experience.
Searching for Cancer Information on the Internet: Analyzing Natural Language Search Queries
Theofanos, Mary Frances
2003-01-01
Background Searching for health information is one of the most-common tasks performed by Internet users. Many users begin searching on popular search engines rather than on prominent health information sites. We know that many visitors to our (National Cancer Institute) Web site, cancer.gov, arrive via links in search engine result. Objective To learn more about the specific needs of our general-public users, we wanted to understand what lay users really wanted to know about cancer, how they phrased their questions, and how much detail they used. Methods The National Cancer Institute partnered with AskJeeves, Inc to develop a methodology to capture, sample, and analyze 3 months of cancer-related queries on the Ask.com Web site, a prominent United States consumer search engine, which receives over 35 million queries per week. Using a benchmark set of 500 terms and word roots supplied by the National Cancer Institute, AskJeeves identified a test sample of cancer queries for 1 week in August 2001. From these 500 terms only 37 appeared ≥ 5 times/day over the trial test week in 17208 queries. Using these 37 terms, 204165 instances of cancer queries were found in the Ask.com query logs for the actual test period of June-August 2001. Of these, 7500 individual user questions were randomly selected for detailed analysis and assigned to appropriate categories. The exact language of sample queries is presented. Results Considering multiples of the same questions, the sample of 7500 individual user queries represented 76077 queries (37% of the total 3-month pool). Overall 78.37% of sampled Cancer queries asked about 14 specific cancer types. Within each cancer type, queries were sorted into appropriate subcategories including at least the following: General Information, Symptoms, Diagnosis and Testing, Treatment, Statistics, Definition, and Cause/Risk/Link. The most-common specific cancer types mentioned in queries were Digestive/Gastrointestinal/Bowel (15.0%), Breast (11.7%), Skin (11.3%), and Genitourinary (10.5%). Additional subcategories of queries about specific cancer types varied, depending on user input. Queries that were not specific to a cancer type were also tracked and categorized. Conclusions Natural-language searching affords users the opportunity to fully express their information needs and can aid users naïve to the content and vocabulary. The specific queries analyzed for this study reflect news and research studies reported during the study dates and would surely change with different study dates. Analyzing queries from search engines represents one way of knowing what kinds of content to provide to users of a given Web site. Users ask questions using whole sentences and keywords, often misspelling words. Providing the option for natural-language searching does not obviate the need for good information architecture, usability engineering, and user testing in order to optimize user experience. PMID:14713659
A topic clustering approach to finding similar questions from large question and answer archives.
Zhang, Wei-Nan; Liu, Ting; Yang, Yang; Cao, Liujuan; Zhang, Yu; Ji, Rongrong
2014-01-01
With the blooming of Web 2.0, Community Question Answering (CQA) services such as Yahoo! Answers (http://answers.yahoo.com), WikiAnswer (http://wiki.answers.com), and Baidu Zhidao (http://zhidao.baidu.com), etc., have emerged as alternatives for knowledge and information acquisition. Over time, a large number of question and answer (Q&A) pairs with high quality devoted by human intelligence have been accumulated as a comprehensive knowledge base. Unlike the search engines, which return long lists of results, searching in the CQA services can obtain the correct answers to the question queries by automatically finding similar questions that have already been answered by other users. Hence, it greatly improves the efficiency of the online information retrieval. However, given a question query, finding the similar and well-answered questions is a non-trivial task. The main challenge is the word mismatch between question query (query) and candidate question for retrieval (question). To investigate this problem, in this study, we capture the word semantic similarity between query and question by introducing the topic modeling approach. We then propose an unsupervised machine-learning approach to finding similar questions on CQA Q&A archives. The experimental results show that our proposed approach significantly outperforms the state-of-the-art methods.
CellAtlasSearch: a scalable search engine for single cells.
Srivastava, Divyanshu; Iyer, Arvind; Kumar, Vibhor; Sengupta, Debarka
2018-05-21
Owing to the advent of high throughput single cell transcriptomics, past few years have seen exponential growth in production of gene expression data. Recently efforts have been made by various research groups to homogenize and store single cell expression from a large number of studies. The true value of this ever increasing data deluge can be unlocked by making it searchable. To this end, we propose CellAtlasSearch, a novel search architecture for high dimensional expression data, which is massively parallel as well as light-weight, thus infinitely scalable. In CellAtlasSearch, we use a Graphical Processing Unit (GPU) friendly version of Locality Sensitive Hashing (LSH) for unmatched speedup in data processing and query. Currently, CellAtlasSearch features over 300 000 reference expression profiles including both bulk and single-cell data. It enables the user query individual single cell transcriptomes and finds matching samples from the database along with necessary meta information. CellAtlasSearch aims to assist researchers and clinicians in characterizing unannotated single cells. It also facilitates noise free, low dimensional representation of single-cell expression profiles by projecting them on a wide variety of reference samples. The web-server is accessible at: http://www.cellatlassearch.com.
Fast Query-Optimized Kernel-Machine Classification
NASA Technical Reports Server (NTRS)
Mazzoni, Dominic; DeCoste, Dennis
2004-01-01
A recently developed algorithm performs kernel-machine classification via incremental approximate nearest support vectors. The algorithm implements support-vector machines (SVMs) at speeds 10 to 100 times those attainable by use of conventional SVM algorithms. The algorithm offers potential benefits for classification of images, recognition of speech, recognition of handwriting, and diverse other applications in which there are requirements to discern patterns in large sets of data. SVMs constitute a subset of kernel machines (KMs), which have become popular as models for machine learning and, more specifically, for automated classification of input data on the basis of labeled training data. While similar in many ways to k-nearest-neighbors (k-NN) models and artificial neural networks (ANNs), SVMs tend to be more accurate. Using representations that scale only linearly in the numbers of training examples, while exploring nonlinear (kernelized) feature spaces that are exponentially larger than the original input dimensionality, KMs elegantly and practically overcome the classic curse of dimensionality. However, the price that one must pay for the power of KMs is that query-time complexity scales linearly with the number of training examples, making KMs often orders of magnitude more computationally expensive than are ANNs, decision trees, and other popular machine learning alternatives. The present algorithm treats an SVM classifier as a special form of a k-NN. The algorithm is based partly on an empirical observation that one can often achieve the same classification as that of an exact KM by using only small fraction of the nearest support vectors (SVs) of a query. The exact KM output is a weighted sum over the kernel values between the query and the SVs. In this algorithm, the KM output is approximated with a k-NN classifier, the output of which is a weighted sum only over the kernel values involving k selected SVs. Before query time, there are gathered statistics about how misleading the output of the k-NN model can be, relative to the outputs of the exact KM for a representative set of examples, for each possible k from 1 to the total number of SVs. From these statistics, there are derived upper and lower thresholds for each step k. These thresholds identify output levels for which the particular variant of the k-NN model already leans so strongly positively or negatively that a reversal in sign is unlikely, given the weaker SV neighbors still remaining. At query time, the partial output of each query is incrementally updated, stopping as soon as it exceeds the predetermined statistical thresholds of the current step. For an easy query, stopping can occur as early as step k = 1. For more difficult queries, stopping might not occur until nearly all SVs are touched. A key empirical observation is that this approach can tolerate very approximate nearest-neighbor orderings. In experiments, SVs and queries were projected to a subspace comprising the top few principal- component dimensions and neighbor orderings were computed in that subspace. This approach ensured that the overhead of the nearest-neighbor computations was insignificant, relative to that of the exact KM computation.
A Distributed Multi-Agent System for Collaborative Information Management and Learning
NASA Technical Reports Server (NTRS)
Chen, James R.; Wolfe, Shawn R.; Wragg, Stephen D.; Koga, Dennis (Technical Monitor)
2000-01-01
In this paper, we present DIAMS, a system of distributed, collaborative agents to help users access, manage, share and exchange information. A DIAMS personal agent helps its owner find information most relevant to current needs. It provides tools and utilities for users to manage their information repositories with dynamic organization and virtual views. Flexible hierarchical display is integrated with indexed query search-to support effective information access. Automatic indexing methods are employed to support user queries and communication between agents. Contents of a repository are kept in object-oriented storage to facilitate information sharing. Collaboration between users is aided by easy sharing utilities as well as automated information exchange. Matchmaker agents are designed to establish connections between users with similar interests and expertise. DIAMS agents provide needed services for users to share and learn information from one another on the World Wide Web.
Demonstration of quantum advantage in machine learning
NASA Astrophysics Data System (ADS)
Ristè, Diego; da Silva, Marcus P.; Ryan, Colm A.; Cross, Andrew W.; Córcoles, Antonio D.; Smolin, John A.; Gambetta, Jay M.; Chow, Jerry M.; Johnson, Blake R.
2017-04-01
The main promise of quantum computing is to efficiently solve certain problems that are prohibitively expensive for a classical computer. Most problems with a proven quantum advantage involve the repeated use of a black box, or oracle, whose structure encodes the solution. One measure of the algorithmic performance is the query complexity, i.e., the scaling of the number of oracle calls needed to find the solution with a given probability. Few-qubit demonstrations of quantum algorithms, such as Deutsch-Jozsa and Grover, have been implemented across diverse physical systems such as nuclear magnetic resonance, trapped ions, optical systems, and superconducting circuits. However, at the small scale, these problems can already be solved classically with a few oracle queries, limiting the obtained advantage. Here we solve an oracle-based problem, known as learning parity with noise, on a five-qubit superconducting processor. Executing classical and quantum algorithms using the same oracle, we observe a large gap in query count in favor of quantum processing. We find that this gap grows by orders of magnitude as a function of the error rates and the problem size. This result demonstrates that, while complex fault-tolerant architectures will be required for universal quantum computing, a significant quantum advantage already emerges in existing noisy systems.
Sayyah Ensan, Ladan; Faghankhani, Masoomeh; Javanbakht, Anna; Ahmadi, Seyed-Foad; Baradaran, Hamid Reza
2011-01-01
To compare PubMed Clinical Queries and UpToDate regarding the amount and speed of information retrieval and users' satisfaction. A cross-over randomized trial was conducted in February 2009 in Tehran University of Medical Sciences that included 44 year-one or two residents who participated in an information mastery workshop. A one-hour lecture on the principles of information mastery was organized followed by self learning slide shows before using each database. Subsequently, participants were randomly assigned to answer 2 clinical scenarios using either UpToDate or PubMed Clinical Queries then crossed to use the other database to answer 2 different clinical scenarios. The proportion of relevantly answered clinical scenarios, time to answer retrieval, and users' satisfaction were measured in each database. Based on intention-to-treat analysis, participants retrieved the answer of 67 (76%) questions using UpToDate and 38 (43%) questions using PubMed Clinical Queries (P<0.001). The median time to answer retrieval was 17 min (95% CI: 16 to 18) using UpToDate compared to 29 min (95% CI: 26 to 32) using PubMed Clinical Queries (P<0.001). The satisfaction with the accuracy of retrieved answers, interaction with UpToDate and also overall satisfaction were higher among UpToDate users compared to PubMed Clinical Queries users (P<0.001). For first time users, using UpToDate compared to Pubmed Clinical Queries can lead to not only a higher proportion of relevant answer retrieval within a shorter time, but also a higher users' satisfaction. So, addition of tutoring pre-appraised sources such as UpToDate to the information mastery curricula seems to be highly efficient.
NASA Astrophysics Data System (ADS)
Mallepudi, Sri Abhishikth; Calix, Ricardo A.; Knapp, Gerald M.
2011-02-01
In recent years there has been a rapid increase in the size of video and image databases. Effective searching and retrieving of images from these databases is a significant current research area. In particular, there is a growing interest in query capabilities based on semantic image features such as objects, locations, and materials, known as content-based image retrieval. This study investigated mechanisms for identifying materials present in an image. These capabilities provide additional information impacting conditional probabilities about images (e.g. objects made of steel are more likely to be buildings). These capabilities are useful in Building Information Modeling (BIM) and in automatic enrichment of images. I2T methodologies are a way to enrich an image by generating text descriptions based on image analysis. In this work, a learning model is trained to detect certain materials in images. To train the model, an image dataset was constructed containing single material images of bricks, cloth, grass, sand, stones, and wood. For generalization purposes, an additional set of 50 images containing multiple materials (some not used in training) was constructed. Two different supervised learning classification models were investigated: a single multi-class SVM classifier, and multiple binary SVM classifiers (one per material). Image features included Gabor filter parameters for texture, and color histogram data for RGB components. All classification accuracy scores using the SVM-based method were above 85%. The second model helped in gathering more information from the images since it assigned multiple classes to the images. A framework for the I2T methodology is presented.
Lyceum: A Multi-Protocol Digital Library Gateway
NASA Technical Reports Server (NTRS)
Maa, Ming-Hokng; Nelson, Michael L.; Esler, Sandra L.
1997-01-01
Lyceum is a prototype scalable query gateway that provides a logically central interface to multi-protocol and physically distributed, digital libraries of scientific and technical information. Lyceum processes queries to multiple syntactically distinct search engines used by various distributed information servers from a single logically central interface without modification of the remote search engines. A working prototype (http://www.larc.nasa.gov/lyceum/) demonstrates the capabilities, potentials, and advantages of this type of meta-search engine by providing access to over 50 servers covering over 20 disciplines.
Navigation as a New Form of Search for Agricultural Learning Resources in Semantic Repositories
NASA Astrophysics Data System (ADS)
Cano, Ramiro; Abián, Alberto; Mena, Elena
Education is essential when it comes to raise public awareness on the environmental and economic benefits of organic agriculture and agroecology (OA & AE). Organic.Edunet, an EU funded project, aims at providing a freely-available portal where learning contents on OA & AE can be published and accessed through specialized technologies. This paper describes a novel mechanism for providing semantic capabilities (such as semantic navigational queries) to an arbitrary set of agricultural learning resources, in the context of the Organic.Edunet initiative.
Ad-Hoc Queries over Document Collections - A Case Study
NASA Astrophysics Data System (ADS)
Löser, Alexander; Lutter, Steffen; Düssel, Patrick; Markl, Volker
We discuss the novel problem of supporting analytical business intelligence queries over web-based textual content, e.g., BI-style reports based on 100.000's of documents from an ad-hoc web search result. Neither conventional search engines nor conventional Business Intelligence and ETL tools address this problem, which lies at the intersection of their capabilities. "Google Squared" or our system GOOLAP.info, are examples of these kinds of systems. They execute information extraction methods over one or several document collections at query time and integrate extracted records into a common view or tabular structure. Frequent extraction and object resolution failures cause incomplete records which could not be joined into a record answering the query. Our focus is the identification of join-reordering heuristics maximizing the size of complete records answering a structured query. With respect to given costs for document extraction we propose two novel join-operations: The multi-way CJ-operator joins records from multiple relationships extracted from a single document. The two-way join-operator DJ ensures data density by removing incomplete records from results. In a preliminary case study we observe that our join-reordering heuristics positively impact result size, record density and lower execution costs.
Automatic Depth Extraction from 2D Images Using a Cluster-Based Learning Framework.
Herrera, Jose L; Del-Blanco, Carlos R; Garcia, Narciso
2018-07-01
There has been a significant increase in the availability of 3D players and displays in the last years. Nonetheless, the amount of 3D content has not experimented an increment of such magnitude. To alleviate this problem, many algorithms for converting images and videos from 2D to 3D have been proposed. Here, we present an automatic learning-based 2D-3D image conversion approach, based on the key hypothesis that color images with similar structure likely present a similar depth structure. The presented algorithm estimates the depth of a color query image using the prior knowledge provided by a repository of color + depth images. The algorithm clusters this database attending to their structural similarity, and then creates a representative of each color-depth image cluster that will be used as prior depth map. The selection of the appropriate prior depth map corresponding to one given color query image is accomplished by comparing the structural similarity in the color domain between the query image and the database. The comparison is based on a K-Nearest Neighbor framework that uses a learning procedure to build an adaptive combination of image feature descriptors. The best correspondences determine the cluster, and in turn the associated prior depth map. Finally, this prior estimation is enhanced through a segmentation-guided filtering that obtains the final depth map estimation. This approach has been tested using two publicly available databases, and compared with several state-of-the-art algorithms in order to prove its efficiency.
Webb, Samuel J; Hanser, Thierry; Howlin, Brendan; Krause, Paul; Vessey, Jonathan D
2014-03-25
A new algorithm has been developed to enable the interpretation of black box models. The developed algorithm is agnostic to learning algorithm and open to all structural based descriptors such as fragments, keys and hashed fingerprints. The algorithm has provided meaningful interpretation of Ames mutagenicity predictions from both random forest and support vector machine models built on a variety of structural fingerprints.A fragmentation algorithm is utilised to investigate the model's behaviour on specific substructures present in the query. An output is formulated summarising causes of activation and deactivation. The algorithm is able to identify multiple causes of activation or deactivation in addition to identifying localised deactivations where the prediction for the query is active overall. No loss in performance is seen as there is no change in the prediction; the interpretation is produced directly on the model's behaviour for the specific query. Models have been built using multiple learning algorithms including support vector machine and random forest. The models were built on public Ames mutagenicity data and a variety of fingerprint descriptors were used. These models produced a good performance in both internal and external validation with accuracies around 82%. The models were used to evaluate the interpretation algorithm. Interpretation was revealed that links closely with understood mechanisms for Ames mutagenicity. This methodology allows for a greater utilisation of the predictions made by black box models and can expedite further study based on the output for a (quantitative) structure activity model. Additionally the algorithm could be utilised for chemical dataset investigation and knowledge extraction/human SAR development.
Comparing the performance of two CBIRS indexing schemes
NASA Astrophysics Data System (ADS)
Mueller, Wolfgang; Robbert, Guenter; Henrich, Andreas
2003-01-01
Content based image retrieval (CBIR) as it is known today has to deal with a number of challenges. Quickly summarized, the main challenges are firstly, to bridge the semantic gap between high-level concepts and low-level features using feedback, secondly to provide performance under adverse conditions. High-dimensional spaces, as well as a demanding machine learning task make the right way of indexing an important issue. When indexing multimedia data, most groups opt for extraction of high-dimensional feature vectors from the data, followed by dimensionality reduction like PCA (Principal Components Analysis) or LSI (Latent Semantic Indexing). The resulting vectors are indexed using spatial indexing structures such as kd-trees or R-trees, for example. Other projects, such as MARS and Viper propose the adaptation of text indexing techniques, notably the inverted file. Here, the Viper system is the most direct adaptation of text retrieval techniques to quantized vectors. However, while the Viper query engine provides decent performance together with impressive user-feedback behavior, as well as the possibility for easy integration of long-term learning algorithms, and support for potentially infinite feature vectors, there has been no comparison of vector-based methods and inverted-file-based methods under similar conditions. In this publication, we compare a CBIR query engine that uses inverted files (Bothrops, a rewrite of the Viper query engine based on a relational database), and a CBIR query engine based on LSD (Local Split Decision) trees for spatial indexing using the same feature sets. The Benchathlon initiative works on providing a set of images and ground truth for simulating image queries by example and corresponding user feedback. When performing the Benchathlon benchmark on a CBIR system (the System Under Test, SUT), a benchmarking harness connects over internet to the SUT, performing a number of queries using an agreed-upon protocol, the multimedia retrieval markup language (MRML). Using this benchmark one can measure the quality of retrieval, as well as the overall (speed) performance of the benchmarked system. Our Benchmarks will draw on the Benchathlon"s work for documenting the retrieval performance of both inverted file-based and LSD tree based techniques. However in addition to these results, we will present statistics, that can be obtained only inside the system under test. These statistics will include the number of complex mathematical operations, as well as the amount of data that has to be read from disk during operation of a query.
Automatically finding relevant citations for clinical guideline development.
Bui, Duy Duc An; Jonnalagadda, Siddhartha; Del Fiol, Guilherme
2015-10-01
Literature database search is a crucial step in the development of clinical practice guidelines and systematic reviews. In the age of information technology, the process of literature search is still conducted manually, therefore it is costly, slow and subject to human errors. In this research, we sought to improve the traditional search approach using innovative query expansion and citation ranking approaches. We developed a citation retrieval system composed of query expansion and citation ranking methods. The methods are unsupervised and easily integrated over the PubMed search engine. To validate the system, we developed a gold standard consisting of citations that were systematically searched and screened to support the development of cardiovascular clinical practice guidelines. The expansion and ranking methods were evaluated separately and compared with baseline approaches. Compared with the baseline PubMed expansion, the query expansion algorithm improved recall (80.2% vs. 51.5%) with small loss on precision (0.4% vs. 0.6%). The algorithm could find all citations used to support a larger number of guideline recommendations than the baseline approach (64.5% vs. 37.2%, p<0.001). In addition, the citation ranking approach performed better than PubMed's "most recent" ranking (average precision +6.5%, recall@k +21.1%, p<0.001), PubMed's rank by "relevance" (average precision +6.1%, recall@k +14.8%, p<0.001), and the machine learning classifier that identifies scientifically sound studies from MEDLINE citations (average precision +4.9%, recall@k +4.2%, p<0.001). Our unsupervised query expansion and ranking techniques are more flexible and effective than PubMed's default search engine behavior and the machine learning classifier. Automated citation finding is promising to augment the traditional literature search. Copyright © 2015 Elsevier Inc. All rights reserved.
EmptyHeaded: A Relational Engine for Graph Processing
Aberger, Christopher R.; Tu, Susan; Olukotun, Kunle; Ré, Christopher
2016-01-01
There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded’s design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP. PMID:28077912
Binary Multidimensional Scaling for Hashing.
Huang, Yameng; Lin, Zhouchen
2017-10-04
Hashing is a useful technique for fast nearest neighbor search due to its low storage cost and fast query speed. Unsupervised hashing aims at learning binary hash codes for the original features so that the pairwise distances can be best preserved. While several works have targeted on this task, the results are not satisfactory mainly due to the oversimplified model. In this paper, we propose a unified and concise unsupervised hashing framework, called Binary Multidimensional Scaling (BMDS), which is able to learn the hash code for distance preservation in both batch and online mode. In the batch mode, unlike most existing hashing methods, we do not need to simplify the model by predefining the form of hash map. Instead, we learn the binary codes directly based on the pairwise distances among the normalized original features by Alternating Minimization. This enables a stronger expressive power of the hash map. In the online mode, we consider the holistic distance relationship between current query example and those we have already learned, rather than only focusing on current data chunk. It is useful when the data come in a streaming fashion. Empirical results show that while being efficient for training, our algorithm outperforms state-of-the-art methods by a large margin in terms of distance preservation, which is practical for real-world applications.
A similarity learning approach to content-based image retrieval: application to digital mammography.
El-Naqa, Issam; Yang, Yongyi; Galatsanos, Nikolas P; Nishikawa, Robert M; Wernick, Miles N
2004-10-01
In this paper, we describe an approach to content-based retrieval of medical images from a database, and provide a preliminary demonstration of our approach as applied to retrieval of digital mammograms. Content-based image retrieval (CBIR) refers to the retrieval of images from a database using information derived from the images themselves, rather than solely from accompanying text indices. In the medical-imaging context, the ultimate aim of CBIR is to provide radiologists with a diagnostic aid in the form of a display of relevant past cases, along with proven pathology and other suitable information. CBIR may also be useful as a training tool for medical students and residents. The goal of information retrieval is to recall from a database information that is relevant to the user's query. The most challenging aspect of CBIR is the definition of relevance (similarity), which is used to guide the retrieval machine. In this paper, we pursue a new approach, in which similarity is learned from training examples provided by human observers. Specifically, we explore the use of neural networks and support vector machines to predict the user's notion of similarity. Within this framework we propose using a hierarchal learning approach, which consists of a cascade of a binary classifier and a regression module to optimize retrieval effectiveness and efficiency. We also explore how to incorporate online human interaction to achieve relevance feedback in this learning framework. Our experiments are based on a database consisting of 76 mammograms, all of which contain clustered microcalcifications (MCs). Our goal is to retrieve mammogram images containing similar MC clusters to that in a query. The performance of the retrieval system is evaluated using precision-recall curves computed using a cross-validation procedure. Our experimental results demonstrate that: 1) the learning framework can accurately predict the perceptual similarity reported by human observers, thereby serving as a basis for CBIR; 2) the learning-based framework can significantly outperform a simple distance-based similarity metric; 3) the use of the hierarchical two-stage network can improve retrieval performance; and 4) relevance feedback can be effectively incorporated into this learning framework to achieve improvement in retrieval precision based on online interaction with users; and 5) the retrieved images by the network can have predicting value for the disease condition of the query.
Remote file inquiry (RFI) system
NASA Technical Reports Server (NTRS)
1975-01-01
System interrogates and maintains user-definable data files from remote terminals, using English-like, free-form query language easily learned by persons not proficient in computer programming. System operates in asynchronous mode, allowing any number of inquiries within limitation of available core to be active concurrently.
An ontology-based method for secondary use of electronic dental record data.
Schleyer, Titus Kl; Ruttenberg, Alan; Duncan, William; Haendel, Melissa; Torniai, Carlo; Acharya, Amit; Song, Mei; Thyvalikakath, Thankam P; Liu, Kaihong; Hernandez, Pedro
2013-01-01
A key question for healthcare is how to operationalize the vision of the Learning Healthcare System, in which electronic health record data become a continuous information source for quality assurance and research. This project presents an initial, ontology-based, method for secondary use of electronic dental record (EDR) data. We defined a set of dental clinical research questions; constructed the Oral Health and Disease Ontology (OHD); analyzed data from a commercial EDR database; and created a knowledge base, with the OHD used to represent clinical data about 4,500 patients from a single dental practice. Currently, the OHD includes 213 classes and reuses 1,658 classes from other ontologies. We have developed an initial set of SPARQL queries to allow extraction of data about patients, teeth, surfaces, restorations and findings. Further work will establish a complete, open and reproducible workflow for extracting and aggregating data from a variety of EDRs for research and quality assurance.
Robust hashing with local models for approximate similarity search.
Song, Jingkuan; Yang, Yi; Li, Xuelong; Huang, Zi; Yang, Yang
2014-07-01
Similarity search plays an important role in many applications involving high-dimensional data. Due to the known dimensionality curse, the performance of most existing indexing structures degrades quickly as the feature dimensionality increases. Hashing methods, such as locality sensitive hashing (LSH) and its variants, have been widely used to achieve fast approximate similarity search by trading search quality for efficiency. However, most existing hashing methods make use of randomized algorithms to generate hash codes without considering the specific structural information in the data. In this paper, we propose a novel hashing method, namely, robust hashing with local models (RHLM), which learns a set of robust hash functions to map the high-dimensional data points into binary hash codes by effectively utilizing local structural information. In RHLM, for each individual data point in the training dataset, a local hashing model is learned and used to predict the hash codes of its neighboring data points. The local models from all the data points are globally aligned so that an optimal hash code can be assigned to each data point. After obtaining the hash codes of all the training data points, we design a robust method by employing l2,1 -norm minimization on the loss function to learn effective hash functions, which are then used to map each database point into its hash code. Given a query data point, the search process first maps it into the query hash code by the hash functions and then explores the buckets, which have similar hash codes to the query hash code. Extensive experimental results conducted on real-life datasets show that the proposed RHLM outperforms the state-of-the-art methods in terms of search quality and efficiency.
Virtual Solar Observatory Distributed Query Construction
NASA Technical Reports Server (NTRS)
Gurman, J. B.; Dimitoglou, G.; Bogart, R.; Davey, A.; Hill, F.; Martens, P.
2003-01-01
Through a prototype implementation (Tian et al., this meeting) the VSO has already demonstrated the capability of unifying geographically distributed data sources following the Web Services paradigm and utilizing mechanisms such as the Simple Object Access Protocol (SOAP). So far, four participating sites (Stanford, Montana State University, National Solar Observatory and the Solar Data Analysis Center) permit Web-accessible, time-based searches that allow browse access to a number of diverse data sets. Our latest work includes the extension of the simple, time-based queries to include numerous other searchable observation parameters. For VSO users, this extended functionality enables more refined searches. For the VSO, it is a proof of concept that more complex, distributed queries can be effectively constructed and that results from heterogeneous, remote sources can be synthesized and presented to users as a single, virtual data product.
Image Search Reranking With Hierarchical Topic Awareness.
Tian, Xinmei; Yang, Linjun; Lu, Yijuan; Tian, Qi; Tao, Dacheng
2015-10-01
With much attention from both academia and industrial communities, visual search reranking has recently been proposed to refine image search results obtained from text-based image search engines. Most of the traditional reranking methods cannot capture both relevance and diversity of the search results at the same time. Or they ignore the hierarchical topic structure of search result. Each topic is treated equally and independently. However, in real applications, images returned for certain queries are naturally in hierarchical organization, rather than simple parallel relation. In this paper, a new reranking method "topic-aware reranking (TARerank)" is proposed. TARerank describes the hierarchical topic structure of search results in one model, and seamlessly captures both relevance and diversity of the image search results simultaneously. Through a structured learning framework, relevance and diversity are modeled in TARerank by a set of carefully designed features, and then the model is learned from human-labeled training samples. The learned model is expected to predict reranking results with high relevance and diversity for testing queries. To verify the effectiveness of the proposed method, we collect an image search dataset and conduct comparison experiments on it. The experimental results demonstrate that the proposed TARerank outperforms the existing relevance-based and diversified reranking methods.
Association between Search Behaviors and Disease Prevalence Rates at 18 U.S. Children's Hospitals.
Daniel, Dennis; Wolbrink, Traci; Logvinenko, Tanya; Harper, Marvin; Burns, Jeffrey
2017-10-01
Background Usage of online resources by clinicians in training and practice can provide insight into knowledge gaps and inform development of decision support tools. Although online information seeking is often driven by encountered patient problems, the relationship between disease prevalence and search rate has not been previously characterized. Objective This article aimed to (1) identify topics frequently searched by pediatric clinicians using UpToDate (http://www.uptodate.com) and (2) explore the association between disease prevalence rate and search rate using data from the Pediatric Health Information System. Methods We identified the most common search queries and resources most frequently accessed on UpToDate for a cohort of 18 children's hospitals during calendar year 2012. We selected 64 of the most frequently searched diseases and matched ICD-9 data from the PHIS database during the same time period. Using linear regression, we explored the relationship between clinician query rate and disease prevalence rate. Results The hospital cohort submitted 1,228,138 search queries across 592,454 sessions. The majority of search sessions focused on a single search topic. We identified no consistent overall association between disease prevalence and search rates. Diseases where search rate was substantially higher than prevalence rate were often infectious or immune/rheumatologic conditions, involved potentially complex diagnosis or management, and carried risk of significant morbidity or mortality. None of the examined diseases showed a decrease in search rate associated with increased disease prevalence rates. Conclusion This is one of the first medical learning needs assessments to use large-scale, multisite data to identify topics of interest to pediatric clinicians, and to examine the relationship between disease prevalence and search rate for a set of pediatric diseases. Overall, disease search rate did not appear to be associated with hospital disease prevalence rates based on ICD-9 codes. However, some diseases were consistently searched at a higher rate than their prevalence rate; many of these diseases shared common features.
Breaking the Curse of Cardinality on Bitmap Indexes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Kesheng; Wu, Kesheng; Stockinger, Kurt
2008-04-04
Bitmap indexes are known to be efficient for ad-hoc range queries that are common in data warehousing and scientific applications. However, they suffer from the curse of cardinality, that is, their efficiency deteriorates as attribute cardinalities increase. A number of strategies have been proposed, but none of them addresses the problem adequately. In this paper, we propose a novel binned bitmap index that greatly reduces the cost to answer queries, and therefore breaks the curse of cardinality. The key idea is to augment the binned index with an Order-preserving Bin-based Clustering (OrBiC) structure. This data structure significantly reduces the I/Omore » operations needed to resolve records that cannot be resolved with the bitmaps. To further improve the proposed index structure, we also present a strategy to create single-valued bins for frequent values. This strategy reduces index sizes and improves query processing speed. Overall, the binned indexes with OrBiC great improves the query processing speed, and are 3 - 25 times faster than the best available indexes for high-cardinality data.« less
Demonstration of quantum superiority in learning parity with noise with superconducting qubits
NASA Astrophysics Data System (ADS)
Ristè, Diego; da Silva, Marcus; Ryan, Colm; Cross, Andrew; Smolin, John; Gambetta, Jay; Chow, Jerry; Johnson, Blake
A problem in machine learning is to identify the function programmed in an unknown device, or oracle, having only access to its output. In particular, a parity function computes the parity of a subset of a bit register. We implement an oracle executing parity functions in a five-qubit superconducting processor and compare the performance of a classical and a quantum learner. The classical learner reads the output of multiple oracle calls and uses the results to infer the hidden function. In addition to querying the oracle, the quantum learner can apply coherent rotations on the output register before the readout. We show that, given a target success probability, the quantum approach outperforms the classical one in the number of queries needed. Moreover, this gap increases with readout noise and with the size of the qubit register. This result shows that quantum advantage can already emerge in current systems with a few, noisy qubits. We acknowledge support from IARPA under Contract W911NF-10-1-0324.
Real-time community detection in full social networks on a laptop
Chamberlain, Benjamin Paul; Levy-Kramer, Josh; Humby, Clive
2018-01-01
For a broad range of research and practical applications it is important to understand the allegiances, communities and structure of key players in society. One promising direction towards extracting this information is to exploit the rich relational data in digital social networks (the social graph). As global social networks (e.g., Facebook and Twitter) are very large, most approaches make use of distributed computing systems for this purpose. Distributing graph processing requires solving many difficult engineering problems, which has lead some researchers to look at single-machine solutions that are faster and easier to maintain. In this article, we present an approach for analyzing full social networks on a standard laptop, allowing for interactive exploration of the communities in the locality of a set of user specified query vertices. The key idea is that the aggregate actions of large numbers of users can be compressed into a data structure that encapsulates the edge weights between vertices in a derived graph. Local communities can be constructed by selecting vertices that are connected to the query vertices with high edge weights in the derived graph. This compression is robust to noise and allows for interactive queries of local communities in real-time, which we define to be less than the average human reaction time of 0.25s. We achieve single-machine real-time performance by compressing the neighborhood of each vertex using minhash signatures and facilitate rapid queries through Locality Sensitive Hashing. These techniques reduce query times from hours using industrial desktop machines operating on the full graph to milliseconds on standard laptops. Our method allows exploration of strongly associated regions (i.e., communities) of large graphs in real-time on a laptop. It has been deployed in software that is actively used by social network analysts and offers another channel for media owners to monetize their data, helping them to continue to provide free services that are valued by billions of people globally. PMID:29342158
Zhu, Xinjie; Zhang, Qiang; Ho, Eric Dun; Yu, Ken Hung-On; Liu, Chris; Huang, Tim H; Cheng, Alfred Sze-Lok; Kao, Ben; Lo, Eric; Yip, Kevin Y
2017-09-22
A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.
Efficient privacy-preserving string search and an application in genomics.
Shimizu, Kana; Nuida, Koji; Rätsch, Gunnar
2016-06-01
Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within [Formula: see text] 4.6 s and [Formula: see text] 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm. https://github.com/iskana/PBWT-sec and https://github.com/ratschlab/PBWT-sec shimizu-kana@aist.go.jp or Gunnar.Ratsch@ratschlab.org Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Efficient privacy-preserving string search and an application in genomics
Shimizu, Kana; Nuida, Koji; Rätsch, Gunnar
2016-01-01
Motivation: Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. Approach: We propose a novel approach that combines efficient string data structures such as the Burrows–Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows–Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. Results: We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within ≈ 4.6 s and ≈ 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm. Availability and implementation: https://github.com/iskana/PBWT-sec and https://github.com/ratschlab/PBWT-sec. Contacts: shimizu-kana@aist.go.jp or Gunnar.Ratsch@ratschlab.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27153731
Sayyah Ensan, Ladan; Faghankhani, Masoomeh; Javanbakht, Anna; Ahmadi, Seyed-Foad; Baradaran, Hamid Reza
2011-01-01
Purpose To compare PubMed Clinical Queries and UpToDate regarding the amount and speed of information retrieval and users' satisfaction. Method A cross-over randomized trial was conducted in February 2009 in Tehran University of Medical Sciences that included 44 year-one or two residents who participated in an information mastery workshop. A one-hour lecture on the principles of information mastery was organized followed by self learning slide shows before using each database. Subsequently, participants were randomly assigned to answer 2 clinical scenarios using either UpToDate or PubMed Clinical Queries then crossed to use the other database to answer 2 different clinical scenarios. The proportion of relevantly answered clinical scenarios, time to answer retrieval, and users' satisfaction were measured in each database. Results Based on intention-to-treat analysis, participants retrieved the answer of 67 (76%) questions using UpToDate and 38 (43%) questions using PubMed Clinical Queries (P<0.001). The median time to answer retrieval was 17 min (95% CI: 16 to 18) using UpToDate compared to 29 min (95% CI: 26 to 32) using PubMed Clinical Queries (P<0.001). The satisfaction with the accuracy of retrieved answers, interaction with UpToDate and also overall satisfaction were higher among UpToDate users compared to PubMed Clinical Queries users (P<0.001). Conclusions For first time users, using UpToDate compared to Pubmed Clinical Querries can lead to not only a higher proportion of relevant answer retrieval within a shorter time, but also a higher users' satisfaction. So, addition of tutoring pre-appraised sources such as UpToDate to the information mastery curricula seems to be highly efficient. PMID:21858142
Federated Space-Time Query for Earth Science Data Using OpenSearch Conventions
NASA Astrophysics Data System (ADS)
Lynnes, C.; Beaumont, B.; Duerr, R. E.; Hua, H.
2009-12-01
The past decade has seen a burgeoning of remote sensing and Earth science data providers, as evidenced in the growth of the Earth Science Information Partner (ESIP) federation. At the same time, the need to combine diverse data sets to enable understanding of the Earth as a system has also grown. While the expansion of data providers is in general a boon to such studies, the diversity presents a challenge to finding useful data for a given study. Locating all the data files with aerosol information for a particular volcanic eruption, for example, may involve learning and using several different search tools to execute the requisite space-time queries. To address this issue, the ESIP federation is developing a federated space-time query framework, based on the OpenSearch convention (www.opensearch.org), with Geo and Time extensions. In this framework, data providers publish OpenSearch Description Documents that describe in a machine-readable form how to execute queries against the provider. The novelty of OpenSearch is that the space-time query interface becomes both machine callable and easy enough to integrate into the web browser's search box. This flexibility, together with a simple REST (HTTP-get) interface, should allow a variety of data providers to participate in the federated search framework, from large institutional data centers to individual scientists. The simple interface enables trivial querying of multiple data sources and participation in recursive-like federated searches--all using the same common OpenSearch interface. This simplicity also makes the construction of clients easy, as does existing OpenSearch client libraries in a variety of languages. Moreover, a number of clients and aggregation services already exist and OpenSearch is already supported by a number of web browsers such as Firefox and Internet Explorer.
Dictionary-driven protein annotation.
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-09-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
Web-based resources for critical care education.
Kleinpell, Ruth; Ely, E Wesley; Williams, Ged; Liolios, Antonios; Ward, Nicholas; Tisherman, Samuel A
2011-03-01
To identify, catalog, and critically evaluate Web-based resources for critical care education. A multilevel search strategy was utilized. Literature searches were conducted (from 1996 to September 30, 2010) using OVID-MEDLINE, PubMed, and the Cumulative Index to Nursing and Allied Health Literature with the terms "Web-based learning," "computer-assisted instruction," "e-learning," "critical care," "tutorials," "continuing education," "virtual learning," and "Web-based education." The Web sites of relevant critical care organizations (American College of Chest Physicians, American Society of Anesthesiologists, American Thoracic Society, European Society of Intensive Care Medicine, Society of Critical Care Medicine, World Federation of Societies of Intensive and Critical Care Medicine, American Association of Critical Care Nurses, and World Federation of Critical Care Nurses) were reviewed for the availability of e-learning resources. Finally, Internet searches and e-mail queries to critical care medicine fellowship program directors and members of national and international acute/critical care listserves were conducted to 1) identify the use of and 2) review and critique Web-based resources for critical care education. To ensure credibility of Web site information, Web sites were reviewed by three independent reviewers on the basis of the criteria of authority, objectivity, authenticity, accuracy, timeliness, relevance, and efficiency in conjunction with suggested formats for evaluating Web sites in the medical literature. Literature searches using OVID-MEDLINE, PubMed, and the Cumulative Index to Nursing and Allied Health Literature resulted in >250 citations. Those pertinent to critical care provide examples of the integration of e-learning techniques, the development of specific resources, reports of the use of types of e-learning, including interactive tutorials, case studies, and simulation, and reports of student or learner satisfaction, among other general reviews of the benefits of utilizing e-learning. Review of the Web sites of relevant critical care organizations revealed the existence of a number of e-learning resources, including online critical care courses, tutorials, podcasts, webcasts, slide sets, and continuing medical education resources, some requiring membership or a fee to access. Respondents to listserve queries (>100) and critical care medicine fellowship director and advanced practice nursing educator e-mail queries (>50) identified the use of a number of tutorials, self-directed learning modules, and video-enhanced programs for critical care education and practice. In all, >135 Web-based education resources exist, including video Web resources for critical care education in a variety of e-learning formats, such as tutorials, self-directed learning modules, interactive case studies, webcasts, podcasts, and video-enhanced programs. As identified by critical care educators and practitioners, e-learning is actively being integrated into critical care medicine and nursing training programs for continuing medical education and competency training purposes. Knowledge of available Web-based educational resources may enhance critical care practitioners' ongoing learning and clinical competence, although this has not been objectively measured to date.
SCEC UCVM - Unified California Velocity Model
NASA Astrophysics Data System (ADS)
Small, P.; Maechling, P. J.; Jordan, T. H.; Ely, G. P.; Taborda, R.
2011-12-01
The SCEC Unified California Velocity Model (UCVM) is a software framework for a state-wide California velocity model. UCVM provides researchers with two new capabilities: (1) the ability to query Vp, Vs, and density from any standard regional California velocity model through a uniform interface, and (2) the ability to combine multiple velocity models into a single state-wide model. These features are crucial in order to support large-scale ground motion simulations and to facilitate improvements in the underlying velocity models. UCVM provides integrated support for the following standard velocity models: SCEC CVM-H, SCEC CVM-S and the CVM-SI variant, USGS Bay Area (cencalvm), Lin-Thurber Statewide, and other smaller regional models. New models may be easily incorporated as they become available. Two query interfaces are provided: a Linux command line program, and a C application programming interface (API). The C API query interface is simple, fully independent of any specific model, and MPI-friendly. Input coordinates are geographic longitude/latitude and the vertical coordinate may be either depth or elevation. Output parameters include Vp, Vs, and density along with the identity of the model from which these material properties were obtained. In addition to access to the standard models, UCVM also includes a high resolution statewide digital elevation model, Vs30 map, and an optional near-surface geo-technical layer (GTL) based on Ely's Vs30-derived GTL. The elevation and Vs30 information is bundled along with the returned Vp,Vs velocities and density, so that all relevant information is retrieved with a single query. When the GTL is enabled, it is blended with the underlying crustal velocity models along a configurable transition depth range with an interpolation function. Multiple, possibly overlapping, regional velocity models may be combined together into a single state-wide model. This is accomplished by tiling the regional models on top of one another in three dimensions in a researcher-specified order. No reconciliation is performed within overlapping model regions, although a post-processing tool is provided to perform a simple numerical smoothing. Lastly, a 3D region from a combined model may be extracted and exported into a CVM-Etree. This etree may then be queried by UCVM much like a standard velocity model but with less overhead and generally better performance due to the efficiency of the etree data structure.
NASA Technical Reports Server (NTRS)
Ryan, Patrick M.
1994-01-01
The EOSDIS Version 0 system, released in July, 1994, is a working prototype of a distributed data system. One of the purposes of the V0 project is to take several existing data systems and coordinate them into one system while maintaining the independent nature of the original systems. The project is a learning experience and the lessons are being passed on to the architects of the system which will distribute the data received from the planned EOS satellites. In the V0 system, the data resides on heterogeneous systems across the globe but users are presented with a single, integrated interface. This interface allows users to query the participating data centers based on a wide set of criteria. Because this system is a prototype, we used many novel approaches in trying to connect a diverse group of users with the huge amount of available data. Some of these methods worked and others did not. Now that V0 has been released to the public, we can look back at the design and implementation of the system and also consider some possible future directions for the next generation of EOSDIS.
Improving Protein Fold Recognition by Deep Learning Networks.
Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin
2015-12-04
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl's benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
Quantum Inference on Bayesian Networks
NASA Astrophysics Data System (ADS)
Yoder, Theodore; Low, Guang Hao; Chuang, Isaac
2014-03-01
Because quantum physics is naturally probabilistic, it seems reasonable to expect physical systems to describe probabilities and their evolution in a natural fashion. Here, we use quantum computation to speedup sampling from a graphical probability model, the Bayesian network. A specialization of this sampling problem is approximate Bayesian inference, where the distribution on query variables is sampled given the values e of evidence variables. Inference is a key part of modern machine learning and artificial intelligence tasks, but is known to be NP-hard. Classically, a single unbiased sample is obtained from a Bayesian network on n variables with at most m parents per node in time (nmP(e) - 1 / 2) , depending critically on P(e) , the probability the evidence might occur in the first place. However, by implementing a quantum version of rejection sampling, we obtain a square-root speedup, taking (n2m P(e) -1/2) time per sample. The speedup is the result of amplitude amplification, which is proving to be broadly applicable in sampling and machine learning tasks. In particular, we provide an explicit and efficient circuit construction that implements the algorithm without the need for oracle access.
Active Learning Using Hint Information.
Li, Chun-Liang; Ferng, Chun-Sung; Lin, Hsuan-Tien
2015-08-01
The abundance of real-world data and limited labeling budget calls for active learning, an important learning paradigm for reducing human labeling efforts. Many recently developed active learning algorithms consider both uncertainty and representativeness when making querying decisions. However, exploiting representativeness with uncertainty concurrently usually requires tackling sophisticated and challenging learning tasks, such as clustering. In this letter, we propose a new active learning framework, called hinted sampling, which takes both uncertainty and representativeness into account in a simpler way. We design a novel active learning algorithm within the hinted sampling framework with an extended support vector machine. Experimental results validate that the novel active learning algorithm can result in a better and more stable performance than that achieved by state-of-the-art algorithms. We also show that the hinted sampling framework allows improving another active learning algorithm designed from the transductive support vector machine.
BioFed: federated query processing over life sciences linked open data.
Hasnain, Ali; Mehmood, Qaiser; Sana E Zainab, Syeda; Saleem, Muhammad; Warren, Claude; Zehra, Durre; Decker, Stefan; Rebholz-Schuhmann, Dietrich
2017-03-15
Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain. The efficient cataloguing approach of the federated query processing system 'BioFed', the triple pattern wise source selection and the semantic source normalisation forms the core to our solution. It gathers and integrates data from newly identified public endpoints for federated access. Basic provenance information is linked to the retrieved data. Last but not least, BioFed makes use of the latest SPARQL standard (i.e., 1.1) to leverage the full benefits for query federation. The evaluation is based on 10 simple and 10 complex queries, which address data in 10 major and very popular data sources (e.g., Dugbank, Sider). BioFed is a solution for a single-point-of-access for a large number of SPARQL endpoints providing life science data. It facilitates efficient query generation for data access and provides basic provenance information in combination with the retrieved data. BioFed fully supports SPARQL 1.1 and gives access to the endpoint's availability based on the EndpointData graph. Our evaluation of BioFed against FedX is based on 20 heterogeneous federated SPARQL queries and shows competitive execution performance in comparison to FedX, which can be attributed to the provision of provenance information for the source selection. Developing and testing federated query engines for life sciences data is still a challenging task. According to our findings, it is advantageous to optimise the source selection. The cataloguing of SPARQL endpoints, including type and property indexing, leads to efficient querying of data resources over the Web of Data. This could even be further improved through the use of ontologies, e.g., for abstract normalisation of query terms.
An Intelligent System for Document Retrieval in Distributed Office Environments.
ERIC Educational Resources Information Center
Mukhopadhyay, Uttam; And Others
1986-01-01
MINDS (Multiple Intelligent Node Document Servers) is a distributed system of knowledge-based query engines for efficiently retrieving multimedia documents in an office environment of distributed workstations. By learning document distribution patterns and user interests and preferences during system usage, it customizes document retrievals for…
Large scale study of multiple-molecule queries
2009-01-01
Background In ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family. Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics. Results Fourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics. Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data. Conclusion Fourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from http://cdb.ics.uci.edu/. PMID:20298525
Toward accelerating landslide mapping with interactive machine learning techniques
NASA Astrophysics Data System (ADS)
Stumpf, André; Lachiche, Nicolas; Malet, Jean-Philippe; Kerle, Norman; Puissant, Anne
2013-04-01
Despite important advances in the development of more automated methods for landslide mapping from optical remote sensing images, the elaboration of inventory maps after major triggering events still remains a tedious task. Image classification with expert defined rules typically still requires significant manual labour for the elaboration and adaption of rule sets for each particular case. Machine learning algorithm, on the contrary, have the ability to learn and identify complex image patterns from labelled examples but may require relatively large amounts of training data. In order to reduce the amount of required training data active learning has evolved as key concept to guide the sampling for applications such as document classification, genetics and remote sensing. The general underlying idea of most active learning approaches is to initialize a machine learning model with a small training set, and to subsequently exploit the model state and/or the data structure to iteratively select the most valuable samples that should be labelled by the user and added in the training set. With relatively few queries and labelled samples, an active learning strategy should ideally yield at least the same accuracy than an equivalent classifier trained with many randomly selected samples. Our study was dedicated to the development of an active learning approach for landslide mapping from VHR remote sensing images with special consideration of the spatial distribution of the samples. The developed approach is a region-based query heuristic that enables to guide the user attention towards few compact spatial batches rather than distributed points resulting in time savings of 50% and more compared to standard active learning techniques. The approach was tested with multi-temporal and multi-sensor satellite images capturing recent large scale triggering events in Brazil and China and demonstrated balanced user's and producer's accuracies between 74% and 80%. The assessment also included an experimental evaluation of the uncertainties of manual mappings from multiple experts and demonstrated strong relationships between the uncertainty of the experts and the machine learning model.
Loss-tolerant measurement-device-independent quantum private queries
NASA Astrophysics Data System (ADS)
Zhao, Liang-Yuan; Yin, Zhen-Qiang; Chen, Wei; Qian, Yong-Jun; Zhang, Chun-Mei; Guo, Guang-Can; Han, Zheng-Fu
2017-01-01
Quantum private queries (QPQ) is an important cryptography protocol aiming to protect both the user’s and database’s privacy when the database is queried privately. Recently, a variety of practical QPQ protocols based on quantum key distribution (QKD) have been proposed. However, for QKD-based QPQ the user’s imperfect detectors can be subjected to some detector- side-channel attacks launched by the dishonest owner of the database. Here, we present a simple example that shows how the detector-blinding attack can damage the security of QKD-based QPQ completely. To remove all the known and unknown detector side channels, we propose a solution of measurement-device-independent QPQ (MDI-QPQ) with single- photon sources. The security of the proposed protocol has been analyzed under some typical attacks. Moreover, we prove that its security is completely loss independent. The results show that practical QPQ will remain the same degree of privacy as before even with seriously uncharacterized detectors.
Loss-tolerant measurement-device-independent quantum private queries.
Zhao, Liang-Yuan; Yin, Zhen-Qiang; Chen, Wei; Qian, Yong-Jun; Zhang, Chun-Mei; Guo, Guang-Can; Han, Zheng-Fu
2017-01-04
Quantum private queries (QPQ) is an important cryptography protocol aiming to protect both the user's and database's privacy when the database is queried privately. Recently, a variety of practical QPQ protocols based on quantum key distribution (QKD) have been proposed. However, for QKD-based QPQ the user's imperfect detectors can be subjected to some detector- side-channel attacks launched by the dishonest owner of the database. Here, we present a simple example that shows how the detector-blinding attack can damage the security of QKD-based QPQ completely. To remove all the known and unknown detector side channels, we propose a solution of measurement-device-independent QPQ (MDI-QPQ) with single- photon sources. The security of the proposed protocol has been analyzed under some typical attacks. Moreover, we prove that its security is completely loss independent. The results show that practical QPQ will remain the same degree of privacy as before even with seriously uncharacterized detectors.
NASA Astrophysics Data System (ADS)
Ho, Chris M. W.; Marshall, Garland R.
1993-12-01
SPLICE is a program that processes partial query solutions retrieved from 3D, structural databases to generate novel, aggregate ligands. It is designed to interface with the database searching program FOUNDATION, which retrieves fragments containing any combination of a user-specified minimum number of matching query elements. SPLICE eliminates aspects of structures that are physically incapable of binding within the active site. Then, a systematic rule-based procedure is performed upon the remaining fragments to ensure receptor complementarity. All modifications are automated and remain transparent to the user. Ligands are then assembled by linking components into composite structures through overlapping bonds. As a control experiment, FOUNDATION and SPLICE were used to reconstruct a know HIV-1 protease inhibitor after it had been fragmented, reoriented, and added to a sham database of fifty different small molecules. To illustrate the capabilities of this program, a 3D search query containing the pharmacophoric elements of an aspartic proteinase-inhibitor crystal complex was searched using FOUNDATION against a subset of the Cambridge Structural Database. One hundred thirty-one compounds were retrieved, each containing any combination of at least four query elements. Compounds were automatically screened and edited for receptor complementarity. Numerous combinations of fragments were discovered that could be linked to form novel structures, containing a greater number of pharmacophoric elements than any single retrieved fragment.
An Application Programming Interface for Synthetic Snowflake Particle Structure and Scattering Data
NASA Technical Reports Server (NTRS)
Lammers, Matthew; Kuo, Kwo-Sen
2017-01-01
The work by Kuo and colleagues on growing synthetic snowflakes and calculating their single-scattering properties has demonstrated great potential to improve the retrievals of snowfall. To grant colleagues flexible and targeted access to their large collection of sizes and shapes at fifteen (15) microwave frequencies, we have developed a web-based Application Programming Interface (API) integrated with NASA Goddard's Precipitation Processing System (PPS) Group. It is our hope that the API will enable convenient programmatic utilization of the database. To help users better understand the API's capabilities, we have developed an interactive web interface called the OpenSSP API Query Builder, which implements an intuitive system of mechanisms for selecting shapes, sizes, and frequencies to generate queries, with which the API can then extract and return data from the database. The Query Builder also allows for the specification of normalized particle size distributions by setting pertinent parameters, with which the API can also return mean geometric and scattering properties for each size bin. Additionally, the Query Builder interface enables downloading of raw scattering and particle structure data packages. This presentation will describe some of the challenges and successes associated with developing such an API. Examples of its usage will be shown both through downloading output and pulling it into a spreadsheet, as well as querying the API programmatically and working with the output in code.
Using J-Query Mobile Technology to Support a Pedagogical Proficiency Course
ERIC Educational Resources Information Center
Kert, Serhat Bahadir
2013-01-01
Technology-enriched educational environments supported by different technological tools and applications are today's important research areas in the educational literature. During the educational process, different types of technologies are used in order to enhance the learning capabilities of students. Given the popularity of mobile phones, it…
Multi-documents summarization based on clustering of learning object using hierarchical clustering
NASA Astrophysics Data System (ADS)
Mustamiin, M.; Budi, I.; Santoso, H. B.
2018-03-01
The Open Educational Resources (OER) is a portal of teaching, learning and research resources that is available in public domain and freely accessible. Learning contents or Learning Objects (LO) are granular and can be reused for constructing new learning materials. LO ontology-based searching techniques can be used to search for LO in the Indonesia OER. In this research, LO from search results are used as an ingredient to create new learning materials according to the topic searched by users. Summarizing-based grouping of LO use Hierarchical Agglomerative Clustering (HAC) with the dependency context to the user’s query which has an average value F-Measure of 0.487, while summarizing by K-Means F-Measure only has an average value of 0.336.
Zhang, Ai-bing; Feng, Jie; Ward, Robert D; Wan, Ping; Gao, Qiang; Wu, Jun; Zhao, Wei-zhong
2012-01-01
Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.
Characterizing the Discussion of Antibiotics in the Twittersphere: What is the Bigger Picture?
Kendra, Rachel Lynn; Karki, Suman; Eickholt, Jesse Lee; Gandy, Lisa
2015-06-19
User content posted through Twitter has been used for biosurveillance, to characterize public perception of health-related topics, and as a means of distributing information to the general public. Most of the existing work surrounding Twitter and health care has shown Twitter to be an effective medium for these problems but more could be done to provide finer and more efficient access to all pertinent data. Given the diversity of user-generated content, small samples or summary presentations of the data arguably omit a large part of the virtual discussion taking place in the Twittersphere. Still, managing, processing, and querying large amounts of Twitter data is not a trivial task. This work describes tools and techniques capable of handling larger sets of Twitter data and demonstrates their use with the issue of antibiotics. This work has two principle objectives: (1) to provide an open-source means to efficiently explore all collected tweets and query health-related topics on Twitter, specifically, questions such as what users are saying and how messages are spread, and (2) to characterize the larger discourse taking place on Twitter with respect to antibiotics. Open-source software suites Hadoop, Flume, and Hive were used to collect and query a large number of Twitter posts. To classify tweets by topic, a deep network classifier was trained using a limited number of manually classified tweets. The particular machine learning approach used also allowed the use of a large number of unclassified tweets to increase performance. Query-based analysis of the collected tweets revealed that a large number of users contributed to the online discussion and that a frequent topic mentioned was resistance. A number of prominent events related to antibiotics led to a number of spikes in activity but these were short in duration. The category-based classifier developed was able to correctly classify 70% of manually labeled tweets (using a 10-fold cross validation procedure and 9 classes). The classifier also performed well when evaluated on a per category basis. Using existing tools such as Hive, Flume, Hadoop, and machine learning techniques, it is possible to construct tools and workflows to collect and query large amounts of Twitter data to characterize the larger discussion taking place on Twitter with respect to a particular health-related topic. Furthermore, using newer machine learning techniques and a limited number of manually labeled tweets, an entire body of collected tweets can be classified to indicate what topics are driving the virtual, online discussion. The resulting classifier can also be used to efficiently explore collected tweets by category and search for messages of interest or exemplary content.
YAHA: fast and flexible long-read alignment with optimal breakpoint detection.
Faust, Gregory G; Hall, Ira M
2012-10-01
With improved short-read assembly algorithms and the recent development of long-read sequencers, split mapping will soon be the preferred method for structural variant (SV) detection. Yet, current alignment tools are not well suited for this. We present YAHA, a fast and flexible hash-based aligner. YAHA is as fast and accurate as BWA-SW at finding the single best alignment per query and is dramatically faster and more sensitive than both SSAHA2 and MegaBLAST at finding all possible alignments. Unlike other aligners that report all, or one, alignment per query, or that use simple heuristics to select alignments, YAHA uses a directed acyclic graph to find the optimal set of alignments that cover a query using a biologically relevant breakpoint penalty. YAHA can also report multiple mappings per defined segment of the query. We show that YAHA detects more breakpoints in less time than BWA-SW across all SV classes, and especially excels at complex SVs comprising multiple breakpoints. YAHA is currently supported on 64-bit Linux systems. Binaries and sample data are freely available for download from http://faculty.virginia.edu/irahall/YAHA. imh4y@virginia.edu.
Data augmentation-assisted deep learning of hand-drawn partially colored sketches for visual search
Muhammad, Khan; Baik, Sung Wook
2017-01-01
In recent years, image databases are growing at exponential rates, making their management, indexing, and retrieval, very challenging. Typical image retrieval systems rely on sample images as queries. However, in the absence of sample query images, hand-drawn sketches are also used. The recent adoption of touch screen input devices makes it very convenient to quickly draw shaded sketches of objects to be used for querying image databases. This paper presents a mechanism to provide access to visual information based on users’ hand-drawn partially colored sketches using touch screen devices. A key challenge for sketch-based image retrieval systems is to cope with the inherent ambiguity in sketches due to the lack of colors, textures, shading, and drawing imperfections. To cope with these issues, we propose to fine-tune a deep convolutional neural network (CNN) using augmented dataset to extract features from partially colored hand-drawn sketches for query specification in a sketch-based image retrieval framework. The large augmented dataset contains natural images, edge maps, hand-drawn sketches, de-colorized, and de-texturized images which allow CNN to effectively model visual contents presented to it in a variety of forms. The deep features extracted from CNN allow retrieval of images using both sketches and full color images as queries. We also evaluated the role of partial coloring or shading in sketches to improve the retrieval performance. The proposed method is tested on two large datasets for sketch recognition and sketch-based image retrieval and achieved better classification and retrieval performance than many existing methods. PMID:28859140
Dictionary-driven protein annotation
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-01-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12202776
Solving the problem of Trans-Genomic Query with alignment tables.
Parker, Douglass Stott; Hsiao, Ruey-Lung; Xing, Yi; Resch, Alissa M; Lee, Christopher J
2008-01-01
The trans-genomic query (TGQ) problem--enabling the free query of biological information, even across genomes--is a central challenge facing bioinformatics. Solutions to this problem can alter the nature of the field, moving it beyond the jungle of data integration and expanding the number and scope of questions that can be answered. An alignment table is a binary relationship on locations (sequence segments). An important special case of alignment tables are hit tables ? tables of pairs of highly similar segments produced by alignment tools like BLAST. However, alignment tables also include general binary relationships, and can represent any useful connection between sequence locations. They can be curated, and provide a high-quality queryable backbone of connections between biological information. Alignment tables thus can be a natural foundation for TGQ, as they permit a central part of the TGQ problem to be reduced to purely technical problems involving tables of locations.Key challenges in implementing alignment tables include efficient representation and indexing of sequence locations. We define a location datatype that can be incorporated naturally into common off-the-shelf database systems. We also describe an implementation of alignment tables in BLASTGRES, an extension of the open-source POSTGRESQL database system that provides indexing and operators on locations required for querying alignment tables. This paper also reviews several successful large-scale applications of alignment tables for Trans-Genomic Query. Tables with millions of alignments have been used in queries about alternative splicing, an area of genomic analysis concerning the way in which a single gene can yield multiple transcripts. Comparative genomics is a large potential application area for TGQ and alignment tables.
ERIC Educational Resources Information Center
Reese, Ruth Norrine Ryida Re
2013-01-01
There is a distressing dilemma in higher education concerning the increase in student attrition rates (Angelino, Williams, & Natvig, 2007; Hagedorn, 2006; Kramer, 2007; Lau, 2003, Seidman, 2005; Tinto, 2006). The seminal query of this study asks what higher education institutions are doing to effectively retain these students and secure…
Alterity: Learning Polyvalent Selves, Resisting Disabling Notions of the Self
ERIC Educational Resources Information Center
Walker, Wayland
2011-01-01
This article queries how one type of human difference--alterity, the experience of multiple distinct consciousnesses, or "alters," by one person--is pathologized in American culture. This experience is inscribed as a mental illness, labeled now as dissociative identity disorder (DID) and formerly known as multiple personality disorder (MPD). In…
New Perspectives: TA Preparation for Critical Literacy in First Year Composition.
ERIC Educational Resources Information Center
Duffelmeyer, Barb Blakely
2002-01-01
Notes that new teaching assistants (TAs) and first year composition students similarly grapple with ambiguity, multiplicity, and open-endedness. Contends that new TAs' queries and early classroom experiences can provide a valuable occasion to re-balance the emphasis in a pro-seminar between teaching and learning. Presents strategies for addressing…
Lessons for Australia? Learning from England's Curriculum "Black Box"
ERIC Educational Resources Information Center
Thomson, Pat
2008-01-01
While all comparisons are difficult, a consideration of the English national curriculum may offer some insights about the Australian version. In this paper, I suggest that there are some important similarities between the two approaches. I query the stated purposes of the Australian national curriculum and its adoption of a notion of equity which…
Student Query Trend Assessment with Semantical Annotation and Artificial Intelligent Multi-Agents
ERIC Educational Resources Information Center
Malik, Kaleem Razzaq; Mir, Rizwan Riaz; Farhan, Muhammad; Rafiq, Tariq; Aslam, Muhammad
2017-01-01
Research in era of data representation to contribute and improve key data policy involving the assessment of learning, training and English language competency. Students are required to communicate in English with high level impact using language and influence. The electronic technology works to assess students' questions positively enabling…
Evolution of Protein Lipograms: A Bioinformatics Problem
ERIC Educational Resources Information Center
White, Harold B., III; Dhurjati, Prasad
2006-01-01
A protein lacking one of the 20 common amino acids is a protein lipogram. This open-ended problem-based learning assignment deals with the evolution of proteins with biased amino acid composition. It has students query protein and metabolic databases to test the hypothesis that natural selection has reduced the frequency of each amino acid…
An Efficient Quantum Somewhat Homomorphic Symmetric Searchable Encryption
NASA Astrophysics Data System (ADS)
Sun, Xiaoqiang; Wang, Ting; Sun, Zhiwei; Wang, Ping; Yu, Jianping; Xie, Weixin
2017-04-01
In 2009, Gentry first introduced an ideal lattices fully homomorphic encryption (FHE) scheme. Later, based on the approximate greatest common divisor problem, learning with errors problem or learning with errors over rings problem, FHE has developed rapidly, along with the low efficiency and computational security. Combined with quantum mechanics, Liang proposed a symmetric quantum somewhat homomorphic encryption (QSHE) scheme based on quantum one-time pad, which is unconditional security. And it was converted to a quantum fully homomorphic encryption scheme, whose evaluation algorithm is based on the secret key. Compared with Liang's QSHE scheme, we propose a more efficient QSHE scheme for classical input states with perfect security, which is used to encrypt the classical message, and the secret key is not required in the evaluation algorithm. Furthermore, an efficient symmetric searchable encryption (SSE) scheme is constructed based on our QSHE scheme. SSE is important in the cloud storage, which allows users to offload search queries to the untrusted cloud. Then the cloud is responsible for returning encrypted files that match search queries (also encrypted), which protects users' privacy.
Yom-Tov, Elad; Fernandez-Luque, Luis
2014-01-01
Vaccination campaigns are one of the most important and successful public health programs ever undertaken. People who want to learn about vaccines in order to make an informed decision on whether to vaccinate are faced with a wealth of information on the Internet, both for and against vaccinations. In this paper we develop an automated way to score Internet search queries and web pages as to the likelihood that a person making these queries or reading those pages would decide to vaccinate. We apply this method to data from a major Internet search engine, while people seek information about the Measles, Mumps and Rubella (MMR) vaccine. We show that our method is accurate, and use it to learn about the information acquisition process of people. Our results show that people who are pro-vaccination as well as people who are anti-vaccination seek similar information, but browsing this information has differing effect on their future browsing. These findings demonstrate the need for health authorities to tailor their information according to the current stance of users.
Blair, P G; Templeton, E; Sachdeva, A K
1996-01-01
An adult education model was developed to familiarize third-year medical students with the Physician Data Query (PDQ) system, a computerized, full-text database of state-of-the-art cancer information developed by the National Cancer Institute. The educational model was designed in collaboration with a medical librarian and was implemented within the context of a busy surgery clerkship using only modest resources that were readily available within the medical school. During three years, 275 medical students participated in the exercise and evaluated both the PDQ system and the educational model. Overall, 87% of the students considered the PDQ system to be a valuable source of information, and 84% anticipated using PDQ after completing their surgical rotations. Ninety-six percent of the students agreed that the objectives of the exercise were met. This article provides a description of the educational model and discusses the principles of adult education and andragogy on which it was developed. The importance of emphasizing the process of learning as well as the content is described relative to self-directed and life-long learning.
Fu, Lawrence D.; Aphinyanaphongs, Yindalon; Wang, Lily; Aliferis, Constantin F.
2011-01-01
Evaluating the biomedical literature and health-related websites for quality are challenging information retrieval tasks. Current commonly used methods include impact factor for journals, PubMed’s clinical query filters and machine learning-based filter models for articles, and PageRank for websites. Previous work has focused on the average performance of these methods without considering the topic, and it is unknown how performance varies for specific topics or focused searches. Clinicians, researchers, and users should be aware when expected performance is not achieved for specific topics. The present work analyzes the behavior of these methods for a variety of topics. Impact factor, clinical query filters, and PageRank vary widely across different topics while a topic-specific impact factor and machine learning-based filter models are more stable. The results demonstrate that a method may perform excellently on average but struggle when used on a number of narrower topics. Topic adjusted metrics and other topic robust methods have an advantage in such situations. Users of traditional topic-sensitive metrics should be aware of their limitations. PMID:21419864
Yom-Tov, Elad; Fernandez-Luque, Luis
2014-01-01
Vaccination campaigns are one of the most important and successful public health programs ever undertaken. People who want to learn about vaccines in order to make an informed decision on whether to vaccinate are faced with a wealth of information on the Internet, both for and against vaccinations. In this paper we develop an automated way to score Internet search queries and web pages as to the likelihood that a person making these queries or reading those pages would decide to vaccinate. We apply this method to data from a major Internet search engine, while people seek information about the Measles, Mumps and Rubella (MMR) vaccine. We show that our method is accurate, and use it to learn about the information acquisition process of people. Our results show that people who are pro-vaccination as well as people who are anti-vaccination seek similar information, but browsing this information has differing effect on their future browsing. These findings demonstrate the need for health authorities to tailor their information according to the current stance of users. PMID:25954435
Graphical modeling and query language for hospitals.
Barzdins, Janis; Barzdins, Juris; Rencis, Edgars; Sostaks, Agris
2013-01-01
So far there has been little evidence that implementation of the health information technologies (HIT) is leading to health care cost savings. One of the reasons for this lack of impact by the HIT likely lies in the complexity of the business process ownership in the hospitals. The goal of our research is to develop a business model-based method for hospital use which would allow doctors to retrieve directly the ad-hoc information from various hospital databases. We have developed a special domain-specific process modelling language called the MedMod. Formally, we define the MedMod language as a profile on UML Class diagrams, but we also demonstrate it on examples, where we explain the semantics of all its elements informally. Moreover, we have developed the Process Query Language (PQL) that is based on MedMod process definition language. The purpose of PQL is to allow a doctor querying (filtering) runtime data of hospital's processes described using MedMod. The MedMod language tries to overcome deficiencies in existing process modeling languages, allowing to specify the loosely-defined sequence of the steps to be performed in the clinical process. The main advantages of PQL are in two main areas - usability and efficiency. They are: 1) the view on data through "glasses" of familiar process, 2) the simple and easy-to-perceive means of setting filtering conditions require no more expertise than using spreadsheet applications, 3) the dynamic response to each step in construction of the complete query that shortens the learning curve greatly and reduces the error rate, and 4) the selected means of filtering and data retrieving allows to execute queries in O(n) time regarding the size of the dataset. We are about to continue developing this project with three further steps. First, we are planning to develop user-friendly graphical editors for the MedMod process modeling and query languages. The second step is to do evaluation of usability the proposed language and tool involving the physicians from several hospitals in Latvia and working with real data from these hospitals. Our third step is to develop an efficient implementation of the query language.
Software Helps Retrieve Information Relevant to the User
NASA Technical Reports Server (NTRS)
Mathe, Natalie; Chen, James
2003-01-01
The Adaptive Indexing and Retrieval Agent (ARNIE) is a code library, designed to be used by an application program, that assists human users in retrieving desired information in a hypertext setting. Using ARNIE, the program implements a computational model for interactively learning what information each human user considers relevant in context. The model, called a "relevance network," incrementally adapts retrieved information to users individual profiles on the basis of feedback from the users regarding specific queries. The model also generalizes such knowledge for subsequent derivation of relevant references for similar queries and profiles, thereby, assisting users in filtering information by relevance. ARNIE thus enables users to categorize and share information of interest in various contexts. ARNIE encodes the relevance and structure of information in a neural network dynamically configured with a genetic algorithm. ARNIE maintains an internal database, wherein it saves associations, and from which it returns associated items in response to a query. A C++ compiler for a platform on which ARNIE will be utilized is necessary for creating the ARNIE library but is not necessary for the execution of the software.
Shan, Liran Christine; Panagiotopoulos, Panagiotis; Regan, Áine; De Brún, Aoife; Barnett, Julie; Wall, Patrick; McConnon, Áine
2015-01-01
To examine the use and impact of social media on 2-way communication between consumers and public organizations in the food safety and nutrition area. In-depth qualitative study conducted between October, 2012 and January, 2013, using semi-structured interviews in the United Kingdom and Ireland. Sixteen professionals worked on the public interface within 5 national organizations with a role in communicating on food safety and nutrition issues in this thematic analysis. Five main themes were identified: gradual shift toward social media-based queries and complaints; challenges and limitations of social media to deal with queries and complaints; benefits of using social media in query and complaint services; content redesign driven by social media use; and using social media to learn more about consumers. Social media penetrated and brought new opportunities to food organizations' interactions with the public. Given the increasing use of social media by the public, food organizations need to explore such new opportunities for communication and research. Copyright © 2015 Society for Nutrition Education and Behavior. Published by Elsevier Inc. All rights reserved.
NASA Technical Reports Server (NTRS)
Steeman, Gerald; Connell, Christopher
2000-01-01
Many librarians may feel that dynamic Web pages are out of their reach, financially and technically. Yet we are reminded in library and Web design literature that static home pages are a thing of the past. This paper describes how librarians at the Institute for Defense Analyses (IDA) library developed a database-driven, dynamic intranet site using commercial off-the-shelf applications. Administrative issues include surveying a library users group for interest and needs evaluation; outlining metadata elements; and, committing resources from managing time to populate the database and training in Microsoft FrontPage and Web-to-database design. Technical issues covered include Microsoft Access database fundamentals, lessons learned in the Web-to-database process (including setting up Database Source Names (DSNs), redesigning queries to accommodate the Web interface, and understanding Access 97 query language vs. Standard Query Language (SQL)). This paper also offers tips on editing Active Server Pages (ASP) scripting to create desired results. A how-to annotated resource list closes out the paper.
End-to-End ASR-Free Keyword Search From Speech
NASA Astrophysics Data System (ADS)
Audhkhasi, Kartik; Rosenberg, Andrew; Sethy, Abhinav; Ramabhadran, Bhuvana; Kingsbury, Brian
2017-12-01
End-to-end (E2E) systems have achieved competitive results compared to conventional hybrid hidden Markov model (HMM)-deep neural network based automatic speech recognition (ASR) systems. Such E2E systems are attractive due to the lack of dependence on alignments between input acoustic and output grapheme or HMM state sequence during training. This paper explores the design of an ASR-free end-to-end system for text query-based keyword search (KWS) from speech trained with minimal supervision. Our E2E KWS system consists of three sub-systems. The first sub-system is a recurrent neural network (RNN)-based acoustic auto-encoder trained to reconstruct the audio through a finite-dimensional representation. The second sub-system is a character-level RNN language model using embeddings learned from a convolutional neural network. Since the acoustic and text query embeddings occupy different representation spaces, they are input to a third feed-forward neural network that predicts whether the query occurs in the acoustic utterance or not. This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster.
Introducing glycomics data into the Semantic Web
2013-01-01
Background Glycoscience is a research field focusing on complex carbohydrates (otherwise known as glycans)a, which can, for example, serve as “switches” that toggle between different functions of a glycoprotein or glycolipid. Due to the advancement of glycomics technologies that are used to characterize glycan structures, many glycomics databases are now publicly available and provide useful information for glycoscience research. However, these databases have almost no link to other life science databases. Results In order to implement support for the Semantic Web most efficiently for glycomics research, the developers of major glycomics databases agreed on a minimal standard for representing glycan structure and annotation information using RDF (Resource Description Framework). Moreover, all of the participants implemented this standard prototype and generated preliminary RDF versions of their data. To test the utility of the converted data, all of the data sets were uploaded into a Virtuoso triple store, and several SPARQL queries were tested as “proofs-of-concept” to illustrate the utility of the Semantic Web in querying across databases which were originally difficult to implement. Conclusions We were able to successfully retrieve information by linking UniCarbKB, GlycomeDB and JCGGDB in a single SPARQL query to obtain our target information. We also tested queries linking UniProt with GlycoEpitope as well as lectin data with GlycomeDB through PDB. As a result, we have been able to link proteomics data with glycomics data through the implementation of Semantic Web technologies, allowing for more flexible queries across these domains. PMID:24280648
Introducing glycomics data into the Semantic Web.
Aoki-Kinoshita, Kiyoko F; Bolleman, Jerven; Campbell, Matthew P; Kawano, Shin; Kim, Jin-Dong; Lütteke, Thomas; Matsubara, Masaaki; Okuda, Shujiro; Ranzinger, Rene; Sawaki, Hiromichi; Shikanai, Toshihide; Shinmachi, Daisuke; Suzuki, Yoshinori; Toukach, Philip; Yamada, Issaku; Packer, Nicolle H; Narimatsu, Hisashi
2013-11-26
Glycoscience is a research field focusing on complex carbohydrates (otherwise known as glycans)a, which can, for example, serve as "switches" that toggle between different functions of a glycoprotein or glycolipid. Due to the advancement of glycomics technologies that are used to characterize glycan structures, many glycomics databases are now publicly available and provide useful information for glycoscience research. However, these databases have almost no link to other life science databases. In order to implement support for the Semantic Web most efficiently for glycomics research, the developers of major glycomics databases agreed on a minimal standard for representing glycan structure and annotation information using RDF (Resource Description Framework). Moreover, all of the participants implemented this standard prototype and generated preliminary RDF versions of their data. To test the utility of the converted data, all of the data sets were uploaded into a Virtuoso triple store, and several SPARQL queries were tested as "proofs-of-concept" to illustrate the utility of the Semantic Web in querying across databases which were originally difficult to implement. We were able to successfully retrieve information by linking UniCarbKB, GlycomeDB and JCGGDB in a single SPARQL query to obtain our target information. We also tested queries linking UniProt with GlycoEpitope as well as lectin data with GlycomeDB through PDB. As a result, we have been able to link proteomics data with glycomics data through the implementation of Semantic Web technologies, allowing for more flexible queries across these domains.
Tai, David; Fang, Jianwen
2012-08-27
The large sizes of today's chemical databases require efficient algorithms to perform similarity searches. It can be very time consuming to compare two large chemical databases. This paper seeks to build upon existing research efforts by describing a novel strategy for accelerating existing search algorithms for comparing large chemical collections. The quest for efficiency has focused on developing better indexing algorithms by creating heuristics for searching individual chemical against a chemical library by detecting and eliminating needless similarity calculations. For comparing two chemical collections, these algorithms simply execute searches for each chemical in the query set sequentially. The strategy presented in this paper achieves a speedup upon these algorithms by indexing the set of all query chemicals so redundant calculations that arise in the case of sequential searches are eliminated. We implement this novel algorithm by developing a similarity search program called Symmetric inDexing or SymDex. SymDex shows over a 232% maximum speedup compared to the state-of-the-art single query search algorithm over real data for various fingerprint lengths. Considerable speedup is even seen for batch searches where query set sizes are relatively small compared to typical database sizes. To the best of our knowledge, SymDex is the first search algorithm designed specifically for comparing chemical libraries. It can be adapted to most, if not all, existing indexing algorithms and shows potential for accelerating future similarity search algorithms for comparing chemical databases.
Exposing the cancer genome atlas as a SPARQL endpoint
Deus, Helena F.; Veiga, Diogo F.; Freire, Pablo R.; Weinstein, John N.; Mills, Gordon B.; Almeida, Jonas S.
2011-01-01
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source. PMID:20851208
gmos: Rapid Detection of Genome Mosaicism over Short Evolutionary Distances.
Domazet-Lošo, Mirjana; Domazet-Lošo, Tomislav
2016-01-01
Prokaryotic and viral genomes are often altered by recombination and horizontal gene transfer. The existing methods for detecting recombination are primarily aimed at viral genomes or sets of loci, since the expensive computation of underlying statistical models often hinders the comparison of complete prokaryotic genomes. As an alternative, alignment-free solutions are more efficient, but cannot map (align) a query to subject genomes. To address this problem, we have developed gmos (Genome MOsaic Structure), a new program that determines the mosaic structure of query genomes when compared to a set of closely related subject genomes. The program first computes local alignments between query and subject genomes and then reconstructs the query mosaic structure by choosing the best local alignment for each query region. To accomplish the analysis quickly, the program mostly relies on pairwise alignments and constructs multiple sequence alignments over short overlapping subject regions only when necessary. This fine-tuned implementation achieves an efficiency comparable to an alignment-free tool. The program performs well for simulated and real data sets of closely related genomes and can be used for fast recombination detection; for instance, when a new prokaryotic pathogen is discovered. As an example, gmos was used to detect genome mosaicism in a pathogenic Enterococcus faecium strain compared to seven closely related genomes. The analysis took less than two minutes on a single 2.1 GHz processor. The output is available in fasta format and can be visualized using an accessory program, gmosDraw (freely available with gmos).
gmos: Rapid Detection of Genome Mosaicism over Short Evolutionary Distances
Domazet-Lošo, Mirjana; Domazet-Lošo, Tomislav
2016-01-01
Prokaryotic and viral genomes are often altered by recombination and horizontal gene transfer. The existing methods for detecting recombination are primarily aimed at viral genomes or sets of loci, since the expensive computation of underlying statistical models often hinders the comparison of complete prokaryotic genomes. As an alternative, alignment-free solutions are more efficient, but cannot map (align) a query to subject genomes. To address this problem, we have developed gmos (Genome MOsaic Structure), a new program that determines the mosaic structure of query genomes when compared to a set of closely related subject genomes. The program first computes local alignments between query and subject genomes and then reconstructs the query mosaic structure by choosing the best local alignment for each query region. To accomplish the analysis quickly, the program mostly relies on pairwise alignments and constructs multiple sequence alignments over short overlapping subject regions only when necessary. This fine-tuned implementation achieves an efficiency comparable to an alignment-free tool. The program performs well for simulated and real data sets of closely related genomes and can be used for fast recombination detection; for instance, when a new prokaryotic pathogen is discovered. As an example, gmos was used to detect genome mosaicism in a pathogenic Enterococcus faecium strain compared to seven closely related genomes. The analysis took less than two minutes on a single 2.1 GHz processor. The output is available in fasta format and can be visualized using an accessory program, gmosDraw (freely available with gmos). PMID:27846272
Optimizing Earth Data Search Ranking using Deep Learning and Real-time User Behaviour
NASA Astrophysics Data System (ADS)
Jiang, Y.; Yang, C. P.; Armstrong, E. M.; Huang, T.; Moroni, D. F.; McGibbney, L. J.; Greguska, F. R., III
2017-12-01
Finding Earth science data has been a challenging problem given both the quantity of data available and the heterogeneity of the data across a wide variety of domains. Current search engines in most geospatial data portals tend to induce end users to focus on one single data characteristic dimension (e.g., term frequency-inverse document frequency (TF-IDF) score, popularity, release date, etc.). This approach largely fails to take account of users' multidimensional preferences for geospatial data, and hence may likely result in a less than optimal user experience in discovering the most applicable dataset out of a vast range of available datasets. With users interacting with search engines, sufficient information is already hidden in the log files. Compared with explicit feedback data, information that can be derived/extracted from log files is virtually free and substantially more timely. In this dissertation, I propose an online deep learning framework that can quickly update the learning function based on real-time user clickstream data. The contributions of this framework include 1) a log processor that can ingest, process and create training data from web logs in a real-time manner; 2) a query understanding module to better interpret users' search intent using web log processing results and metadata; 3) a feature extractor that identifies ranking features representing users' multidimensional interests of geospatial data; and 4) a deep learning based ranking algorithm that can be trained incrementally using user behavior data. The search ranking results will be evaluated using precision at K and normalized discounted cumulative gain (NDCG).
Query-by-example surgical activity detection.
Gao, Yixin; Vedula, S Swaroop; Lee, Gyusung I; Lee, Mija R; Khudanpur, Sanjeev; Hager, Gregory D
2016-06-01
Easy acquisition of surgical data opens many opportunities to automate skill evaluation and teaching. Current technology to search tool motion data for surgical activity segments of interest is limited by the need for manual pre-processing, which can be prohibitive at scale. We developed a content-based information retrieval method, query-by-example (QBE), to automatically detect activity segments within surgical data recordings of long duration that match a query. The example segment of interest (query) and the surgical data recording (target trial) are time series of kinematics. Our approach includes an unsupervised feature learning module using a stacked denoising autoencoder (SDAE), two scoring modules based on asymmetric subsequence dynamic time warping (AS-DTW) and template matching, respectively, and a detection module. A distance matrix of the query against the trial is computed using the SDAE features, followed by AS-DTW combined with template scoring, to generate a ranked list of candidate subsequences (substrings). To evaluate the quality of the ranked list against the ground-truth, thresholding conventional DTW distances and bipartite matching are applied. We computed the recall, precision, F1-score, and a Jaccard index-based score on three experimental setups. We evaluated our QBE method using a suture throw maneuver as the query, on two tool motion datasets (JIGSAWS and MISTIC-SL) captured in a training laboratory. We observed a recall of 93, 90 and 87 % and a precision of 93, 91, and 88 % with same surgeon same trial (SSST), same surgeon different trial (SSDT) and different surgeon (DS) experiment setups on JIGSAWS, and a recall of 87, 81 and 75 % and a precision of 72, 61, and 53 % with SSST, SSDT and DS experiment setups on MISTIC-SL, respectively. We developed a novel, content-based information retrieval method to automatically detect multiple instances of an activity within long surgical recordings. Our method demonstrated adequate recall across different complexity datasets and experimental conditions.
Active learning for noisy oracle via density power divergence.
Sogawa, Yasuhiro; Ueno, Tsuyoshi; Kawahara, Yoshinobu; Washio, Takashi
2013-10-01
The accuracy of active learning is critically influenced by the existence of noisy labels given by a noisy oracle. In this paper, we propose a novel pool-based active learning framework through robust measures based on density power divergence. By minimizing density power divergence, such as β-divergence and γ-divergence, one can estimate the model accurately even under the existence of noisy labels within data. Accordingly, we develop query selecting measures for pool-based active learning using these divergences. In addition, we propose an evaluation scheme for these measures based on asymptotic statistical analyses, which enables us to perform active learning by evaluating an estimation error directly. Experiments with benchmark datasets and real-world image datasets show that our active learning scheme performs better than several baseline methods. Copyright © 2013 Elsevier Ltd. All rights reserved.
Predicate Oriented Pattern Analysis for Biomedical Knowledge Discovery
Shen, Feichen; Liu, Hongfang; Sohn, Sunghwan; Larson, David W.; Lee, Yugyung
2017-01-01
In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF, OWL). With the increasing number of semi-structured data coming into the biomedical community, data integration and knowledge discovery from heterogeneous domains become important research problem. In the application level, detection of related concepts among medical ontologies is an important goal of life science research. It is more crucial to figure out how different concepts are related within a single ontology or across multiple ontologies by analysing predicates in different knowledge bases. However, the world today is one of information explosion, and it is extremely difficult for biomedical researchers to find existing or potential predicates to perform linking among cross domain concepts without any support from schema pattern analysis. Therefore, there is a need for a mechanism to do predicate oriented pattern analysis to partition heterogeneous ontologies into closer small topics and do query generation to discover cross domain knowledge from each topic. In this paper, we present such a model that predicates oriented pattern analysis based on their close relationship and generates a similarity matrix. Based on this similarity matrix, we apply an innovated unsupervised learning algorithm to partition large data sets into smaller and closer topics and generate meaningful queries to fully discover knowledge over a set of interlinked data sources. We have implemented a prototype system named BmQGen and evaluate the proposed model with colorectal surgical cohort from the Mayo Clinic. PMID:28983419
An ontology-based method for secondary use of electronic dental record data
Schleyer, Titus KL; Ruttenberg, Alan; Duncan, William; Haendel, Melissa; Torniai, Carlo; Acharya, Amit; Song, Mei; Thyvalikakath, Thankam P.; Liu, Kaihong; Hernandez, Pedro
A key question for healthcare is how to operationalize the vision of the Learning Healthcare System, in which electronic health record data become a continuous information source for quality assurance and research. This project presents an initial, ontology-based, method for secondary use of electronic dental record (EDR) data. We defined a set of dental clinical research questions; constructed the Oral Health and Disease Ontology (OHD); analyzed data from a commercial EDR database; and created a knowledge base, with the OHD used to represent clinical data about 4,500 patients from a single dental practice. Currently, the OHD includes 213 classes and reuses 1,658 classes from other ontologies. We have developed an initial set of SPARQL queries to allow extraction of data about patients, teeth, surfaces, restorations and findings. Further work will establish a complete, open and reproducible workflow for extracting and aggregating data from a variety of EDRs for research and quality assurance. PMID:24303273
Scalable and responsive event processing in the cloud
Suresh, Visalakshmi; Ezhilchelvan, Paul; Watson, Paul
2013-01-01
Event processing involves continuous evaluation of queries over streams of events. Response-time optimization is traditionally done over a fixed set of nodes and/or by using metrics measured at query-operator levels. Cloud computing makes it easy to acquire and release computing nodes as required. Leveraging this flexibility, we propose a novel, queueing-theory-based approach for meeting specified response-time targets against fluctuating event arrival rates by drawing only the necessary amount of computing resources from a cloud platform. In the proposed approach, the entire processing engine of a distinct query is modelled as an atomic unit for predicting response times. Several such units hosted on a single node are modelled as a multiple class M/G/1 system. These aspects eliminate intrusive, low-level performance measurements at run-time, and also offer portability and scalability. Using model-based predictions, cloud resources are efficiently used to meet response-time targets. The efficacy of the approach is demonstrated through cloud-based experiments. PMID:23230164
Loss-tolerant measurement-device-independent quantum private queries
Zhao, Liang-Yuan; Yin, Zhen-Qiang; Chen, Wei; Qian, Yong-Jun; Zhang, Chun-Mei; Guo, Guang-Can; Han, Zheng-Fu
2017-01-01
Quantum private queries (QPQ) is an important cryptography protocol aiming to protect both the user’s and database’s privacy when the database is queried privately. Recently, a variety of practical QPQ protocols based on quantum key distribution (QKD) have been proposed. However, for QKD-based QPQ the user’s imperfect detectors can be subjected to some detector- side-channel attacks launched by the dishonest owner of the database. Here, we present a simple example that shows how the detector-blinding attack can damage the security of QKD-based QPQ completely. To remove all the known and unknown detector side channels, we propose a solution of measurement-device-independent QPQ (MDI-QPQ) with single- photon sources. The security of the proposed protocol has been analyzed under some typical attacks. Moreover, we prove that its security is completely loss independent. The results show that practical QPQ will remain the same degree of privacy as before even with seriously uncharacterized detectors. PMID:28051101
Active learning in the presence of unlabelable examples
NASA Technical Reports Server (NTRS)
Mazzoni, Dominic; Wagstaff, Kiri
2004-01-01
We propose a new active learning framework where the expert labeler is allowed to decline to label any example. This may be necessary because the true label is unknown or because the example belongs to a class that is not part of the real training problem. We show that within this framework, popular active learning algorithms (such as Simple) may perform worse than random selection because they make so many queries to the unlabelable class. We present a method by which any active learning algorithm can be modified to avoid unlabelable examples by training a second classifier to distinguish between the labelable and unlabelable classes. We also demonstrate the effectiveness of the method on two benchmark data sets and a real-world problem.
Less is more: Sampling chemical space with active learning
NASA Astrophysics Data System (ADS)
Smith, Justin S.; Nebgen, Ben; Lubbers, Nicholas; Isayev, Olexandr; Roitberg, Adrian E.
2018-06-01
The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the original random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. This universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.
Improving Protein Fold Recognition by Deep Learning Networks
NASA Astrophysics Data System (ADS)
Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin
2015-12-01
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl’s benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
Ong, Edison; Xiang, Zuoshuang; Zhao, Bin; Liu, Yue; Lin, Yu; Zheng, Jie; Mungall, Chris; Courtot, Mélanie; Ruttenberg, Alan; He, Yongqun
2017-01-01
Linked Data (LD) aims to achieve interconnected data by representing entities using Unified Resource Identifiers (URIs), and sharing information using Resource Description Frameworks (RDFs) and HTTP. Ontologies, which logically represent entities and relations in specific domains, are the basis of LD. Ontobee (http://www.ontobee.org/) is a linked ontology data server that stores ontology information using RDF triple store technology and supports query, visualization and linkage of ontology terms. Ontobee is also the default linked data server for publishing and browsing biomedical ontologies in the Open Biological Ontology (OBO) Foundry (http://obofoundry.org) library. Ontobee currently hosts more than 180 ontologies (including 131 OBO Foundry Library ontologies) with over four million terms. Ontobee provides a user-friendly web interface for querying and visualizing the details and hierarchy of a specific ontology term. Using the eXtensible Stylesheet Language Transformation (XSLT) technology, Ontobee is able to dereference a single ontology term URI, and then output RDF/eXtensible Markup Language (XML) for computer processing or display the HTML information on a web browser for human users. Statistics and detailed information are generated and displayed for each ontology listed in Ontobee. In addition, a SPARQL web interface is provided for custom advanced SPARQL queries of one or multiple ontologies. PMID:27733503
Sundanese ancient manuscripts search engine using probability approach
NASA Astrophysics Data System (ADS)
Suryani, Mira; Hadi, Setiawan; Paulus, Erick; Nurma Yulita, Intan; Supriatna, Asep K.
2017-10-01
Today, Information and Communication Technology (ICT) has become a regular thing for every aspect of live include cultural and heritage aspect. Sundanese ancient manuscripts as Sundanese heritage are in damage condition and also the information that containing on it. So in order to preserve the information in Sundanese ancient manuscripts and make them easier to search, a search engine has been developed. The search engine must has good computing ability. In order to get the best computation in developed search engine, three types of probabilistic approaches: Bayesian Networks Model, Divergence from Randomness with PL2 distribution, and DFR-PL2F as derivative form DFR-PL2 have been compared in this study. The three probabilistic approaches supported by index of documents and three different weighting methods: term occurrence, term frequency, and TF-IDF. The experiment involved 12 Sundanese ancient manuscripts. From 12 manuscripts there are 474 distinct terms. The developed search engine tested by 50 random queries for three types of query. The experiment results showed that for the single query and multiple query, the best searching performance given by the combination of PL2F approach and TF-IDF weighting method. The performance has been evaluated using average time responds with value about 0.08 second and Mean Average Precision (MAP) about 0.33.
A new reference implementation of the PSICQUIC web service.
del-Toro, Noemi; Dumousseau, Marine; Orchard, Sandra; Jimenez, Rafael C; Galeota, Eugenia; Launay, Guillaume; Goll, Johannes; Breuer, Karin; Ono, Keiichiro; Salwinski, Lukasz; Hermjakob, Henning
2013-07-01
The Proteomics Standard Initiative Common QUery InterfaCe (PSICQUIC) specification was created by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) to enable computational access to molecular-interaction data resources by means of a standard Web Service and query language. Currently providing >150 million binary interaction evidences from 28 servers globally, the PSICQUIC interface allows the concurrent search of multiple molecular-interaction information resources using a single query. Here, we present an extension of the PSICQUIC specification (version 1.3), which has been released to be compliant with the enhanced standards in molecular interactions. The new release also includes a new reference implementation of the PSICQUIC server available to the data providers. It offers augmented web service capabilities and improves the user experience. PSICQUIC has been running for almost 5 years, with a user base growing from only 4 data providers to 28 (April 2013) allowing access to 151 310 109 binary interactions. The power of this web service is shown in PSICQUIC View web application, an example of how to simultaneously query, browse and download results from the different PSICQUIC servers. This application is free and open to all users with no login requirement (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml).
Ashkenazy, Haim; Abadi, Shiran; Martz, Eric; Chay, Ofer; Mayrose, Itay; Pupko, Tal; Ben-Tal, Nir
2016-01-01
The degree of evolutionary conservation of an amino acid in a protein or a nucleic acid in DNA/RNA reflects a balance between its natural tendency to mutate and the overall need to retain the structural integrity and function of the macromolecule. The ConSurf web server (http://consurf.tau.ac.il), established over 15 years ago, analyses the evolutionary pattern of the amino/nucleic acids of the macromolecule to reveal regions that are important for structure and/or function. Starting from a query sequence or structure, the server automatically collects homologues, infers their multiple sequence alignment and reconstructs a phylogenetic tree that reflects their evolutionary relations. These data are then used, within a probabilistic framework, to estimate the evolutionary rates of each sequence position. Here we introduce several new features into ConSurf, including automatic selection of the best evolutionary model used to infer the rates, the ability to homology-model query proteins, prediction of the secondary structure of query RNA molecules from sequence, the ability to view the biological assembly of a query (in addition to the single chain), mapping of the conservation grades onto 2D RNA models and an advanced view of the phylogenetic tree that enables interactively rerunning ConSurf with the taxa of a sub-tree. PMID:27166375
Conflict and Accommodation in Classroom Codeswitching in Taiwan
ERIC Educational Resources Information Center
Tien, Ching-yi
2009-01-01
The concept of "English only" as the best teaching-learning method in English as a foreign language classrooms has been promoted in Taiwan over the last decade. During that time, the concept has been queried and debated. Teachers and learners have come to realise that for beginners and slow language learners, the use of codeswitching in…
Urban Forest Health Needs Assessment Survey: Results and Recommendations
Jill D. Pokorny
1998-01-01
The survey was designed to query urban forestry professionals in the 20 northeastern and Midwest States and the District of Columbia, which are served by the Northeastern Area, to learn about their attitudes toward the general issue of urban forest health, identify specific training and information needs in the area of urban tree health management, and discover...
ERIC Educational Resources Information Center
Javaid, Muhammad Asim; Chakraborty, Shelly; Cryan, John F.; Schellekens, Harriët; Toulouse, André
2018-01-01
Recent studies have highlighted a fear or difficulty with the study and understanding of neuroanatomy among medical and healthcare students. This has been linked with a diminished confidence of clinical practitioners and students to manage patients with neurological conditions. The underlying reasons for this difficulty have been queried among a…
Learning from and Teaching about Disaster: The Case of the April 2011 Tornado Outbreak
ERIC Educational Resources Information Center
Ray, Beverly; Hocutt, Martha
2016-01-01
The purpose of this research was to examine how secondary teachers teach about disaster using the media during periods of declared natural disaster and post-disaster in their local communities. Issues of preparedness, pre and post disaster, were examined as well. Secondary teachers were queried in one public school system in the Southeastern…
ERIC Educational Resources Information Center
Maitaouthong, Therdsak; Tuamsuk, Kulthida; Techamanee, Yupin
2011-01-01
This study was aimed at developing an instructional model by integrating information literacy in the instructional process of general education courses at an undergraduate level. The research query, "What is the teaching methodology that integrates information literacy in the instructional process of general education courses at an undergraduate…
Writing/Thinking in Real Time: Digital Video and Corpus Query Analysis
ERIC Educational Resources Information Center
Park, Kwanghyun; Kinginger, Celeste
2010-01-01
The advance of digital video technology in the past two decades facilitates empirical investigation of learning in real time. The focus of this paper is the combined use of real-time digital video and a networked linguistic corpus for exploring the ways in which these technologies enhance our capability to investigate the cognitive process of…
Performing Cities: Engaging the High-Tech Flaneur
ERIC Educational Resources Information Center
Rodriguez, Karen; Rink, Bradley
2011-01-01
As international educators, the overwhelmingly urban context of study abroad requires not only a more intense querying of "the city" as a concept, but also more attention to how students could engage with and learn from the urban landscape. In this article, the authors begin with a theoretical flanerie through a range of ideas related to the city…
The Teacher "Problem": An Analysis of the NSW Education Policy "Great Teaching, Inspired Learning"
ERIC Educational Resources Information Center
Stacey, Meghan
2017-01-01
This paper seeks to understand the construction of teachers within one New South Wales education policy, querying this construction in relation to both local and international processes and factors. As such, it also looks to contribute to a growing body of international literature which grapples with the role and nature of neoliberal policy…
ERIC Educational Resources Information Center
Mills, Robert J.; Dupin-Bryant, Pamela A.; Johnson, John D.; Beaulieu, Tanya Y.
2015-01-01
The demand for Information Systems (IS) graduates with expertise in Structured Query Language (SQL) and database management is vast and projected to increase as "big data" becomes ubiquitous. To prepare students to solve complex problems in a data-driven world, educators must explore instructional strategies to help link prior knowledge…
Text Messaging for Student Communication and Voting
ERIC Educational Resources Information Center
McClean, Stephen; Hagan, Paul; Morgan, Jason
2010-01-01
Text messaging has gained widespread popularity in higher education as a communication tool and as a means of engaging students in the learning process. In this study we report on the use of text messaging in a large, year-one introductory chemistry module where students were encouraged to send questions and queries to a dedicated text number both…
ERIC Educational Resources Information Center
Adetayo, Janet Oyebola
2016-01-01
The National Teachers' Institute (NTI) has been involved in training teachers for the Nigeria Certificate in Education (NCE) for over three decades through Distance Learning Programme. The quality of these teachers has been queried particularly when compared with those who run full-time NCE Programme. This study, therefore, determined the…
ERIC Educational Resources Information Center
Adetayo, Janet Oyebola
2016-01-01
The National Teachers' Institute (NTI) has been involved in training teachers for the Nigeria Certificate in Education (NCE) for over three decades through Distance Learning Programme. The quality of these teachers has been queried particularly when compared with those who run full-time NCE Programme. This study, therefore, determined the…
Big Data Analytics with Datalog Queries on Spark.
Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo
2016-01-01
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.
Big Data Analytics with Datalog Queries on Spark
Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo
2017-01-01
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics. PMID:28626296
UniNE at TREC 2008: Fact and Opinion Retrieval in the Blogsphere
2008-11-01
where (1 ) ( / ) j j ij i j t q j ij i n df k tf Score D Q qtf df K tf K k b b l avdl ∈ − + ⋅ = ⋅ ⋅ + = ⋅ − + ⋅ ∑ (1) in which the...the document score was evaluated as: ( , ) j i j ij t q Score D Q qtf w ∈ = ⋅∑ (2) where qtfj denotes the frequency of term tj in query Q, and...query submitted [8], [9]. To combine two or more single runs, we applied the Z- Score operator [10] defined as: jj k k jj j RSV Mean Z Score RSV
Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy.
Tian, Yuling; Zhang, Hongxian
2016-01-01
For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents) based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic-there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions.
Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy
Tian, Yuling; Zhang, Hongxian
2016-01-01
For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents) based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic–there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions. PMID:27487242
Exposing the cancer genome atlas as a SPARQL endpoint.
Deus, Helena F; Veiga, Diogo F; Freire, Pablo R; Weinstein, John N; Mills, Gordon B; Almeida, Jonas S
2010-12-01
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source. Copyright © 2010 Elsevier Inc. All rights reserved.
Implementation of relational data base management systems on micro-computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huang, C.L.
1982-01-01
This dissertation describes an implementation of a Relational Data Base Management System on a microcomputer. A specific floppy disk based hardward called TERAK is being used, and high level query interface which is similar to a subset of the SEQUEL language is provided. The system contains sub-systems such as I/O, file management, virtual memory management, query system, B-tree management, scanner, command interpreter, expression compiler, garbage collection, linked list manipulation, disk space management, etc. The software has been implemented to fulfill the following goals: (1) it is highly modularized. (2) The system is physically segmented into 16 logically independent, overlayable segments,more » in a way such that a minimal amount of memory is needed at execution time. (3) Virtual memory system is simulated that provides the system with seemingly unlimited memory space. (4) A language translator is applied to recognize user requests in the query language. The code generation of this translator generates compact code for the execution of UPDATE, DELETE, and QUERY commands. (5) A complete set of basic functions needed for on-line data base manipulations is provided through the use of a friendly query interface. (6) To eliminate the dependency on the environment (both software and hardware) as much as possible, so that it would be easy to transplant the system to other computers. (7) To simulate each relation as a sequential file. It is intended to be a highly efficient, single user system suited to be used by small or medium sized organizations for, say, administrative purposes. Experiments show that quite satisfying results have indeed been achieved.« less
G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases.
Wang, Xiaohong; Smalter, Aaron; Huan, Jun; Lushington, Gerald H
2009-01-01
Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others.Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database.Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared to state-of-the-art indexing methods such as C-tree, gIndex, and GraphGrep.
PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets.
Djokic-Petrovic, Marija; Cvjetkovic, Vladimir; Yang, Jeremy; Zivanovic, Marko; Wild, David J
2017-09-20
There are a huge variety of data sources relevant to chemical, biological and pharmacological research, but these data sources are highly siloed and cannot be queried together in a straightforward way. Semantic technologies offer the ability to create links and mappings across datasets and manage them as a single, linked network so that searching can be carried out across datasets, independently of the source. We have developed an application called PIBAS FedSPARQL that uses semantic technologies to allow researchers to carry out such searching across a vast array of data sources. PIBAS FedSPARQL is a web-based query builder and result set visualizer of bioinformatics data. As an advanced feature, our system can detect similar data items identified by different Uniform Resource Identifiers (URIs), using a text-mining algorithm based on the processing of named entities to be used in Vector Space Model and Cosine Similarity Measures. According to our knowledge, PIBAS FedSPARQL was unique among the systems that we found in that it allows detecting of similar data items. As a query builder, our system allows researchers to intuitively construct and run Federated SPARQL queries across multiple data sources, including global initiatives, such as Bio2RDF, Chem2Bio2RDF, EMBL-EBI, and one local initiative called CPCTAS, as well as additional user-specified data source. From the input topic, subtopic, template and keyword, a corresponding initial Federated SPARQL query is created and executed. Based on the data obtained, end users have the ability to choose the most appropriate data sources in their area of interest and exploit their Resource Description Framework (RDF) structure, which allows users to select certain properties of data to enhance query results. The developed system is flexible and allows intuitive creation and execution of queries for an extensive range of bioinformatics topics. Also, the novel "similar data items detection" algorithm can be particularly useful for suggesting new data sources and cost optimization for new experiments. PIBAS FedSPARQL can be expanded with new topics, subtopics and templates on demand, rendering information retrieval more robust.
ERIC Educational Resources Information Center
Dyehouse, Jeremiah; Manke, Krysten
2017-01-01
Can John Dewey's experiments at the University of Chicago's Laboratory School teach contemporary inquirers about "learning by making?" This article warrants an affirmative answer to this query. Unlike intellectual historians who trace the source of Dewey's and his colleagues' 1890s pedagogies to their cultural biases, we contend that…
Mining Student Data Captured from a Web-Based Tutoring Tool: Initial Exploration and Results
ERIC Educational Resources Information Center
Merceron, Agathe; Yacef, Kalina
2004-01-01
In this article we describe the initial investigations that we have conducted on student data collected from a web-based tutoring tool. We have used some data mining techniques such as association rule and symbolic data analysis, as well as traditional SQL queries to gain further insight on the students' learning and deduce information to improve…
Spanish for Business: A Journey into Employability
ERIC Educational Resources Information Center
Lallana, Amparo; Pastor-González, Victoria
2016-01-01
As language lecturers, we believe that we equip our graduates with a range of key skills that give them an edge in the employment market. But, query final year students of a Business and Languages degree on the value of language learning for employability, and they are likely to mention a small number of functional abilities such as CV writing and…
Coherent Image Layout using an Adaptive Visual Vocabulary
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dillard, Scott E.; Henry, Michael J.; Bohn, Shawn J.
When querying a huge image database containing millions of images, the result of the query may still contain many thousands of images that need to be presented to the user. We consider the problem of arranging such a large set of images into a visually coherent layout, one that places similar images next to each other. Image similarity is determined using a bag-of-features model, and the layout is constructed from a hierarchical clustering of the image set by mapping an in-order traversal of the hierarchy tree into a space-filling curve. This layout method provides strong locality guarantees so we aremore » able to quantitatively evaluate performance using standard image retrieval benchmarks. Performance of the bag-of-features method is best when the vocabulary is learned on the image set being clustered. Because learning a large, discriminative vocabulary is a computationally demanding task, we present a novel method for efficiently adapting a generic visual vocabulary to a particular dataset. We evaluate our clustering and vocabulary adaptation methods on a variety of image datasets and show that adapting a generic vocabulary to a particular set of images improves performance on both hierarchical clustering and image retrieval tasks.« less
Fu, Lawrence D; Aphinyanaphongs, Yindalon; Wang, Lily; Aliferis, Constantin F
2011-08-01
Evaluating the biomedical literature and health-related websites for quality are challenging information retrieval tasks. Current commonly used methods include impact factor for journals, PubMed's clinical query filters and machine learning-based filter models for articles, and PageRank for websites. Previous work has focused on the average performance of these methods without considering the topic, and it is unknown how performance varies for specific topics or focused searches. Clinicians, researchers, and users should be aware when expected performance is not achieved for specific topics. The present work analyzes the behavior of these methods for a variety of topics. Impact factor, clinical query filters, and PageRank vary widely across different topics while a topic-specific impact factor and machine learning-based filter models are more stable. The results demonstrate that a method may perform excellently on average but struggle when used on a number of narrower topics. Topic-adjusted metrics and other topic robust methods have an advantage in such situations. Users of traditional topic-sensitive metrics should be aware of their limitations. Copyright © 2011 Elsevier Inc. All rights reserved.
Data Container Study for Handling Array-based Data Using Rasdaman, Hive, Spark, and MongoDB
NASA Astrophysics Data System (ADS)
Xu, M.; Hu, F.; Yu, M.; Scheele, C.; Liu, K.; Huang, Q.; Yang, C. P.; Little, M. M.
2016-12-01
Geoscience communities have come up with various big data storage solutions, such as Rasdaman and Hive, to address the grand challenges for massive Earth observation data management and processing. To examine the readiness of current solutions in supporting big Earth observation, we propose to investigate and compare four popular data container solutions, including Rasdaman, Hive, Spark, and MongoDB. Using different types of spatial and non-spatial queries, datasets stored in common scientific data formats (e.g., NetCDF and HDF), and two applications (i.e. dust storm simulation data mining and MERRA data analytics), we systematically compare and evaluate the feature and performance of these four data containers in terms of data discover and access. The computing resources (e.g. CPU, memory, hard drive, network) consumed while performing various queries and operations are monitored and recorded for the performance evaluation. The initial results show that 1) Rasdaman has the best performance for queries on statistical and operational functions, and supports NetCDF data format better than HDF; 2) Rasdaman clustering configuration is more complex than the others; 3) Hive performs better on single pixel extraction from multiple images; and 4) Except for the single pixel extractions, Spark performs better than Hive and its performance is close to Rasdaman. A comprehensive report will detail the experimental results, and compare their pros and cons regarding system performance, ease of use, accessibility, scalability, compatibility, and flexibility.
Automatic Prediction of Protein 3D Structures by Probabilistic Multi-template Homology Modeling.
Meier, Armin; Söding, Johannes
2015-10-01
Homology modeling predicts the 3D structure of a query protein based on the sequence alignment with one or more template proteins of known structure. Its great importance for biological research is owed to its speed, simplicity, reliability and wide applicability, covering more than half of the residues in protein sequence space. Although multiple templates have been shown to generally increase model quality over single templates, the information from multiple templates has so far been combined using empirically motivated, heuristic approaches. We present here a rigorous statistical framework for multi-template homology modeling. First, we find that the query proteins' atomic distance restraints can be accurately described by two-component Gaussian mixtures. This insight allowed us to apply the standard laws of probability theory to combine restraints from multiple templates. Second, we derive theoretically optimal weights to correct for the redundancy among related templates. Third, a heuristic template selection strategy is proposed. We improve the average GDT-ha model quality score by 11% over single template modeling and by 6.5% over a conventional multi-template approach on a set of 1000 query proteins. Robustness with respect to wrong constraints is likewise improved. We have integrated our multi-template modeling approach with the popular MODELLER homology modeling software in our free HHpred server http://toolkit.tuebingen.mpg.de/hhpred and also offer open source software for running MODELLER with the new restraints at https://bitbucket.org/soedinglab/hh-suite.
Liu, Zhao; Zhu, Yunhong; Wu, Chenxue
2016-01-01
Spatial-temporal k-anonymity has become a mainstream approach among techniques for protection of users’ privacy in location-based services (LBS) applications, and has been applied to several variants such as LBS snapshot queries and continuous queries. Analyzing large-scale spatial-temporal anonymity sets may benefit several LBS applications. In this paper, we propose two location prediction methods based on transition probability matrices constructing from sequential rules for spatial-temporal k-anonymity dataset. First, we define single-step sequential rules mined from sequential spatial-temporal k-anonymity datasets generated from continuous LBS queries for multiple users. We then construct transition probability matrices from mined single-step sequential rules, and normalize the transition probabilities in the transition matrices. Next, we regard a mobility model for an LBS requester as a stationary stochastic process and compute the n-step transition probability matrices by raising the normalized transition probability matrices to the power n. Furthermore, we propose two location prediction methods: rough prediction and accurate prediction. The former achieves the probabilities of arriving at target locations along simple paths those include only current locations, target locations and transition steps. By iteratively combining the probabilities for simple paths with n steps and the probabilities for detailed paths with n-1 steps, the latter method calculates transition probabilities for detailed paths with n steps from current locations to target locations. Finally, we conduct extensive experiments, and correctness and flexibility of our proposed algorithm have been verified. PMID:27508502
Secure count query on encrypted genomic data.
Hasan, Mohammad Zahidul; Mahdi, Md Safiur Rahman; Sadat, Md Nazmus; Mohammed, Noman
2018-05-01
Human genomic information can yield more effective healthcare by guiding medical decisions. Therefore, genomics research is gaining popularity as it can identify potential correlations between a disease and a certain gene, which improves the safety and efficacy of drug treatment and can also develop more effective prevention strategies [1]. To reduce the sampling error and to increase the statistical accuracy of this type of research projects, data from different sources need to be brought together since a single organization does not necessarily possess required amount of data. In this case, data sharing among multiple organizations must satisfy strict policies (for instance, HIPAA and PIPEDA) that have been enforced to regulate privacy-sensitive data sharing. Storage and computation on the shared data can be outsourced to a third party cloud service provider, equipped with enormous storage and computation resources. However, outsourcing data to a third party is associated with a potential risk of privacy violation of the participants, whose genomic sequence or clinical profile is used in these studies. In this article, we propose a method for secure sharing and computation on genomic data in a semi-honest cloud server. In particular, there are two main contributions. Firstly, the proposed method can handle biomedical data containing both genotype and phenotype. Secondly, our proposed index tree scheme reduces the computational overhead significantly for executing secure count query operation. In our proposed method, the confidentiality of shared data is ensured through encryption, while making the entire computation process efficient and scalable for cutting-edge biomedical applications. We evaluated our proposed method in terms of efficiency on a database of Single-Nucleotide Polymorphism (SNP) sequences, and experimental results demonstrate that the execution time for a query of 50 SNPs in a database of 50,000 records is approximately 5 s, where each record contains 500 SNPs. And, it requires 69.7 s to execute the query on the same database that also includes phenotypes. Copyright © 2018 Elsevier Inc. All rights reserved.
QCS : a system for querying, clustering, and summarizing documents.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dunlavy, Daniel M.
2006-08-01
Information retrieval systems consist of many complicated components. Research and development of such systems is often hampered by the difficulty in evaluating how each particular component would behave across multiple systems. We present a novel hybrid information retrieval system--the Query, Cluster, Summarize (QCS) system--which is portable, modular, and permits experimentation with different instantiations of each of the constituent text analysis components. Most importantly, the combination of the three types of components in the QCS design improves retrievals by providing users more focused information organized by topic. We demonstrate the improved performance by a series of experiments using standard test setsmore » from the Document Understanding Conferences (DUC) along with the best known automatic metric for summarization system evaluation, ROUGE. Although the DUC data and evaluations were originally designed to test multidocument summarization, we developed a framework to extend it to the task of evaluation for each of the three components: query, clustering, and summarization. Under this framework, we then demonstrate that the QCS system (end-to-end) achieves performance as good as or better than the best summarization engines. Given a query, QCS retrieves relevant documents, separates the retrieved documents into topic clusters, and creates a single summary for each cluster. In the current implementation, Latent Semantic Indexing is used for retrieval, generalized spherical k-means is used for the document clustering, and a method coupling sentence ''trimming'', and a hidden Markov model, followed by a pivoted QR decomposition, is used to create a single extract summary for each cluster. The user interface is designed to provide access to detailed information in a compact and useful format. Our system demonstrates the feasibility of assembling an effective IR system from existing software libraries, the usefulness of the modularity of the design, and the value of this particular combination of modules.« less
QCS: a system for querying, clustering and summarizing documents.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dunlavy, Daniel M.; Schlesinger, Judith D.; O'Leary, Dianne P.
2006-10-01
Information retrieval systems consist of many complicated components. Research and development of such systems is often hampered by the difficulty in evaluating how each particular component would behave across multiple systems. We present a novel hybrid information retrieval system--the Query, Cluster, Summarize (QCS) system--which is portable, modular, and permits experimentation with different instantiations of each of the constituent text analysis components. Most importantly, the combination of the three types of components in the QCS design improves retrievals by providing users more focused information organized by topic. We demonstrate the improved performance by a series of experiments using standard test setsmore » from the Document Understanding Conferences (DUC) along with the best known automatic metric for summarization system evaluation, ROUGE. Although the DUC data and evaluations were originally designed to test multidocument summarization, we developed a framework to extend it to the task of evaluation for each of the three components: query, clustering, and summarization. Under this framework, we then demonstrate that the QCS system (end-to-end) achieves performance as good as or better than the best summarization engines. Given a query, QCS retrieves relevant documents, separates the retrieved documents into topic clusters, and creates a single summary for each cluster. In the current implementation, Latent Semantic Indexing is used for retrieval, generalized spherical k-means is used for the document clustering, and a method coupling sentence 'trimming', and a hidden Markov model, followed by a pivoted QR decomposition, is used to create a single extract summary for each cluster. The user interface is designed to provide access to detailed information in a compact and useful format. Our system demonstrates the feasibility of assembling an effective IR system from existing software libraries, the usefulness of the modularity of the design, and the value of this particular combination of modules.« less
Active Exploration of Large 3D Model Repositories.
Gao, Lin; Cao, Yan-Pei; Lai, Yu-Kun; Huang, Hao-Zhi; Kobbelt, Leif; Hu, Shi-Min
2015-12-01
With broader availability of large-scale 3D model repositories, the need for efficient and effective exploration becomes more and more urgent. Existing model retrieval techniques do not scale well with the size of the database since often a large number of very similar objects are returned for a query, and the possibilities to refine the search are quite limited. We propose an interactive approach where the user feeds an active learning procedure by labeling either entire models or parts of them as "like" or "dislike" such that the system can automatically update an active set of recommended models. To provide an intuitive user interface, candidate models are presented based on their estimated relevance for the current query. From the methodological point of view, our main contribution is to exploit not only the similarity between a query and the database models but also the similarities among the database models themselves. We achieve this by an offline pre-processing stage, where global and local shape descriptors are computed for each model and a sparse distance metric is derived that can be evaluated efficiently even for very large databases. We demonstrate the effectiveness of our method by interactively exploring a repository containing over 100 K models.
Miwa, Makoto; Ohta, Tomoko; Rak, Rafal; Rowley, Andrew; Kell, Douglas B.; Pyysalo, Sampo; Ananiadou, Sophia
2013-01-01
Motivation: To create, verify and maintain pathway models, curators must discover and assess knowledge distributed over the vast body of biological literature. Methods supporting these tasks must understand both the pathway model representations and the natural language in the literature. These methods should identify and order documents by relevance to any given pathway reaction. No existing system has addressed all aspects of this challenge. Method: We present novel methods for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models and then turns them into queries for three text mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machine-learning approaches. Results: Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find that a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText. Availability: An online demonstration of PathText 2 and the annotated corpus are available for research purposes at http://www.nactem.ac.uk/pathtext2/. Contact: makoto.miwa@manchester.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23813008
Forecasting influenza in Hong Kong with Google search queries and statistical model fusion.
Xu, Qinneng; Gel, Yulia R; Ramirez Ramirez, L Leticia; Nezafati, Kusha; Zhang, Qingpeng; Tsui, Kwok-Leung
2017-01-01
The objective of this study is to investigate predictive utility of online social media and web search queries, particularly, Google search data, to forecast new cases of influenza-like-illness (ILI) in general outpatient clinics (GOPC) in Hong Kong. To mitigate the impact of sensitivity to self-excitement (i.e., fickle media interest) and other artifacts of online social media data, in our approach we fuse multiple offline and online data sources. Four individual models: generalized linear model (GLM), least absolute shrinkage and selection operator (LASSO), autoregressive integrated moving average (ARIMA), and deep learning (DL) with Feedforward Neural Networks (FNN) are employed to forecast ILI-GOPC both one week and two weeks in advance. The covariates include Google search queries, meteorological data, and previously recorded offline ILI. To our knowledge, this is the first study that introduces deep learning methodology into surveillance of infectious diseases and investigates its predictive utility. Furthermore, to exploit the strength from each individual forecasting models, we use statistical model fusion, using Bayesian model averaging (BMA), which allows a systematic integration of multiple forecast scenarios. For each model, an adaptive approach is used to capture the recent relationship between ILI and covariates. DL with FNN appears to deliver the most competitive predictive performance among the four considered individual models. Combing all four models in a comprehensive BMA framework allows to further improve such predictive evaluation metrics as root mean squared error (RMSE) and mean absolute predictive error (MAPE). Nevertheless, DL with FNN remains the preferred method for predicting locations of influenza peaks. The proposed approach can be viewed a feasible alternative to forecast ILI in Hong Kong or other countries where ILI has no constant seasonal trend and influenza data resources are limited. The proposed methodology is easily tractable and computationally efficient.
TopFed: TCGA tailored federated query processing and linking to LOD.
Saleem, Muhammad; Padmanabhuni, Shanmukha S; Ngomo, Axel-Cyrille Ngonga; Iqbal, Aftab; Almeida, Jonas S; Decker, Stefan; Deus, Helena F
2014-01-01
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis. We address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed. We compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX. With TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing.
Implicit learning in cotton-top tamarins (Saguinus oedipus) and pigeons (Columba livia).
Locurto, Charles; Fox, Maura; Mazzella, Andrea
2015-06-01
There is considerable interest in the conditions under which human subjects learn patterned information without explicit instructions to learn that information. This form of learning, termed implicit or incidental learning, can be approximated in nonhumans by exposing subjects to patterned information but delivering reinforcement randomly, thereby not requiring the subjects to learn the information in order to be reinforced. Following acquisition, nonhuman subjects are queried as to what they have learned about the patterned information. In the present experiment, we extended the study of implicit learning in nonhumans by comparing two species, cotton-top tamarins (Saguinus oedipus) and pigeons (Columba livia), on an implicit learning task that used an artificial grammar to generate the patterned elements for training. We equated the conditions of training and testing as much as possible between the two species. The results indicated that both species demonstrated approximately the same magnitude of implicit learning, judged both by a random test and by choice tests between pairs of training elements. This finding suggests that the ability to extract patterned information from situations in which such learning is not demanded is of longstanding origin.
CoMetaR: A Collaborative Metadata Repository for Biomedical Research Networks.
Stöhr, Mark R; Helm, Gudrun; Majeed, Raphael W; Günther, Andreas
2017-01-01
The German Center for Lung Research (DZL) is a research network with the aim of researching respiratory diseases. To perform consortium-wide queries through one single interface, it requires a uniform conceptual structure. No single terminology covers all our concepts. To achieve a broadly accepted and complete ontology, we developed a platform for collaborative metadata management "CoMetaR". Anyone can browse and discuss the ontology while editing can be performed by authenticated users.
Ong, Edison; Xiang, Zuoshuang; Zhao, Bin; Liu, Yue; Lin, Yu; Zheng, Jie; Mungall, Chris; Courtot, Mélanie; Ruttenberg, Alan; He, Yongqun
2017-01-04
Linked Data (LD) aims to achieve interconnected data by representing entities using Unified Resource Identifiers (URIs), and sharing information using Resource Description Frameworks (RDFs) and HTTP. Ontologies, which logically represent entities and relations in specific domains, are the basis of LD. Ontobee (http://www.ontobee.org/) is a linked ontology data server that stores ontology information using RDF triple store technology and supports query, visualization and linkage of ontology terms. Ontobee is also the default linked data server for publishing and browsing biomedical ontologies in the Open Biological Ontology (OBO) Foundry (http://obofoundry.org) library. Ontobee currently hosts more than 180 ontologies (including 131 OBO Foundry Library ontologies) with over four million terms. Ontobee provides a user-friendly web interface for querying and visualizing the details and hierarchy of a specific ontology term. Using the eXtensible Stylesheet Language Transformation (XSLT) technology, Ontobee is able to dereference a single ontology term URI, and then output RDF/eXtensible Markup Language (XML) for computer processing or display the HTML information on a web browser for human users. Statistics and detailed information are generated and displayed for each ontology listed in Ontobee. In addition, a SPARQL web interface is provided for custom advanced SPARQL queries of one or multiple ontologies. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Query3d: a new method for high-throughput analysis of functional residues in protein structures.
Ausiello, Gabriele; Via, Allegra; Helmer-Citterich, Manuela
2005-12-01
The identification of local similarities between two protein structures can provide clues of a common function. Many different methods exist for searching for similar subsets of residues in proteins of known structure. However, the lack of functional and structural information on single residues, together with the low level of integration of this information in comparison methods, is a limitation that prevents these methods from being fully exploited in high-throughput analyses. Here we describe Query3d, a program that is both a structural DBMS (Database Management System) and a local comparison method. The method conserves a copy of all the residues of the Protein Data Bank annotated with a variety of functional and structural information. New annotations can be easily added from a variety of methods and known databases. The algorithm makes it possible to create complex queries based on the residues' function and then to compare only subsets of the selected residues. Functional information is also essential to speed up the comparison and the analysis of the results. With Query3d, users can easily obtain statistics on how many and which residues share certain properties in all proteins of known structure. At the same time, the method also finds their structural neighbours in the whole PDB. Programs and data can be accessed through the PdbFun web interface.
Query3d: a new method for high-throughput analysis of functional residues in protein structures
Ausiello, Gabriele; Via, Allegra; Helmer-Citterich, Manuela
2005-01-01
Background The identification of local similarities between two protein structures can provide clues of a common function. Many different methods exist for searching for similar subsets of residues in proteins of known structure. However, the lack of functional and structural information on single residues, together with the low level of integration of this information in comparison methods, is a limitation that prevents these methods from being fully exploited in high-throughput analyses. Results Here we describe Query3d, a program that is both a structural DBMS (Database Management System) and a local comparison method. The method conserves a copy of all the residues of the Protein Data Bank annotated with a variety of functional and structural information. New annotations can be easily added from a variety of methods and known databases. The algorithm makes it possible to create complex queries based on the residues' function and then to compare only subsets of the selected residues. Functional information is also essential to speed up the comparison and the analysis of the results. Conclusion With Query3d, users can easily obtain statistics on how many and which residues share certain properties in all proteins of known structure. At the same time, the method also finds their structural neighbours in the whole PDB. Programs and data can be accessed through the PdbFun web interface. PMID:16351754
Cui, Xuefeng; Lu, Zhiwu; Wang, Sheng; Jing-Yan Wang, Jim; Gao, Xin
2016-06-15
Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods. Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx : xin.gao@kaust.edu.sa Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
FoldMiner and LOCK 2: protein structure comparison and motif discovery on the web.
Shapiro, Jessica; Brutlag, Douglas
2004-07-01
The FoldMiner web server (http://foldminer.stanford.edu/) provides remote access to methods for protein structure alignment and unsupervised motif discovery. FoldMiner is unique among such algorithms in that it improves both the motif definition and the sensitivity of a structural similarity search by combining the search and motif discovery methods and using information from each process to enhance the other. In a typical run, a query structure is aligned to all structures in one of several databases of single domain targets in order to identify its structural neighbors and to discover a motif that is the basis for the similarity among the query and statistically significant targets. This process is fully automated, but options for manual refinement of the results are available as well. The server uses the Chime plugin and customized controls to allow for visualization of the motif and of structural superpositions. In addition, we provide an interface to the LOCK 2 algorithm for rapid alignments of a query structure to smaller numbers of user-specified targets.
Pharmit: interactive exploration of chemical space.
Sunseri, Jocelyn; Koes, David Ryan
2016-07-08
Pharmit (http://pharmit.csb.pitt.edu) provides an online, interactive environment for the virtual screening of large compound databases using pharmacophores, molecular shape and energy minimization. Users can import, create and edit virtual screening queries in an interactive browser-based interface. Queries are specified in terms of a pharmacophore, a spatial arrangement of the essential features of an interaction, and molecular shape. Search results can be further ranked and filtered using energy minimization. In addition to a number of pre-built databases of popular compound libraries, users may submit their own compound libraries for screening. Pharmit uses state-of-the-art sub-linear algorithms to provide interactive screening of millions of compounds. Queries typically take a few seconds to a few minutes depending on their complexity. This allows users to iteratively refine their search during a single session. The easy access to large chemical datasets provided by Pharmit simplifies and accelerates structure-based drug design. Pharmit is available under a dual BSD/GPL open-source license. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Pattern Discovery and Change Detection of Online Music Query Streams
NASA Astrophysics Data System (ADS)
Li, Hua-Fu
In this paper, an efficient stream mining algorithm, called FTP-stream (Frequent Temporal Pattern mining of streams), is proposed to find the frequent temporal patterns over melody sequence streams. In the framework of our proposed algorithm, an effective bit-sequence representation is used to reduce the time and memory needed to slide the windows. The FTP-stream algorithm can calculate the support threshold in only a single pass based on the concept of bit-sequence representation. It takes the advantage of "left" and "and" operations of the representation. Experiments show that the proposed algorithm only scans the music query stream once, and runs significant faster and consumes less memory than existing algorithms, such as SWFI-stream and Moment.
Sarica, Alessia; Cerasa, Antonio; Quattrone, Aldo
2017-01-01
Objective: Machine learning classification has been the most important computational development in the last years to satisfy the primary need of clinicians for automatic early diagnosis and prognosis. Nowadays, Random Forest (RF) algorithm has been successfully applied for reducing high dimensional and multi-source data in many scientific realms. Our aim was to explore the state of the art of the application of RF on single and multi-modal neuroimaging data for the prediction of Alzheimer's disease. Methods: A systematic review following PRISMA guidelines was conducted on this field of study. In particular, we constructed an advanced query using boolean operators as follows: ("random forest" OR "random forests") AND neuroimaging AND ("alzheimer's disease" OR alzheimer's OR alzheimer) AND (prediction OR classification) . The query was then searched in four well-known scientific databases: Pubmed, Scopus, Google Scholar and Web of Science. Results: Twelve articles-published between the 2007 and 2017-have been included in this systematic review after a quantitative and qualitative selection. The lesson learnt from these works suggest that when RF was applied on multi-modal data for prediction of Alzheimer's disease (AD) conversion from the Mild Cognitive Impairment (MCI), it produces one of the best accuracies to date. Moreover, the RF has important advantages in terms of robustness to overfitting, ability to handle highly non-linear data, stability in the presence of outliers and opportunity for efficient parallel processing mainly when applied on multi-modality neuroimaging data, such as, MRI morphometric, diffusion tensor imaging, and PET images. Conclusions: We discussed the strengths of RF, considering also possible limitations and by encouraging further studies on the comparisons of this algorithm with other commonly used classification approaches, particularly in the early prediction of the progression from MCI to AD.
Space environment data storage and access: lessons learned and recommendations for the future
NASA Astrophysics Data System (ADS)
Evans, Hugh; Heynderickx, Daniel
2012-07-01
With the ever increasing volume of space environment data available at present and planned for the near future, the demands on data storage and access methods are increasing as well. In addition, continued access to historical, archived data remains crucial. On the basis of many years of experience, the authors identify the following issues as important for continued and efficient handling of datasets now and in the future: The huge data volumes currently or very soon avaiable from a number of space missions will limi direct Internet download access to even relatively short epoch ranges of data. Therefore, data providers should establish or extend standardised data (post-) processing services so that only data query results should be downloaded. Although a single standardised data format will in all likelihood remain utopia, data providers should at least include extensive metadata with their data products, according to established standards and practices (e.g. ISTP, SPASE). Standardisation of (sets of) metadata greatly facilitates data mining and querying. The use of SQL database storage should be considered instead of, or in parallel with, classic storage of data files. The use of SQL does away with having to handle file parsing and processing, while at the same time standard access protocols can be used to (remotely) connect to such data repositories. Many data holdings are still lacking in extensive descriptions of data provenance (e.g. instrument description), content and format. Unfortunately, detailed data information is usually rejected by scientific and technical journals. Re-processing of historical archived datasets into modern formats, making them easily available and usable, is urgently required, as knowledge is being lost. A global data directory has still not been achieved; policy makers should enforce stricter rules for "broadcasting" dataset information.
Retrieving high-resolution images over the Internet from an anatomical image database
NASA Astrophysics Data System (ADS)
Strupp-Adams, Annette; Henderson, Earl
1999-12-01
The Visible Human Data set is an important contribution to the national collection of anatomical images. To enhance the availability of these images, the National Library of Medicine has supported the design and development of a prototype object-oriented image database which imports, stores, and distributes high resolution anatomical images in both pixel and voxel formats. One of the key database modules is its client-server Internet interface. This Web interface provides a query engine with retrieval access to high-resolution anatomical images that range in size from 100KB for browser viewable rendered images, to 1GB for anatomical structures in voxel file formats. The Web query and retrieval client-server system is composed of applet GUIs, servlets, and RMI application modules which communicate with each other to allow users to query for specific anatomical structures, and retrieve image data as well as associated anatomical images from the database. Selected images can be downloaded individually as single files via HTTP or downloaded in batch-mode over the Internet to the user's machine through an applet that uses Netscape's Object Signing mechanism. The image database uses ObjectDesign's object-oriented DBMS, ObjectStore that has a Java interface. The query and retrieval systems has been tested with a Java-CDE window system, and on the x86 architecture using Windows NT 4.0. This paper describes the Java applet client search engine that queries the database; the Java client module that enables users to view anatomical images online; the Java application server interface to the database which organizes data returned to the user, and its distribution engine that allow users to download image files individually and/or in batch-mode.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Harber, K.S.
1993-05-01
This report contains the following papers: Implications in vivid logic; a self-learning bayesian expert system; a natural language generation system for a heterogeneous distributed database system; competence-switching'' managed by intelligent systems; strategy acquisition by an artificial neural network: Experiments in learning to play a stochastic game; viewpoints and selective inheritance in object-oriented modeling; multivariate discretization of continuous attributes for machine learning; utilization of the case-based reasoning method to resolve dynamic problems; formalization of an ontology of ceramic science in CLASSIC; linguistic tools for intelligent systems; an application of rough sets in knowledge synthesis; and a relational model for imprecise queries.more » These papers have been indexed separately.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Harber, K.S.
1993-05-01
This report contains the following papers: Implications in vivid logic; a self-learning Bayesian Expert System; a natural language generation system for a heterogeneous distributed database system; ``competence-switching`` managed by intelligent systems; strategy acquisition by an artificial neural network: Experiments in learning to play a stochastic game; viewpoints and selective inheritance in object-oriented modeling; multivariate discretization of continuous attributes for machine learning; utilization of the case-based reasoning method to resolve dynamic problems; formalization of an ontology of ceramic science in CLASSIC; linguistic tools for intelligent systems; an application of rough sets in knowledge synthesis; and a relational model for imprecise queries.more » These papers have been indexed separately.« less
A Survey of Kurdish Students' Sound Segment & Syllabic Pattern Errors in the Course of Learning EFL
ERIC Educational Resources Information Center
Mohammadi, Jahangir
2014-01-01
This paper is devoted to finding adequate answers to the following queries: (A) what are the segmental and syllabic pattern errors made by Kurdish students in their pronunciation? (B) Can the problematic areas in pronunciation be predicted by a systematic comparison of the sound systems of both native and target languages? (C) Can there be any…
Griffin: A Tool for Symbolic Inference of Synchronous Boolean Molecular Networks.
Muñoz, Stalin; Carrillo, Miguel; Azpeitia, Eugenio; Rosenblueth, David A
2018-01-01
Boolean networks are important models of biochemical systems, located at the high end of the abstraction spectrum. A number of Boolean gene networks have been inferred following essentially the same method. Such a method first considers experimental data for a typically underdetermined "regulation" graph. Next, Boolean networks are inferred by using biological constraints to narrow the search space, such as a desired set of (fixed-point or cyclic) attractors. We describe Griffin , a computer tool enhancing this method. Griffin incorporates a number of well-established algorithms, such as Dubrova and Teslenko's algorithm for finding attractors in synchronous Boolean networks. In addition, a formal definition of regulation allows Griffin to employ "symbolic" techniques, able to represent both large sets of network states and Boolean constraints. We observe that when the set of attractors is required to be an exact set, prohibiting additional attractors, a naive Boolean coding of this constraint may be unfeasible. Such cases may be intractable even with symbolic methods, as the number of Boolean constraints may be astronomically large. To overcome this problem, we employ an Artificial Intelligence technique known as "clause learning" considerably increasing Griffin 's scalability. Without clause learning only toy examples prohibiting additional attractors are solvable: only one out of seven queries reported here is answered. With clause learning, by contrast, all seven queries are answered. We illustrate Griffin with three case studies drawn from the Arabidopsis thaliana literature. Griffin is available at: http://turing.iimas.unam.mx/griffin.
An investigative, cooperative learning approach to the general microbiology laboratory.
Seifert, Kyle; Fenster, Amy; Dilts, Judith A; Temple, Louise
2009-01-01
Investigative- and cooperative-based learning strategies have been used effectively in a variety of classrooms to enhance student learning and engagement. In the General Microbiology laboratory for juniors and seniors at James Madison University, these strategies were combined to make a semester-long, investigative, cooperative learning experience involving culture and identification of microbial isolates that the students obtained from various environments. To assess whether this strategy was successful, students were asked to complete a survey at the beginning and at the end of the semester regarding their comfort level with a variety of topics. For most of the topics queried, the students reported that their comfort had increased significantly during the semester. Furthermore, this group of students thought that the quality of this investigative lab experience was much better than that of any of their previous lab experiences.
An Investigative, Cooperative Learning Approach to the General Microbiology Laboratory
Seifert, Kyle; Fenster, Amy; Dilts, Judith A.
2009-01-01
Investigative- and cooperative-based learning strategies have been used effectively in a variety of classrooms to enhance student learning and engagement. In the General Microbiology laboratory for juniors and seniors at James Madison University, these strategies were combined to make a semester-long, investigative, cooperative learning experience involving culture and identification of microbial isolates that the students obtained from various environments. To assess whether this strategy was successful, students were asked to complete a survey at the beginning and at the end of the semester regarding their comfort level with a variety of topics. For most of the topics queried, the students reported that their comfort had increased significantly during the semester. Furthermore, this group of students thought that the quality of this investigative lab experience was much better than that of any of their previous lab experiences. PMID:19487504
The iMars web-GIS - spatio-temporal data queries and single image web map services
NASA Astrophysics Data System (ADS)
Walter, S. H. G.; Steikert, R.; Schreiner, B.; Sidiropoulos, P.; Tao, Y.; Muller, J.-P.; Putry, A. R. D.; van Gasselt, S.
2017-09-01
We introduce a new approach for a system dedicated to planetary surface change detection by simultaneous visualisation of single-image time series in a multi-temporal context. In the context of the EU FP-7 iMars project we process and ingest vast amounts of automatically co-registered (ACRO) images. The base of the co-registration are the high precision HRSC multi-orbit quadrangle image mosaics, which are based on bundle-block-adjusted multi-orbit HRSC DTMs.
Deep Multimodal Distance Metric Learning Using Click Constraints for Image Ranking.
Yu, Jun; Yang, Xiaokang; Gao, Fei; Tao, Dacheng
2017-12-01
How do we retrieve images accurately? Also, how do we rank a group of images precisely and efficiently for specific queries? These problems are critical for researchers and engineers to generate a novel image searching engine. First, it is important to obtain an appropriate description that effectively represent the images. In this paper, multimodal features are considered for describing images. The images unique properties are reflected by visual features, which are correlated to each other. However, semantic gaps always exist between images visual features and semantics. Therefore, we utilize click feature to reduce the semantic gap. The second key issue is learning an appropriate distance metric to combine these multimodal features. This paper develops a novel deep multimodal distance metric learning (Deep-MDML) method. A structured ranking model is adopted to utilize both visual and click features in distance metric learning (DML). Specifically, images and their related ranking results are first collected to form the training set. Multimodal features, including click and visual features, are collected with these images. Next, a group of autoencoders is applied to obtain initially a distance metric in different visual spaces, and an MDML method is used to assign optimal weights for different modalities. Next, we conduct alternating optimization to train the ranking model, which is used for the ranking of new queries with click features. Compared with existing image ranking methods, the proposed method adopts a new ranking model to use multimodal features, including click features and visual features in DML. We operated experiments to analyze the proposed Deep-MDML in two benchmark data sets, and the results validate the effects of the method.
NASA Astrophysics Data System (ADS)
Richards, Joseph W.; Starr, Dan L.; Brink, Henrik; Miller, Adam A.; Bloom, Joshua S.; Butler, Nathaniel R.; James, J. Berian; Long, James P.; Rice, John
2012-01-01
Despite the great promise of machine-learning algorithms to classify and predict astrophysical parameters for the vast numbers of astrophysical sources and transients observed in large-scale surveys, the peculiarities of the training data often manifest as strongly biased predictions on the data of interest. Typically, training sets are derived from historical surveys of brighter, more nearby objects than those from more extensive, deeper surveys (testing data). This sample selection bias can cause catastrophic errors in predictions on the testing data because (1) standard assumptions for machine-learned model selection procedures break down and (2) dense regions of testing space might be completely devoid of training data. We explore possible remedies to sample selection bias, including importance weighting, co-training, and active learning (AL). We argue that AL—where the data whose inclusion in the training set would most improve predictions on the testing set are queried for manual follow-up—is an effective approach and is appropriate for many astronomical applications. For a variable star classification problem on a well-studied set of stars from Hipparcos and Optical Gravitational Lensing Experiment, AL is the optimal method in terms of error rate on the testing data, beating the off-the-shelf classifier by 3.4% and the other proposed methods by at least 3.0%. To aid with manual labeling of variable stars, we developed a Web interface which allows for easy light curve visualization and querying of external databases. Finally, we apply AL to classify variable stars in the All Sky Automated Survey, finding dramatic improvement in our agreement with the ASAS Catalog of Variable Stars, from 65.5% to 79.5%, and a significant increase in the classifier's average confidence for the testing set, from 14.6% to 42.9%, after a few AL iterations.
Science.gov: gateway to government science information.
Fitzpatrick, Roberta Bronson
2010-01-01
Science.gov is a portal to more than 40 scientific databases and 200 million pages of science information via a single query. It connects users to science information and research results from the U.S. government. This column will provide readers with an overview of the resource, as well as basic search hints.
Automated Ontology Alignment with Fuselets for Community of Interest (COI) Integration
2008-09-01
Search Example ............................................................................... 22 Figure 8 - Federated Search Example Revisited...integrating information from various sources through a single query. This is the traditional federated search problem, where the sources don’t...Figure 7 - Federated Search Example For the data sources in the graphic above, the ontologies align in a fairly straightforward manner
Indexing and retrieving DICOM data in disperse and unstructured archives.
Costa, Carlos; Freitas, Filipe; Pereira, Marco; Silva, Augusto; Oliveira, José L
2009-01-01
This paper proposes an indexing and retrieval solution to gather information from distributed DICOM documents by allowing searches and access to the virtual data repository using a Google-like process. The medical imaging modalities are becoming more powerful and less expensive. The result is the proliferation of equipment acquisition by imaging centers, including the small ones. With this dispersion of data, it is not easy to take advantage of all the information that can be retrieved from these studies. Furthermore, many of these small centers do not have large enough requirements to justify the acquisition of a traditional PACS. A peer-to-peer PACS platform to index and query DICOM files over a set of distributed repositories that are logically viewed as a single federated unit. The solution is based on a public domain document-indexing engine and extends traditional PACS query and retrieval mechanisms. This proposal deals well with complex searching requirements, from a single desktop environment to distributed scenarios. The solution performance and robustness were demonstrated in trials. The characteristics of presented PACS platform make it particularly important for small institutions, including educational and research groups.
An Information Retrieval and Recommendation System for Astronomical Observatories
NASA Astrophysics Data System (ADS)
Mukund, Nikhil; Thakur, Saurabh; Abraham, Sheelu; Aniyan, A. K.; Mitra, Sanjit; Sajeeth Philip, Ninan; Vaghmare, Kaustubh; Acharjya, D. P.
2018-03-01
We present a machine-learning-based information retrieval system for astronomical observatories that tries to address user-defined queries related to an instrument. In the modern instrumentation scenario where heterogeneous systems and talents are simultaneously at work, the ability to supply people with the right information helps speed up the tasks for detector operation, maintenance, and upgradation. The proposed method analyzes existing documented efforts at the site to intelligently group related information to a query and to present it online to the user. The user in response can probe the suggested content and explore previously developed solutions or probable ways to address the present situation optimally. We demonstrate natural language-processing-backed knowledge rediscovery by making use of the open source logbook data from the Laser Interferometric Gravitational Observatory (LIGO). We implement and test a web application that incorporates the above idea for LIGO Livingston, LIGO Hanford, and Virgo observatories.
An energy-aware routing protocol for query-based applications in wireless sensor networks.
Ahvar, Ehsan; Ahvar, Shohreh; Lee, Gyu Myoung; Crespi, Noel
2014-01-01
Wireless sensor network (WSN) typically has energy consumption restriction. Designing energy-aware routing protocol can significantly reduce energy consumption in WSNs. Energy-aware routing protocols can be classified into two categories, energy savers and energy balancers. Energy saving protocols are used to minimize the overall energy consumed by a WSN, while energy balancing protocols attempt to efficiently distribute the consumption of energy throughout the network. In general terms, energy saving protocols are not necessarily good at balancing energy consumption and energy balancing protocols are not always good at reducing energy consumption. In this paper, we propose an energy-aware routing protocol (ERP) for query-based applications in WSNs, which offers a good trade-off between traditional energy balancing and energy saving objectives and supports a soft real time packet delivery. This is achieved by means of fuzzy sets and learning automata techniques along with zonal broadcasting to decrease total energy consumption.
Diamond Eye: a distributed architecture for image data mining
NASA Astrophysics Data System (ADS)
Burl, Michael C.; Fowlkes, Charless; Roden, Joe; Stechert, Andre; Mukhtar, Saleem
1999-02-01
Diamond Eye is a distributed software architecture, which enables users (scientists) to analyze large image collections by interacting with one or more custom data mining servers via a Java applet interface. Each server is coupled with an object-oriented database and a computational engine, such as a network of high-performance workstations. The database provides persistent storage and supports querying of the 'mined' information. The computational engine provides parallel execution of expensive image processing, object recognition, and query-by-content operations. Key benefits of the Diamond Eye architecture are: (1) the design promotes trial evaluation of advanced data mining and machine learning techniques by potential new users (all that is required is to point a web browser to the appropriate URL), (2) software infrastructure that is common across a range of science mining applications is factored out and reused, and (3) the system facilitates closer collaborations between algorithm developers and domain experts.
An Energy-Aware Routing Protocol for Query-Based Applications in Wireless Sensor Networks
Crespi, Noel
2014-01-01
Wireless sensor network (WSN) typically has energy consumption restriction. Designing energy-aware routing protocol can significantly reduce energy consumption in WSNs. Energy-aware routing protocols can be classified into two categories, energy savers and energy balancers. Energy saving protocols are used to minimize the overall energy consumed by a WSN, while energy balancing protocols attempt to efficiently distribute the consumption of energy throughout the network. In general terms, energy saving protocols are not necessarily good at balancing energy consumption and energy balancing protocols are not always good at reducing energy consumption. In this paper, we propose an energy-aware routing protocol (ERP) for query-based applications in WSNs, which offers a good trade-off between traditional energy balancing and energy saving objectives and supports a soft real time packet delivery. This is achieved by means of fuzzy sets and learning automata techniques along with zonal broadcasting to decrease total energy consumption. PMID:24696640
Array Processing in the Cloud: the rasdaman Approach
NASA Astrophysics Data System (ADS)
Merticariu, Vlad; Dumitru, Alex
2015-04-01
The multi-dimensional array data model is gaining more and more attention when dealing with Big Data challenges in a variety of domains such as climate simulations, geographic information systems, medical imaging or astronomical observations. Solutions provided by classical Big Data tools such as Key-Value Stores and MapReduce, as well as traditional relational databases, proved to be limited in domains associated with multi-dimensional data. This problem has been addressed by the field of array databases, in which systems provide database services for raster data, without imposing limitations on the number of dimensions that a dataset can have. Examples of datasets commonly handled by array databases include 1-dimensional sensor data, 2-D satellite imagery, 3-D x/y/t image time series as well as x/y/z geophysical voxel data, and 4-D x/y/z/t weather data. And this can grow as large as simulations of the whole universe when it comes to astrophysics. rasdaman is a well established array database, which implements many optimizations for dealing with large data volumes and operation complexity. Among those, the latest one is intra-query parallelization support: a network of machines collaborate for answering a single array database query, by dividing it into independent sub-queries sent to different servers. This enables massive processing speed-ups, which promise solutions to research challenges on multi-Petabyte data cubes. There are several correlated factors which influence the speedup that intra-query parallelisation brings: the number of servers, the capabilities of each server, the quality of the network, the availability of the data to the server that needs it in order to compute the result and many more. In the effort of adapting the engine to cloud processing patterns, two main components have been identified: one that handles communication and gathers information about the arrays sitting on every server, and a processing unit responsible with dividing work among available nodes and executing operations on local data. The federation daemon collects and stores statistics from the other network nodes and provides real time updates about local changes. Information exchanged includes available datasets, CPU load and memory usage per host. The processing component is represented by the rasdaman server. Using information from the federation daemon it breaks queries into subqueries to be executed on peer nodes, ships them, and assembles the intermediate results. Thus, we define a rasdaman network node as a pair of a federation daemon and a rasdaman server. Any node can receive a query and will subsequently act as this query's dispatcher, so all peers are at the same level and there is no single point of failure. Should a node become inaccessible then the peers will recognize this and will not any longer consider this peer for distribution. Conversely, a peer at any time can join the network. To assess the feasibility of our approach, we deployed a rasdaman network in the Amazon Elastic Cloud environment on 1001 nodes, and observed that this feature can greatly increase the performance and scalability of the system, offering a large throughput of processed data.
Unstructured medical image query using big data - An epilepsy case study.
Istephan, Sarmad; Siadat, Mohammad-Reza
2016-02-01
Big data technologies are critical to the medical field which requires new frameworks to leverage them. Such frameworks would benefit medical experts to test hypotheses by querying huge volumes of unstructured medical data to provide better patient care. The objective of this work is to implement and examine the feasibility of having such a framework to provide efficient querying of unstructured data in unlimited ways. The feasibility study was conducted specifically in the epilepsy field. The proposed framework evaluates a query in two phases. In phase 1, structured data is used to filter the clinical data warehouse. In phase 2, feature extraction modules are executed on the unstructured data in a distributed manner via Hadoop to complete the query. Three modules have been created, volume comparer, surface to volume conversion and average intensity. The framework allows for user-defined modules to be imported to provide unlimited ways to process the unstructured data hence potentially extending the application of this framework beyond epilepsy field. Two types of criteria were used to validate the feasibility of the proposed framework - the ability/accuracy of fulfilling an advanced medical query and the efficiency that Hadoop provides. For the first criterion, the framework executed an advanced medical query that spanned both structured and unstructured data with accurate results. For the second criterion, different architectures were explored to evaluate the performance of various Hadoop configurations and were compared to a traditional Single Server Architecture (SSA). The surface to volume conversion module performed up to 40 times faster than the SSA (using a 20 node Hadoop cluster) and the average intensity module performed up to 85 times faster than the SSA (using a 40 node Hadoop cluster). Furthermore, the 40 node Hadoop cluster executed the average intensity module on 10,000 models in 3h which was not even practical for the SSA. The current study is limited to epilepsy field and further research and more feature extraction modules are required to show its applicability in other medical domains. The proposed framework advances data-driven medicine by unleashing the content of unstructured medical data in an efficient and unlimited way to be harnessed by medical experts. Copyright © 2015 Elsevier Inc. All rights reserved.
KARL: A Knowledge-Assisted Retrieval Language. M.S. Thesis Final Report, 1 Jul. 1985 - 31 Dec. 1987
NASA Technical Reports Server (NTRS)
Dominick, Wayne D. (Editor); Triantafyllopoulos, Spiros
1985-01-01
Data classification and storage are tasks typically performed by application specialists. In contrast, information users are primarily non-computer specialists who use information in their decision-making and other activities. Interaction efficiency between such users and the computer is often reduced by machine requirements and resulting user reluctance to use the system. This thesis examines the problems associated with information retrieval for non-computer specialist users, and proposes a method for communicating in restricted English that uses knowledge of the entities involved, relationships between entities, and basic English language syntax and semantics to translate the user requests into formal queries. The proposed method includes an intelligent dictionary, syntax and semantic verifiers, and a formal query generator. In addition, the proposed system has a learning capability that can improve portability and performance. With the increasing demand for efficient human-machine communication, the significance of this thesis becomes apparent. As human resources become more valuable, software systems that will assist in improving the human-machine interface will be needed and research addressing new solutions will be of utmost importance. This thesis presents an initial design and implementation as a foundation for further research and development into the emerging field of natural language database query systems.
PIRIA: a general tool for indexing, search, and retrieval of multimedia content
NASA Astrophysics Data System (ADS)
Joint, Magali; Moellic, Pierre-Alain; Hede, P.; Adam, P.
2004-05-01
The Internet is a continuously expanding source of multimedia content and information. There are many products in development to search, retrieve, and understand multimedia content. But most of the current image search/retrieval engines, rely on a image database manually pre-indexed with keywords. Computers are still powerless to understand the semantic meaning of still or animated image content. Piria (Program for the Indexing and Research of Images by Affinity), the search engine we have developed brings this possibility closer to reality. Piria is a novel search engine that uses the query by example method. A user query is submitted to the system, which then returns a list of images ranked by similarity, obtained by a metric distance that operates on every indexed image signature. These indexed images are compared according to several different classifiers, not only Keywords, but also Form, Color and Texture, taking into account geometric transformations and variance like rotation, symmetry, mirroring, etc. Form - Edges extracted by an efficient segmentation algorithm. Color - Histogram, semantic color segmentation and spatial color relationship. Texture - Texture wavelets and local edge patterns. If required, Piria is also able to fuse results from multiple classifiers with a new classification of index categories: Single Indexer Single Call (SISC), Single Indexer Multiple Call (SIMC), Multiple Indexers Single Call (MISC) or Multiple Indexers Multiple Call (MIMC). Commercial and industrial applications will be explored and discussed as well as current and future development.
ERIC Educational Resources Information Center
Springgay, Stephanie; Rotas, Nikki
2015-01-01
This paper engages with Guattari's query about, how to make a classroom operate like a work of art? Guattari's question is not intended to be prescriptive or dogmatic. Rather, his thinking engenders a way of thinking about art as an affective event that has the capacity to invent new relations and new ways of learning. In the first section, we…
2016-10-01
Manpower Data Center (DMDC) for data extracts identifying monthly deployments from September 2001 through December 2014. This data would answer questions... Manpower Data Center (DMDC) databases captured which service members were mobilized and deployed. Government history offices, lessons learned...develop MOEs and MOPs to conduct assessments. 1. Data Extracts Concurrent with engagement efforts, IDA queried the Defense Manpower Data Center (DMDC
Minimization of annotation work: diagnosis of mammographic masses via active learning
NASA Astrophysics Data System (ADS)
Zhao, Yu; Zhang, Jingyang; Xie, Hongzhi; Zhang, Shuyang; Gu, Lixu
2018-06-01
The prerequisite for establishing an effective prediction system for mammographic diagnosis is the annotation of each mammographic image. The manual annotation work is time-consuming and laborious, which becomes a great hindrance for researchers. In this article, we propose a novel active learning algorithm that can adequately address this problem, leading to the minimization of the labeling costs on the premise of guaranteed performance. Our proposed method is different from the existing active learning methods designed for the general problem as it is specifically designed for mammographic images. Through its modified discriminant functions and improved sample query criteria, the proposed method can fully utilize the pairing of mammographic images and select the most valuable images from both the mediolateral and craniocaudal views. Moreover, in order to extend active learning to the ordinal regression problem, which has no precedent in existing studies, but is essential for mammographic diagnosis (mammographic diagnosis is not only a classification task, but also an ordinal regression task for predicting an ordinal variable, viz. the malignancy risk of lesions), multiple sample query criteria need to be taken into consideration simultaneously. We formulate it as a criteria integration problem and further present an algorithm based on self-adaptive weighted rank aggregation to achieve a good solution. The efficacy of the proposed method was demonstrated on thousands of mammographic images from the digital database for screening mammography. The labeling costs of obtaining optimal performance in the classification and ordinal regression task respectively fell to 33.8 and 19.8 percent of their original costs. The proposed method also generated 1228 wins, 369 ties and 47 losses for the classification task, and 1933 wins, 258 ties and 185 losses for the ordinal regression task compared to the other state-of-the-art active learning algorithms. By taking the particularities of mammographic images, the proposed AL method can indeed reduce the manual annotation work to a great extent without sacrificing the performance of the prediction system for mammographic diagnosis.
Minimization of annotation work: diagnosis of mammographic masses via active learning.
Zhao, Yu; Zhang, Jingyang; Xie, Hongzhi; Zhang, Shuyang; Gu, Lixu
2018-05-22
The prerequisite for establishing an effective prediction system for mammographic diagnosis is the annotation of each mammographic image. The manual annotation work is time-consuming and laborious, which becomes a great hindrance for researchers. In this article, we propose a novel active learning algorithm that can adequately address this problem, leading to the minimization of the labeling costs on the premise of guaranteed performance. Our proposed method is different from the existing active learning methods designed for the general problem as it is specifically designed for mammographic images. Through its modified discriminant functions and improved sample query criteria, the proposed method can fully utilize the pairing of mammographic images and select the most valuable images from both the mediolateral and craniocaudal views. Moreover, in order to extend active learning to the ordinal regression problem, which has no precedent in existing studies, but is essential for mammographic diagnosis (mammographic diagnosis is not only a classification task, but also an ordinal regression task for predicting an ordinal variable, viz. the malignancy risk of lesions), multiple sample query criteria need to be taken into consideration simultaneously. We formulate it as a criteria integration problem and further present an algorithm based on self-adaptive weighted rank aggregation to achieve a good solution. The efficacy of the proposed method was demonstrated on thousands of mammographic images from the digital database for screening mammography. The labeling costs of obtaining optimal performance in the classification and ordinal regression task respectively fell to 33.8 and 19.8 percent of their original costs. The proposed method also generated 1228 wins, 369 ties and 47 losses for the classification task, and 1933 wins, 258 ties and 185 losses for the ordinal regression task compared to the other state-of-the-art active learning algorithms. By taking the particularities of mammographic images, the proposed AL method can indeed reduce the manual annotation work to a great extent without sacrificing the performance of the prediction system for mammographic diagnosis.
Identification of Functionally Related Enzymes by Learning-to-Rank Methods.
Stock, Michiel; Fober, Thomas; Hüllermeier, Eyke; Glinca, Serghei; Klebe, Gerhard; Pahikkala, Tapio; Airola, Antti; De Baets, Bernard; Waegeman, Willem
2014-01-01
Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking for correspondences in their sequences, structures or surfaces. For a given query, the search operation results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information about the biological function of annotated database enzymes is ignored. In this work, we show that rankings of that kind can be substantially improved by applying kernel-based learning algorithms. This approach enables the detection of statistical dependencies between similarities of the active cleft and the biological function of annotated enzymes. This is in contrast to search-based approaches, which do not take annotated training data into account. Similarity measures based on the active cleft are known to outperform sequence-based or structure-based measures under certain conditions. We consider the Enzyme Commission (EC) classification hierarchy for obtaining annotated enzymes during the training phase. The results of a set of sizeable experiments indicate a consistent and significant improvement for a set of similarity measures that exploit information about small cavities in the surface of enzymes.
Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions
2011-01-01
S∩[n](ri + δ) ∙ ∏ j∈S∩[n+1,2n]rj − ∏ i∈ Sri ) | 6 ∑ 16|S|6d|cS | ∙ (∑ ∅6=T⊆S∩[n]δ |T | ∙ ∏ i∈S\\T |ri| ) 6 ∑ 16|S|6d|cS | ∙ 2 |S| ∙ ( δ ∙ ∏ i∈S:ri>1...Diakonikolas, P. Harsha, A. Klivans, R. Meka, P. Raghavendra, R. A. Servedio, and L.-Y. Tan . Bounding the average sensitivity and noise sensitiv- ity of...applications. Information and Computation, 100(1):78–150, 1992. [12] J. Jackson . An efficient membership-query algo- rithm for learning DNF with respect to
Hauser, Susan E; Demner-Fushman, Dina; Jacobs, Joshua L; Humphrey, Susanne M; Ford, Glenn; Thoma, George R
2007-01-01
To evaluate: (1) the effectiveness of wireless handheld computers for online information retrieval in clinical settings; (2) the role of MEDLINE in answering clinical questions raised at the point of care. A prospective single-cohort study: accompanying medical teams on teaching rounds, five internal medicine residents used and evaluated MD on Tap, an application for handheld computers, to seek answers in real time to clinical questions arising at the point of care. All transactions were stored by an intermediate server. Evaluators recorded clinical scenarios and questions, identified MEDLINE citations that answered the questions, and submitted daily and summative reports of their experience. A senior medical librarian corroborated the relevance of the selected citation to each scenario and question. Evaluators answered 68% of 363 background and foreground clinical questions during rounding sessions using a variety of MD on Tap features in an average session length of less than four minutes. The evaluator, the number and quality of query terms, the total number of citations found for a query, and the use of auto-spellcheck significantly contributed to the probability of query success. Handheld computers with Internet access are useful tools for healthcare providers to access MEDLINE in real time. MEDLINE citations can answer specific clinical questions when several medical terms are used to form a query. The MD on Tap application is an effective interface to MEDLINE in clinical settings, allowing clinicians to quickly find relevant citations.
Hauser, Susan E.; Demner-Fushman, Dina; Jacobs, Joshua L.; Humphrey, Susanne M.; Ford, Glenn; Thoma, George R.
2007-01-01
Objective To evaluate: (1) the effectiveness of wireless handheld computers for online information retrieval in clinical settings; (2) the role of MEDLINE® in answering clinical questions raised at the point of care. Design A prospective single-cohort study: accompanying medical teams on teaching rounds, five internal medicine residents used and evaluated MD on Tap, an application for handheld computers, to seek answers in real time to clinical questions arising at the point of care. Measurements All transactions were stored by an intermediate server. Evaluators recorded clinical scenarios and questions, identified MEDLINE citations that answered the questions, and submitted daily and summative reports of their experience. A senior medical librarian corroborated the relevance of the selected citation to each scenario and question. Results Evaluators answered 68% of 363 background and foreground clinical questions during rounding sessions using a variety of MD on Tap features in an average session length of less than four minutes. The evaluator, the number and quality of query terms, the total number of citations found for a query, and the use of auto-spellcheck significantly contributed to the probability of query success. Conclusion Handheld computers with Internet access are useful tools for healthcare providers to access MEDLINE in real time. MEDLINE citations can answer specific clinical questions when several medical terms are used to form a query. The MD on Tap application is an effective interface to MEDLINE in clinical settings, allowing clinicians to quickly find relevant citations. PMID:17712085
Private and Efficient Query Processing on Outsourced Genomic Databases.
Ghasemi, Reza; Al Aziz, Md Momin; Mohammed, Noman; Dehkordi, Massoud Hadian; Jiang, Xiaoqian
2017-09-01
Applications of genomic studies are spreading rapidly in many domains of science and technology such as healthcare, biomedical research, direct-to-consumer services, and legal and forensic. However, there are a number of obstacles that make it hard to access and process a big genomic database for these applications. First, sequencing genomic sequence is a time consuming and expensive process. Second, it requires large-scale computation and storage systems to process genomic sequences. Third, genomic databases are often owned by different organizations, and thus, not available for public usage. Cloud computing paradigm can be leveraged to facilitate the creation and sharing of big genomic databases for these applications. Genomic data owners can outsource their databases in a centralized cloud server to ease the access of their databases. However, data owners are reluctant to adopt this model, as it requires outsourcing the data to an untrusted cloud service provider that may cause data breaches. In this paper, we propose a privacy-preserving model for outsourcing genomic data to a cloud. The proposed model enables query processing while providing privacy protection of genomic databases. Privacy of the individuals is guaranteed by permuting and adding fake genomic records in the database. These techniques allow cloud to evaluate count and top-k queries securely and efficiently. Experimental results demonstrate that a count and a top-k query over 40 Single Nucleotide Polymorphisms (SNPs) in a database of 20 000 records takes around 100 and 150 s, respectively.
Frequently Asked Questions | DOE PAGES
read, download, and analyze at no charge to users. DOE PAGES offers free public access to the best manuscripts are hosted, DOE PAGES enables readers to search them all via a single query. In most cases, free . DOE PAGES offers free public access to the best available full-text version of DOE-funded scholarly
Anavi, Yaron; Kogan, Ilya; Gelbart, Elad; Geva, Ofer; Greenspan, Hayit
2015-08-01
In this work various approaches are investigated for X-ray image retrieval and specifically chest pathology retrieval. Given a query image taken from a data set of 443 images, the objective is to rank images according to similarity. Different features, including binary features, texture features, and deep learning (CNN) features are examined. In addition, two approaches are investigated for the retrieval task. One approach is based on the distance of image descriptors using the above features (hereon termed the "descriptor"-based approach); the second approach ("classification"-based approach) is based on a probability descriptor, generated by a pair-wise classification of each two classes (pathologies) and their decision values using an SVM classifier. Best results are achieved using deep learning features in a classification scheme.
Query Auto-Completion Based on Word2vec Semantic Similarity
NASA Astrophysics Data System (ADS)
Shao, Taihua; Chen, Honghui; Chen, Wanyu
2018-04-01
Query auto-completion (QAC) is the first step of information retrieval, which helps users formulate the entire query after inputting only a few prefixes. Regarding the models of QAC, the traditional method ignores the contribution from the semantic relevance between queries. However, similar queries always express extremely similar search intention. In this paper, we propose a hybrid model FS-QAC based on query semantic similarity as well as the query frequency. We choose word2vec method to measure the semantic similarity between intended queries and pre-submitted queries. By combining both features, our experiments show that FS-QAC model improves the performance when predicting the user’s query intention and helping formulate the right query. Our experimental results show that the optimal hybrid model contributes to a 7.54% improvement in terms of MRR against a state-of-the-art baseline using the public AOL query logs.
GIS tool for California state legislature electoral history
NASA Astrophysics Data System (ADS)
Artham, Swathi
The California State Legislature contains two bodies consisting of the lower house, the California State Assembly, with eighty members, and the upper house, the California State Senate, with forty members. Elections are held for every two years for both Senate and Assembly. The terms of the Senators are staggered so that half the membership is elected every two years, whereas all the Assembly members are elected every two years. The electoral district boundaries vary after every 10-year census. My main objective is to provide a summary of both California State Senate and California State Assembly election results in a single GIS tool, from the years 1970 to 2012. This tool provides information about different trends in the California State Senate and State Assembly elections along the years. This tool was designed to help students, and teachers to interactively learn about the California State Legislature elections. Users can view the election results by selecting a particular year for Senate or Assembly, which results in adding a new layer on the map with a coloring scheme for better understanding of change of parties; red for Republicans, blue for Democrats and green for Independents. Users can click on any district shown on the map using a hotlink tool to see the electoral trends for the districts for the past years. This application provides a powerful Stored Query Language (SQL) query option to enter queries and get election results in the form of tables with various fields. This data can be further used to aid other analysis as per user requirements. This tool also provides various visual statistics using graphs and tables for voter turnout, number of candidates won by each party, number of seats changed from one party to another. It also features a color matrix table that helps users to see trends in California State Senate and Assembly. Every two-year election results are shown in the form of graphs and tables for better understanding by the user. The tool provides two quiz options for users who are willing to test the knowledge they gained using the tool. This tool was developed in JAVA swing and AWT, Map Objects Java Objects (MOJO), Apache Derby, DBF Explorer, HTML5, CSS3 and JavaScript.
EquiX-A Search and Query Language for XML.
ERIC Educational Resources Information Center
Cohen, Sara; Kanza, Yaron; Kogan, Yakov; Sagiv, Yehoshua; Nutt, Werner; Serebrenik, Alexander
2002-01-01
Describes EquiX, a search language for XML that combines querying with searching to query the data and the meta-data content of Web pages. Topics include search engines; a data model for XML documents; search query syntax; search query semantics; an algorithm for evaluating a query on a document; and indexing EquiX queries. (LRW)
Sujansky, Walter V; Faus, Sam A; Stone, Ethan; Brennan, Patricia Flatley
2010-10-01
Online personal health records (PHRs) enable patients to access, manage, and share certain of their own health information electronically. This capability creates the need for precise access-controls mechanisms that restrict the sharing of data to that intended by the patient. The authors describe the design and implementation of an access-control mechanism for PHR repositories that is modeled on the eXtensible Access Control Markup Language (XACML) standard, but intended to reduce the cognitive and computational complexity of XACML. The authors implemented the mechanism entirely in a relational database system using ANSI-standard SQL statements. Based on a set of access-control rules encoded as relational table rows, the mechanism determines via a single SQL query whether a user who accesses patient data from a specific application is authorized to perform a requested operation on a specified data object. Testing of this query on a moderately large database has demonstrated execution times consistently below 100ms. The authors include the details of the implementation, including algorithms, examples, and a test database as Supplementary materials. Copyright © 2010 Elsevier Inc. All rights reserved.
New tools and methods for direct programmatic access to the dbSNP relational database.
Saccone, Scott F; Quan, Jiaxi; Mehta, Gaurang; Bolze, Raphael; Thomas, Prasanth; Deelman, Ewa; Tischfield, Jay A; Rice, John P
2011-01-01
Genome-wide association studies often incorporate information from public biological databases in order to provide a biological reference for interpreting the results. The dbSNP database is an extensive source of information on single nucleotide polymorphisms (SNPs) for many different organisms, including humans. We have developed free software that will download and install a local MySQL implementation of the dbSNP relational database for a specified organism. We have also designed a system for classifying dbSNP tables in terms of common tasks we wish to accomplish using the database. For each task we have designed a small set of custom tables that facilitate task-related queries and provide entity-relationship diagrams for each task composed from the relevant dbSNP tables. In order to expose these concepts and methods to a wider audience we have developed web tools for querying the database and browsing documentation on the tables and columns to clarify the relevant relational structure. All web tools and software are freely available to the public at http://cgsmd.isi.edu/dbsnpq. Resources such as these for programmatically querying biological databases are essential for viably integrating biological information into genetic association experiments on a genome-wide scale.
Family health history reporting is sensitive to small changes in wording.
Conway-Pearson, Liam S; Christensen, Kurt D; Savage, Sarah K; Huntington, Noelle L; Weitzman, Elissa R; Ziniel, Sonja I; Bacon, Phoebe; Cacioppo, Cara N; Green, Robert C; Holm, Ingrid A
2016-12-01
Family health history is often collected through single-item queries that ask patients whether their family members are affected by certain conditions. The specific wording of these queries may influence what individuals report. Parents of Boston Children's Hospital patients were invited to participate in a Web-based survey about the return of individual genomic research results regarding their children. Participants reported whether 11 types of medical conditions affected them or their family. Randomization determined whether participants were specifically instructed to consider their extended family. Family health history was reported by 2,901 participants. Those asked to consider their extended family were more likely to report a positive family history for 8 of 11 medical conditions. The largest differences were observed for cancer (65.1 vs. 45.7%; P < 0.001), cardiovascular conditions (72.5 vs. 56.0%; P < 0.001), and endocrine/hormonal conditions (50.9 vs. 36.7%; P < 0.001). Small alterations to the way family health history queries are worded can substantially change patient responses. Clinicians and researchers need to be sensitive about patients' tendencies to omit extended family from health history reporting unless specifically asked to consider them.Genet Med 18 12, 1308-1311.
One Shot Detection with Laplacian Object and Fast Matrix Cosine Similarity.
Biswas, Sujoy Kumar; Milanfar, Peyman
2016-03-01
One shot, generic object detection involves searching for a single query object in a larger target image. Relevant approaches have benefited from features that typically model the local similarity patterns. In this paper, we combine local similarity (encoded by local descriptors) with a global context (i.e., a graph structure) of pairwise affinities among the local descriptors, embedding the query descriptors into a low dimensional but discriminatory subspace. Unlike principal components that preserve global structure of feature space, we actually seek a linear approximation to the Laplacian eigenmap that permits us a locality preserving embedding of high dimensional region descriptors. Our second contribution is an accelerated but exact computation of matrix cosine similarity as the decision rule for detection, obviating the computationally expensive sliding window search. We leverage the power of Fourier transform combined with integral image to achieve superior runtime efficiency that allows us to test multiple hypotheses (for pose estimation) within a reasonably short time. Our approach to one shot detection is training-free, and experiments on the standard data sets confirm the efficacy of our model. Besides, low computation cost of the proposed (codebook-free) object detector facilitates rather straightforward query detection in large data sets including movie videos.
Active learning methods for interactive image retrieval.
Gosselin, Philippe Henri; Cord, Matthieu
2008-07-01
Active learning methods have been considered with increased interest in the statistical learning community. Initially developed within a classification framework, a lot of extensions are now being proposed to handle multimedia applications. This paper provides algorithms within a statistical framework to extend active learning for online content-based image retrieval (CBIR). The classification framework is presented with experiments to compare several powerful classification techniques in this information retrieval context. Focusing on interactive methods, active learning strategy is then described. The limitations of this approach for CBIR are emphasized before presenting our new active selection process RETIN. First, as any active method is sensitive to the boundary estimation between classes, the RETIN strategy carries out a boundary correction to make the retrieval process more robust. Second, the criterion of generalization error to optimize the active learning selection is modified to better represent the CBIR objective of database ranking. Third, a batch processing of images is proposed. Our strategy leads to a fast and efficient active learning scheme to retrieve sets of online images (query concept). Experiments on large databases show that the RETIN method performs well in comparison to several other active strategies.
Spatial and symbolic queries for 3D image data
NASA Astrophysics Data System (ADS)
Benson, Daniel C.; Zick, Gregory L.
1992-04-01
We present a query system for an object-oriented biomedical imaging database containing 3-D anatomical structures and their corresponding 2-D images. The graphical interface facilitates the formation of spatial queries, nonspatial or symbolic queries, and combined spatial/symbolic queries. A query editor is used for the creation and manipulation of 3-D query objects as volumes, surfaces, lines, and points. Symbolic predicates are formulated through a combination of text fields and multiple choice selections. Query results, which may include images, image contents, composite objects, graphics, and alphanumeric data, are displayed in multiple views. Objects returned by the query may be selected directly within the views for further inspection or modification, or for use as query objects in subsequent queries. Our image database query system provides visual feedback and manipulation of spatial query objects, multiple views of volume data, and the ability to combine spatial and symbolic queries. The system allows for incremental enhancement of existing objects and the addition of new objects and spatial relationships. The query system is designed for databases containing symbolic and spatial data. This paper discuses its application to data acquired in biomedical 3- D image reconstruction, but it is applicable to other areas such as CAD/CAM, geographical information systems, and computer vision.
GenoQuery: a new querying module for functional annotation in a genomic warehouse
Lemoine, Frédéric; Labedan, Bernard; Froidevaux, Christine
2008-01-01
Motivation: We have to cope with both a deluge of new genome sequences and a huge amount of data produced by high-throughput approaches used to exploit these genomic features. Crossing and comparing such heterogeneous and disparate data will help improving functional annotation of genomes. This requires designing elaborate integration systems such as warehouses for storing and querying these data. Results: We have designed a relational genomic warehouse with an original multi-layer architecture made of a databases layer and an entities layer. We describe a new querying module, GenoQuery, which is based on this architecture. We use the entities layer to define mixed queries. These mixed queries allow searching for instances of biological entities and their properties in the different databases, without specifying in which database they should be found. Accordingly, we further introduce the central notion of alternative queries. Such queries have the same meaning as the original mixed queries, while exploiting complementarities yielded by the various integrated databases of the warehouse. We explain how GenoQuery computes all the alternative queries of a given mixed query. We illustrate how useful this querying module is by means of a thorough example. Availability: http://www.lri.fr/~lemoine/GenoQuery/ Contact: chris@lri.fr, lemoine@lri.fr PMID:18586731
SPARK: Adapting Keyword Query to Semantic Search
NASA Astrophysics Data System (ADS)
Zhou, Qi; Wang, Chong; Xiong, Miao; Wang, Haofen; Yu, Yong
Semantic search promises to provide more accurate result than present-day keyword search. However, progress with semantic search has been delayed due to the complexity of its query languages. In this paper, we explore a novel approach of adapting keywords to querying the semantic web: the approach automatically translates keyword queries into formal logic queries so that end users can use familiar keywords to perform semantic search. A prototype system named 'SPARK' has been implemented in light of this approach. Given a keyword query, SPARK outputs a ranked list of SPARQL queries as the translation result. The translation in SPARK consists of three major steps: term mapping, query graph construction and query ranking. Specifically, a probabilistic query ranking model is proposed to select the most likely SPARQL query. In the experiment, SPARK achieved an encouraging translation result.
Griffon, N; Schuers, M; Dhombres, F; Merabti, T; Kerdelhué, G; Rollin, L; Darmoni, S J
2016-08-02
Despite international initiatives like Orphanet, it remains difficult to find up-to-date information about rare diseases. The aim of this study is to propose an exhaustive set of queries for PubMed based on terminological knowledge and to evaluate it versus the queries based on expertise provided by the most frequently used resource in Europe: Orphanet. Four rare disease terminologies (MeSH, OMIM, HPO and HRDO) were manually mapped to each other permitting the automatic creation of expended terminological queries for rare diseases. For 30 rare diseases, 30 citations retrieved by Orphanet expert query and/or query based on terminological knowledge were assessed for relevance by two independent reviewers unaware of the query's origin. An adjudication procedure was used to resolve any discrepancy. Precision, relative recall and F-measure were all computed. For each Orphanet rare disease (n = 8982), there was a corresponding terminological query, in contrast with only 2284 queries provided by Orphanet. Only 553 citations were evaluated due to queries with 0 or only a few hits. There were no significant differences between the Orpha query and terminological query in terms of precision, respectively 0.61 vs 0.52 (p = 0.13). Nevertheless, terminological queries retrieved more citations more often than Orpha queries (0.57 vs. 0.33; p = 0.01). Interestingly, Orpha queries seemed to retrieve older citations than terminological queries (p < 0.0001). The terminological queries proposed in this study are now currently available for all rare diseases. They may be a useful tool for both precision or recall oriented literature search.
Integrating a local database into the StarView distributed user interface
NASA Technical Reports Server (NTRS)
Silberberg, D. P.
1992-01-01
A distributed user interface to the Space Telescope Data Archive and Distribution Service (DADS) known as StarView is being developed. The DADS architecture consists of the data archive as well as a relational database catalog describing the archive. StarView is a client/server system in which the user interface is the front-end client to the DADS catalog and archive servers. Users query the DADS catalog from the StarView interface. Query commands are transmitted via a network and evaluated by the database. The results are returned via the network and are displayed on StarView forms. Based on the results, users decide which data sets to retrieve from the DADS archive. Archive requests are packaged by StarView and sent to DADS, which returns the requested data sets to the users. The advantages of distributed client/server user interfaces over traditional one-machine systems are well known. Since users run software on machines separate from the database, the overall client response time is much faster. Also, since the server is free to process only database requests, the database response time is much faster. Disadvantages inherent in this architecture are slow overall database access time due to the network delays, lack of a 'get previous row' command, and that refinements of a previously issued query must be submitted to the database server, even though the domain of values have already been returned by the previous query. This architecture also does not allow users to cross correlate DADS catalog data with other catalogs. Clearly, a distributed user interface would be more powerful if it overcame these disadvantages. A local database is being integrated into StarView to overcome these disadvantages. When a query is made through a StarView form, which is often composed of fields from multiple tables, it is translated to an SQL query and issued to the DADS catalog. At the same time, a local database table is created to contain the resulting rows of the query. The returned rows are displayed on the form as well as inserted into the local database table. Identical results are produced by reissuing the query to either the DADS catalog or to the local table. Relational databases do not provide a 'get previous row' function because of the inherent complexity of retrieving previous rows of multiple-table joins. However, since this function is easily implemented on a single table, StarView uses the local table to retrieve the previous row. Also, StarView issues subsequent query refinements to the local table instead of the DADS catalog, eliminating the network transmission overhead. Finally, other catalogs can be imported into the local database for cross correlation with local tables. Overall, it is believe that this is a more powerful architecture for distributed, database user interfaces.
Array Databases: Agile Analytics (not just) for the Earth Sciences
NASA Astrophysics Data System (ADS)
Baumann, P.; Misev, D.
2015-12-01
Gridded data, such as images, image timeseries, and climate datacubes, today are managed separately from the metadata, and with different, restricted retrieval capabilities. While databases are good at metadata modelled in tables, XML hierarchies, or RDF graphs, they traditionally do not support multi-dimensional arrays.This gap is being closed by Array Databases, pioneered by the scalable rasdaman ("raster data manager") array engine. Its declarative query language, rasql, extends SQL with array operators which are optimized and parallelized on server side. Installations can easily be mashed up securely, thereby enabling large-scale location-transparent query processing in federations. Domain experts value the integration with their commonly used tools leading to a quick learning curve.Earth, Space, and Life sciences, but also Social sciences as well as business have massive amounts of data and complex analysis challenges that are answered by rasdaman. As of today, rasdaman is mature and in operational use on hundreds of Terabytes of timeseries datacubes, with transparent query distribution across more than 1,000 nodes. Additionally, its concepts have shaped international Big Data standards in the field, including the forthcoming array extension to ISO SQL, many of which are supported by both open-source and commercial systems meantime. In the geo field, rasdaman is reference implementation for the Open Geospatial Consortium (OGC) Big Data standard, WCS, now also under adoption by ISO. Further, rasdaman is in the final stage of OSGeo incubation.In this contribution we present array queries a la rasdaman, describe the architecture and novel optimization and parallelization techniques introduced in 2015, and put this in context of the intercontinental EarthServer initiative which utilizes rasdaman for enabling agile analytics on Petascale datacubes.
Hybrid Collaborative Learning for Classification and Clustering in Sensor Networks
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri L.; Sosnowski, Scott; Lane, Terran
2012-01-01
Traditionally, nodes in a sensor network simply collect data and then pass it on to a centralized node that archives, distributes, and possibly analyzes the data. However, analysis at the individual nodes could enable faster detection of anomalies or other interesting events as well as faster responses, such as sending out alerts or increasing the data collection rate. There is an additional opportunity for increased performance if learners at individual nodes can communicate with their neighbors. In previous work, methods were developed by which classification algorithms deployed at sensor nodes can communicate information about event labels to each other, building on prior work with co-training, self-training, and active learning. The idea of collaborative learning was extended to function for clustering algorithms as well, similar to ideas from penta-training and consensus clustering. However, collaboration between these learner types had not been explored. A new protocol was developed by which classifiers and clusterers can share key information about their observations and conclusions as they learn. This is an active collaboration in which learners of either type can query their neighbors for information that they then use to re-train or re-learn the concept they are studying. The protocol also supports broadcasts from the classifiers and clusterers to the rest of the network to announce new discoveries. Classifiers observe an event and assign it a label (type). Clusterers instead group observations into clusters without assigning them a label, and they collaborate in terms of pairwise constraints between two events [same-cluster (mustlink) or different-cluster (cannot-link)]. Fundamentally, these two learner types speak different languages. To bridge this gap, the new communication protocol provides four types of exchanges: hybrid queries for information, hybrid "broadcasts" of learned information, each specified for classifiers-to-clusterers, and clusterers-to-classifiers. The new capability has the potential to greatly expand the in situ analysis abilities of sensor networks. Classifiers seeking to categorize incoming data into different types of events can operate in tandem with clusterers that are sensitive to the occurrence of new kinds of events not known to the classifiers. In contrast to current approaches that treat these operations as independent components, a hybrid collaborative learning system can enable them to learn from each other.
SPARQL Query Re-writing Using Partonomy Based Transformation Rules
NASA Astrophysics Data System (ADS)
Jain, Prateek; Yeh, Peter Z.; Verma, Kunal; Henson, Cory A.; Sheth, Amit P.
Often the information present in a spatial knowledge base is represented at a different level of granularity and abstraction than the query constraints. For querying ontology's containing spatial information, the precise relationships between spatial entities has to be specified in the basic graph pattern of SPARQL query which can result in long and complex queries. We present a novel approach to help users intuitively write SPARQL queries to query spatial data, rather than relying on knowledge of the ontology structure. Our framework re-writes queries, using transformation rules to exploit part-whole relations between geographical entities to address the mismatches between query constraints and knowledge base. Our experiments were performed on completely third party datasets and queries. Evaluations were performed on Geonames dataset using questions from National Geographic Bee serialized into SPARQL and British Administrative Geography Ontology using questions from a popular trivia website. These experiments demonstrate high precision in retrieval of results and ease in writing queries.
A systematic review of data mining and machine learning for air pollution epidemiology.
Bellinger, Colin; Mohomed Jabbar, Mohomed Shazan; Zaïane, Osmar; Osornio-Vargas, Alvaro
2017-11-28
Data measuring airborne pollutants, public health and environmental factors are increasingly being stored and merged. These big datasets offer great potential, but also challenge traditional epidemiological methods. This has motivated the exploration of alternative methods to make predictions, find patterns and extract information. To this end, data mining and machine learning algorithms are increasingly being applied to air pollution epidemiology. We conducted a systematic literature review on the application of data mining and machine learning methods in air pollution epidemiology. We carried out our search process in PubMed, the MEDLINE database and Google Scholar. Research articles applying data mining and machine learning methods to air pollution epidemiology were queried and reviewed. Our search queries resulted in 400 research articles. Our fine-grained analysis employed our inclusion/exclusion criteria to reduce the results to 47 articles, which we separate into three primary areas of interest: 1) source apportionment; 2) forecasting/prediction of air pollution/quality or exposure; and 3) generating hypotheses. Early applications had a preference for artificial neural networks. In more recent work, decision trees, support vector machines, k-means clustering and the APRIORI algorithm have been widely applied. Our survey shows that the majority of the research has been conducted in Europe, China and the USA, and that data mining is becoming an increasingly common tool in environmental health. For potential new directions, we have identified that deep learning and geo-spacial pattern mining are two burgeoning areas of data mining that have good potential for future applications in air pollution epidemiology. We carried out a systematic review identifying the current trends, challenges and new directions to explore in the application of data mining methods to air pollution epidemiology. This work shows that data mining is increasingly being applied in air pollution epidemiology. The potential to support air pollution epidemiology continues to grow with advancements in data mining related to temporal and geo-spacial mining, and deep learning. This is further supported by new sensors and storage mediums that enable larger, better quality data. This suggests that many more fruitful applications can be expected in the future.
2006-06-01
SPARQL SPARQL Protocol and RDF Query Language SQL Structured Query Language SUMO Suggested Upper Merged Ontology SW... Query optimization algorithms are implemented in the Pellet reasoner in order to ensure querying a knowledge base is efficient . These algorithms...memory as a treelike structure in order for the data to be queried . XML Query (XQuery) is the standard language used when querying XML
Implementation of Quantum Private Queries Using Nuclear Magnetic Resonance
NASA Astrophysics Data System (ADS)
Wang, Chuan; Hao, Liang; Zhao, Lian-Jie
2011-08-01
We present a modified protocol for the realization of a quantum private query process on a classical database. Using one-qubit query and CNOT operation, the query process can be realized in a two-mode database. In the query process, the data privacy is preserved as the sender would not reveal any information about the database besides her query information, and the database provider cannot retain any information about the query. We implement the quantum private query protocol in a nuclear magnetic resonance system. The density matrix of the memory registers are constructed.
STBase: one million species trees for comparative biology.
McMahon, Michelle M; Deepak, Akshay; Fernández-Baca, David; Boss, Darren; Sanderson, Michael J
2015-01-01
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.
Tsai, Keng-Chang; Jian, Jhih-Wei; Yang, Ei-Wen; Hsu, Po-Chiang; Peng, Hung-Pin; Chen, Ching-Tai; Chen, Jun-Bo; Chang, Jeng-Yih; Hsu, Wen-Lian; Yang, An-Suei
2012-01-01
Non-covalent protein-carbohydrate interactions mediate molecular targeting in many biological processes. Prediction of non-covalent carbohydrate binding sites on protein surfaces not only provides insights into the functions of the query proteins; information on key carbohydrate-binding residues could suggest site-directed mutagenesis experiments, design therapeutics targeting carbohydrate-binding proteins, and provide guidance in engineering protein-carbohydrate interactions. In this work, we show that non-covalent carbohydrate binding sites on protein surfaces can be predicted with relatively high accuracy when the query protein structures are known. The prediction capabilities were based on a novel encoding scheme of the three-dimensional probability density maps describing the distributions of 36 non-covalent interacting atom types around protein surfaces. One machine learning model was trained for each of the 30 protein atom types. The machine learning algorithms predicted tentative carbohydrate binding sites on query proteins by recognizing the characteristic interacting atom distribution patterns specific for carbohydrate binding sites from known protein structures. The prediction results for all protein atom types were integrated into surface patches as tentative carbohydrate binding sites based on normalized prediction confidence level. The prediction capabilities of the predictors were benchmarked by a 10-fold cross validation on 497 non-redundant proteins with known carbohydrate binding sites. The predictors were further tested on an independent test set with 108 proteins. The residue-based Matthews correlation coefficient (MCC) for the independent test was 0.45, with prediction precision and sensitivity (or recall) of 0.45 and 0.49 respectively. In addition, 111 unbound carbohydrate-binding protein structures for which the structures were determined in the absence of the carbohydrate ligands were predicted with the trained predictors. The overall prediction MCC was 0.49. Independent tests on anti-carbohydrate antibodies showed that the carbohydrate antigen binding sites were predicted with comparable accuracy. These results demonstrate that the predictors are among the best in carbohydrate binding site predictions to date. PMID:22848404
Diabetes and Hypertension Quality Measurement in Four Safety-Net Sites
Benkert, R.; Dennehy, P.; White, J.; Hamilton, A.; Tanner, C.
2014-01-01
Summary Background In this new era after the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009, the literature on lessons learned with electronic health record (EHR) implementation needs to be revisited. Objectives Our objective was to describe what implementation of a commercially available EHR with built-in quality query algorithms showed us about our care for diabetes and hypertension populations in four safety net clinics, specifically feasibility of data retrieval, measurements over time, quality of data, and how our teams used this data. Methods A cross-sectional study was conducted from October 2008 to October 2012 in four safety-net clinics located in the Midwest and Western United States. A data warehouse that stores data from across the U.S was utilized for data extraction from patients with diabetes or hypertension diagnoses and at least two office visits per year. Standard quality measures were collected over a period of two to four years. All sites were engaged in a partnership model with the IT staff and a shared learning process to enhance the use of the quality metrics. Results While use of the algorithms was feasible across sites, challenges occurred when attempting to use the query results for research purposes. There was wide variation of both process and outcome results by individual centers. Composite calculations balanced out the differences seen in the individual measures. Despite using consistent quality definitions, the differences across centers had an impact on numerators and denominators. All sites agreed to a partnership model of EHR implementation, and each center utilized the available resources of the partnership for Center-specific quality initiatives. Conclusions Utilizing a shared EHR, a Regional Extension Center-like partnership model, and similar quality query algorithms allowed safety-net clinics to benchmark and improve the quality of care across differing patient populations and health care delivery models. PMID:25298815
Forecasting influenza in Hong Kong with Google search queries and statistical model fusion
Ramirez Ramirez, L. Leticia; Nezafati, Kusha; Zhang, Qingpeng; Tsui, Kwok-Leung
2017-01-01
Background The objective of this study is to investigate predictive utility of online social media and web search queries, particularly, Google search data, to forecast new cases of influenza-like-illness (ILI) in general outpatient clinics (GOPC) in Hong Kong. To mitigate the impact of sensitivity to self-excitement (i.e., fickle media interest) and other artifacts of online social media data, in our approach we fuse multiple offline and online data sources. Methods Four individual models: generalized linear model (GLM), least absolute shrinkage and selection operator (LASSO), autoregressive integrated moving average (ARIMA), and deep learning (DL) with Feedforward Neural Networks (FNN) are employed to forecast ILI-GOPC both one week and two weeks in advance. The covariates include Google search queries, meteorological data, and previously recorded offline ILI. To our knowledge, this is the first study that introduces deep learning methodology into surveillance of infectious diseases and investigates its predictive utility. Furthermore, to exploit the strength from each individual forecasting models, we use statistical model fusion, using Bayesian model averaging (BMA), which allows a systematic integration of multiple forecast scenarios. For each model, an adaptive approach is used to capture the recent relationship between ILI and covariates. Results DL with FNN appears to deliver the most competitive predictive performance among the four considered individual models. Combing all four models in a comprehensive BMA framework allows to further improve such predictive evaluation metrics as root mean squared error (RMSE) and mean absolute predictive error (MAPE). Nevertheless, DL with FNN remains the preferred method for predicting locations of influenza peaks. Conclusions The proposed approach can be viewed a feasible alternative to forecast ILI in Hong Kong or other countries where ILI has no constant seasonal trend and influenza data resources are limited. The proposed methodology is easily tractable and computationally efficient. PMID:28464015
The NOAO Data Lab PHAT Photometry Database
NASA Astrophysics Data System (ADS)
Olsen, Knut; Williams, Ben; Fitzpatrick, Michael; PHAT Team
2018-01-01
We present a database containing both the combined photometric object catalog and the single epoch measurements from the Panchromatic Hubble Andromeda Treasury (PHAT). This database is hosted by the NOAO Data Lab (http://datalab.noao.edu), and as such exposes a number of data services to the PHAT photometry, including access through a Table Access Protocol (TAP) service, direct PostgreSQL queries, web-based and programmatic query interfaces, remote storage space for personal database tables and files, and a JupyterHub-based Notebook analysis environment, as well as image access through a Simple Image Access (SIA) service. We show how the Data Lab database and Jupyter Notebook environment allow for straightforward and efficient analyses of PHAT catalog data, including maps of object density, depth, and color, extraction of light curves of variable objects, and proper motion exploration.
A study of medical and health queries to web search engines.
Spink, Amanda; Yang, Yin; Jansen, Jim; Nykanen, Pirrko; Lorence, Daniel P; Ozmutlu, Seda; Ozmutlu, H Cenk
2004-03-01
This paper reports findings from an analysis of medical or health queries to different web search engines. We report results: (i). comparing samples of 10000 web queries taken randomly from 1.2 million query logs from the AlltheWeb.com and Excite.com commercial web search engines in 2001 for medical or health queries, (ii). comparing the 2001 findings from Excite and AlltheWeb.com users with results from a previous analysis of medical and health related queries from the Excite Web search engine for 1997 and 1999, and (iii). medical or health advice-seeking queries beginning with the word 'should'. Findings suggest: (i). a small percentage of web queries are medical or health related, (ii). the top five categories of medical or health queries were: general health, weight issues, reproductive health and puberty, pregnancy/obstetrics, and human relationships, and (iii). over time, the medical and health queries may have declined as a proportion of all web queries, as the use of specialized medical/health websites and e-commerce-related queries has increased. Findings provide insights into medical and health-related web querying and suggests some implications for the use of the general web search engines when seeking medical/health information.
Monitoring Moving Queries inside a Safe Region
Al-Khalidi, Haidar; Taniar, David; Alamri, Sultan
2014-01-01
With mobile moving range queries, there is a need to recalculate the relevant surrounding objects of interest whenever the query moves. Therefore, monitoring the moving query is very costly. The safe region is one method that has been proposed to minimise the communication and computation cost of continuously monitoring a moving range query. Inside the safe region the set of objects of interest to the query do not change; thus there is no need to update the query while it is inside its safe region. However, when the query leaves its safe region the mobile device has to reevaluate the query, necessitating communication with the server. Knowing when and where the mobile device will leave a safe region is widely known as a difficult problem. To solve this problem, we propose a novel method to monitor the position of the query over time using a linear function based on the direction of the query obtained by periodic monitoring of its position. Periodic monitoring ensures that the query is aware of its location all the time. This method reduces the costs associated with communications in client-server architecture. Computational results show that our method is successful in handling moving query patterns. PMID:24696652
RDF-GL: A SPARQL-Based Graphical Query Language for RDF
NASA Astrophysics Data System (ADS)
Hogenboom, Frederik; Milea, Viorel; Frasincar, Flavius; Kaymak, Uzay
This chapter presents RDF-GL, a graphical query language (GQL) for RDF. The GQL is based on the textual query language SPARQL and mainly focuses on SPARQL SELECT queries. The advantage of a GQL over textual query languages is that complexity is hidden through the use of graphical symbols. RDF-GL is supported by a Java-based editor, SPARQLinG, which is presented as well. The editor does not only allow for RDF-GL query creation, but also converts RDF-GL queries to SPARQL queries and is able to subsequently execute these. Experiments show that using the GQL in combination with the editor makes RDF querying more accessible for end users.
ERIC Educational Resources Information Center
Silber, Bohne; Rosenstein, Carole
2010-01-01
This study began with a readily understandable impulse: to enumerate the nation's outdoor arts festivals and to identify their shared and divergent traits, considering factors such as event programming, staffing, finances, and audience demographics. The reason for this query is also straightforward. To date, no single report or database carries…
KA-SB: from data integration to large scale reasoning
Roldán-García, María del Mar; Navas-Delgado, Ismael; Kerzazi, Amine; Chniber, Othmane; Molina-Castro, Joaquín; Aldana-Montes, José F
2009-01-01
Background The analysis of information in the biological domain is usually focused on the analysis of data from single on-line data sources. Unfortunately, studying a biological process requires having access to disperse, heterogeneous, autonomous data sources. In this context, an analysis of the information is not possible without the integration of such data. Methods KA-SB is a querying and analysis system for final users based on combining a data integration solution with a reasoner. Thus, the tool has been created with a process divided into two steps: 1) KOMF, the Khaos Ontology-based Mediator Framework, is used to retrieve information from heterogeneous and distributed databases; 2) the integrated information is crystallized in a (persistent and high performance) reasoner (DBOWL). This information could be further analyzed later (by means of querying and reasoning). Results In this paper we present a novel system that combines the use of a mediation system with the reasoning capabilities of a large scale reasoner to provide a way of finding new knowledge and of analyzing the integrated information from different databases, which is retrieved as a set of ontology instances. This tool uses a graphical query interface to build user queries easily, which shows a graphical representation of the ontology and allows users o build queries by clicking on the ontology concepts. Conclusion These kinds of systems (based on KOMF) will provide users with very large amounts of information (interpreted as ontology instances once retrieved), which cannot be managed using traditional main memory-based reasoners. We propose a process for creating persistent and scalable knowledgebases from sets of OWL instances obtained by integrating heterogeneous data sources with KOMF. This process has been applied to develop a demo tool , which uses the BioPax Level 3 ontology as the integration schema, and integrates UNIPROT, KEGG, CHEBI, BRENDA and SABIORK databases. PMID:19796402
Database architectures for Space Telescope Science Institute
NASA Astrophysics Data System (ADS)
Lubow, Stephen
1993-08-01
At STScI nearly all large applications require database support. A general purpose architecture has been developed and is in use that relies upon an extended client-server paradigm. Processing is in general distributed across three processes, each of which generally resides on its own processor. Database queries are evaluated on one such process, called the DBMS server. The DBMS server software is provided by a database vendor. The application issues database queries and is called the application client. This client uses a set of generic DBMS application programming calls through our STDB/NET programming interface. Intermediate between the application client and the DBMS server is the STDB/NET server. This server accepts generic query requests from the application and converts them into the specific requirements of the DBMS server. In addition, it accepts query results from the DBMS server and passes them back to the application. Typically the STDB/NET server is local to the DBMS server, while the application client may be remote. The STDB/NET server provides additional capabilities such as database deadlock restart and performance monitoring. This architecture is currently in use for some major STScI applications, including the ground support system. We are currently investigating means of providing ad hoc query support to users through the above architecture. Such support is critical for providing flexible user interface capabilities. The Universal Relation advocated by Ullman, Kernighan, and others appears to be promising. In this approach, the user sees the entire database as a single table, thereby freeing the user from needing to understand the detailed schema. A software layer provides the translation between the user and detailed schema views of the database. However, many subtle issues arise in making this transformation. We are currently exploring this scheme for use in the Hubble Space Telescope user interface to the data archive system (DADS).
The I4 Online Query Tool for Earth Observations Data
NASA Technical Reports Server (NTRS)
Stefanov, William L.; Vanderbloemen, Lisa A.; Lawrence, Samuel J.
2015-01-01
The NASA Earth Observation System Data and Information System (EOSDIS) delivers an average of 22 terabytes per day of data collected by orbital and airborne sensor systems to end users through an integrated online search environment (the Reverb/ECHO system). Earth observations data collected by sensors on the International Space Station (ISS) are not currently included in the EOSDIS system, and are only accessible through various individual online locations. This increases the effort required by end users to query multiple datasets, and limits the opportunity for data discovery and innovations in analysis. The Earth Science and Remote Sensing Unit of the Exploration Integration and Science Directorate at NASA Johnson Space Center has collaborated with the School of Earth and Space Exploration at Arizona State University (ASU) to develop the ISS Instrument Integration Implementation (I4) data query tool to provide end users a clean, simple online interface for querying both current and historical ISS Earth Observations data. The I4 interface is based on the Lunaserv and Lunaserv Global Explorer (LGE) open-source software packages developed at ASU for query of lunar datasets. In order to avoid mirroring existing databases - and the need to continually sync/update those mirrors - our design philosophy is for the I4 tool to be a pure query engine only. Once an end user identifies a specific scene or scenes of interest, I4 transparently takes the user to the appropriate online location to download the data. The tool consists of two public-facing web interfaces. The Map Tool provides a graphic geobrowser environment where the end user can navigate to an area of interest and select single or multiple datasets to query. The Map Tool displays active image footprints for the selected datasets (Figure 1). Selecting a footprint will open a pop-up window that includes a browse image and a link to available image metadata, along with a link to the online location to order or download the actual data. Search results are either delivered in the form of browse images linked to the appropriate online database, similar to the Map Tool, or they may be transferred within the I4 environment for display as footprints in the Map Tool. Datasets searchable through I4 (http://eol.jsc.nasa.gov/I4_tool) currently include: Crew Earth Observations (CEO) cataloged and uncataloged handheld astronaut photography; Sally Ride EarthKAM; Hyperspectral Imager for the Coastal Ocean (HICO); and the ISS SERVIR Environmental Research and Visualization System (ISERV). The ISS is a unique platform in that it will have multiple users over its lifetime, and that no single remote sensing system has a permanent internal or external berth. The open source I4 tool is designed to enable straightforward addition of new datasets as they become available such as ISS-RapidSCAT, Cloud Aerosol Transport System (CATS), and the High Definition Earth Viewing (HDEV) system. Data from other sensor systems, such as those operated by the ISS International Partners or under the auspices of the US National Laboratory program, can also be added to I4 provided sufficient access to enable searching of data or metadata is available. Commercial providers of remotely sensed data from the ISS may be particularly interested in I4 as an additional means of directing potential customers and clients to their products.
A Query Integrator and Manager for the Query Web
Brinkley, James F.; Detwiler, Landon T.
2012-01-01
We introduce two concepts: the Query Web as a layer of interconnected queries over the document web and the semantic web, and a Query Web Integrator and Manager (QI) that enables the Query Web to evolve. QI permits users to write, save and reuse queries over any web accessible source, including other queries saved in other installations of QI. The saved queries may be in any language (e.g. SPARQL, XQuery); the only condition for interconnection is that the queries return their results in some form of XML. This condition allows queries to chain off each other, and to be written in whatever language is appropriate for the task. We illustrate the potential use of QI for several biomedical use cases, including ontology view generation using a combination of graph-based and logical approaches, value set generation for clinical data management, image annotation using terminology obtained from an ontology web service, ontology-driven brain imaging data integration, small-scale clinical data integration, and wider-scale clinical data integration. Such use cases illustrate the current range of applications of QI and lead us to speculate about the potential evolution from smaller groups of interconnected queries into a larger query network that layers over the document and semantic web. The resulting Query Web could greatly aid researchers and others who now have to manually navigate through multiple information sources in order to answer specific questions. PMID:22531831
Development and evaluation of a biomedical search engine using a predicate-based vector space model.
Kwak, Myungjae; Leroy, Gondy; Martinez, Jesse D; Harwell, Jeffrey
2013-10-01
Although biomedical information available in articles and patents is increasing exponentially, we continue to rely on the same information retrieval methods and use very few keywords to search millions of documents. We are developing a fundamentally different approach for finding much more precise and complete information with a single query using predicates instead of keywords for both query and document representation. Predicates are triples that are more complex datastructures than keywords and contain more structured information. To make optimal use of them, we developed a new predicate-based vector space model and query-document similarity function with adjusted tf-idf and boost function. Using a test bed of 107,367 PubMed abstracts, we evaluated the first essential function: retrieving information. Cancer researchers provided 20 realistic queries, for which the top 15 abstracts were retrieved using a predicate-based (new) and keyword-based (baseline) approach. Each abstract was evaluated, double-blind, by cancer researchers on a 0-5 point scale to calculate precision (0 versus higher) and relevance (0-5 score). Precision was significantly higher (p<.001) for the predicate-based (80%) than for the keyword-based (71%) approach. Relevance was almost doubled with the predicate-based approach-2.1 versus 1.6 without rank order adjustment (p<.001) and 1.34 versus 0.98 with rank order adjustment (p<.001) for predicate--versus keyword-based approach respectively. Predicates can support more precise searching than keywords, laying the foundation for rich and sophisticated information search. Copyright © 2013 Elsevier Inc. All rights reserved.
Reyes-Aldasoro, Constantino Carlos
2017-01-01
In this work, the public database of biomedical literature PubMed was mined using queries with combinations of keywords and year restrictions. It was found that the proportion of Cancer-related entries per year in PubMed has risen from around 6% in 1950 to more than 16% in 2016. This increase is not shared by other conditions such as AIDS, Malaria, Tuberculosis, Diabetes, Cardiovascular, Stroke and Infection some of which have, on the contrary, decreased as a proportion of the total entries per year. Organ-related queries were performed to analyse the variation of some specific cancers. A series of queries related to incidence, funding, and relationship with DNA, Computing and Mathematics, were performed to test correlation between the keywords, with the hope of elucidating the cause behind the rise of Cancer in PubMed. Interestingly, the proportion of Cancer-related entries that contain "DNA", "Computational" or "Mathematical" have increased, which suggests that the impact of these scientific advances on Cancer has been stronger than in other conditions. It is important to highlight that the results obtained with the data mining approach here presented are limited to the presence or absence of the keywords on a single, yet extensive, database. Therefore, results should be observed with caution. All the data used for this work is publicly available through PubMed and the UK's Office for National Statistics. All queries and figures were generated with the software platform Matlab and the files are available as supplementary material.
2017-01-01
In this work, the public database of biomedical literature PubMed was mined using queries with combinations of keywords and year restrictions. It was found that the proportion of Cancer-related entries per year in PubMed has risen from around 6% in 1950 to more than 16% in 2016. This increase is not shared by other conditions such as AIDS, Malaria, Tuberculosis, Diabetes, Cardiovascular, Stroke and Infection some of which have, on the contrary, decreased as a proportion of the total entries per year. Organ-related queries were performed to analyse the variation of some specific cancers. A series of queries related to incidence, funding, and relationship with DNA, Computing and Mathematics, were performed to test correlation between the keywords, with the hope of elucidating the cause behind the rise of Cancer in PubMed. Interestingly, the proportion of Cancer-related entries that contain “DNA”, “Computational” or “Mathematical” have increased, which suggests that the impact of these scientific advances on Cancer has been stronger than in other conditions. It is important to highlight that the results obtained with the data mining approach here presented are limited to the presence or absence of the keywords on a single, yet extensive, database. Therefore, results should be observed with caution. All the data used for this work is publicly available through PubMed and the UK’s Office for National Statistics. All queries and figures were generated with the software platform Matlab and the files are available as supplementary material. PMID:28282418
2015-01-01
Background In recent years, with advances in techniques for protein structure analysis, the knowledge about protein structure and function has been published in a vast number of articles. A method to search for specific publications from such a large pool of articles is needed. In this paper, we propose a method to search for related articles on protein structure analysis by using an article itself as a query. Results Each article is represented as a set of concepts in the proposed method. Then, by using similarities among concepts formulated from databases such as Gene Ontology, similarities between articles are evaluated. In this framework, the desired search results vary depending on the user's search intention because a variety of information is included in a single article. Therefore, the proposed method provides not only one input article (primary article) but also additional articles related to it as an input query to determine the search intention of the user, based on the relationship between two query articles. In other words, based on the concepts contained in the input article and additional articles, we actualize a relevant literature search that considers user intention by varying the degree of attention given to each concept and modifying the concept hierarchy graph. Conclusions We performed an experiment to retrieve relevant papers from articles on protein structure analysis registered in the Protein Data Bank by using three query datasets. The experimental results yielded search results with better accuracy than when user intention was not considered, confirming the effectiveness of the proposed method. PMID:25952498
The Priority of the Question: Focus Questions for Sustained Reasoning in Science
NASA Astrophysics Data System (ADS)
Lustick, David
2010-08-01
Science education standards place a high priority on promoting the skills and dispositions associated with inquiry at all levels of learning. Yet, the questions teachers employ to foster sustained reasoning are most likely borrowed from a textbook, lab manual, or worksheet. Such generic questions generated for a mass audience, lack authenticity and contextual cues that allow learners to immediately appreciate a question’s relevance. Teacher queries intended to motivate, guide, and foster learning through inquiry are known as focus questions. This theoretical article draws upon science education research to present a typology and conceptual framework intended to support science teacher educators as they identify, develop, and evaluate focus questions with their students.
Collaborative mining of graph patterns from multiple sources
NASA Astrophysics Data System (ADS)
Levchuk, Georgiy; Colonna-Romanoa, John
2016-05-01
Intelligence analysts require automated tools to mine multi-source data, including answering queries, learning patterns of life, and discovering malicious or anomalous activities. Graph mining algorithms have recently attracted significant attention in intelligence community, because the text-derived knowledge can be efficiently represented as graphs of entities and relationships. However, graph mining models are limited to use-cases involving collocated data, and often make restrictive assumptions about the types of patterns that need to be discovered, the relationships between individual sources, and availability of accurate data segmentation. In this paper we present a model to learn the graph patterns from multiple relational data sources, when each source might have only a fragment (or subgraph) of the knowledge that needs to be discovered, and segmentation of data into training or testing instances is not available. Our model is based on distributed collaborative graph learning, and is effective in situations when the data is kept locally and cannot be moved to a centralized location. Our experiments show that proposed collaborative learning achieves learning quality better than aggregated centralized graph learning, and has learning time comparable to traditional distributed learning in which a knowledge of data segmentation is needed.
An automated algorithm for determining photometric redshifts of quasars
NASA Astrophysics Data System (ADS)
Wang, Dan; Zhang, Yanxia; Zhao, Yongheng
2010-07-01
We employ k-nearest neighbor algorithm (KNN) for photometric redshift measurement of quasars with the Fifth Data Release (DR5) of the Sloan Digital Sky Survey (SDSS). KNN is an instance learning algorithm where the result of new instance query is predicted based on the closest training samples. The regressor do not use any model to fit and only based on memory. Given a query quasar, we find the known quasars or (training points) closest to the query point, whose redshift value is simply assigned to be the average of the values of its k nearest neighbors. Three kinds of different colors (PSF, Model or Fiber) and spectral redshifts are used as input parameters, separatively. The combination of the three kinds of colors is also taken as input. The experimental results indicate that the best input pattern is PSF + Model + Fiber colors in all experiments. With this pattern, 59.24%, 77.34% and 84.68% of photometric redshifts are obtained within ▵z < 0.1, 0.2 and 0.3, respectively. If only using one kind of colors as input, the model colors achieve the best performance. However, when using two kinds of colors, the best result is achieved by PSF + Fiber colors. In addition, nearest neighbor method (k = 1) shows its superiority compared to KNN (k ≠ 1) for the given sample.
Using Generalized Annotated Programs to Solve Social Network Diffusion Optimization Problems
2013-01-01
as follows: —Let kall be the k value for the SNDOP-ALL query and for each SNDOP query i, let ki be the k for that query. For each query i, set ki... kall − 1. —Number each element of vi ∈ V such that gI(vi) and V C(vi) are true. For the ith SNDOP query, let vi be the corresponding element of V —Let...vertices of S. PROOF. We set up |V | SNDOP-queries as follows: —Let kall be the k value for the SNDOP-ALL query and and for each SNDOP-query i, let ki be
A web-based data-querying tool based on ontology-driven methodology and flowchart-based model.
Ping, Xiao-Ou; Chung, Yufang; Tseng, Yi-Ju; Liang, Ja-Der; Yang, Pei-Ming; Huang, Guan-Tarn; Lai, Feipei
2013-10-08
Because of the increased adoption rate of electronic medical record (EMR) systems, more health care records have been increasingly accumulating in clinical data repositories. Therefore, querying the data stored in these repositories is crucial for retrieving the knowledge from such large volumes of clinical data. The aim of this study is to develop a Web-based approach for enriching the capabilities of the data-querying system along the three following considerations: (1) the interface design used for query formulation, (2) the representation of query results, and (3) the models used for formulating query criteria. The Guideline Interchange Format version 3.5 (GLIF3.5), an ontology-driven clinical guideline representation language, was used for formulating the query tasks based on the GLIF3.5 flowchart in the Protégé environment. The flowchart-based data-querying model (FBDQM) query execution engine was developed and implemented for executing queries and presenting the results through a visual and graphical interface. To examine a broad variety of patient data, the clinical data generator was implemented to automatically generate the clinical data in the repository, and the generated data, thereby, were employed to evaluate the system. The accuracy and time performance of the system for three medical query tasks relevant to liver cancer were evaluated based on the clinical data generator in the experiments with varying numbers of patients. In this study, a prototype system was developed to test the feasibility of applying a methodology for building a query execution engine using FBDQMs by formulating query tasks using the existing GLIF. The FBDQM-based query execution engine was used to successfully retrieve the clinical data based on the query tasks formatted using the GLIF3.5 in the experiments with varying numbers of patients. The accuracy of the three queries (ie, "degree of liver damage," "degree of liver damage when applying a mutually exclusive setting," and "treatments for liver cancer") was 100% for all four experiments (10 patients, 100 patients, 1000 patients, and 10,000 patients). Among the three measured query phases, (1) structured query language operations, (2) criteria verification, and (3) other, the first two had the longest execution time. The ontology-driven FBDQM-based approach enriched the capabilities of the data-querying system. The adoption of the GLIF3.5 increased the potential for interoperability, shareability, and reusability of the query tasks.
Hasan, Md Al Mehedi; Ahmad, Shamim; Molla, Md Khademul Islam
2017-03-28
Predicting the subcellular locations of proteins can provide useful hints that reveal their functions, increase our understanding of the mechanisms of some diseases, and finally aid in the development of novel drugs. As the number of newly discovered proteins has been growing exponentially, which in turns, makes the subcellular localization prediction by purely laboratory tests prohibitively laborious and expensive. In this context, to tackle the challenges, computational methods are being developed as an alternative choice to aid biologists in selecting target proteins and designing related experiments. However, the success of protein subcellular localization prediction is still a complicated and challenging issue, particularly, when query proteins have multi-label characteristics, i.e., if they exist simultaneously in more than one subcellular location or if they move between two or more different subcellular locations. To date, to address this problem, several types of subcellular localization prediction methods with different levels of accuracy have been proposed. The support vector machine (SVM) has been employed to provide potential solutions to the protein subcellular localization prediction problem. However, the practicability of an SVM is affected by the challenges of selecting an appropriate kernel and selecting the parameters of the selected kernel. To address this difficulty, in this study, we aimed to develop an efficient multi-label protein subcellular localization prediction system, named as MKLoc, by introducing multiple kernel learning (MKL) based SVM. We evaluated MKLoc using a combined dataset containing 5447 single-localized proteins (originally published as part of the Höglund dataset) and 3056 multi-localized proteins (originally published as part of the DBMLoc set). Note that this dataset was used by Briesemeister et al. in their extensive comparison of multi-localization prediction systems. Finally, our experimental results indicate that MKLoc not only achieves higher accuracy than a single kernel based SVM system but also shows significantly better results than those obtained from other top systems (MDLoc, BNCs, YLoc+). Moreover, MKLoc requires less computation time to tune and train the system than that required for BNCs and single kernel based SVM.
Visual analytics for semantic queries of TerraSAR-X image content
NASA Astrophysics Data System (ADS)
Espinoza-Molina, Daniela; Alonso, Kevin; Datcu, Mihai
2015-10-01
With the continuous image product acquisition of satellite missions, the size of the image archives is considerably increasing every day as well as the variety and complexity of their content, surpassing the end-user capacity to analyse and exploit them. Advances in the image retrieval field have contributed to the development of tools for interactive exploration and extraction of the images from huge archives using different parameters like metadata, key-words, and basic image descriptors. Even though we count on more powerful tools for automated image retrieval and data analysis, we still face the problem of understanding and analyzing the results. Thus, a systematic computational analysis of these results is required in order to provide to the end-user a summary of the archive content in comprehensible terms. In this context, visual analytics combines automated analysis with interactive visualizations analysis techniques for an effective understanding, reasoning and decision making on the basis of very large and complex datasets. Moreover, currently several researches are focused on associating the content of the images with semantic definitions for describing the data in a format to be easily understood by the end-user. In this paper, we present our approach for computing visual analytics and semantically querying the TerraSAR-X archive. Our approach is mainly composed of four steps: 1) the generation of a data model that explains the information contained in a TerraSAR-X product. The model is formed by primitive descriptors and metadata entries, 2) the storage of this model in a database system, 3) the semantic definition of the image content based on machine learning algorithms and relevance feedback, and 4) querying the image archive using semantic descriptors as query parameters and computing the statistical analysis of the query results. The experimental results shows that with the help of visual analytics and semantic definitions we are able to explain the image content using semantic terms and the relations between them answering questions such as what is the percentage of urban area in a region? or what is the distribution of water bodies in a city?
Behavioral Issues in the Use of Interactive Systems
1976-12-14
communication. American Psychologist, 1971, 26, 949-961. Codd , E. F . Seven steps to rendezvous with the casual user. IBM Research Report, RI 1333. 1974. Conrad...Approved for public releasel distribution unlimited. F LL(I j i’ This ~Research wavs spotdi pr yteEnierons~ao _ 1Repr o seionrin who Zur ichati emte...natural language ( Codd , 1974). Behavioral work has shown that non-programmers could learn to use a laboratory query language in about 3 hours (Thomas
Jadhav, Ashutosh; Andrews, Donna; Fiksdal, Alexander; Kumbamu, Ashok; McCormick, Jennifer B; Misitano, Andrew; Nelsen, Laurie; Ryu, Euijung; Sheth, Amit; Wu, Stephen
2014-01-01
Background The number of people using the Internet and mobile/smart devices for health information seeking is increasing rapidly. Although the user experience for online health information seeking varies with the device used, for example, smart devices (SDs) like smartphones/tablets versus personal computers (PCs) like desktops/laptops, very few studies have investigated how online health information seeking behavior (OHISB) may differ by device. Objective The objective of this study is to examine differences in OHISB between PCs and SDs through a comparative analysis of large-scale health search queries submitted through Web search engines from both types of devices. Methods Using the Web analytics tool, IBM NetInsight OnDemand, and based on the type of devices used (PCs or SDs), we obtained the most frequent health search queries between June 2011 and May 2013 that were submitted on Web search engines and directed users to the Mayo Clinic’s consumer health information website. We performed analyses on “Queries with considering repetition counts (QwR)” and “Queries without considering repetition counts (QwoR)”. The dataset contains (1) 2.74 million and 3.94 million QwoR, respectively for PCs and SDs, and (2) more than 100 million QwR for both PCs and SDs. We analyzed structural properties of the queries (length of the search queries, usage of query operators and special characters in health queries), types of search queries (keyword-based, wh-questions, yes/no questions), categorization of the queries based on health categories and information mentioned in the queries (gender, age-groups, temporal references), misspellings in the health queries, and the linguistic structure of the health queries. Results Query strings used for health information searching via PCs and SDs differ by almost 50%. The most searched health categories are “Symptoms” (1 in 3 search queries), “Causes”, and “Treatments & Drugs”. The distribution of search queries for different health categories differs with the device used for the search. Health queries tend to be longer and more specific than general search queries. Health queries from SDs are longer and have slightly fewer spelling mistakes than those from PCs. Users specify words related to women and children more often than that of men and any other age group. Most of the health queries are formulated using keywords; the second-most common are wh- and yes/no questions. Users ask more health questions using SDs than PCs. Almost all health queries have at least one noun and health queries from SDs are more descriptive than those from PCs. Conclusions This study is a large-scale comparative analysis of health search queries to understand the effects of device type (PCs vs SDs) used on OHISB. The study indicates that the device used for online health information search plays an important role in shaping how health information searches by consumers and patients are executed. PMID:25000537
Jadhav, Ashutosh; Andrews, Donna; Fiksdal, Alexander; Kumbamu, Ashok; McCormick, Jennifer B; Misitano, Andrew; Nelsen, Laurie; Ryu, Euijung; Sheth, Amit; Wu, Stephen; Pathak, Jyotishman
2014-07-04
The number of people using the Internet and mobile/smart devices for health information seeking is increasing rapidly. Although the user experience for online health information seeking varies with the device used, for example, smart devices (SDs) like smartphones/tablets versus personal computers (PCs) like desktops/laptops, very few studies have investigated how online health information seeking behavior (OHISB) may differ by device. The objective of this study is to examine differences in OHISB between PCs and SDs through a comparative analysis of large-scale health search queries submitted through Web search engines from both types of devices. Using the Web analytics tool, IBM NetInsight OnDemand, and based on the type of devices used (PCs or SDs), we obtained the most frequent health search queries between June 2011 and May 2013 that were submitted on Web search engines and directed users to the Mayo Clinic's consumer health information website. We performed analyses on "Queries with considering repetition counts (QwR)" and "Queries without considering repetition counts (QwoR)". The dataset contains (1) 2.74 million and 3.94 million QwoR, respectively for PCs and SDs, and (2) more than 100 million QwR for both PCs and SDs. We analyzed structural properties of the queries (length of the search queries, usage of query operators and special characters in health queries), types of search queries (keyword-based, wh-questions, yes/no questions), categorization of the queries based on health categories and information mentioned in the queries (gender, age-groups, temporal references), misspellings in the health queries, and the linguistic structure of the health queries. Query strings used for health information searching via PCs and SDs differ by almost 50%. The most searched health categories are "Symptoms" (1 in 3 search queries), "Causes", and "Treatments & Drugs". The distribution of search queries for different health categories differs with the device used for the search. Health queries tend to be longer and more specific than general search queries. Health queries from SDs are longer and have slightly fewer spelling mistakes than those from PCs. Users specify words related to women and children more often than that of men and any other age group. Most of the health queries are formulated using keywords; the second-most common are wh- and yes/no questions. Users ask more health questions using SDs than PCs. Almost all health queries have at least one noun and health queries from SDs are more descriptive than those from PCs. This study is a large-scale comparative analysis of health search queries to understand the effects of device type (PCs vs. SDs) used on OHISB. The study indicates that the device used for online health information search plays an important role in shaping how health information searches by consumers and patients are executed.
An evolution-based DNA-binding residue predictor using a dynamic query-driven learning scheme.
Chai, H; Zhang, J; Yang, G; Ma, Z
2016-11-15
DNA-binding proteins play a pivotal role in various biological activities. Identification of DNA-binding residues (DBRs) is of great importance for understanding the mechanism of gene regulations and chromatin remodeling. Most traditional computational methods usually construct their predictors on static non-redundant datasets. They excluded many homologous DNA-binding proteins so as to guarantee the generalization capability of their models. However, those ignored samples may potentially provide useful clues when studying protein-DNA interactions, which have not obtained enough attention. In view of this, we propose a novel method, namely DQPred-DBR, to fill the gap of DBR predictions. First, a large-scale extensible sample pool was compiled. Second, evolution-based features in the form of a relative position specific score matrix and covariant evolutionary conservation descriptors were used to encode the feature space. Third, a dynamic query-driven learning scheme was designed to make more use of proteins with known structure and functions. In comparison with a traditional static model, the introduction of dynamic models could obviously improve the prediction performance. Experimental results from the benchmark and independent datasets proved that our DQPred-DBR had promising generalization capability. It was capable of producing decent predictions and outperforms many state-of-the-art methods. For the convenience of academic use, our proposed method was also implemented as a web server at .
SkyQuery - A Prototype Distributed Query and Cross-Matching Web Service for the Virtual Observatory
NASA Astrophysics Data System (ADS)
Thakar, A. R.; Budavari, T.; Malik, T.; Szalay, A. S.; Fekete, G.; Nieto-Santisteban, M.; Haridas, V.; Gray, J.
2002-12-01
We have developed a prototype distributed query and cross-matching service for the VO community, called SkyQuery, which is implemented with hierarchichal Web Services. SkyQuery enables astronomers to run combined queries on existing distributed heterogeneous astronomy archives. SkyQuery provides a simple, user-friendly interface to run distributed queries over the federation of registered astronomical archives in the VO. The SkyQuery client connects to the portal Web Service, which farms the query out to the individual archives, which are also Web Services called SkyNodes. The cross-matching algorithm is run recursively on each SkyNode. Each archive is a relational DBMS with a HTM index for fast spatial lookups. The results of the distributed query are returned as an XML DataSet that is automatically rendered by the client. SkyQuery also returns the image cutout corresponding to the query result. SkyQuery finds not only matches between the various catalogs, but also dropouts - objects that exist in some of the catalogs but not in others. This is often as important as finding matches. We demonstrate the utility of SkyQuery with a brown-dwarf search between SDSS and 2MASS, and a search for radio-quiet quasars in SDSS, 2MASS and FIRST. The importance of a service like SkyQuery for the worldwide astronomical community cannot be overstated: data on the same objects in various archives is mapped in different wavelength ranges and looks very different due to different errors, instrument sensitivities and other peculiarities of each archive. Our cross-matching algorithm preforms a fuzzy spatial join across multiple catalogs. This type of cross-matching is currently often done by eye, one object at a time. A static cross-identification table for a set of archives would become obsolete by the time it was built - the exponential growth of astronomical data means that a dynamic cross-identification mechanism like SkyQuery is the only viable option. SkyQuery was funded by a grant from the NASA AISR program.
BioMart Central Portal: an open database network for the biological community
Guberman, Jonathan M.; Ai, J.; Arnaiz, O.; Baran, Joachim; Blake, Andrew; Baldock, Richard; Chelala, Claude; Croft, David; Cros, Anthony; Cutts, Rosalind J.; Di Génova, A.; Forbes, Simon; Fujisawa, T.; Gadaleta, E.; Goodstein, D. M.; Gundem, Gunes; Haggarty, Bernard; Haider, Syed; Hall, Matthew; Harris, Todd; Haw, Robin; Hu, S.; Hubbard, Simon; Hsu, Jack; Iyer, Vivek; Jones, Philip; Katayama, Toshiaki; Kinsella, R.; Kong, Lei; Lawson, Daniel; Liang, Yong; Lopez-Bigas, Nuria; Luo, J.; Lush, Michael; Mason, Jeremy; Moreews, Francois; Ndegwa, Nelson; Oakley, Darren; Perez-Llamas, Christian; Primig, Michael; Rivkin, Elena; Rosanoff, S.; Shepherd, Rebecca; Simon, Reinhard; Skarnes, B.; Smedley, Damian; Sperling, Linda; Spooner, William; Stevenson, Peter; Stone, Kevin; Teague, J.; Wang, Jun; Wang, Jianxin; Whitty, Brett; Wong, D. T.; Wong-Erasmus, Marie; Yao, L.; Youens-Clark, Ken; Yung, Christina; Zhang, Junjun; Kasprzyk, Arek
2011-01-01
BioMart Central Portal is a first of its kind, community-driven effort to provide unified access to dozens of biological databases spanning genomics, proteomics, model organisms, cancer data, ontology information and more. Anybody can contribute an independently maintained resource to the Central Portal, allowing it to be exposed to and shared with the research community, and linking it with the other resources in the portal. Users can take advantage of the common interface to quickly utilize different sources without learning a new system for each. The system also simplifies cross-database searches that might otherwise require several complicated steps. Several integrated tools streamline common tasks, such as converting between ID formats and retrieving sequences. The combination of a wide variety of databases, an easy-to-use interface, robust programmatic access and the array of tools make Central Portal a one-stop shop for biological data querying. Here, we describe the structure of Central Portal and show example queries to demonstrate its capabilities. Database URL: http://central.biomart.org. PMID:21930507
PubMedReco: A Real-Time Recommender System for PubMed Citations.
Samuel, Hamman W; Zaïane, Osmar R
2017-01-01
We present a recommender system, PubMedReco, for real-time suggestions of medical articles from PubMed, a database of over 23 million medical citations. PubMedReco can recommend medical article citations while users are conversing in a synchronous communication environment such as a chat room. Normally, users would have to leave their chat interface to open a new web browser window, and formulate an appropriate search query to retrieve relevant results. PubMedReco automatically generates the search query and shows relevant citations within the same integrated user interface. PubMedReco analyzes relevant keywords associated with the conversation and uses them to search for relevant citations using the PubMed E-utilities programming interface. Our contributions include improvements to the user experience for searching PubMed from within health forums and chat rooms, and a machine learning model for identifying relevant keywords. We demonstrate the feasibility of PubMedReco using BMJ's Doc2Doc forum discussions.
Influenza-like illness surveillance on Twitter through automated learning of naïve language.
Gesualdo, Francesco; Stilo, Giovanni; Agricola, Eleonora; Gonfiantini, Michaela V; Pandolfi, Elisabetta; Velardi, Paola; Tozzi, Alberto E
2013-01-01
Twitter has the potential to be a timely and cost-effective source of data for syndromic surveillance. When speaking of an illness, Twitter users often report a combination of symptoms, rather than a suspected or final diagnosis, using naïve, everyday language. We developed a minimally trained algorithm that exploits the abundance of health-related web pages to identify all jargon expressions related to a specific technical term. We then translated an influenza case definition into a Boolean query, each symptom being described by a technical term and all related jargon expressions, as identified by the algorithm. Subsequently, we monitored all tweets that reported a combination of symptoms satisfying the case definition query. In order to geolocalize messages, we defined 3 localization strategies based on codes associated with each tweet. We found a high correlation coefficient between the trend of our influenza-positive tweets and ILI trends identified by US traditional surveillance systems.
Influenza-Like Illness Surveillance on Twitter through Automated Learning of Naïve Language
Gesualdo, Francesco; Stilo, Giovanni; Agricola, Eleonora; Gonfiantini, Michaela V.; Pandolfi, Elisabetta; Velardi, Paola; Tozzi, Alberto E.
2013-01-01
Twitter has the potential to be a timely and cost-effective source of data for syndromic surveillance. When speaking of an illness, Twitter users often report a combination of symptoms, rather than a suspected or final diagnosis, using naïve, everyday language. We developed a minimally trained algorithm that exploits the abundance of health-related web pages to identify all jargon expressions related to a specific technical term. We then translated an influenza case definition into a Boolean query, each symptom being described by a technical term and all related jargon expressions, as identified by the algorithm. Subsequently, we monitored all tweets that reported a combination of symptoms satisfying the case definition query. In order to geolocalize messages, we defined 3 localization strategies based on codes associated with each tweet. We found a high correlation coefficient between the trend of our influenza-positive tweets and ILI trends identified by US traditional surveillance systems. PMID:24324799
BioMart Central Portal: an open database network for the biological community.
Guberman, Jonathan M; Ai, J; Arnaiz, O; Baran, Joachim; Blake, Andrew; Baldock, Richard; Chelala, Claude; Croft, David; Cros, Anthony; Cutts, Rosalind J; Di Génova, A; Forbes, Simon; Fujisawa, T; Gadaleta, E; Goodstein, D M; Gundem, Gunes; Haggarty, Bernard; Haider, Syed; Hall, Matthew; Harris, Todd; Haw, Robin; Hu, S; Hubbard, Simon; Hsu, Jack; Iyer, Vivek; Jones, Philip; Katayama, Toshiaki; Kinsella, R; Kong, Lei; Lawson, Daniel; Liang, Yong; Lopez-Bigas, Nuria; Luo, J; Lush, Michael; Mason, Jeremy; Moreews, Francois; Ndegwa, Nelson; Oakley, Darren; Perez-Llamas, Christian; Primig, Michael; Rivkin, Elena; Rosanoff, S; Shepherd, Rebecca; Simon, Reinhard; Skarnes, B; Smedley, Damian; Sperling, Linda; Spooner, William; Stevenson, Peter; Stone, Kevin; Teague, J; Wang, Jun; Wang, Jianxin; Whitty, Brett; Wong, D T; Wong-Erasmus, Marie; Yao, L; Youens-Clark, Ken; Yung, Christina; Zhang, Junjun; Kasprzyk, Arek
2011-01-01
BioMart Central Portal is a first of its kind, community-driven effort to provide unified access to dozens of biological databases spanning genomics, proteomics, model organisms, cancer data, ontology information and more. Anybody can contribute an independently maintained resource to the Central Portal, allowing it to be exposed to and shared with the research community, and linking it with the other resources in the portal. Users can take advantage of the common interface to quickly utilize different sources without learning a new system for each. The system also simplifies cross-database searches that might otherwise require several complicated steps. Several integrated tools streamline common tasks, such as converting between ID formats and retrieving sequences. The combination of a wide variety of databases, an easy-to-use interface, robust programmatic access and the array of tools make Central Portal a one-stop shop for biological data querying. Here, we describe the structure of Central Portal and show example queries to demonstrate its capabilities.
Teng, Rui; Leibnitz, Kenji; Miura, Ryu
2013-01-01
An essential application of wireless sensor networks is to successfully respond to user queries. Query packet losses occur in the query dissemination due to wireless communication problems such as interference, multipath fading, packet collisions, etc. The losses of query messages at sensor nodes result in the failure of sensor nodes reporting the requested data. Hence, the reliable and successful dissemination of query messages to sensor nodes is a non-trivial problem. The target of this paper is to enable highly successful query delivery to sensor nodes by localized and energy-efficient discovery, and recovery of query losses. We adopt local and collective cooperation among sensor nodes to increase the success rate of distributed discoveries and recoveries. To enable the scalability in the operations of discoveries and recoveries, we employ a distributed name resolution mechanism at each sensor node to allow sensor nodes to self-detect the correlated queries and query losses, and then efficiently locally respond to the query losses. We prove that the collective discovery of query losses has a high impact on the success of query dissemination and reveal that scalability can be achieved by using the proposed approach. We further study the novel features of the cooperation and competition in the collective recovery at PHY and MAC layers, and show that the appropriate number of detectors can achieve optimal successful recovery rate. We evaluate the proposed approach with both mathematical analyses and computer simulations. The proposed approach enables a high rate of successful delivery of query messages and it results in short route lengths to recover from query losses. The proposed approach is scalable and operates in a fully distributed manner. PMID:23748172
Ontological Approach to Military Knowledge Modeling and Management
2004-03-01
federated search mechanism has to reformulate user queries (expressed using the ontology) in the query languages of the different sources (e.g. SQL...ontologies as a common terminology – Unified query to perform federated search • Query processing – Ontology mapping to sources reformulate queries
NASA Astrophysics Data System (ADS)
Li, C.; Zhu, X.; Guo, W.; Liu, Y.; Huang, H.
2015-05-01
A method suitable for indoor complex semantic query considering the computation of indoor spatial relations is provided According to the characteristics of indoor space. This paper designs ontology model describing the space related information of humans, events and Indoor space objects (e.g. Storey and Room) as well as their relations to meet the indoor semantic query. The ontology concepts are used in IndoorSPARQL query language which extends SPARQL syntax for representing and querying indoor space. And four types specific primitives for indoor query, "Adjacent", "Opposite", "Vertical" and "Contain", are defined as query functions in IndoorSPARQL used to support quantitative spatial computations. Also a method is proposed to analysis the query language. Finally this paper adopts this method to realize indoor semantic query on the study area through constructing the ontology model for the study building. The experimental results show that the method proposed in this paper can effectively support complex indoor space semantic query.
VISAGE: Interactive Visual Graph Querying.
Pienta, Robert; Navathe, Shamkant; Tamersoy, Acar; Tong, Hanghang; Endert, Alex; Chau, Duen Horng
2016-06-01
Extracting useful patterns from large network datasets has become a fundamental challenge in many domains. We present VISAGE, an interactive visual graph querying approach that empowers users to construct expressive queries, without writing complex code (e.g., finding money laundering rings of bankers and business owners). Our contributions are as follows: (1) we introduce graph autocomplete , an interactive approach that guides users to construct and refine queries, preventing over-specification; (2) VISAGE guides the construction of graph queries using a data-driven approach, enabling users to specify queries with varying levels of specificity, from concrete and detailed (e.g., query by example), to abstract (e.g., with "wildcard" nodes of any types), to purely structural matching; (3) a twelve-participant, within-subject user study demonstrates VISAGE's ease of use and the ability to construct graph queries significantly faster than using a conventional query language; (4) VISAGE works on real graphs with over 468K edges, achieving sub-second response times for common queries.
VISAGE: Interactive Visual Graph Querying
Pienta, Robert; Navathe, Shamkant; Tamersoy, Acar; Tong, Hanghang; Endert, Alex; Chau, Duen Horng
2017-01-01
Extracting useful patterns from large network datasets has become a fundamental challenge in many domains. We present VISAGE, an interactive visual graph querying approach that empowers users to construct expressive queries, without writing complex code (e.g., finding money laundering rings of bankers and business owners). Our contributions are as follows: (1) we introduce graph autocomplete, an interactive approach that guides users to construct and refine queries, preventing over-specification; (2) VISAGE guides the construction of graph queries using a data-driven approach, enabling users to specify queries with varying levels of specificity, from concrete and detailed (e.g., query by example), to abstract (e.g., with “wildcard” nodes of any types), to purely structural matching; (3) a twelve-participant, within-subject user study demonstrates VISAGE’s ease of use and the ability to construct graph queries significantly faster than using a conventional query language; (4) VISAGE works on real graphs with over 468K edges, achieving sub-second response times for common queries. PMID:28553670
Deep Correlated Holistic Metric Learning for Sketch-Based 3D Shape Retrieval.
Dai, Guoxian; Xie, Jin; Fang, Yi
2018-07-01
How to effectively retrieve desired 3D models with simple queries is a long-standing problem in computer vision community. The model-based approach is quite straightforward but nontrivial, since people could not always have the desired 3D query model available by side. Recently, large amounts of wide-screen electronic devices are prevail in our daily lives, which makes the sketch-based 3D shape retrieval a promising candidate due to its simpleness and efficiency. The main challenge of sketch-based approach is the huge modality gap between sketch and 3D shape. In this paper, we proposed a novel deep correlated holistic metric learning (DCHML) method to mitigate the discrepancy between sketch and 3D shape domains. The proposed DCHML trains two distinct deep neural networks (one for each domain) jointly, which learns two deep nonlinear transformations to map features from both domains into a new feature space. The proposed loss, including discriminative loss and correlation loss, aims to increase the discrimination of features within each domain as well as the correlation between different domains. In the new feature space, the discriminative loss minimizes the intra-class distance of the deep transformed features and maximizes the inter-class distance of the deep transformed features to a large margin within each domain, while the correlation loss focused on mitigating the distribution discrepancy across different domains. Different from existing deep metric learning methods only with loss at the output layer, our proposed DCHML is trained with loss at both hidden layer and output layer to further improve the performance by encouraging features in the hidden layer also with desired properties. Our proposed method is evaluated on three benchmarks, including 3D Shape Retrieval Contest 2013, 2014, and 2016 benchmarks, and the experimental results demonstrate the superiority of our proposed method over the state-of-the-art methods.
A Visual Interface for Querying Heterogeneous Phylogenetic Databases.
Jamil, Hasan M
2017-01-01
Despite the recent growth in the number of phylogenetic databases, access to these wealth of resources remain largely tool or form-based interface driven. It is our thesis that the flexibility afforded by declarative query languages may offer the opportunity to access these repositories in a better way, and to use such a language to pose truly powerful queries in unprecedented ways. In this paper, we propose a substantially enhanced closed visual query language, called PhyQL, that can be used to query phylogenetic databases represented in a canonical form. The canonical representation presented helps capture most phylogenetic tree formats in a convenient way, and is used as the storage model for our PhyloBase database for which PhyQL serves as the query language. We have implemented a visual interface for the end users to pose PhyQL queries using visual icons, and drag and drop operations defined over them. Once a query is posed, the interface translates the visual query into a Datalog query for execution over the canonical database. Responses are returned as hyperlinks to phylogenies that can be viewed in several formats using the tree viewers supported by PhyloBase. Results cached in PhyQL buffer allows secondary querying on the computed results making it a truly powerful querying architecture.
Federal Register 2010, 2011, 2012, 2013, 2014
2010-04-22
... example, for a single device to which the Vendor has granted two people access, the Vendor should report... Amex BBO service in any calendar month. In order to take advantage of the per-query fee, a NYSE Amex... anticipates will be the most likely to take advantage of the proposed service; (iv) the contribution of market...
New tools and methods for direct programmatic access to the dbSNP relational database
Saccone, Scott F.; Quan, Jiaxi; Mehta, Gaurang; Bolze, Raphael; Thomas, Prasanth; Deelman, Ewa; Tischfield, Jay A.; Rice, John P.
2011-01-01
Genome-wide association studies often incorporate information from public biological databases in order to provide a biological reference for interpreting the results. The dbSNP database is an extensive source of information on single nucleotide polymorphisms (SNPs) for many different organisms, including humans. We have developed free software that will download and install a local MySQL implementation of the dbSNP relational database for a specified organism. We have also designed a system for classifying dbSNP tables in terms of common tasks we wish to accomplish using the database. For each task we have designed a small set of custom tables that facilitate task-related queries and provide entity-relationship diagrams for each task composed from the relevant dbSNP tables. In order to expose these concepts and methods to a wider audience we have developed web tools for querying the database and browsing documentation on the tables and columns to clarify the relevant relational structure. All web tools and software are freely available to the public at http://cgsmd.isi.edu/dbsnpq. Resources such as these for programmatically querying biological databases are essential for viably integrating biological information into genetic association experiments on a genome-wide scale. PMID:21037260
Sun, Shulei; Chen, Jing; Li, Weizhong; Altintas, Ilkay; Lin, Abel; Peltier, Steve; Stocks, Karen; Allen, Eric E.; Ellisman, Mark; Grethe, Jeffrey; Wooley, John
2011-01-01
The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing and sharing data about microbial biology through an advanced web-based analysis portal. CAMERA collects and links metadata relevant to environmental metagenome data sets with annotation in a semantically-aware environment allowing users to write expressive semantic queries against the database. To meet the needs of the research community, users are able to query metadata categories such as habitat, sample type, time, location and other environmental physicochemical parameters. CAMERA is compliant with the standards promulgated by the Genomic Standards Consortium (GSC), and sustains a role within the GSC in extending standards for content and format of the metagenomic data and metadata and its submission to the CAMERA repository. To ensure wide, ready access to data and annotation, CAMERA also provides data submission tools to allow researchers to share and forward data to other metagenomics sites and community data archives such as GenBank. It has multiple interfaces for easy submission of large or complex data sets, and supports pre-registration of samples for sequencing. CAMERA integrates a growing list of tools and viewers for querying, analyzing, annotating and comparing metagenome and genome data. PMID:21045053
Sun, Shulei; Chen, Jing; Li, Weizhong; Altintas, Ilkay; Lin, Abel; Peltier, Steve; Stocks, Karen; Allen, Eric E; Ellisman, Mark; Grethe, Jeffrey; Wooley, John
2011-01-01
The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing and sharing data about microbial biology through an advanced web-based analysis portal. CAMERA collects and links metadata relevant to environmental metagenome data sets with annotation in a semantically-aware environment allowing users to write expressive semantic queries against the database. To meet the needs of the research community, users are able to query metadata categories such as habitat, sample type, time, location and other environmental physicochemical parameters. CAMERA is compliant with the standards promulgated by the Genomic Standards Consortium (GSC), and sustains a role within the GSC in extending standards for content and format of the metagenomic data and metadata and its submission to the CAMERA repository. To ensure wide, ready access to data and annotation, CAMERA also provides data submission tools to allow researchers to share and forward data to other metagenomics sites and community data archives such as GenBank. It has multiple interfaces for easy submission of large or complex data sets, and supports pre-registration of samples for sequencing. CAMERA integrates a growing list of tools and viewers for querying, analyzing, annotating and comparing metagenome and genome data.
The Protein Disease Database of human body fluids: II. Computer methods and data issues.
Lemkin, P F; Orr, G A; Goldstein, M P; Creed, G J; Myrick, J E; Merril, C R
1995-01-01
The Protein Disease Database (PDD) is a relational database of proteins and diseases. With this database it is possible to screen for quantitative protein abnormalities associated with disease states. These quantitative relationships use data drawn from the peer-reviewed biomedical literature. Assays may also include those observed in high-resolution electrophoretic gels that offer the potential to quantitate many proteins in a single test as well as data gathered by enzymatic or immunologic assays. We are using the Internet World Wide Web (WWW) and the Web browser paradigm as an access method for wide distribution and querying of the Protein Disease Database. The WWW hypertext transfer protocol and its Common Gateway Interface make it possible to build powerful graphical user interfaces that can support easy-to-use data retrieval using query specification forms or images. The details of these interactions are totally transparent to the users of these forms. Using a client-server SQL relational database, user query access, initial data entry and database maintenance are all performed over the Internet with a Web browser. We discuss the underlying design issues, mapping mechanisms and assumptions that we used in constructing the system, data entry, access to the database server, security, and synthesis of derived two-dimensional gel image maps and hypertext documents resulting from SQL database searches.
Query-Adaptive Hash Code Ranking for Large-Scale Multi-View Visual Search.
Liu, Xianglong; Huang, Lei; Deng, Cheng; Lang, Bo; Tao, Dacheng
2016-10-01
Hash-based nearest neighbor search has become attractive in many applications. However, the quantization in hashing usually degenerates the discriminative power when using Hamming distance ranking. Besides, for large-scale visual search, existing hashing methods cannot directly support the efficient search over the data with multiple sources, and while the literature has shown that adaptively incorporating complementary information from diverse sources or views can significantly boost the search performance. To address the problems, this paper proposes a novel and generic approach to building multiple hash tables with multiple views and generating fine-grained ranking results at bitwise and tablewise levels. For each hash table, a query-adaptive bitwise weighting is introduced to alleviate the quantization loss by simultaneously exploiting the quality of hash functions and their complement for nearest neighbor search. From the tablewise aspect, multiple hash tables are built for different data views as a joint index, over which a query-specific rank fusion is proposed to rerank all results from the bitwise ranking by diffusing in a graph. Comprehensive experiments on image search over three well-known benchmarks show that the proposed method achieves up to 17.11% and 20.28% performance gains on single and multiple table search over the state-of-the-art methods.
SING: Subgraph search In Non-homogeneous Graphs
2010-01-01
Background Finding the subgraphs of a graph database that are isomorphic to a given query graph has practical applications in several fields, from cheminformatics to image understanding. Since subgraph isomorphism is a computationally hard problem, indexing techniques have been intensively exploited to speed up the process. Such systems filter out those graphs which cannot contain the query, and apply a subgraph isomorphism algorithm to each residual candidate graph. The applicability of such systems is limited to databases of small graphs, because their filtering power degrades on large graphs. Results In this paper, SING (Subgraph search In Non-homogeneous Graphs), a novel indexing system able to cope with large graphs, is presented. The method uses the notion of feature, which can be a small subgraph, subtree or path. Each graph in the database is annotated with the set of all its features. The key point is to make use of feature locality information. This idea is used to both improve the filtering performance and speed up the subgraph isomorphism task. Conclusions Extensive tests on chemical compounds, biological networks and synthetic graphs show that the proposed system outperforms the most popular systems in query time over databases of medium and large graphs. Other specific tests show that the proposed system is effective for single large graphs. PMID:20170516
BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data
2014-01-01
Background Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. Results We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Conclusions Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility. PMID:25089180
BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data.
Wu, Hongyan; Fujiwara, Toyofumi; Yamamoto, Yasunori; Bolleman, Jerven; Yamaguchi, Atsuko
2014-01-01
Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility.
Which factors predict the time spent answering queries to a drug information centre?
Reppe, Linda A.; Spigset, Olav
2010-01-01
Objective To develop a model based upon factors able to predict the time spent answering drug-related queries to Norwegian drug information centres (DICs). Setting and method Drug-related queries received at 5 DICs in Norway from March to May 2007 were randomly assigned to 20 employees until each of them had answered a minimum of five queries. The employees reported the number of drugs involved, the type of literature search performed, and whether the queries were considered judgmental or not, using a specifically developed scoring system. Main outcome measures The scores of these three factors were added together to define a workload score for each query. Workload and its individual factors were subsequently related to the measured time spent answering the queries by simple or multiple linear regression analyses. Results Ninety-six query/answer pairs were analyzed. Workload significantly predicted the time spent answering the queries (adjusted R2 = 0.22, P < 0.001). Literature search was the individual factor best predicting the time spent answering the queries (adjusted R2 = 0.17, P < 0.001), and this variable also contributed the most in the multiple regression analyses. Conclusion The most important workload factor predicting the time spent handling the queries in this study was the type of literature search that had to be performed. The categorisation of queries as judgmental or not, also affected the time spent answering the queries. The number of drugs involved did not significantly influence the time spent answering drug information queries. PMID:20922480
Personalized query suggestion based on user behavior
NASA Astrophysics Data System (ADS)
Chen, Wanyu; Hao, Zepeng; Shao, Taihua; Chen, Honghui
Query suggestions help users refine their queries after they input an initial query. Previous work mainly concentrated on similarity-based and context-based query suggestion approaches. However, models that focus on adapting to a specific user (personalization) can help to improve the probability of the user being satisfied. In this paper, we propose a personalized query suggestion model based on users’ search behavior (UB model), where we inject relevance between queries and users’ search behavior into a basic probabilistic model. For the relevance between queries, we consider their semantical similarity and co-occurrence which indicates the behavior information from other users in web search. Regarding the current user’s preference to a query, we combine the user’s short-term and long-term search behavior in a linear fashion and deal with the data sparse problem with Bayesian probabilistic matrix factorization (BPMF). In particular, we also investigate the impact of different personalization strategies (the combination of the user’s short-term and long-term search behavior) on the performance of query suggestion reranking. We quantify the improvement of our proposed UB model against a state-of-the-art baseline using the public AOL query logs and show that it beats the baseline in terms of metrics used in query suggestion reranking. The experimental results show that: (i) for personalized ranking, users’ behavioral information helps to improve query suggestion effectiveness; and (ii) given a query, merging information inferred from the short-term and long-term search behavior of a particular user can result in a better performance than both plain approaches.
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce.
Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel
2013-08-01
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce
Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel
2013-01-01
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS – a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive. PMID:24187650
Woo, Hyekyung; Cho, Youngtae; Shim, Eunyoung; Lee, Jong-Koo; Lee, Chang-Gun; Kim, Seong Hwan
2016-07-04
As suggested as early as in 2006, logs of queries submitted to search engines seeking information could be a source for detection of emerging influenza epidemics if changes in the volume of search queries are monitored (infodemiology). However, selecting queries that are most likely to be associated with influenza epidemics is a particular challenge when it comes to generating better predictions. In this study, we describe a methodological extension for detecting influenza outbreaks using search query data; we provide a new approach for query selection through the exploration of contextual information gleaned from social media data. Additionally, we evaluate whether it is possible to use these queries for monitoring and predicting influenza epidemics in South Korea. Our study was based on freely available weekly influenza incidence data and query data originating from the search engine on the Korean website Daum between April 3, 2011 and April 5, 2014. To select queries related to influenza epidemics, several approaches were applied: (1) exploring influenza-related words in social media data, (2) identifying the chief concerns related to influenza, and (3) using Web query recommendations. Optimal feature selection by least absolute shrinkage and selection operator (Lasso) and support vector machine for regression (SVR) were used to construct a model predicting influenza epidemics. In total, 146 queries related to influenza were generated through our initial query selection approach. A considerable proportion of optimal features for final models were derived from queries with reference to the social media data. The SVR model performed well: the prediction values were highly correlated with the recent observed influenza-like illness (r=.956; P<.001) and virological incidence rate (r=.963; P<.001). These results demonstrate the feasibility of using search queries to enhance influenza surveillance in South Korea. In addition, an approach for query selection using social media data seems ideal for supporting influenza surveillance based on search query data.
Woo, Hyekyung; Shim, Eunyoung; Lee, Jong-Koo; Lee, Chang-Gun; Kim, Seong Hwan
2016-01-01
Background As suggested as early as in 2006, logs of queries submitted to search engines seeking information could be a source for detection of emerging influenza epidemics if changes in the volume of search queries are monitored (infodemiology). However, selecting queries that are most likely to be associated with influenza epidemics is a particular challenge when it comes to generating better predictions. Objective In this study, we describe a methodological extension for detecting influenza outbreaks using search query data; we provide a new approach for query selection through the exploration of contextual information gleaned from social media data. Additionally, we evaluate whether it is possible to use these queries for monitoring and predicting influenza epidemics in South Korea. Methods Our study was based on freely available weekly influenza incidence data and query data originating from the search engine on the Korean website Daum between April 3, 2011 and April 5, 2014. To select queries related to influenza epidemics, several approaches were applied: (1) exploring influenza-related words in social media data, (2) identifying the chief concerns related to influenza, and (3) using Web query recommendations. Optimal feature selection by least absolute shrinkage and selection operator (Lasso) and support vector machine for regression (SVR) were used to construct a model predicting influenza epidemics. Results In total, 146 queries related to influenza were generated through our initial query selection approach. A considerable proportion of optimal features for final models were derived from queries with reference to the social media data. The SVR model performed well: the prediction values were highly correlated with the recent observed influenza-like illness (r=.956; P<.001) and virological incidence rate (r=.963; P<.001). Conclusions These results demonstrate the feasibility of using search queries to enhance influenza surveillance in South Korea. In addition, an approach for query selection using social media data seems ideal for supporting influenza surveillance based on search query data. PMID:27377323
Schuers, Matthieu; Joulakian, Mher; Kerdelhué, Gaetan; Segas, Léa; Grosjean, Julien; Darmoni, Stéfan J; Griffon, Nicolas
2017-07-03
MEDLINE is the most widely used medical bibliographic database in the world. Most of its citations are in English and this can be an obstacle for some researchers to access the information the database contains. We created a multilingual query builder to facilitate access to the PubMed subset using a language other than English. The aim of our study was to assess the impact of this multilingual query builder on the quality of PubMed queries for non-native English speaking physicians and medical researchers. A randomised controlled study was conducted among French speaking general practice residents. We designed a multi-lingual query builder to facilitate information retrieval, based on available MeSH translations and providing users with both an interface and a controlled vocabulary in their own language. Participating residents were randomly allocated either the French or the English version of the query builder. They were asked to translate 12 short medical questions into MeSH queries. The main outcome was the quality of the query. Two librarians blind to the arm independently evaluated each query, using a modified published classification that differentiated eight types of errors. Twenty residents used the French version of the query builder and 22 used the English version. 492 queries were analysed. There were significantly more perfect queries in the French group vs. the English group (respectively 37.9% vs. 17.9%; p < 0.01). It took significantly more time for the members of the English group than the members of the French group to build each query, respectively 194 sec vs. 128 sec; p < 0.01. This multi-lingual query builder is an effective tool to improve the quality of PubMed queries in particular for researchers whose first language is not English.
A Web-Based Data-Querying Tool Based on Ontology-Driven Methodology and Flowchart-Based Model
Ping, Xiao-Ou; Chung, Yufang; Liang, Ja-Der; Yang, Pei-Ming; Huang, Guan-Tarn; Lai, Feipei
2013-01-01
Background Because of the increased adoption rate of electronic medical record (EMR) systems, more health care records have been increasingly accumulating in clinical data repositories. Therefore, querying the data stored in these repositories is crucial for retrieving the knowledge from such large volumes of clinical data. Objective The aim of this study is to develop a Web-based approach for enriching the capabilities of the data-querying system along the three following considerations: (1) the interface design used for query formulation, (2) the representation of query results, and (3) the models used for formulating query criteria. Methods The Guideline Interchange Format version 3.5 (GLIF3.5), an ontology-driven clinical guideline representation language, was used for formulating the query tasks based on the GLIF3.5 flowchart in the Protégé environment. The flowchart-based data-querying model (FBDQM) query execution engine was developed and implemented for executing queries and presenting the results through a visual and graphical interface. To examine a broad variety of patient data, the clinical data generator was implemented to automatically generate the clinical data in the repository, and the generated data, thereby, were employed to evaluate the system. The accuracy and time performance of the system for three medical query tasks relevant to liver cancer were evaluated based on the clinical data generator in the experiments with varying numbers of patients. Results In this study, a prototype system was developed to test the feasibility of applying a methodology for building a query execution engine using FBDQMs by formulating query tasks using the existing GLIF. The FBDQM-based query execution engine was used to successfully retrieve the clinical data based on the query tasks formatted using the GLIF3.5 in the experiments with varying numbers of patients. The accuracy of the three queries (ie, “degree of liver damage,” “degree of liver damage when applying a mutually exclusive setting,” and “treatments for liver cancer”) was 100% for all four experiments (10 patients, 100 patients, 1000 patients, and 10,000 patients). Among the three measured query phases, (1) structured query language operations, (2) criteria verification, and (3) other, the first two had the longest execution time. Conclusions The ontology-driven FBDQM-based approach enriched the capabilities of the data-querying system. The adoption of the GLIF3.5 increased the potential for interoperability, shareability, and reusability of the query tasks. PMID:25600078
QKD-based quantum private query without a failure probability
NASA Astrophysics Data System (ADS)
Liu, Bin; Gao, Fei; Huang, Wei; Wen, QiaoYan
2015-10-01
In this paper, we present a quantum-key-distribution (QKD)-based quantum private query (QPQ) protocol utilizing single-photon signal of multiple optical pulses. It maintains the advantages of the QKD-based QPQ, i.e., easy to implement and loss tolerant. In addition, different from the situations in the previous QKD-based QPQ protocols, in our protocol, the number of the items an honest user will obtain is always one and the failure probability is always zero. This characteristic not only improves the stability (in the sense that, ignoring the noise and the attack, the protocol would always succeed), but also benefits the privacy of the database (since the database will no more reveal additional secrets to the honest users). Furthermore, for the user's privacy, the proposed protocol is cheat sensitive, and for security of the database, we obtain an upper bound for the leaked information of the database in theory.
Indexing and retrieving point and region objects
NASA Astrophysics Data System (ADS)
Ibrahim, Azzam T.; Fotouhi, Farshad A.
1996-03-01
R-tree and its variants are examples of spatial data structures for paged-secondary memory. To process a query, these structures require multiple path traversals. In this paper, we present a new image access method, SB+-tree which requires a single path traversal to process a query. Also, SB+-tree will allow commercial databases an access method for spatial objects without a major change, since most commercial databases already support B+-tree as an access method for text data. The SB+-tree can be used for zero and non-zero size data objects. Non-zero size objects are approximated by their minimum bounding rectangles (MBRs). The number of SB+-trees generated is dependent upon the number of dimensions of the approximation of the object. The structure supports efficient spatial operations such as regions-overlap, distance and direction. In this paper, we experimentally and analytically demonstrate the superiority of SB+-tree over R-tree.
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets
Plaku, Erion; Kavraki, Lydia E.
2009-01-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
Grubb, Stephen C.; Bult, Carol J.; Bogue, Molly A.
2014-01-01
The Mouse Phenome Database (MPD; phenome.jax.org) was launched in 2001 as the data coordination center for the international Mouse Phenome Project. MPD integrates quantitative phenotype, gene expression and genotype data into a common annotated framework to facilitate query and analysis. MPD contains >3500 phenotype measurements or traits relevant to human health, including cancer, aging, cardiovascular disorders, obesity, infectious disease susceptibility, blood disorders, neurosensory disorders, drug addiction and toxicity. Since our 2012 NAR report, we have added >70 new data sets, including data from Collaborative Cross lines and Diversity Outbred mice. During this time we have completely revamped our homepage, improved search and navigational aspects of the MPD application, developed several web-enabled data analysis and visualization tools, annotated phenotype data to public ontologies, developed an ontology browser and released new single nucleotide polymorphism query functionality with much higher density coverage than before. Here, we summarize recent data acquisitions and describe our latest improvements. PMID:24243846
Experimental realization of a one-way quantum computer algorithm solving Simon's problem.
Tame, M S; Bell, B A; Di Franco, C; Wadsworth, W J; Rarity, J G
2014-11-14
We report an experimental demonstration of a one-way implementation of a quantum algorithm solving Simon's problem-a black-box period-finding problem that has an exponential gap between the classical and quantum runtime. Using an all-optical setup and modifying the bases of single-qubit measurements on a five-qubit cluster state, key representative functions of the logical two-qubit version's black box can be queried and solved. To the best of our knowledge, this work represents the first experimental realization of the quantum algorithm solving Simon's problem. The experimental results are in excellent agreement with the theoretical model, demonstrating the successful performance of the algorithm. With a view to scaling up to larger numbers of qubits, we analyze the resource requirements for an n-qubit version. This work helps highlight how one-way quantum computing provides a practical route to experimentally investigating the quantum-classical gap in the query complexity model.
Machine Learning Approaches in Cardiovascular Imaging.
Henglin, Mir; Stein, Gillian; Hushcha, Pavel V; Snoek, Jasper; Wiltschko, Alexander B; Cheng, Susan
2017-10-01
Cardiovascular imaging technologies continue to increase in their capacity to capture and store large quantities of data. Modern computational methods, developed in the field of machine learning, offer new approaches to leveraging the growing volume of imaging data available for analyses. Machine learning methods can now address data-related problems ranging from simple analytic queries of existing measurement data to the more complex challenges involved in analyzing raw images. To date, machine learning has been used in 2 broad and highly interconnected areas: automation of tasks that might otherwise be performed by a human and generation of clinically important new knowledge. Most cardiovascular imaging studies have focused on task-oriented problems, but more studies involving algorithms aimed at generating new clinical insights are emerging. Continued expansion in the size and dimensionality of cardiovascular imaging databases is driving strong interest in applying powerful deep learning methods, in particular, to analyze these data. Overall, the most effective approaches will require an investment in the resources needed to appropriately prepare such large data sets for analyses. Notwithstanding current technical and logistical challenges, machine learning and especially deep learning methods have much to offer and will substantially impact the future practice and science of cardiovascular imaging. © 2017 American Heart Association, Inc.
Mining Longitudinal Web Queries: Trends and Patterns.
ERIC Educational Resources Information Center
Wang, Peiling; Berry, Michael W.; Yang, Yiheng
2003-01-01
Analyzed user queries submitted to an academic Web site during a four-year period, using a relational database, to examine users' query behavior, to identify problems they encounter, and to develop techniques for optimizing query analysis and mining. Linguistic analyses focus on query structures, lexicon, and word associations using statistical…
Optimizing a Query by Transformation and Expansion.
Glocker, Katrin; Knurr, Alexander; Dieter, Julia; Dominick, Friederike; Forche, Melanie; Koch, Christian; Pascoe Pérez, Analie; Roth, Benjamin; Ückert, Frank
2017-01-01
In the biomedical sector not only the amount of information produced and uploaded into the web is enormous, but also the number of sources where these data can be found. Clinicians and researchers spend huge amounts of time on trying to access this information and to filter the most important answers to a given question. As the formulation of these queries is crucial, automated query expansion is an effective tool to optimize a query and receive the best possible results. In this paper we introduce the concept of a workflow for an optimization of queries in the medical and biological sector by using a series of tools for expansion and transformation of the query. After the definition of attributes by the user, the query string is compared to previous queries in order to add semantic co-occurring terms to the query. Additionally, the query is enlarged by an inclusion of synonyms. The translation into database specific ontologies ensures the optimal query formulation for the chosen database(s). As this process can be performed in various databases at once, the results are ranked and normalized in order to achieve a comparable list of answers for a question.
WATCHMAN: A Data Warehouse Intelligent Cache Manager
NASA Technical Reports Server (NTRS)
Scheuermann, Peter; Shim, Junho; Vingralek, Radek
1996-01-01
Data warehouses store large volumes of data which are used frequently by decision support applications. Such applications involve complex queries. Query performance in such an environment is critical because decision support applications often require interactive query response time. Because data warehouses are updated infrequently, it becomes possible to improve query performance by caching sets retrieved by queries in addition to query execution plans. In this paper we report on the design of an intelligent cache manager for sets retrieved by queries called WATCHMAN, which is particularly well suited for data warehousing environment. Our cache manager employs two novel, complementary algorithms for cache replacement and for cache admission. WATCHMAN aims at minimizing query response time and its cache replacement policy swaps out entire retrieved sets of queries instead of individual pages. The cache replacement and admission algorithms make use of a profit metric, which considers for each retrieved set its average rate of reference, its size, and execution cost of the associated query. We report on a performance evaluation based on the TPC-D and Set Query benchmarks. These experiments show that WATCHMAN achieves a substantial performance improvement in a decision support environment when compared to a traditional LRU replacement algorithm.
A memory learning framework for effective image retrieval.
Han, Junwei; Ngan, King N; Li, Mingjing; Zhang, Hong-Jiang
2005-04-01
Most current content-based image retrieval systems are still incapable of providing users with their desired results. The major difficulty lies in the gap between low-level image features and high-level image semantics. To address the problem, this study reports a framework for effective image retrieval by employing a novel idea of memory learning. It forms a knowledge memory model to store the semantic information by simply accumulating user-provided interactions. A learning strategy is then applied to predict the semantic relationships among images according to the memorized knowledge. Image queries are finally performed based on a seamless combination of low-level features and learned semantics. One important advantage of our framework is its ability to efficiently annotate images and also propagate the keyword annotation from the labeled images to unlabeled images. The presented algorithm has been integrated into a practical image retrieval system. Experiments on a collection of 10,000 general-purpose images demonstrate the effectiveness of the proposed framework.
Application of Machine Learning Approaches for Protein-protein Interactions Prediction.
Zhang, Mengying; Su, Qiang; Lu, Yi; Zhao, Manman; Niu, Bing
2017-01-01
Proteomics endeavors to study the structures, functions and interactions of proteins. Information of the protein-protein interactions (PPIs) helps to improve our knowledge of the functions and the 3D structures of proteins. Thus determining the PPIs is essential for the study of the proteomics. In this review, in order to study the application of machine learning in predicting PPI, some machine learning approaches such as support vector machine (SVM), artificial neural networks (ANNs) and random forest (RF) were selected, and the examples of its applications in PPIs were listed. SVM and RF are two commonly used methods. Nowadays, more researchers predict PPIs by combining more than two methods. This review presents the application of machine learning approaches in predicting PPI. Many examples of success in identification and prediction in the area of PPI prediction have been discussed, and the PPIs research is still in progress. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Lin, Wei-Zhong; Fang, Jian-An; Xiao, Xuan; Chou, Kuo-Chen
2013-04-05
Predicting protein subcellular localization is a challenging problem, particularly when query proteins have multi-label features meaning that they may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing methods can only be used to deal with the single-label proteins. Actually, multi-label proteins should not be ignored because they usually bear some special function worthy of in-depth studies. By introducing the "multi-label learning" approach, a new predictor, called iLoc-Animal, has been developed that can be used to deal with the systems containing both single- and multi-label animal (metazoan except human) proteins. Meanwhile, to measure the prediction quality of a multi-label system in a rigorous way, five indices were introduced; they are "Absolute-True", "Absolute-False" (or Hamming-Loss"), "Accuracy", "Precision", and "Recall". As a demonstration, the jackknife cross-validation was performed with iLoc-Animal on a benchmark dataset of animal proteins classified into the following 20 location sites: (1) acrosome, (2) cell membrane, (3) centriole, (4) centrosome, (5) cell cortex, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracellular, (11) Golgi apparatus, (12) lysosome, (13) mitochondrion, (14) melanosome, (15) microsome, (16) nucleus, (17) peroxisome, (18) plasma membrane, (19) spindle, and (20) synapse, where many proteins belong to two or more locations. For such a complicated system, the outcomes achieved by iLoc-Animal for all the aforementioned five indices were quite encouraging, indicating that the predictor may become a useful tool in this area. It has not escaped our notice that the multi-label approach and the rigorous measurement metrics can also be used to investigate many other multi-label problems in molecular biology. As a user-friendly web-server, iLoc-Animal is freely accessible to the public at the web-site .
EarthServer - an FP7 project to enable the web delivery and analysis of 3D/4D models
NASA Astrophysics Data System (ADS)
Laxton, John; Sen, Marcus; Passmore, James
2013-04-01
EarthServer aims at open access and ad-hoc analytics on big Earth Science data, based on the OGC geoservice standards Web Coverage Service (WCS) and Web Coverage Processing Service (WCPS). The WCS model defines "coverages" as a unifying paradigm for multi-dimensional raster data, point clouds, meshes, etc., thereby addressing a wide range of Earth Science data including 3D/4D models. WCPS allows declarative SQL-style queries on coverages. The project is developing a pilot implementing these standards, and will also investigate the use of GeoSciML to describe coverages. Integration of WCPS with XQuery will in turn allow coverages to be queried in combination with their metadata and GeoSciML description. The unified service will support navigation, extraction, aggregation, and ad-hoc analysis on coverage data from SQL. Clients will range from mobile devices to high-end immersive virtual reality, and will enable 3D model visualisation using web browser technology coupled with developing web standards. EarthServer is establishing open-source client and server technology intended to be scalable to Petabyte/Exabyte volumes, based on distributed processing, supercomputing, and cloud virtualization. Implementation will be based on the existing rasdaman server technology developed. Services using rasdaman technology are being installed serving the atmospheric, oceanographic, geological, cryospheric, planetary and general earth observation communities. The geology service (http://earthserver.bgs.ac.uk/) is being provided by BGS and at present includes satellite imagery, superficial thickness data, onshore DTMs and 3D models for the Glasgow area. It is intended to extend the data sets available to include 3D voxel models. Use of the WCPS standard allows queries to be constructed against single or multiple coverages. For example on a single coverage data for a particular area can be selected or data with a particular range of pixel values. Queries on multiple surfaces can be constructed to calculate, for example, the thickness between two surfaces in a 3D model or the depth from ground surface to the top of a particular geologic unit. In the first version of the service a simple interface showing some example queries has been implemented in order to show the potential of the technologies. The project aims to develop the services available in light of user feedback, both in terms of the data available, the functionality and the interface. User feedback on the services guides the software and standards development aspects of the project, leading to enhanced versions of the software which will be implemented in upgraded versions of the services during the lifetime of the project.
Assisting Consumer Health Information Retrieval with Query Recommendations
Zeng, Qing T.; Crowell, Jonathan; Plovnick, Robert M.; Kim, Eunjung; Ngo, Long; Dibble, Emily
2006-01-01
Objective: Health information retrieval (HIR) on the Internet has become an important practice for millions of people, many of whom have problems forming effective queries. We have developed and evaluated a tool to assist people in health-related query formation. Design: We developed the Health Information Query Assistant (HIQuA) system. The system suggests alternative/additional query terms related to the user's initial query that can be used as building blocks to construct a better, more specific query. The recommended terms are selected according to their semantic distance from the original query, which is calculated on the basis of concept co-occurrences in medical literature and log data as well as semantic relations in medical vocabularies. Measurements: An evaluation of the HIQuA system was conducted and a total of 213 subjects participated in the study. The subjects were randomized into 2 groups. One group was given query recommendations and the other was not. Each subject performed HIR for both a predefined and a self-defined task. Results: The study showed that providing HIQuA recommendations resulted in statistically significantly higher rates of successful queries (odds ratio = 1.66, 95% confidence interval = 1.16–2.38), although no statistically significant impact on user satisfaction or the users' ability to accomplish the predefined retrieval task was found. Conclusion: Providing semantic-distance-based query recommendations can help consumers with query formation during HIR. PMID:16221944
PAQ: Persistent Adaptive Query Middleware for Dynamic Environments
NASA Astrophysics Data System (ADS)
Rajamani, Vasanth; Julien, Christine; Payton, Jamie; Roman, Gruia-Catalin
Pervasive computing applications often entail continuous monitoring tasks, issuing persistent queries that return continuously updated views of the operational environment. We present PAQ, a middleware that supports applications' needs by approximating a persistent query as a sequence of one-time queries. PAQ introduces an integration strategy abstraction that allows composition of one-time query responses into streams representing sophisticated spatio-temporal phenomena of interest. A distinguishing feature of our middleware is the realization that the suitability of a persistent query's result is a function of the application's tolerance for accuracy weighed against the associated overhead costs. In PAQ, programmers can specify an inquiry strategy that dictates how information is gathered. Since network dynamics impact the suitability of a particular inquiry strategy, PAQ associates an introspection strategy with a persistent query, that evaluates the quality of the query's results. The result of introspection can trigger application-defined adaptation strategies that alter the nature of the query. PAQ's simple API makes developing adaptive querying systems easily realizable. We present the key abstractions, describe their implementations, and demonstrate the middleware's usefulness through application examples and evaluation.
NASA Astrophysics Data System (ADS)
Kuznetsov, Valentin; Riley, Daniel; Afaq, Anzar; Sekhri, Vijay; Guo, Yuyi; Lueking, Lee
2010-04-01
The CMS experiment has implemented a flexible and powerful system enabling users to find data within the CMS physics data catalog. The Dataset Bookkeeping Service (DBS) comprises a database and the services used to store and access metadata related to CMS physics data. To this, we have added a generalized query system in addition to the existing web and programmatic interfaces to the DBS. This query system is based on a query language that hides the complexity of the underlying database structure by discovering the join conditions between database tables. This provides a way of querying the system that is simple and straightforward for CMS data managers and physicists to use without requiring knowledge of the database tables or keys. The DBS Query Language uses the ANTLR tool to build the input query parser and tokenizer, followed by a query builder that uses a graph representation of the DBS schema to construct the SQL query sent to underlying database. We will describe the design of the query system, provide details of the language components and overview of how this component fits into the overall data discovery system architecture.
Spatial aggregation query in dynamic geosensor networks
NASA Astrophysics Data System (ADS)
Yi, Baolin; Feng, Dayang; Xiao, Shisong; Zhao, Erdun
2007-11-01
Wireless sensor networks have been widely used for civilian and military applications, such as environmental monitoring and vehicle tracking. In many of these applications, the researches mainly aim at building sensor network based systems to leverage the sensed data to applications. However, the existing works seldom exploited spatial aggregation query considering the dynamic characteristics of sensor networks. In this paper, we investigate how to process spatial aggregation query over dynamic geosensor networks where both the sink node and sensor nodes are mobile and propose several novel improvements on enabling techniques. The mobility of sensors makes the existing routing protocol based on information of fixed framework or the neighborhood infeasible. We present an improved location-based stateless implicit geographic forwarding (IGF) protocol for routing a query toward the area specified by query window, a diameter-based window aggregation query (DWAQ) algorithm for query propagation and data aggregation in the query window, finally considering the location changing of the sink node, we present two schemes to forward the result to the sink node. Simulation results show that the proposed algorithms can improve query latency and query accuracy.
The Development and Preliminary Application Ofplant Quarantine Remote Teaching System Inchina
NASA Astrophysics Data System (ADS)
Wu, Zhigang; Li, Zhihong; Yang, Ding; Zhang, Guozhen
With the development of modern information technology, the traditional teaching mode becomes more deficient for the requirement of modern education. Plant Quarantine has been accepted as the common course for the universities of agriculture in China after the entry of WTO. But the teaching resources of this course are not enough especially for most universities with lack base. The characteristic of e-learning is regarded as one way to solve the problem of short teaching resource. PQRTS (Plant Quarantine Remote Teaching System) was designed and developed with JSP (Java Sever Pages), MySQL and Tomcat in this study. The system included many kinds of plant quarantine teaching resources, such as international glossary, regulations and standards, multimedia information of quarantine process and pests, ppt files of teaching, and training exercise. The system prototype implemented the functions of remote learning, querying, management, examination and remote discussion. It could be a tool for teaching, teaching assistance and learning online.
Designing learning management system interoperability in semantic web
NASA Astrophysics Data System (ADS)
Anistyasari, Y.; Sarno, R.; Rochmawati, N.
2018-01-01
The extensive adoption of learning management system (LMS) has set the focus on the interoperability requirement. Interoperability is the ability of different computer systems, applications or services to communicate, share and exchange data, information, and knowledge in a precise, effective and consistent way. Semantic web technology and the use of ontologies are able to provide the required computational semantics and interoperability for the automation of tasks in LMS. The purpose of this study is to design learning management system interoperability in the semantic web which currently has not been investigated deeply. Moodle is utilized to design the interoperability. Several database tables of Moodle are enhanced and some features are added. The semantic web interoperability is provided by exploited ontology in content materials. The ontology is further utilized as a searching tool to match user’s queries and available courses. It is concluded that LMS interoperability in Semantic Web is possible to be performed.
Person Authentication Using Learned Parameters of Lifting Wavelet Filters
NASA Astrophysics Data System (ADS)
Niijima, Koichi
2006-10-01
This paper proposes a method for identifying persons by the use of the lifting wavelet parameters learned by kurtosis-minimization. Our learning method uses desirable properties of kurtosis and wavelet coefficients of a facial image. Exploiting these properties, the lifting parameters are trained so as to minimize the kurtosis of lifting wavelet coefficients computed for the facial image. Since this minimization problem is an ill-posed problem, it is solved by the aid of Tikhonov's regularization method. Our learning algorithm is applied to each of the faces to be identified to generate its feature vector whose components consist of the learned parameters. The constructed feature vectors are memorized together with the corresponding faces in a feature vectors database. Person authentication is performed by comparing the feature vector of a query face with those stored in the database. In numerical experiments, the lifting parameters are trained for each of the neutral faces of 132 persons (74 males and 58 females) in the AR face database. Person authentication is executed by using the smile and anger faces of the same persons in the database.
Hoogendam, Arjen; Stalenhoef, Anton FH; Robbé, Pieter F de Vries; Overbeke, A John PM
2008-01-01
Background The use of PubMed to answer daily medical care questions is limited because it is challenging to retrieve a small set of relevant articles and time is restricted. Knowing what aspects of queries are likely to retrieve relevant articles can increase the effectiveness of PubMed searches. The objectives of our study were to identify queries that are likely to retrieve relevant articles by relating PubMed search techniques and tools to the number of articles retrieved and the selection of articles for further reading. Methods This was a prospective observational study of queries regarding patient-related problems sent to PubMed by residents and internists in internal medicine working in an Academic Medical Centre. We analyzed queries, search results, query tools (Mesh, Limits, wildcards, operators), selection of abstract and full-text for further reading, using a portal that mimics PubMed. Results PubMed was used to solve 1121 patient-related problems, resulting in 3205 distinct queries. Abstracts were viewed in 999 (31%) of these queries, and in 126 (39%) of 321 queries using query tools. The average term count per query was 2.5. Abstracts were selected in more than 40% of queries using four or five terms, increasing to 63% if the use of four or five terms yielded 2–161 articles. Conclusion Queries sent to PubMed by physicians at our hospital during daily medical care contain fewer than three terms. Queries using four to five terms, retrieving less than 161 article titles, are most likely to result in abstract viewing. PubMed search tools are used infrequently by our population and are less effective than the use of four or five terms. Methods to facilitate the formulation of precise queries, using more relevant terms, should be the focus of education and research. PMID:18816391
ProBiS-CHARMMing: Web Interface for Prediction and Optimization of Ligands in Protein Binding Sites.
Konc, Janez; Miller, Benjamin T; Štular, Tanja; Lešnik, Samo; Woodcock, H Lee; Brooks, Bernard R; Janežič, Dušanka
2015-11-23
Proteins often exist only as apo structures (unligated) in the Protein Data Bank, with their corresponding holo structures (with ligands) unavailable. However, apoproteins may not represent the amino-acid residue arrangement upon ligand binding well, which is especially problematic for molecular docking. We developed the ProBiS-CHARMMing web interface by connecting the ProBiS ( http://probis.cmm.ki.si ) and CHARMMing ( http://www.charmming.org ) web servers into one functional unit that enables prediction of protein-ligand complexes and allows for their geometry optimization and interaction energy calculation. The ProBiS web server predicts ligands (small compounds, proteins, nucleic acids, and single-atom ligands) that may bind to a query protein. This is achieved by comparing its surface structure against a nonredundant database of protein structures and finding those that have binding sites similar to that of the query protein. Existing ligands found in the similar binding sites are then transposed to the query according to predictions from ProBiS. The CHARMMing web server enables, among other things, minimization and potential energy calculation for a wide variety of biomolecular systems, and it is used here to optimize the geometry of the predicted protein-ligand complex structures using the CHARMM force field and to calculate their interaction energies with the corresponding query proteins. We show how ProBiS-CHARMMing can be used to predict ligands and their poses for a particular binding site, and minimize the predicted protein-ligand complexes to obtain representations of holoproteins. The ProBiS-CHARMMing web interface is freely available for academic users at http://probis.nih.gov.
Tao, Shiqiang; Cui, Licong; Wu, Xi; Zhang, Guo-Qiang
2017-01-01
To help researchers better access clinical data, we developed a prototype query engine called DataSphere for exploring large-scale integrated clinical data repositories. DataSphere expedites data importing using a NoSQL data management system and dynamically renders its user interface for concept-based querying tasks. DataSphere provides an interactive query-building interface together with query translation and optimization strategies, which enable users to build and execute queries effectively and efficiently. We successfully loaded a dataset of one million patients for University of Kentucky (UK) Healthcare into DataSphere with more than 300 million clinical data records. We evaluated DataSphere by comparing it with an instance of i2b2 deployed at UK Healthcare, demonstrating that DataSphere provides enhanced user experience for both query building and execution.
Tao, Shiqiang; Cui, Licong; Wu, Xi; Zhang, Guo-Qiang
2017-01-01
To help researchers better access clinical data, we developed a prototype query engine called DataSphere for exploring large-scale integrated clinical data repositories. DataSphere expedites data importing using a NoSQL data management system and dynamically renders its user interface for concept-based querying tasks. DataSphere provides an interactive query-building interface together with query translation and optimization strategies, which enable users to build and execute queries effectively and efficiently. We successfully loaded a dataset of one million patients for University of Kentucky (UK) Healthcare into DataSphere with more than 300 million clinical data records. We evaluated DataSphere by comparing it with an instance of i2b2 deployed at UK Healthcare, demonstrating that DataSphere provides enhanced user experience for both query building and execution. PMID:29854239
Improve Performance of Data Warehouse by Query Cache
NASA Astrophysics Data System (ADS)
Gour, Vishal; Sarangdevot, S. S.; Sharma, Anand; Choudhary, Vinod
2010-11-01
The primary goal of data warehouse is to free the information locked up in the operational database so that decision makers and business analyst can make queries, analysis and planning regardless of the data changes in operational database. As the number of queries is large, therefore, in certain cases there is reasonable probability that same query submitted by the one or multiple users at different times. Each time when query is executed, all the data of warehouse is analyzed to generate the result of that query. In this paper we will study how using query cache improves performance of Data Warehouse and try to find the common problems faced. These kinds of problems are faced by Data Warehouse administrators which are minimizes response time and improves the efficiency of query in data warehouse overall, particularly when data warehouse is updated at regular interval.
Safari, Leila; Patrick, Jon D
2018-06-01
This paper reports on a generic framework to provide clinicians with the ability to conduct complex analyses on elaborate research topics using cascaded queries to resolve internal time-event dependencies in the research questions, as an extension to the proposed Clinical Data Analytics Language (CliniDAL). A cascaded query model is proposed to resolve internal time-event dependencies in the queries which can have up to five levels of criteria starting with a query to define subjects to be admitted into a study, followed by a query to define the time span of the experiment. Three more cascaded queries can be required to define control groups, control variables and output variables which all together simulate a real scientific experiment. According to the complexity of the research questions, the cascaded query model has the flexibility of merging some lower level queries for simple research questions or adding a nested query to each level to compose more complex queries. Three different scenarios (one of them contains two studies) are described and used for evaluation of the proposed solution. CliniDAL's complex analyses solution enables answering complex queries with time-event dependencies at most in a few hours which manually would take many days. An evaluation of results of the research studies based on the comparison between CliniDAL and SQL solutions reveals high usability and efficiency of CliniDAL's solution. Copyright © 2018 Elsevier Inc. All rights reserved.
A Data Warehouse to Support Condition Based Maintenance (CBM)
2005-05-01
Application ( VBA ) code sequence to import the original MAST-generated CSV and then create a single output table in DBASE IV format. The DBASE IV format...database architecture (Oracle, Sybase, MS- SQL , etc). This design includes table definitions, comments, specification of table attributes, primary and foreign...built queries and applications. Needs the application developers to construct data views. No SQL programming experience. b. Power Database User - knows
Machine Learning in the Big Data Era: Are We There Yet?
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sukumar, Sreenivas Rangan
In this paper, we discuss the machine learning challenges of the Big Data era. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical machine learning under more scrutiny and evaluation for gleaning insights from the data than ever before. In that context, we pose and debate the question - Are machine learning algorithms scaling with the ability to store and compute? If yes, how? If not, why not? We survey recent developments in the state-of-the-art to discuss emerging and outstandingmore » challenges in the design and implementation of machine learning algorithms at scale. We leverage experience from real-world Big Data knowledge discovery projects across domains of national security and healthcare to suggest our efforts be focused along the following axes: (i) the data science challenge - designing scalable and flexible computational architectures for machine learning (beyond just data-retrieval); (ii) the science of data challenge the ability to understand characteristics of data before applying machine learning algorithms and tools; and (iii) the scalable predictive functions challenge the ability to construct, learn and infer with increasing sample size, dimensionality, and categories of labels. We conclude with a discussion of opportunities and directions for future research.« less
Evaluation of Sub Query Performance in SQL Server
NASA Astrophysics Data System (ADS)
Oktavia, Tanty; Sujarwo, Surya
2014-03-01
The paper explores several sub query methods used in a query and their impact on the query performance. The study uses experimental approach to evaluate the performance of each sub query methods combined with indexing strategy. The sub query methods consist of in, exists, relational operator and relational operator combined with top operator. The experimental shows that using relational operator combined with indexing strategy in sub query has greater performance compared with using same method without indexing strategy and also other methods. In summary, for application that emphasized on the performance of retrieving data from database, it better to use relational operator combined with indexing strategy. This study is done on Microsoft SQL Server 2012.
Secure Skyline Queries on Cloud Platform.
Liu, Jinfei; Yang, Juncheng; Xiong, Li; Pei, Jian
2017-04-01
Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions.
Towards Hybrid Online On-Demand Querying of Realtime Data with Stateful Complex Event Processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, Qunzhi; Simmhan, Yogesh; Prasanna, Viktor K.
Emerging Big Data applications in areas like e-commerce and energy industry require both online and on-demand queries to be performed over vast and fast data arriving as streams. These present novel challenges to Big Data management systems. Complex Event Processing (CEP) is recognized as a high performance online query scheme which in particular deals with the velocity aspect of the 3-V’s of Big Data. However, traditional CEP systems do not consider data variety and lack the capability to embed ad hoc queries over the volume of data streams. In this paper, we propose H2O, a stateful complex event processing framework,more » to support hybrid online and on-demand queries over realtime data. We propose a semantically enriched event and query model to address data variety. A formal query algebra is developed to precisely capture the stateful and containment semantics of online and on-demand queries. We describe techniques to achieve the interactive query processing over realtime data featured by efficient online querying, dynamic stream data persistence and on-demand access. The system architecture is presented and the current implementation status reported.« less
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce.
Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng
2013-11-01
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS - a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing.
Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert
2012-01-01
Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the “wisdom of the crowds.” Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., “funky jazz with saxophone,” “spooky electronica,” etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data. PMID:22460786
Game-powered machine learning.
Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert
2012-04-24
Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the "wisdom of the crowds." Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., "funky jazz with saxophone," "spooky electronica," etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data.
The application of machine learning techniques in the clinical drug therapy.
Meng, Huan-Yu; Jin, Wan-Lin; Yan, Cheng-Kai; Yang, Huan
2018-05-25
The development of a novel drug is an extremely complicated process that includes the target identification, design and manufacture, and proper therapy of the novel drug, as well as drug dose selection, drug efficacy evaluation, and adverse drug reaction control. Due to the limited resources, high costs, long duration, and low hit-to-lead ratio in the development of pharmacogenetics and computer technology, machine learning techniques have assisted novel drug development and have gradually received more attention by researchers. According to current research, machine learning techniques are widely applied in the process of the discovery of new drugs and novel drug targets, the decision surrounding proper therapy and drug dose, and the prediction of drug efficacy and adverse drug reactions. In this article, we discussed the history, workflow, and advantages and disadvantages of machine learning techniques in the processes mentioned above. Although the advantages of machine learning techniques are fairly obvious, the application of machine learning techniques is currently limited. With further research, the application of machine techniques in drug development could be much more widespread and could potentially be one of the major methods used in drug development. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Health information technology and implementation science: partners in progress in the VHA.
Hynes, Denise M; Whittier, Erika R; Owens, Arika
2013-03-01
The Department of Veterans Affairs (VA) Quality Enhancement Research Initiative (QUERI) has demonstrated how implementation science can enhance the quality of health care. During this time an increasing number of implementation research projects have developed or utilized health information technology (HIT) innovations to leverage the VA's electronic health record and information systems. To describe the HIT approaches used and to characterize the facilitators and barriers to progress within implementation research projects in the VA QUERI program. Nine case studies were selected from among 88 projects and represented 8 of 14 HIT categories identified. Each case study included key informants whose roles on the project were principal investigator, implementation science and informatics development. We conducted documentation analysis and semistructured in-person interviews with key informants for each of the 9 case studies. We used qualitative analysis software to identify and thematically code information and interview responses. : Thematic analyses revealed 3 domains or pathways critical to progression through the QUERI steps. These pathways addressed: (1) compliance and collaboration with information technology policies and procedures; (2) operating within organizational policies and building collaborations with end users, clinicians, and administrators; and (3) obtaining and maintaining research resources and approvals. Sustained efforts in HIT innovation and in implementation science in the Veterans Health Administration demonstrates the interdependencies of these initiatives and the critical pathways that can contribute to progress. Other health care quality improvement efforts that rely on HIT can learn from the Veterans Health Administration experience.
Rossa, Carlos; Lehmann, Thomas; Sloboda, Ronald; Usmani, Nawaid; Tavakoli, Mahdi
2017-08-01
Global modelling has traditionally been the approach taken to estimate needle deflection in soft tissue. In this paper, we propose a new method based on local data-driven modelling of needle deflection. External measurement of needle-tissue interactions is collected from several insertions in ex vivo tissue to form a cloud of data. Inputs to the system are the needle insertion depth, axial rotations, and the forces and torques measured at the needle base by a force sensor. When a new insertion is performed, the just-in-time learning method estimates the model outputs given the current inputs to the needle-tissue system and the historical database. The query is compared to every observation in the database and is given weights according to some similarity criteria. Only a subset of historical data that is most relevant to the query is selected and a local linear model is fit to the selected points to estimate the query output. The model outputs the 3D deflection of the needle tip and the needle insertion force. The proposed approach is validated in ex vivo multilayered biological tissue in different needle insertion scenarios. Experimental results in five different case studies indicate an accuracy in predicting needle deflection of 0.81 and 1.24 mm in the horizontal and vertical lanes, respectively, and an accuracy of 0.5 N in predicting the needle insertion force over 216 needle insertions.
Using search engine query data to track pharmaceutical utilization: a study of statins.
Schuster, Nathaniel M; Rogers, Mary A M; McMahon, Laurence F
2010-08-01
To examine temporal and geographic associations between Google queries for health information and healthcare utilization benchmarks. Retrospective longitudinal study. Using Google Trends and Google Insights for Search data, the search terms Lipitor (atorvastatin calcium; Pfizer, Ann Arbor, MI) and simvastatin were evaluated for change over time and for association with Lipitor revenues. The relationship between query data and community-based resource use per Medicare beneficiary was assessed for 35 US metropolitan areas. Google queries for Lipitor significantly decreased from January 2004 through June 2009 and queries for simvastatin significantly increased (P <.001 for both), particularly after Lipitor came off patent (P <.001 for change in slope). The mean number of Google queries for Lipitor correlated (r = 0.98) with the percentage change in Lipitor global revenues from 2004 to 2008 (P <.001). Query preference for Lipitor over simvastatin was positively associated (r = 0.40) with a community's use of Medicare services. For every 1% increase in utilization of Medicare services in a community, there was a 0.2-unit increase in the ratio of Lipitor queries to simvastatin queries in that community (P = .02). Specific search engine queries for medical information correlate with pharmaceutical revenue and with overall healthcare utilization in a community. This suggests that search query data can track community-wide characteristics in healthcare utilization and have the potential for informing payers and policy makers regarding trends in utilization.
CSRQ: Communication-Efficient Secure Range Queries in Two-Tiered Sensor Networks
Dai, Hua; Ye, Qingqun; Yang, Geng; Xu, Jia; He, Ruiliang
2016-01-01
In recent years, we have seen many applications of secure query in two-tiered wireless sensor networks. Storage nodes are responsible for storing data from nearby sensor nodes and answering queries from Sink. It is critical to protect data security from a compromised storage node. In this paper, the Communication-efficient Secure Range Query (CSRQ)—a privacy and integrity preserving range query protocol—is proposed to prevent attackers from gaining information of both data collected by sensor nodes and queries issued by Sink. To preserve privacy and integrity, in addition to employing the encoding mechanisms, a novel data structure called encrypted constraint chain is proposed, which embeds the information of integrity verification. Sink can use this encrypted constraint chain to verify the query result. The performance evaluation shows that CSRQ has lower communication cost than the current range query protocols. PMID:26907293
SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases.
Schweiger, Dominik; Trajanoski, Zlatko; Pabinger, Stephan
2014-08-15
Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. SPARQLGraph offers an intuitive drag & drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers. This new graphical way of creating queries for biological Semantic Web databases considerably facilitates usability as it removes the requirement of knowing specific query languages and database structures. The system is freely available at http://sparqlgraph.i-med.ac.at.
A Cybernetic Design Methodology for 'Intelligent' Online Learning Support
NASA Astrophysics Data System (ADS)
Quinton, Stephen R.
The World Wide Web (WWW) provides learners and knowledge workers convenient access to vast stores of information, so much that present methods for refinement of a query or search result are inadequate - there is far too much potentially useful material. The problem often encountered is that users usually do not recognise what may be useful until they have progressed some way through the discovery, learning, and knowledge acquisition process. Additional support is needed to structure and identify potentially relevant information, and to provide constructive feedback. In short, support for learning is needed. The learning envisioned here is not simply the capacity to recall facts or to recognise objects. The focus is on learning that results in the construction of knowledge. Although most online learning platforms are efficient at delivering information, most do not provide tools that support learning as envisaged in this chapter. It is conceivable that Web-based learning environments can incorporate software systems that assist learners to form new associations between concepts and synthesise information to create new knowledge. This chapter details the rationale and theory behind a research study that aims to evolve Web-based learning environments into 'intelligent thinking' systems that respond to natural language human input. Rather than functioning simply as a means of delivering information, it is argued that online learning solutions will 1 day interact directly with students to support their conceptual thinking and cognitive development.
Chandra Source Catalog: User Interfaces
NASA Astrophysics Data System (ADS)
Bonaventura, Nina; Evans, I. N.; Harbo, P. N.; Rots, A. H.; Tibbetts, M. S.; Van Stone, D. W.; Zografou, P.; Anderson, C. S.; Chen, J. C.; Davis, J. E.; Doe, S. M.; Evans, J. D.; Fabbiano, G.; Galle, E.; Gibbs, D. G.; Glotfelty, K. J.; Grier, J. D.; Hain, R.; Hall, D. M.; He, X.; Houck, J. C.; Karovska, M.; Lauer, J.; McCollough, M. L.; McDowell, J. C.; Miller, J. B.; Mitschang, A. W.; Morgan, D. L.; Nichols, J. S.; Nowak, M. A.; Plummer, D. A.; Primini, F. A.; Refsdal, B. L.; Siemiginowska, A. L.; Sundheim, B. A.; Winkelman, S. L.
2010-03-01
The CSCview data mining interface is available for browsing the Chandra Source Catalog (CSC) and downloading tables of quality-assured source properties and data products. Once the desired source properties and search criteria are entered into the CSCview query form, the resulting source matches are returned in a table along with the values of the requested source properties for each source. (The catalog can be searched on any source property, not just position.) At this point, the table of search results may be saved to a text file, and the available data products for each source may be downloaded. CSCview save files are output in RDB-like and VOTable format. The available CSC data products include event files, spectra, lightcurves, and images, all of which are processed with the CIAO software. CSC data may also be accessed non-interactively with Unix command-line tools such as cURL and Wget, using ADQL 2.0 query syntax. In fact, CSCview features a separate ADQL query form for those who wish to specify this type of query within the GUI. Several interfaces are available for learning if a source is included in the catalog (in addition to CSCview): 1) the CSC interface to Sky in Google Earth shows the footprint of each Chandra observation on the sky, along with the CSC footprint for comparison (CSC source properties are also accessible when a source within a Chandra field-of-view is clicked); 2) the CSC Limiting Sensitivity online tool indicates if a source at an input celestial location was too faint for detection; 3) an IVOA Simple Cone Search interface locates all CSC sources within a specified radius of an R.A. and Dec.; and 4) the CSC-SDSS cross-match service returns the list of sources common to the CSC and SDSS, either all such sources or a subset based on search criteria.
Advanced Query and Data Mining Capabilities for MaROS
NASA Technical Reports Server (NTRS)
Wang, Paul; Wallick, Michael N.; Allard, Daniel A.; Gladden, Roy E.; Hy, Franklin H.
2013-01-01
The Mars Relay Operational Service (MaROS) comprises a number of tools to coordinate, plan, and visualize various aspects of the Mars Relay network. These levels include a Web-based user interface, a back-end "ReSTlet" built in Java, and databases that store the data as it is received from the network. As part of MaROS, the innovators have developed and implemented a feature set that operates on several levels of the software architecture. This new feature is an advanced querying capability through either the Web-based user interface, or through a back-end REST interface to access all of the data gathered from the network. This software is not meant to replace the REST interface, but to augment and expand the range of available data. The current REST interface provides specific data that is used by the MaROS Web application to display and visualize the information; however, the returned information from the REST interface has typically been pre-processed to return only a subset of the entire information within the repository, particularly only the information that is of interest to the GUI (graphical user interface). The new, advanced query and data mining capabilities allow users to retrieve the raw data and/or to perform their own data processing. The query language used to access the repository is a restricted subset of the structured query language (SQL) that can be built safely from the Web user interface, or entered as freeform SQL by a user. The results are returned in a CSV (Comma Separated Values) format for easy exporting to third party tools and applications that can be used for data mining or user-defined visualization and interpretation. This is the first time that a service is capable of providing access to all cross-project relay data from a single Web resource. Because MaROS contains the data for a variety of missions from the Mars network, which span both NASA and ESA, the software also establishes an access control list (ACL) on each data record in the database repository to enforce user access permissions through a multilayered approach.
Jung, HaRim; Song, MoonBae; Youn, Hee Yong; Kim, Ung Mo
2015-09-18
A content-matched (CM) rangemonitoring query overmoving objects continually retrieves the moving objects (i) whose non-spatial attribute values are matched to given non-spatial query values; and (ii) that are currently located within a given spatial query range. In this paper, we propose a new query indexing structure, called the group-aware query region tree (GQR-tree) for efficient evaluation of CMrange monitoring queries. The primary role of the GQR-tree is to help the server leverage the computational capabilities of moving objects in order to improve the system performance in terms of the wireless communication cost and server workload. Through a series of comprehensive simulations, we verify the superiority of the GQR-tree method over the existing methods.
Estimating Missing Features to Improve Multimedia Information Retrieval
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bagherjeiran, A; Love, N S; Kamath, C
Retrieval in a multimedia database usually involves combining information from different modalities of data, such as text and images. However, all modalities of the data may not be available to form the query. The retrieval results from such a partial query are often less than satisfactory. In this paper, we present an approach to complete a partial query by estimating the missing features in the query. Our experiments with a database of images and their associated captions show that, with an initial text-only query, our completion method has similar performance to a full query with both image and text features.more » In addition, when we use relevance feedback, our approach outperforms the results obtained using a full query.« less
Chemical reaction vector embeddings: towards predicting drug metabolism in the human gut microbiome.
Mallory, Emily K; Acharya, Ambika; Rensi, Stefano E; Turnbaugh, Peter J; Bright, Roselie A; Altman, Russ B
2018-01-01
Bacteria in the human gut have the ability to activate, inactivate, and reactivate drugs with both intended and unintended effects. For example, the drug digoxin is reduced to the inactive metabolite dihydrodigoxin by the gut Actinobacterium E. lenta, and patients colonized with high levels of drug metabolizing strains may have limited response to the drug. Understanding the complete space of drugs that are metabolized by the human gut microbiome is critical for predicting bacteria-drug relationships and their effects on individual patient response. Discovery and validation of drug metabolism via bacterial enzymes has yielded >50 drugs after nearly a century of experimental research. However, there are limited computational tools for screening drugs for potential metabolism by the gut microbiome. We developed a pipeline for comparing and characterizing chemical transformations using continuous vector representations of molecular structure learned using unsupervised representation learning. We applied this pipeline to chemical reaction data from MetaCyc to characterize the utility of vector representations for chemical reaction transformations. After clustering molecular and reaction vectors, we performed enrichment analyses and queries to characterize the space. We detected enriched enzyme names, Gene Ontology terms, and Enzyme Consortium (EC) classes within reaction clusters. In addition, we queried reactions against drug-metabolite transformations known to be metabolized by the human gut microbiome. The top results for these known drug transformations contained similar substructure modifications to the original drug pair. This work enables high throughput screening of drugs and their resulting metabolites against chemical reactions common to gut bacteria.
QBIC project: querying images by content, using color, texture, and shape
NASA Astrophysics Data System (ADS)
Niblack, Carlton W.; Barber, Ron; Equitz, Will; Flickner, Myron D.; Glasman, Eduardo H.; Petkovic, Dragutin; Yanker, Peter; Faloutsos, Christos; Taubin, Gabriel
1993-04-01
In the query by image content (QBIC) project we are studying methods to query large on-line image databases using the images' content as the basis of the queries. Examples of the content we use include color, texture, and shape of image objects and regions. Potential applications include medical (`Give me other images that contain a tumor with a texture like this one'), photo-journalism (`Give me images that have blue at the top and red at the bottom'), and many others in art, fashion, cataloging, retailing, and industry. Key issues include derivation and computation of attributes of images and objects that provide useful query functionality, retrieval methods based on similarity as opposed to exact match, query by image example or user drawn image, the user interfaces, query refinement and navigation, high dimensional database indexing, and automatic and semi-automatic database population. We currently have a prototype system written in X/Motif and C running on an RS/6000 that allows a variety of queries, and a test database of over 1000 images and 1000 objects populated from commercially available photo clip art images. In this paper we present the main algorithms for color texture, shape and sketch query that we use, show example query results, and discuss future directions.
a Novel Approach of Indexing and Retrieving Spatial Polygons for Efficient Spatial Region Queries
NASA Astrophysics Data System (ADS)
Zhao, J. H.; Wang, X. Z.; Wang, F. Y.; Shen, Z. H.; Zhou, Y. C.; Wang, Y. L.
2017-10-01
Spatial region queries are more and more widely used in web-based applications. Mechanisms to provide efficient query processing over geospatial data are essential. However, due to the massive geospatial data volume, heavy geometric computation, and high access concurrency, it is difficult to get response in real time. Spatial indexes are usually used in this situation. In this paper, based on k-d tree, we introduce a distributed KD-Tree (DKD-Tree) suitbable for polygon data, and a two-step query algorithm. The spatial index construction is recursive and iterative, and the query is an in memory process. Both the index and query methods can be processed in parallel, and are implemented based on HDFS, Spark and Redis. Experiments on a large volume of Remote Sensing images metadata have been carried out, and the advantages of our method are investigated by comparing with spatial region queries executed on PostgreSQL and PostGIS. Results show that our approach not only greatly improves the efficiency of spatial region query, but also has good scalability, Moreover, the two-step spatial range query algorithm can also save cluster resources to support a large number of concurrent queries. Therefore, this method is very useful when building large geographic information systems.
Secure Skyline Queries on Cloud Platform
Liu, Jinfei; Yang, Juncheng; Xiong, Li; Pei, Jian
2017-01-01
Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions. PMID:28883710
NASA Astrophysics Data System (ADS)
Indrayana, I. N. E.; P, N. M. Wirasyanti D.; Sudiartha, I. KG
2018-01-01
Mobile application allow many users to access data from the application without being limited to space, space and time. Over time the data population of this application will increase. Data access time will cause problems if the data record has reached tens of thousands to millions of records.The objective of this research is to maintain the performance of data execution for large data records. One effort to maintain data access time performance is to apply query optimization method. The optimization used in this research is query heuristic optimization method. The built application is a mobile-based financial application using MySQL database with stored procedure therein. This application is used by more than one business entity in one database, thus enabling rapid data growth. In this stored procedure there is an optimized query using heuristic method. Query optimization is performed on a “Select” query that involves more than one table with multiple clausa. Evaluation is done by calculating the average access time using optimized and unoptimized queries. Access time calculation is also performed on the increase of population data in the database. The evaluation results shown the time of data execution with query heuristic optimization relatively faster than data execution time without using query optimization.
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce
Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng
2016-01-01
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS – a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing. PMID:27617325
Enhancement of plant metabolite fingerprinting by machine learning.
Scott, Ian M; Vermeer, Cornelia P; Liakata, Maria; Corol, Delia I; Ward, Jane L; Lin, Wanchang; Johnson, Helen E; Whitehead, Lynne; Kular, Baldeep; Baker, John M; Walsh, Sean; Dave, Anuja; Larson, Tony R; Graham, Ian A; Wang, Trevor L; King, Ross D; Draper, John; Beale, Michael H
2010-08-01
Metabolite fingerprinting of Arabidopsis (Arabidopsis thaliana) mutants with known or predicted metabolic lesions was performed by (1)H-nuclear magnetic resonance, Fourier transform infrared, and flow injection electrospray-mass spectrometry. Fingerprinting enabled processing of five times more plants than conventional chromatographic profiling and was competitive for discriminating mutants, other than those affected in only low-abundance metabolites. Despite their rapidity and complexity, fingerprints yielded metabolomic insights (e.g. that effects of single lesions were usually not confined to individual pathways). Among fingerprint techniques, (1)H-nuclear magnetic resonance discriminated the most mutant phenotypes from the wild type and Fourier transform infrared discriminated the fewest. To maximize information from fingerprints, data analysis was crucial. One-third of distinctive phenotypes might have been overlooked had data models been confined to principal component analysis score plots. Among several methods tested, machine learning (ML) algorithms, namely support vector machine or random forest (RF) classifiers, were unsurpassed for phenotype discrimination. Support vector machines were often the best performing classifiers, but RFs yielded some particularly informative measures. First, RFs estimated margins between mutant phenotypes, whose relations could then be visualized by Sammon mapping or hierarchical clustering. Second, RFs provided importance scores for the features within fingerprints that discriminated mutants. These scores correlated with analysis of variance F values (as did Kruskal-Wallis tests, true- and false-positive measures, mutual information, and the Relief feature selection algorithm). ML classifiers, as models trained on one data set to predict another, were ideal for focused metabolomic queries, such as the distinctiveness and consistency of mutant phenotypes. Accessible software for use of ML in plant physiology is highlighted.
A novel method for efficient archiving and retrieval of biomedical images using MPEG-7
NASA Astrophysics Data System (ADS)
Meyer, Joerg; Pahwa, Ash
2004-10-01
Digital archiving and efficient retrieval of radiological scans have become critical steps in contemporary medical diagnostics. Since more and more images and image sequences (single scans or video) from various modalities (CT/MRI/PET/digital X-ray) are now available in digital formats (e.g., DICOM-3), hospitals and radiology clinics need to implement efficient protocols capable of managing the enormous amounts of data generated daily in a typical clinical routine. We present a method that appears to be a viable way to eliminate the tedious step of manually annotating image and video material for database indexing. MPEG-7 is a new framework that standardizes the way images are characterized in terms of color, shape, and other abstract, content-related criteria. A set of standardized descriptors that are automatically generated from an image is used to compare an image to other images in a database, and to compute the distance between two images for a given application domain. Text-based database queries can be replaced with image-based queries using MPEG-7. Consequently, image queries can be conducted without any prior knowledge of the keys that were used as indices in the database. Since the decoding and matching steps are not part of the MPEG-7 standard, this method also enables searches that were not planned by the time the keys were generated.
Nosql for Storage and Retrieval of Large LIDAR Data Collections
NASA Astrophysics Data System (ADS)
Boehm, J.; Liu, K.
2015-08-01
Developments in LiDAR technology over the past decades have made LiDAR to become a mature and widely accepted source of geospatial information. This in turn has led to an enormous growth in data volume. The central idea for a file-centric storage of LiDAR point clouds is the observation that large collections of LiDAR data are typically delivered as large collections of files, rather than single files of terabyte size. This split of the dataset, commonly referred to as tiling, was usually done to accommodate a specific processing pipeline. It makes therefore sense to preserve this split. A document oriented NoSQL database can easily emulate this data partitioning, by representing each tile (file) in a separate document. The document stores the metadata of the tile. The actual files are stored in a distributed file system emulated by the NoSQL database. We demonstrate the use of MongoDB a highly scalable document oriented NoSQL database for storing large LiDAR files. MongoDB like any NoSQL database allows for queries on the attributes of the document. As a specialty MongoDB also allows spatial queries. Hence we can perform spatial queries on the bounding boxes of the LiDAR tiles. Inserting and retrieving files on a cloud-based database is compared to native file system and cloud storage transfer speed.
The EarthServer Federation: State, Role, and Contribution to GEOSS
NASA Astrophysics Data System (ADS)
Merticariu, Vlad; Baumann, Peter
2016-04-01
The intercontinental EarthServer initiative has established a European datacube platform with proven scalability: known databases exceed 100 TB, and single queries have been split across more than 1,000 cloud nodes. Its service interface being rigorously based on the OGC "Big Geo Data" standards, Web Coverage Service (WCS) and Web Coverage Processing Service (WCPS), a series of clients can dock into the services, ranging from open-source OpenLayers and QGIS over open-source NASA WorldWind to proprietary ESRI ArcGIS. Datacube fusion in a "mix and match" style is supported by the platform technolgy, the rasdaman Array Database System, which transparently federates queries so that users simply approach any node of the federation to access any data item, internally optimized for minimal data transfer. Notably, rasdaman is part of GEOSS GCI. NASA is contributing its Web WorldWind virtual globe for user-friendly data extraction, navigation, and analysis. Integrated datacube / metadata queries are contributed by CITE. Current federation members include ESA (managed by MEEO sr.l.), Plymouth Marine Laboratory (PML), the European Centre for Medium-Range Weather Forecast (ECMWF), Australia's National Computational Infrastructure, and Jacobs University (adding in Planetary Science). Further data centers have expressed interest in joining. We present the EarthServer approach, discuss its underlying technology, and illustrate the contribution this datacube platform can make to GEOSS.
STBase: One Million Species Trees for Comparative Biology
McMahon, Michelle M.; Deepak, Akshay; Fernández-Baca, David; Boss, Darren; Sanderson, Michael J.
2015-01-01
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user’s query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees. PMID:25679219
Multi-Case Knowledge-Based IMRT Treatment Planning in Head and Neck Cancer
NASA Astrophysics Data System (ADS)
Grzetic, Shelby Mariah
Head and neck cancer (HNC) IMRT treatment planning is a challenging process that relies heavily on the planner's experience. Previously, we used the single, best match from a library of manually planned cases to semi-automatically generate IMRT plans for a new patient. The current multi-case Knowledge Based Radiation Therapy (MC-KBRT) study utilized different matching cases for each of six individual organs-at-risk (OARs), then combined those six cases to create the new treatment plan. From a database of 103 patient plans created by experienced planners, MC-KBRT plans were created for 40 (17 unilateral and 23 bilateral) HNC "query" patients. For each case, 2D beam's-eye-view images were used to find similar geometric "match" patients separately for each of 6 OARs. Dose distributions for each OAR from the 6 matching cases were combined and then warped to suit the query case's geometry. The dose-volume constraints were used to create the new query treatment plan without the need for human decision-making throughout the IMRT optimization. The optimized MC-KBRT plans were compared against the clinically approved plans and Version 1 (previous KBRT using only one matching case with dose warping) using the dose metrics: mean, median, and maximum (brainstem and cord+5mm) doses. Compared to Version 1, MC-KBRT had no significant reduction of the dose to any of the OARs in either unilateral or bilateral cases. Compared to the manually planned unilateral cases, there was significant reduction of the oral cavity mean/median dose (>2Gy) at the expense of the contralateral parotid. Compared to the manually planned bilateral cases, reduction of dose was significant in the ipsilateral parotid, larynx, and oral cavity (>3Gy mean/median) while maintaining PTV coverage. MC-KBRT planning in head and neck cancer generates IMRT plans with better dose sparing than manually created plans. MC-KBRT using multiple case matches does not show significant dose reduction compared to using a single match case with dose warping.
SPSmart: adapting population based SNP genotype databases for fast and comprehensive web access.
Amigo, Jorge; Salas, Antonio; Phillips, Christopher; Carracedo, Angel
2008-10-10
In the last five years large online resources of human variability have appeared, notably HapMap, Perlegen and the CEPH foundation. These databases of genotypes with population information act as catalogues of human diversity, and are widely used as reference sources for population genetics studies. Although many useful conclusions may be extracted by querying databases individually, the lack of flexibility for combining data from within and between each database does not allow the calculation of key population variability statistics. We have developed a novel tool for accessing and combining large-scale genomic databases of single nucleotide polymorphisms (SNPs) in widespread use in human population genetics: SPSmart (SNPs for Population Studies). A fast pipeline creates and maintains a data mart from the most commonly accessed databases of genotypes containing population information: data is mined, summarized into the standard statistical reference indices, and stored into a relational database that currently handles as many as 4 x 10(9) genotypes and that can be easily extended to new database initiatives. We have also built a web interface to the data mart that allows the browsing of underlying data indexed by population and the combining of populations, allowing intuitive and straightforward comparison of population groups. All the information served is optimized for web display, and most of the computations are already pre-processed in the data mart to speed up the data browsing and any computational treatment requested. In practice, SPSmart allows populations to be combined into user-defined groups, while multiple databases can be accessed and compared in a few simple steps from a single query. It performs the queries rapidly and gives straightforward graphical summaries of SNP population variability through visual inspection of allele frequencies outlined in standard pie-chart format. In addition, full numerical description of the data is output in statistical results panels that include common population genetics metrics such as heterozygosity, Fst and In.
A Customizable Dashboarding System for Watershed Model Interpretation
NASA Astrophysics Data System (ADS)
Easton, Z. M.; Collick, A.; Wagena, M. B.; Sommerlot, A.; Fuka, D.
2017-12-01
Stakeholders, including policymakers, agricultural water managers, and small farm managers, can benefit from the outputs of commonly run watershed models. However, the information that each stakeholder needs is be different. While policy makers are often interested in the broader effects that small farm management may have on a watershed during extreme events or over long periods, farmers are often interested in field specific effects at daily or seasonal period. To provide stakeholders with the ability to analyze and interpret data from large scale watershed models, we have developed a framework that can support custom exploration of the large datasets produced. For the volume of data produced by these models, SQL-based data queries are not efficient; thus, we employ a "Not Only SQL" (NO-SQL) query language, which allows data to scale in both quantity and query volumes. We demonstrate a stakeholder customizable Dashboarding system that allows stakeholders to create custom `dashboards' to summarize model output specific to their needs. Dashboarding is a dynamic and purpose-based visual interface needed to display one-to-many database linkages so that the information can be presented for a single time period or dynamically monitored over time and allows a user to quickly define focus areas of interest for their analysis. We utilize a single watershed model that is run four times daily with a combined set of climate projections, which are then indexed, and added to an ElasticSearch datastore. ElasticSearch is a NO-SQL search engine built on top of Apache Lucene, a free and open-source information retrieval software library. Aligned with the ElasticSearch project is the open source visualization and analysis system, Kibana, which we utilize for custom stakeholder dashboarding. The dashboards create a visualization of the stakeholder selected analysis and can be extended to recommend robust strategies to support decision-making.
Observation Data Model Core Components, its Implementation in the Table Access Protocol Version 1.1
NASA Astrophysics Data System (ADS)
Louys, Mireille; Tody, Doug; Dowler, Patrick; Durand, Daniel; Michel, Laurent; Bonnarel, Francos; Micol, Alberto; IVOA DataModel Working Group; Louys, Mireille; Tody, Doug; Dowler, Patrick; Durand, Daniel
2017-05-01
This document defines the core components of the Observation data model that are necessary to perform data discovery when querying data centers for astronomical observations of interest. It exposes use-cases to be carried out, explains the model and provides guidelines for its implementation as a data access service based on the Table Access Protocol (TAP). It aims at providing a simple model easy to understand and to implement by data providers that wish to publish their data into the Virtual Observatory. This interface integrates data modeling and data access aspects in a single service and is named ObsTAP. It will be referenced as such in the IVOA registries. In this document, the Observation Data Model Core Components (ObsCoreDM) defines the core components of queryable metadata required for global discovery of observational data. It is meant to allow a single query to be posed to TAP services at multiple sites to perform global data discovery without having to understand the details of the services present at each site. It defines a minimal set of basic metadata and thus allows for a reasonable cost of implementation by data providers. The combination of the ObsCoreDM with TAP is referred to as an ObsTAP service. As with most of the VO Data Models, ObsCoreDM makes use of STC, Utypes, Units and UCDs. The ObsCoreDM can be serialized as a VOTable. ObsCoreDM can make reference to more complete data models such as Characterisation DM, Spectrum DM or Simple Spectral Line Data Model (SSLDM). ObsCore shares a large set of common concepts with DataSet Metadata Data Model (Cresitello-Dittmar et al. 2016) which binds together most of the data model concepts from the above models in a comprehensive and more general frame work. This current specification on the contrary provides guidelines for implementing these concepts using the TAP protocol and answering ADQL queries. It is dedicated to global discovery.
A high performance, ad-hoc, fuzzy query processing system for relational databases
NASA Technical Reports Server (NTRS)
Mansfield, William H., Jr.; Fleischman, Robert M.
1992-01-01
Database queries involving imprecise or fuzzy predicates are currently an evolving area of academic and industrial research. Such queries place severe stress on the indexing and I/O subsystems of conventional database environments since they involve the search of large numbers of records. The Datacycle architecture and research prototype is a database environment that uses filtering technology to perform an efficient, exhaustive search of an entire database. It has recently been modified to include fuzzy predicates in its query processing. The approach obviates the need for complex index structures, provides unlimited query throughput, permits the use of ad-hoc fuzzy membership functions, and provides a deterministic response time largely independent of query complexity and load. This paper describes the Datacycle prototype implementation of fuzzy queries and some recent performance results.
Jung, HaRim; Song, MoonBae; Youn, Hee Yong; Kim, Ung Mo
2015-01-01
A content-matched (CM) range monitoring query over moving objects continually retrieves the moving objects (i) whose non-spatial attribute values are matched to given non-spatial query values; and (ii) that are currently located within a given spatial query range. In this paper, we propose a new query indexing structure, called the group-aware query region tree (GQR-tree) for efficient evaluation of CM range monitoring queries. The primary role of the GQR-tree is to help the server leverage the computational capabilities of moving objects in order to improve the system performance in terms of the wireless communication cost and server workload. Through a series of comprehensive simulations, we verify the superiority of the GQR-tree method over the existing methods. PMID:26393613
Systems and methods for an extensible business application framework
NASA Technical Reports Server (NTRS)
Bell, David G. (Inventor); Crawford, Michael (Inventor)
2012-01-01
Method and systems for editing data from a query result include requesting a query result using a unique collection identifier for a collection of individual files and a unique identifier for a configuration file that specifies a data structure for the query result. A query result is generated that contains a plurality of fields as specified by the configuration file, by combining each of the individual files associated with a unique identifier for a collection of individual files. The query result data is displayed with a plurality of labels as specified in the configuration file. Edits can be performed by querying a collection of individual files using the configuration file, editing a portion of the query result, and transmitting only the edited information for storage back into a data repository.
Analysis of Information Needs of Users of MEDLINEplus, 2002 – 2003
Scott-Wright, Alicia; Crowell, Jon; Zeng, Qing; Bates, David W.; Greenes, Robert
2006-01-01
We analyzed query logs from use of MEDLINEplus to answer the questions: Are consumers’ health information needs stable over time? and To what extent do users’ queries change over time? To determine log stability, we assessed an Overlap Rate (OR) defined as the number of unique queries common to two adjacent months divided by the total number of unique queries in those months. All exactly matching queries were considered as one unique query. We measured ORs for the top 10 and 100 unique queries of a month and compared these to ORs for the following month. Over ten months, users submitted 12,234,737 queries; only 2,179,571 (17.8%) were unique and these had a mean word count of 2.73 (S.D., 0.24); 121 of 137 (88.3%) unique queries each comprised of exactly matching search term(s) used at least 5000 times were of only one word. We could predict with 95% confidence that the monthly OR for the top 100 unique queries would lie between 67% – 87% when compared with the top 100 from the previous month. The mean month-to-month OR for top 10 queries was 62% (S.D., 20%) indicating significant variability; the lowest OR of 33% between the top 10 in Mar. compared to Apr. was likely due to “new” interest in information about SARS pneumonia in Apr. 2003. Consumers’ health information needs are relatively stable and the 100 most common unique queries are about 77% the same from month to month. Website sponsors should provide a broad range of information about a relatively stable number of topics. Analyses of log similarity may identify media-induced, cyclical, or seasonal changes in areas of consumer interest. PMID:17238431
Big Data and Dysmenorrhea: What Questions Do Women and Men Ask About Menstrual Pain?
Chen, Chen X; Groves, Doyle; Miller, Wendy R; Carpenter, Janet S
2018-04-30
Menstrual pain is highly prevalent among women of reproductive age. As the general public increasingly obtains health information online, Big Data from online platforms provide novel sources to understand the public's perspectives and information needs about menstrual pain. The study's purpose was to describe salient queries about dysmenorrhea using Big Data from a question and answer platform. We performed text-mining of 1.9 billion queries from ChaCha, a United States-based question and answer platform. Dysmenorrhea-related queries were identified by using keyword searching. Each relevant query was split into token words (i.e., meaningful words or phrases) and stop words (i.e., not meaningful functional words). Word Adjacency Graph (WAG) modeling was used to detect clusters of queries and visualize the range of dysmenorrhea-related topics. We constructed two WAG models respectively from queries by women of reproductive age and bymen. Salient themes were identified through inspecting clusters of WAG models. We identified two subsets of queries: Subset 1 contained 507,327 queries from women aged 13-50 years. Subset 2 contained 113,888 queries from men aged 13 or above. WAG modeling revealed topic clusters for each subset. Between female and male subsets, topic clusters overlapped on dysmenorrhea symptoms and management. Among female queries, there were distinctive topics on approaching menstrual pain at school and menstrual pain-related conditions; while among male queries, there was a distinctive cluster of queries on menstrual pain from male's perspectives. Big Data mining of the ChaCha ® question and answer service revealed a series of information needs among women and men on menstrual pain. Findings may be useful in structuring the content and informing the delivery platform for educational interventions.
Lab Streaming Layer Enabled Myo Data Collection Software User Manual
2017-06-07
time - series data over a local network. LSL handles the networking, time -synchronization, (near-) real- time access as well as, optionally, the... series data collection (e.g., brain activity, heart activity, muscle activity) using the LSL application programming interface (API). Time -synchronized...saved to a single extensible data format (XDF) file. Once the time - series data are collected in a Lab Recorder XDF file, users will be able to query
Relational Algebra and SQL: Better Together
ERIC Educational Resources Information Center
McMaster, Kirby; Sambasivam, Samuel; Hadfield, Steven; Wolthuis, Stuart
2013-01-01
In this paper, we describe how database instructors can teach Relational Algebra and Structured Query Language together through programming. Students write query programs consisting of sequences of Relational Algebra operations vs. Structured Query Language SELECT statements. The query programs can then be run interactively, allowing students to…
A Firefly Algorithm-based Approach for Pseudo-Relevance Feedback: Application to Medical Database.
Khennak, Ilyes; Drias, Habiba
2016-11-01
The difficulty of disambiguating the sense of the incomplete and imprecise keywords that are extensively used in the search queries has caused the failure of search systems to retrieve the desired information. One of the most powerful and promising method to overcome this shortcoming and improve the performance of search engines is Query Expansion, whereby the user's original query is augmented by new keywords that best characterize the user's information needs and produce more useful query. In this paper, a new Firefly Algorithm-based approach is proposed to enhance the retrieval effectiveness of query expansion while maintaining low computational complexity. In contrast to the existing literature, the proposed approach uses a Firefly Algorithm to find the best expanded query among a set of expanded query candidates. Moreover, this new approach allows the determination of the length of the expanded query empirically. Experimental results on MEDLINE, the on-line medical information database, show that our proposed approach is more effective and efficient compared to the state-of-the-art.
RiPPAS: A Ring-Based Privacy-Preserving Aggregation Scheme in Wireless Sensor Networks
Zhang, Kejia; Han, Qilong; Cai, Zhipeng; Yin, Guisheng
2017-01-01
Recently, data privacy in wireless sensor networks (WSNs) has been paid increased attention. The characteristics of WSNs determine that users’ queries are mainly aggregation queries. In this paper, the problem of processing aggregation queries in WSNs with data privacy preservation is investigated. A Ring-based Privacy-Preserving Aggregation Scheme (RiPPAS) is proposed. RiPPAS adopts ring structure to perform aggregation. It uses pseudonym mechanism for anonymous communication and uses homomorphic encryption technique to add noise to the data easily to be disclosed. RiPPAS can handle both sum() queries and min()/max() queries, while the existing privacy-preserving aggregation methods can only deal with sum() queries. For processing sum() queries, compared with the existing methods, RiPPAS has advantages in the aspects of privacy preservation and communication efficiency, which can be proved by theoretical analysis and simulation results. For processing min()/max() queries, RiPPAS provides effective privacy preservation and has low communication overhead. PMID:28178197
The Knowledge Structure in Amarakośa
NASA Astrophysics Data System (ADS)
Nair, Sivaja S.; Kulkarni, Amba
Amarakośa is the most celebrated and authoritative ancient thesaurus of Sanskrit. It is one of the books which an Indian child learning through Indian traditional educational system memorizes as early as his first year of formal learning. Though it appears as a linear list of words, close inspection of it shows a rich organisation of words expressing various relations a word bears with other words. Thus when a child studies Amarakośa further, the linear list of words unfolds into a knowledge web. In this paper we describe our effort to make the implicit knowledge in Amarakośa explicit. A model for storing such structure is discussed and a web tool is described that answers the queries by reconstructing the links among words from the structured tables dynamically.
Dynamic Querying of Mass-Storage RDF Data with Rule-Based Entailment Regimes
NASA Astrophysics Data System (ADS)
Ianni, Giovambattista; Krennwallner, Thomas; Martello, Alessandra; Polleres, Axel
RDF Schema (RDFS) as a lightweight ontology language is gaining popularity and, consequently, tools for scalable RDFS inference and querying are needed. SPARQL has become recently a W3C standard for querying RDF data, but it mostly provides means for querying simple RDF graphs only, whereas querying with respect to RDFS or other entailment regimes is left outside the current specification. In this paper, we show that SPARQL faces certain unwanted ramifications when querying ontologies in conjunction with RDF datasets that comprise multiple named graphs, and we provide an extension for SPARQL that remedies these effects. Moreover, since RDFS inference has a close relationship with logic rules, we generalize our approach to select a custom ruleset for specifying inferences to be taken into account in a SPARQL query. We show that our extensions are technically feasible by providing benchmark results for RDFS querying in our prototype system GiaBATA, which uses Datalog coupled with a persistent Relational Database as a back-end for implementing SPARQL with dynamic rule-based inference. By employing different optimization techniques like magic set rewriting our system remains competitive with state-of-the-art RDFS querying systems.
Reconstruction based finger-knuckle-print verification with score level adaptive binary fusion.
Gao, Guangwei; Zhang, Lei; Yang, Jian; Zhang, Lin; Zhang, David
2013-12-01
Recently, a new biometrics identifier, namely finger knuckle print (FKP), has been proposed for personal authentication with very interesting results. One of the advantages of FKP verification lies in its user friendliness in data collection. However, the user flexibility in positioning fingers also leads to a certain degree of pose variations in the collected query FKP images. The widely used Gabor filtering based competitive coding scheme is sensitive to such variations, resulting in many false rejections. We propose to alleviate this problem by reconstructing the query sample with a dictionary learned from the template samples in the gallery set. The reconstructed FKP image can reduce much the enlarged matching distance caused by finger pose variations; however, both the intra-class and inter-class distances will be reduced. We then propose a score level adaptive binary fusion rule to adaptively fuse the matching distances before and after reconstruction, aiming to reduce the false rejections without increasing much the false acceptances. Experimental results on the benchmark PolyU FKP database show that the proposed method significantly improves the FKP verification accuracy.
MetaSEEk: a content-based metasearch engine for images
NASA Astrophysics Data System (ADS)
Beigi, Mandis; Benitez, Ana B.; Chang, Shih-Fu
1997-12-01
Search engines are the most powerful resources for finding information on the rapidly expanding World Wide Web (WWW). Finding the desired search engines and learning how to use them, however, can be very time consuming. The integration of such search tools enables the users to access information across the world in a transparent and efficient manner. These systems are called meta-search engines. The recent emergence of visual information retrieval (VIR) search engines on the web is leading to the same efficiency problem. This paper describes and evaluates MetaSEEk, a content-based meta-search engine used for finding images on the Web based on their visual information. MetaSEEk is designed to intelligently select and interface with multiple on-line image search engines by ranking their performance for different classes of user queries. User feedback is also integrated in the ranking refinement. We compare MetaSEEk with a base line version of meta-search engine, which does not use the past performance of the different search engines in recommending target search engines for future queries.
Applying Query Structuring in Cross-language Retrieval.
ERIC Educational Resources Information Center
Pirkola, Ari; Puolamaki, Deniz; Jarvelin, Kalervo
2003-01-01
Explores ways to apply query structuring in cross-language information retrieval. Tested were: English queries translated into Finnish using an electronic dictionary, and run in a Finnish newspaper databases; effects of compound-based structuring using a proximity operator for translation equivalents of query language compound components; and a…
Querying and Ranking XML Documents.
ERIC Educational Resources Information Center
Schlieder, Torsten; Meuss, Holger
2002-01-01
Discussion of XML, information retrieval, precision, and recall focuses on a retrieval technique that adopts the similarity measure of the vector space model, incorporates the document structure, and supports structured queries. Topics include a query model based on tree matching; structured queries and term-based ranking; and term frequency and…
Advanced Query Formulation in Deductive Databases.
ERIC Educational Resources Information Center
Niemi, Timo; Jarvelin, Kalervo
1992-01-01
Discusses deductive databases and database management systems (DBMS) and introduces a framework for advanced query formulation for end users. Recursive processing is described, a sample extensional database is presented, query types are explained, and criteria for advanced query formulation from the end user's viewpoint are examined. (31…
A Semantic Graph Query Language
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kaplan, I L
2006-10-16
Semantic graphs can be used to organize large amounts of information from a number of sources into one unified structure. A semantic query language provides a foundation for extracting information from the semantic graph. The graph query language described here provides a simple, powerful method for querying semantic graphs.
PropBase Query Layer: a single portal to UK subsurface physical property databases
NASA Astrophysics Data System (ADS)
Kingdon, Andrew; Nayembil, Martin L.; Richardson, Anne E.; Smith, A. Graham
2013-04-01
Until recently, the delivery of geological information for industry and public was achieved by geological mapping. Now pervasively available computers mean that 3D geological models can deliver realistic representations of the geometric location of geological units, represented as shells or volumes. The next phase of this process is to populate these with physical properties data that describe subsurface heterogeneity and its associated uncertainty. Achieving this requires capture and serving of physical, hydrological and other property information from diverse sources to populate these models. The British Geological Survey (BGS) holds large volumes of subsurface property data, derived both from their own research data collection and also other, often commercially derived data sources. This can be voxelated to incorporate this data into the models to demonstrate property variation within the subsurface geometry. All property data held by BGS has for many years been stored in relational databases to ensure their long-term continuity. However these have, by necessity, complex structures; each database contains positional reference data and model information, and also metadata such as sample identification information and attributes that define the source and processing. Whilst this is critical to assessing these analyses, it also hugely complicates the understanding of variability of the property under assessment and requires multiple queries to study related datasets making extracting physical properties from these databases difficult. Therefore the PropBase Query Layer has been created to allow simplified aggregation and extraction of all related data and its presentation of complex data in simple, mostly denormalized, tables which combine information from multiple databases into a single system. The structure from each relational database is denormalized in a generalised structure, so that each dataset can be viewed together in a common format using a simple interface. Data are re-engineered to facilitate easy loading. The query layer structure comprises tables, procedures, functions, triggers, views and materialised views. The structure contains a main table PRB_DATA which contains all of the data with the following attribution: • a unique identifier • the data source • the unique identifier from the parent database for traceability • the 3D location • the property type • the property value • the units • necessary qualifiers • precision information and an audit trail Data sources, property type and units are constrained by dictionaries, a key component of the structure which defines what properties and inheritance hierarchies are to be coded and also guides the process as to what and how these are extracted from the structure. Data types served by the Query Layer include site investigation derived geotechnical data, hydrogeology datasets, regional geochemistry, geophysical logs as well as lithological and borehole metadata. The size and complexity of the data sets with multiple parent structures requires a technically robust approach to keep the layer synchronised. This is achieved through Oracle procedures written in PL/SQL containing the logic required to carry out the data manipulation (inserts, updates, deletes) to keep the layer synchronised with the underlying databases either as regular scheduled jobs (weekly, monthly etc) or invoked on demand. The PropBase Query Layer's implementation has enabled rapid data discovery, visualisation and interpretation of geological data with greater ease, simplifying the parametrisation of 3D model volumes and facilitating the study of intra-unit heterogeneity.
Harris, Daniel R.; Henderson, Darren W.; Kavuluru, Ramakanth; Stromberg, Arnold J.; Johnson, Todd R.
2015-01-01
We present a custom, Boolean query generator utilizing common-table expressions (CTEs) that is capable of scaling with big datasets. The generator maps user-defined Boolean queries, such as those interactively created in clinical-research and general-purpose healthcare tools, into SQL. We demonstrate the effectiveness of this generator by integrating our work into the Informatics for Integrating Biology and the Bedside (i2b2) query tool and show that it is capable of scaling. Our custom generator replaces and outperforms the default query generator found within the Clinical Research Chart (CRC) cell of i2b2. In our experiments, sixteen different types of i2b2 queries were identified by varying four constraints: date, frequency, exclusion criteria, and whether selected concepts occurred in the same encounter. We generated non-trivial, random Boolean queries based on these 16 types; the corresponding SQL queries produced by both generators were compared by execution times. The CTE-based solution significantly outperformed the default query generator and provided a much more consistent response time across all query types (M=2.03, SD=6.64 vs. M=75.82, SD=238.88 seconds). Without costly hardware upgrades, we provide a scalable solution based on CTEs with very promising empirical results centered on performance gains. The evaluation methodology used for this provides a means of profiling clinical data warehouse performance. PMID:25192572
RCQ-GA: RDF Chain Query Optimization Using Genetic Algorithms
NASA Astrophysics Data System (ADS)
Hogenboom, Alexander; Milea, Viorel; Frasincar, Flavius; Kaymak, Uzay
The application of Semantic Web technologies in an Electronic Commerce environment implies a need for good support tools. Fast query engines are needed for efficient querying of large amounts of data, usually represented using RDF. We focus on optimizing a special class of SPARQL queries, the so-called RDF chain queries. For this purpose, we devise a genetic algorithm called RCQ-GA that determines the order in which joins need to be performed for an efficient evaluation of RDF chain queries. The approach is benchmarked against a two-phase optimization algorithm, previously proposed in literature. The more complex a query is, the more RCQ-GA outperforms the benchmark in solution quality, execution time needed, and consistency of solution quality. When the algorithms are constrained by a time limit, the overall performance of RCQ-GA compared to the benchmark further improves.
Query Language for Location-Based Services: A Model Checking Approach
NASA Astrophysics Data System (ADS)
Hoareau, Christian; Satoh, Ichiro
We present a model checking approach to the rationale, implementation, and applications of a query language for location-based services. Such query mechanisms are necessary so that users, objects, and/or services can effectively benefit from the location-awareness of their surrounding environment. The underlying data model is founded on a symbolic model of space organized in a tree structure. Once extended to a semantic model for modal logic, we regard location query processing as a model checking problem, and thus define location queries as hybrid logicbased formulas. Our approach is unique to existing research because it explores the connection between location models and query processing in ubiquitous computing systems, relies on a sound theoretical basis, and provides modal logic-based query mechanisms for expressive searches over a decentralized data structure. A prototype implementation is also presented and will be discussed.
Towards Building a High Performance Spatial Query System for Large Scale Medical Imaging Data.
Aji, Ablimit; Wang, Fusheng; Saltz, Joel H
2012-11-06
Support of high performance queries on large volumes of scientific spatial data is becoming increasingly important in many applications. This growth is driven by not only geospatial problems in numerous fields, but also emerging scientific applications that are increasingly data- and compute-intensive. For example, digital pathology imaging has become an emerging field during the past decade, where examination of high resolution images of human tissue specimens enables more effective diagnosis, prediction and treatment of diseases. Systematic analysis of large-scale pathology images generates tremendous amounts of spatially derived quantifications of micro-anatomic objects, such as nuclei, blood vessels, and tissue regions. Analytical pathology imaging provides high potential to support image based computer aided diagnosis. One major requirement for this is effective querying of such enormous amount of data with fast response, which is faced with two major challenges: the "big data" challenge and the high computation complexity. In this paper, we present our work towards building a high performance spatial query system for querying massive spatial data on MapReduce. Our framework takes an on demand index building approach for processing spatial queries and a partition-merge approach for building parallel spatial query pipelines, which fits nicely with the computing model of MapReduce. We demonstrate our framework on supporting multi-way spatial joins for algorithm evaluation and nearest neighbor queries for microanatomic objects. To reduce query response time, we propose cost based query optimization to mitigate the effect of data skew. Our experiments show that the framework can efficiently support complex analytical spatial queries on MapReduce.
Towards Building a High Performance Spatial Query System for Large Scale Medical Imaging Data
Aji, Ablimit; Wang, Fusheng; Saltz, Joel H.
2013-01-01
Support of high performance queries on large volumes of scientific spatial data is becoming increasingly important in many applications. This growth is driven by not only geospatial problems in numerous fields, but also emerging scientific applications that are increasingly data- and compute-intensive. For example, digital pathology imaging has become an emerging field during the past decade, where examination of high resolution images of human tissue specimens enables more effective diagnosis, prediction and treatment of diseases. Systematic analysis of large-scale pathology images generates tremendous amounts of spatially derived quantifications of micro-anatomic objects, such as nuclei, blood vessels, and tissue regions. Analytical pathology imaging provides high potential to support image based computer aided diagnosis. One major requirement for this is effective querying of such enormous amount of data with fast response, which is faced with two major challenges: the “big data” challenge and the high computation complexity. In this paper, we present our work towards building a high performance spatial query system for querying massive spatial data on MapReduce. Our framework takes an on demand index building approach for processing spatial queries and a partition-merge approach for building parallel spatial query pipelines, which fits nicely with the computing model of MapReduce. We demonstrate our framework on supporting multi-way spatial joins for algorithm evaluation and nearest neighbor queries for microanatomic objects. To reduce query response time, we propose cost based query optimization to mitigate the effect of data skew. Our experiments show that the framework can efficiently support complex analytical spatial queries on MapReduce. PMID:24501719
Query Expansion and Query Translation as Logical Inference.
ERIC Educational Resources Information Center
Nie, Jian-Yun
2003-01-01
Examines query expansion during query translation in cross language information retrieval and develops a general framework for inferential information retrieval in two particular contexts: using fuzzy logic and probability theory. Obtains evaluation formulas that are shown to strongly correspond to those used in other information retrieval models.…
End-User Use of Data Base Query Language: Pros and Cons.
ERIC Educational Resources Information Center
Nicholes, Walter
1988-01-01
Man-machine interface, the concept of a computer "query," a review of database technology, and a description of the use of query languages at Brigham Young University are discussed. The pros and cons of end-user use of database query languages are explored. (Author/MLW)
Information Retrieval Using UMLS-based Structured Queries
Fagan, Lawrence M.; Berrios, Daniel C.; Chan, Albert; Cucina, Russell; Datta, Anupam; Shah, Maulik; Surendran, Sujith
2001-01-01
During the last three years, we have developed and described components of ELBook, a semantically based information-retrieval system [1-4]. Using these components, domain experts can specify a query model, indexers can use the query model to index documents, and end-users can search these documents for instances of indexed queries.
A Relational Algebra Query Language for Programming Relational Databases
ERIC Educational Resources Information Center
McMaster, Kirby; Sambasivam, Samuel; Anderson, Nicole
2011-01-01
In this paper, we describe a Relational Algebra Query Language (RAQL) and Relational Algebra Query (RAQ) software product we have developed that allows database instructors to teach relational algebra through programming. Instead of defining query operations using mathematical notation (the approach commonly taken in database textbooks), students…
The Approximability of Learning and Constraint Satisfaction Problems
2010-10-07
further improved this result to NP ⊆ naPCP1,3/4+²(O(log(n)),3). Around the same time, Zwick [141] showed that naPCP1,5/8(O(log(n)),3)⊆ BPP by giving a...randomized polynomial-time 5/8-approximation algorithm for satisfiable 3CSP. Therefore unless NP⊆ BPP , the best s must be bigger than 5/8. Zwick... BPP [141]. We think that Question 5.1.2 addresses an important missing part in understanding the 3-query PCP systems. In addition, as is mentioned the
Indexing and retrieval of multimedia objects at different levels of granularity
NASA Astrophysics Data System (ADS)
Faudemay, Pascal; Durand, Gwenael; Seyrat, Claude; Tondre, Nicolas
1998-10-01
Intelligent access to multimedia databases for `naive user' should probably be based on queries formulation by `intelligent agents'. These agents should `understand' the semantics of the contents, learn user preferences and deliver to the user a subset of the source contents, for further navigation. The goal of such systems should be to enable `zero-command' access to the contents, while keeping the freedom of choice of the user. Such systems should interpret multimedia contents in terms of multiple audiovisual objects (from video to visual or audio object), and on actions and scenarios.
Heterogeneous distributed query processing: The DAVID system
NASA Technical Reports Server (NTRS)
Jacobs, Barry E.
1985-01-01
The objective of the Distributed Access View Integrated Database (DAVID) project is the development of an easy to use computer system with which NASA scientists, engineers and administrators can uniformly access distributed heterogeneous databases. Basically, DAVID will be a database management system that sits alongside already existing database and file management systems. Its function is to enable users to access the data in other languages and file systems without having to learn the data manipulation languages. Given here is an outline of a talk on the DAVID project and several charts.
Pattern Activity Clustering and Evaluation (PACE)
NASA Astrophysics Data System (ADS)
Blasch, Erik; Banas, Christopher; Paul, Michael; Bussjager, Becky; Seetharaman, Guna
2012-06-01
With the vast amount of network information available on activities of people (i.e. motions, transportation routes, and site visits) there is a need to explore the salient properties of data that detect and discriminate the behavior of individuals. Recent machine learning approaches include methods of data mining, statistical analysis, clustering, and estimation that support activity-based intelligence. We seek to explore contemporary methods in activity analysis using machine learning techniques that discover and characterize behaviors that enable grouping, anomaly detection, and adversarial intent prediction. To evaluate these methods, we describe the mathematics and potential information theory metrics to characterize behavior. A scenario is presented to demonstrate the concept and metrics that could be useful for layered sensing behavior pattern learning and analysis. We leverage work on group tracking, learning and clustering approaches; as well as utilize information theoretical metrics for classification, behavioral and event pattern recognition, and activity and entity analysis. The performance evaluation of activity analysis supports high-level information fusion of user alerts, data queries and sensor management for data extraction, relations discovery, and situation analysis of existing data.
Discriminative Multi-View Interactive Image Re-Ranking.
Li, Jun; Xu, Chang; Yang, Wankou; Sun, Changyin; Tao, Dacheng
2017-07-01
Given an unreliable visual patterns and insufficient query information, content-based image retrieval is often suboptimal and requires image re-ranking using auxiliary information. In this paper, we propose a discriminative multi-view interactive image re-ranking (DMINTIR), which integrates user relevance feedback capturing users' intentions and multiple features that sufficiently describe the images. In DMINTIR, heterogeneous property features are incorporated in the multi-view learning scheme to exploit their complementarities. In addition, a discriminatively learned weight vector is obtained to reassign updated scores and target images for re-ranking. Compared with other multi-view learning techniques, our scheme not only generates a compact representation in the latent space from the redundant multi-view features but also maximally preserves the discriminative information in feature encoding by the large-margin principle. Furthermore, the generalization error bound of the proposed algorithm is theoretically analyzed and shown to be improved by the interactions between the latent space and discriminant function learning. Experimental results on two benchmark data sets demonstrate that our approach boosts baseline retrieval quality and is competitive with the other state-of-the-art re-ranking strategies.