Dictionary-driven protein annotation.
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-09-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
Query-seeded iterative sequence similarity searching improves selectivity 5–20-fold
Li, Weizhong; Lopez, Rodrigo
2017-01-01
Abstract Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity. PMID:27923999
LSD: Large Survey Database framework
NASA Astrophysics Data System (ADS)
Juric, Mario
2012-09-01
The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures.
Using search engine query data to track pharmaceutical utilization: a study of statins.
Schuster, Nathaniel M; Rogers, Mary A M; McMahon, Laurence F
2010-08-01
To examine temporal and geographic associations between Google queries for health information and healthcare utilization benchmarks. Retrospective longitudinal study. Using Google Trends and Google Insights for Search data, the search terms Lipitor (atorvastatin calcium; Pfizer, Ann Arbor, MI) and simvastatin were evaluated for change over time and for association with Lipitor revenues. The relationship between query data and community-based resource use per Medicare beneficiary was assessed for 35 US metropolitan areas. Google queries for Lipitor significantly decreased from January 2004 through June 2009 and queries for simvastatin significantly increased (P <.001 for both), particularly after Lipitor came off patent (P <.001 for change in slope). The mean number of Google queries for Lipitor correlated (r = 0.98) with the percentage change in Lipitor global revenues from 2004 to 2008 (P <.001). Query preference for Lipitor over simvastatin was positively associated (r = 0.40) with a community's use of Medicare services. For every 1% increase in utilization of Medicare services in a community, there was a 0.2-unit increase in the ratio of Lipitor queries to simvastatin queries in that community (P = .02). Specific search engine queries for medical information correlate with pharmaceutical revenue and with overall healthcare utilization in a community. This suggests that search query data can track community-wide characteristics in healthcare utilization and have the potential for informing payers and policy makers regarding trends in utilization.
Dictionary-driven protein annotation
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-01-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12202776
Querying archetype-based EHRs by search ontology-based XPath engineering.
Kropf, Stefan; Uciteli, Alexandr; Schierle, Katrin; Krücken, Peter; Denecke, Kerstin; Herre, Heinrich
2018-05-11
Legacy data and new structured data can be stored in a standardized format as XML-based EHRs on XML databases. Querying documents on these databases is crucial for answering research questions. Instead of using free text searches, that lead to false positive results, the precision can be increased by constraining the search to certain parts of documents. A search ontology-based specification of queries on XML documents defines search concepts and relates them to parts in the XML document structure. Such query specification method is practically introduced and evaluated by applying concrete research questions formulated in natural language on a data collection for information retrieval purposes. The search is performed by search ontology-based XPath engineering that reuses ontologies and XML-related W3C standards. The key result is that the specification of research questions can be supported by the usage of search ontology-based XPath engineering. A deeper recognition of entities and a semantic understanding of the content is necessary for a further improvement of precision and recall. Key limitation is that the application of the introduced process requires skills in ontology and software development. In future, the time consuming ontology development could be overcome by implementing a new clinical role: the clinical ontologist. The introduced Search Ontology XML extension connects Search Terms to certain parts in XML documents and enables an ontology-based definition of queries. Search ontology-based XPath engineering can support research question answering by the specification of complex XPath expressions without deep syntax knowledge about XPaths.
An efficient approach for video information retrieval
NASA Astrophysics Data System (ADS)
Dong, Daoguo; Xue, Xiangyang
2005-01-01
Today, more and more video information can be accessed through internet, satellite, etc.. Retrieving specific video information from large-scale video database has become an important and challenging research topic in the area of multimedia information retrieval. In this paper, we introduce a new and efficient index structure OVA-File, which is a variant of VA-File. In OVA-File, the approximations close to each other in data space are stored in close positions of the approximation file. The benefit is that only a part of approximations close to the query vector need to be visited to get the query result. Both shot query algorithm and video clip algorithm are proposed to support video information retrieval efficiently. The experimental results showed that the queries based on OVA-File were much faster than that based on VA-File with small loss of result quality.
Measuring Up: Implementing a Dental Quality Measure in the Electronic Health Record Context
Bhardwaj, Aarti; Ramoni, Rachel; Kalenderian, Elsbeth; Neumann, Ana; Hebballi, Nutan B; White, Joel M; McClellan, Lyle; Walji, Muhammad F
2015-01-01
Background Quality improvement requires quality measures that are validly implementable. In this work, we assessed the feasibility and performance of an automated electronic Meaningful Use dental clinical quality measure (percentage of children who received fluoride varnish). Methods We defined how to implement the automated measure queries in a dental electronic health record (EHR). Within records identified through automated query, we manually reviewed a subsample to assess the performance of the query. Results The automated query found 71.0% of patients to have had fluoride varnish compared to 77.6% found using the manual chart review. The automated quality measure performance was 90.5% sensitivity, 90.8% specificity, 96.9% positive predictive value, and 75.2% negative predictive value. Conclusions Our findings support the feasibility of automated dental quality measure queries in the context of sufficient structured data. Information noted only in the free text rather than in structured data would require natural language processing approaches to effectively query. Practical Implications To participate in self-directed quality improvement, dental clinicians must embrace the accountability era. Commitment to quality will require enhanced documentation in order to support near-term automated calculation of quality measures. PMID:26562736
Merglen, Arnaud; Courvoisier, Delphine S; Combescure, Christophe; Garin, Nicolas; Perrier, Arnaud; Perneger, Thomas V
2012-01-01
Background Clinicians perform searches in PubMed daily, but retrieving relevant studies is challenging due to the rapid expansion of medical knowledge. Little is known about the performance of search strategies when they are applied to answer specific clinical questions. Objective To compare the performance of 15 PubMed search strategies in retrieving relevant clinical trials on therapeutic interventions. Methods We used Cochrane systematic reviews to identify relevant trials for 30 clinical questions. Search terms were extracted from the abstract using a predefined procedure based on the population, interventions, comparison, outcomes (PICO) framework and combined into queries. We tested 15 search strategies that varied in their query (PIC or PICO), use of PubMed’s Clinical Queries therapeutic filters (broad or narrow), search limits, and PubMed links to related articles. We assessed sensitivity (recall) and positive predictive value (precision) of each strategy on the first 2 PubMed pages (40 articles) and on the complete search output. Results The performance of the search strategies varied widely according to the clinical question. Unfiltered searches and those using the broad filter of Clinical Queries produced large outputs and retrieved few relevant articles within the first 2 pages, resulting in a median sensitivity of only 10%–25%. In contrast, all searches using the narrow filter performed significantly better, with a median sensitivity of about 50% (all P < .001 compared with unfiltered queries) and positive predictive values of 20%–30% (P < .001 compared with unfiltered queries). This benefit was consistent for most clinical questions. Searches based on related articles retrieved about a third of the relevant studies. Conclusions The Clinical Queries narrow filter, along with well-formulated queries based on the PICO framework, provided the greatest aid in retrieving relevant clinical trials within the 2 first PubMed pages. These results can help clinicians apply effective strategies to answer their questions at the point of care. PMID:22693047
Agoritsas, Thomas; Merglen, Arnaud; Courvoisier, Delphine S; Combescure, Christophe; Garin, Nicolas; Perrier, Arnaud; Perneger, Thomas V
2012-06-12
Clinicians perform searches in PubMed daily, but retrieving relevant studies is challenging due to the rapid expansion of medical knowledge. Little is known about the performance of search strategies when they are applied to answer specific clinical questions. To compare the performance of 15 PubMed search strategies in retrieving relevant clinical trials on therapeutic interventions. We used Cochrane systematic reviews to identify relevant trials for 30 clinical questions. Search terms were extracted from the abstract using a predefined procedure based on the population, interventions, comparison, outcomes (PICO) framework and combined into queries. We tested 15 search strategies that varied in their query (PIC or PICO), use of PubMed's Clinical Queries therapeutic filters (broad or narrow), search limits, and PubMed links to related articles. We assessed sensitivity (recall) and positive predictive value (precision) of each strategy on the first 2 PubMed pages (40 articles) and on the complete search output. The performance of the search strategies varied widely according to the clinical question. Unfiltered searches and those using the broad filter of Clinical Queries produced large outputs and retrieved few relevant articles within the first 2 pages, resulting in a median sensitivity of only 10%-25%. In contrast, all searches using the narrow filter performed significantly better, with a median sensitivity of about 50% (all P < .001 compared with unfiltered queries) and positive predictive values of 20%-30% (P < .001 compared with unfiltered queries). This benefit was consistent for most clinical questions. Searches based on related articles retrieved about a third of the relevant studies. The Clinical Queries narrow filter, along with well-formulated queries based on the PICO framework, provided the greatest aid in retrieving relevant clinical trials within the 2 first PubMed pages. These results can help clinicians apply effective strategies to answer their questions at the point of care.
SAM: String-based sequence search algorithm for mitochondrial DNA database queries
Röck, Alexander; Irwin, Jodi; Dür, Arne; Parsons, Thomas; Parson, Walther
2011-01-01
The analysis of the haploid mitochondrial (mt) genome has numerous applications in forensic and population genetics, as well as in disease studies. Although mtDNA haplotypes are usually determined by sequencing, they are rarely reported as a nucleotide string. Traditionally they are presented in a difference-coded position-based format relative to the corrected version of the first sequenced mtDNA. This convention requires recommendations for standardized sequence alignment that is known to vary between scientific disciplines, even between laboratories. As a consequence, database searches that are vital for the interpretation of mtDNA data can suffer from biased results when query and database haplotypes are annotated differently. In the forensic context that would usually lead to underestimation of the absolute and relative frequencies. To address this issue we introduce SAM, a string-based search algorithm that converts query and database sequences to position-free nucleotide strings and thus eliminates the possibility that identical sequences will be missed in a database query. The mere application of a BLAST algorithm would not be a sufficient remedy as it uses a heuristic approach and does not address properties specific to mtDNA, such as phylogenetically stable but also rapidly evolving insertion and deletion events. The software presented here provides additional flexibility to incorporate phylogenetic data, site-specific mutation rates, and other biologically relevant information that would refine the interpretation of mitochondrial DNA data. The manuscript is accompanied by freeware and example data sets that can be used to evaluate the new software (http://stringvalidation.org). PMID:21056022
Monitoring Moving Queries inside a Safe Region
Al-Khalidi, Haidar; Taniar, David; Alamri, Sultan
2014-01-01
With mobile moving range queries, there is a need to recalculate the relevant surrounding objects of interest whenever the query moves. Therefore, monitoring the moving query is very costly. The safe region is one method that has been proposed to minimise the communication and computation cost of continuously monitoring a moving range query. Inside the safe region the set of objects of interest to the query do not change; thus there is no need to update the query while it is inside its safe region. However, when the query leaves its safe region the mobile device has to reevaluate the query, necessitating communication with the server. Knowing when and where the mobile device will leave a safe region is widely known as a difficult problem. To solve this problem, we propose a novel method to monitor the position of the query over time using a linear function based on the direction of the query obtained by periodic monitoring of its position. Periodic monitoring ensures that the query is aware of its location all the time. This method reduces the costs associated with communications in client-server architecture. Computational results show that our method is successful in handling moving query patterns. PMID:24696652
Identifying QT prolongation from ECG impressions using a general-purpose Natural Language Processor
Denny, Joshua C.; Miller, Randolph A.; Waitman, Lemuel Russell; Arrieta, Mark; Peterson, Joshua F.
2009-01-01
Objective Typically detected via electrocardiograms (ECGs), QT interval prolongation is a known risk factor for sudden cardiac death. Since medications can promote or exacerbate the condition, detection of QT interval prolongation is important for clinical decision support. We investigated the accuracy of natural language processing (NLP) for identifying QT prolongation from cardiologist-generated, free-text ECG impressions compared to corrected QT (QTc) thresholds reported by ECG machines. Methods After integrating negation detection to a locally-developed natural language processor, the KnowledgeMap concept identifier, we evaluated NLP-based detection of QT prolongation compared to the calculated QTc on a set of 44,318 ECGs obtained from hospitalized patients. We also created a string query using regular expressions to identify QT prolongation. We calculated sensitivity and specificity of the methods using manual physician review of the cardiologist-generated reports as the gold standard. To investigate causes of “false positive” calculated QTc, we manually reviewed randomly selected ECGs with a long calculated QTc but no mention of QT prolongation. Separately, we validated the performance of the negation detection algorithm on 5,000 manually-categorized ECG phrases for any medical concept (not limited to QT prolongation) prior to developing the NLP query for QT prolongation. Results The NLP query for QT prolongation correctly identified 2,364 of 2,373 ECGs with QT prolongation with a sensitivity of 0.996 and a positive predictive value of 1.000. There were no false positives. The regular expression query had a sensitivity of 0.999 and positive predictive value of 0.982. In contrast, the positive predictive value of common QTc thresholds derived from ECG machines was 0.07–0.25 with corresponding sensitivities of 0.994–0.046. The negation detection algorithm had a recall of 0.973 and precision of 0.982 for 10,490 concepts found within ECG impressions. Conclusions NLP and regular expression queries of cardiologists’ ECG interpretations can more effectively identify QT prolongation than the automated QTc intervals reported by ECG machines. Future clinical decision support could employ NLP queries to detect QTc prolongation and other reported ECG abnormalities. PMID:18938105
Privacy Risks from Genomic Data-Sharing Beacons.
Shringarpure, Suyash S; Bustamante, Carlos D
2015-11-05
The human genetics community needs robust protocols that enable secure sharing of genomic data from participants in genetic research. Beacons are web servers that answer allele-presence queries--such as "Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?"--with either "yes" or "no." Here, we show that individuals in a beacon are susceptible to re-identification even if the only data shared include presence or absence information about alleles in a beacon. Specifically, we propose a likelihood-ratio test of whether a given individual is present in a given genetic beacon. Our test is not dependent on allele frequencies and is the most powerful test for a specified false-positive rate. Through simulations, we showed that in a beacon with 1,000 individuals, re-identification is possible with just 5,000 queries. Relatives can also be identified in the beacon. Re-identification is possible even in the presence of sequencing errors and variant-calling differences. In a beacon constructed with 65 European individuals from the 1000 Genomes Project, we demonstrated that it is possible to detect membership in the beacon with just 250 SNPs. With just 1,000 SNP queries, we were able to detect the presence of an individual genome from the Personal Genome Project in an existing beacon. Our results show that beacons can disclose membership and implied phenotypic information about participants and do not protect privacy a priori. We discuss risk mitigation through policies and standards such as not allowing anonymous pings of genetic beacons and requiring minimum beacon sizes. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Family health history reporting is sensitive to small changes in wording.
Conway-Pearson, Liam S; Christensen, Kurt D; Savage, Sarah K; Huntington, Noelle L; Weitzman, Elissa R; Ziniel, Sonja I; Bacon, Phoebe; Cacioppo, Cara N; Green, Robert C; Holm, Ingrid A
2016-12-01
Family health history is often collected through single-item queries that ask patients whether their family members are affected by certain conditions. The specific wording of these queries may influence what individuals report. Parents of Boston Children's Hospital patients were invited to participate in a Web-based survey about the return of individual genomic research results regarding their children. Participants reported whether 11 types of medical conditions affected them or their family. Randomization determined whether participants were specifically instructed to consider their extended family. Family health history was reported by 2,901 participants. Those asked to consider their extended family were more likely to report a positive family history for 8 of 11 medical conditions. The largest differences were observed for cancer (65.1 vs. 45.7%; P < 0.001), cardiovascular conditions (72.5 vs. 56.0%; P < 0.001), and endocrine/hormonal conditions (50.9 vs. 36.7%; P < 0.001). Small alterations to the way family health history queries are worded can substantially change patient responses. Clinicians and researchers need to be sensitive about patients' tendencies to omit extended family from health history reporting unless specifically asked to consider them.Genet Med 18 12, 1308-1311.
Standley, Daron M; Toh, Hiroyuki; Nakamura, Haruki
2008-09-01
A method to functionally annotate structural genomics targets, based on a novel structural alignment scoring function, is proposed. In the proposed score, position-specific scoring matrices are used to weight structurally aligned residue pairs to highlight evolutionarily conserved motifs. The functional form of the score is first optimized for discriminating domains belonging to the same Pfam family from domains belonging to different families but the same CATH or SCOP superfamily. In the optimization stage, we consider four standard weighting functions as well as our own, the "maximum substitution probability," and combinations of these functions. The optimized score achieves an area of 0.87 under the receiver-operating characteristic curve with respect to identifying Pfam families within a sequence-unique benchmark set of domain pairs. Confidence measures are then derived from the benchmark distribution of true-positive scores. The alignment method is next applied to the task of functionally annotating 230 query proteins released to the public as part of the Protein 3000 structural genomics project in Japan. Of these queries, 78 were found to align to templates with the same Pfam family as the query or had sequence identities > or = 30%. Another 49 queries were found to match more distantly related templates. Within this group, the template predicted by our method to be the closest functional relative was often not the most structurally similar. Several nontrivial cases are discussed in detail. Finally, 103 queries matched templates at the fold level, but not the family or superfamily level, and remain functionally uncharacterized. 2008 Wiley-Liss, Inc.
An intuitive graphical webserver for multiple-choice protein sequence search.
Banky, Daniel; Szalkai, Balazs; Grolmusz, Vince
2014-04-10
Every day tens of thousands of sequence searches and sequence alignment queries are submitted to webservers. The capitalized word "BLAST" becomes a verb, describing the act of performing sequence search and alignment. However, if one needs to search for sequences that contain, for example, two hydrophobic and three polar residues at five given positions, the query formation on the most frequently used webservers will be difficult. Some servers support the formation of queries with regular expressions, but most of the users are unfamiliar with their syntax. Here we present an intuitive, easily applicable webserver, the Protein Sequence Analysis server, that allows the formation of multiple choice queries by simply drawing the residues to their positions; if more than one residue are drawn to the same position, then they will be nicely stacked on the user interface, indicating the multiple choice at the given position. This computer-game-like interface is natural and intuitive, and the coloring of the residues makes possible to form queries requiring not just certain amino acids in the given positions, but also small nonpolar, negatively charged, hydrophobic, positively charged, or polar ones. The webserver is available at http://psa.pitgroup.org. Copyright © 2014 Elsevier B.V. All rights reserved.
VISAGE: Interactive Visual Graph Querying.
Pienta, Robert; Navathe, Shamkant; Tamersoy, Acar; Tong, Hanghang; Endert, Alex; Chau, Duen Horng
2016-06-01
Extracting useful patterns from large network datasets has become a fundamental challenge in many domains. We present VISAGE, an interactive visual graph querying approach that empowers users to construct expressive queries, without writing complex code (e.g., finding money laundering rings of bankers and business owners). Our contributions are as follows: (1) we introduce graph autocomplete , an interactive approach that guides users to construct and refine queries, preventing over-specification; (2) VISAGE guides the construction of graph queries using a data-driven approach, enabling users to specify queries with varying levels of specificity, from concrete and detailed (e.g., query by example), to abstract (e.g., with "wildcard" nodes of any types), to purely structural matching; (3) a twelve-participant, within-subject user study demonstrates VISAGE's ease of use and the ability to construct graph queries significantly faster than using a conventional query language; (4) VISAGE works on real graphs with over 468K edges, achieving sub-second response times for common queries.
VISAGE: Interactive Visual Graph Querying
Pienta, Robert; Navathe, Shamkant; Tamersoy, Acar; Tong, Hanghang; Endert, Alex; Chau, Duen Horng
2017-01-01
Extracting useful patterns from large network datasets has become a fundamental challenge in many domains. We present VISAGE, an interactive visual graph querying approach that empowers users to construct expressive queries, without writing complex code (e.g., finding money laundering rings of bankers and business owners). Our contributions are as follows: (1) we introduce graph autocomplete, an interactive approach that guides users to construct and refine queries, preventing over-specification; (2) VISAGE guides the construction of graph queries using a data-driven approach, enabling users to specify queries with varying levels of specificity, from concrete and detailed (e.g., query by example), to abstract (e.g., with “wildcard” nodes of any types), to purely structural matching; (3) a twelve-participant, within-subject user study demonstrates VISAGE’s ease of use and the ability to construct graph queries significantly faster than using a conventional query language; (4) VISAGE works on real graphs with over 468K edges, achieving sub-second response times for common queries. PMID:28553670
Intelligent web image retrieval system
NASA Astrophysics Data System (ADS)
Hong, Sungyong; Lee, Chungwoo; Nah, Yunmook
2001-07-01
Recently, the web sites such as e-business sites and shopping mall sites deal with lots of image information. To find a specific image from these image sources, we usually use web search engines or image database engines which rely on keyword only retrievals or color based retrievals with limited search capabilities. This paper presents an intelligent web image retrieval system. We propose the system architecture, the texture and color based image classification and indexing techniques, and representation schemes of user usage patterns. The query can be given by providing keywords, by selecting one or more sample texture patterns, by assigning color values within positional color blocks, or by combining some or all of these factors. The system keeps track of user's preferences by generating user query logs and automatically add more search information to subsequent user queries. To show the usefulness of the proposed system, some experimental results showing recall and precision are also explained.
ERIC Educational Resources Information Center
Chung, EunKyung; Yoon, JungWon
2009-01-01
Introduction: The purpose of this study is to compare characteristics and features of user supplied tags and search query terms for images on the "Flickr" Website in terms of categories of pictorial meanings and level of term specificity. Method: This study focuses on comparisons between tags and search queries using Shatford's categorization…
Static Analysis of Mobile Programs
2017-02-01
information flow analysis has the potential to significantly aid human auditors , but it is handicapped by high false positive rates. Instead, auditors ...presents these specifications to a human auditor for validation. We have implemented this framework for a taint analysis of An- droid apps that relies on...of queries to a human auditor . 6.4 Inferring Library Information Flow Specifications Using Dynamic Anal- ysis In [15], we present a technique to mine
Searching for cancer information on the internet: analyzing natural language search queries.
Bader, Judith L; Theofanos, Mary Frances
2003-12-11
Searching for health information is one of the most-common tasks performed by Internet users. Many users begin searching on popular search engines rather than on prominent health information sites. We know that many visitors to our (National Cancer Institute) Web site, cancer.gov, arrive via links in search engine result. To learn more about the specific needs of our general-public users, we wanted to understand what lay users really wanted to know about cancer, how they phrased their questions, and how much detail they used. The National Cancer Institute partnered with AskJeeves, Inc to develop a methodology to capture, sample, and analyze 3 months of cancer-related queries on the Ask.com Web site, a prominent United States consumer search engine, which receives over 35 million queries per week. Using a benchmark set of 500 terms and word roots supplied by the National Cancer Institute, AskJeeves identified a test sample of cancer queries for 1 week in August 2001. From these 500 terms only 37 appeared >or= 5 times/day over the trial test week in 17208 queries. Using these 37 terms, 204165 instances of cancer queries were found in the Ask.com query logs for the actual test period of June-August 2001. Of these, 7500 individual user questions were randomly selected for detailed analysis and assigned to appropriate categories. The exact language of sample queries is presented. Considering multiples of the same questions, the sample of 7500 individual user queries represented 76077 queries (37% of the total 3-month pool). Overall 78.37% of sampled Cancer queries asked about 14 specific cancer types. Within each cancer type, queries were sorted into appropriate subcategories including at least the following: General Information, Symptoms, Diagnosis and Testing, Treatment, Statistics, Definition, and Cause/Risk/Link. The most-common specific cancer types mentioned in queries were Digestive/Gastrointestinal/Bowel (15.0%), Breast (11.7%), Skin (11.3%), and Genitourinary (10.5%). Additional subcategories of queries about specific cancer types varied, depending on user input. Queries that were not specific to a cancer type were also tracked and categorized. Natural-language searching affords users the opportunity to fully express their information needs and can aid users naïve to the content and vocabulary. The specific queries analyzed for this study reflect news and research studies reported during the study dates and would surely change with different study dates. Analyzing queries from search engines represents one way of knowing what kinds of content to provide to users of a given Web site. Users ask questions using whole sentences and keywords, often misspelling words. Providing the option for natural-language searching does not obviate the need for good information architecture, usability engineering, and user testing in order to optimize user experience.
Searching for Cancer Information on the Internet: Analyzing Natural Language Search Queries
Theofanos, Mary Frances
2003-01-01
Background Searching for health information is one of the most-common tasks performed by Internet users. Many users begin searching on popular search engines rather than on prominent health information sites. We know that many visitors to our (National Cancer Institute) Web site, cancer.gov, arrive via links in search engine result. Objective To learn more about the specific needs of our general-public users, we wanted to understand what lay users really wanted to know about cancer, how they phrased their questions, and how much detail they used. Methods The National Cancer Institute partnered with AskJeeves, Inc to develop a methodology to capture, sample, and analyze 3 months of cancer-related queries on the Ask.com Web site, a prominent United States consumer search engine, which receives over 35 million queries per week. Using a benchmark set of 500 terms and word roots supplied by the National Cancer Institute, AskJeeves identified a test sample of cancer queries for 1 week in August 2001. From these 500 terms only 37 appeared ≥ 5 times/day over the trial test week in 17208 queries. Using these 37 terms, 204165 instances of cancer queries were found in the Ask.com query logs for the actual test period of June-August 2001. Of these, 7500 individual user questions were randomly selected for detailed analysis and assigned to appropriate categories. The exact language of sample queries is presented. Results Considering multiples of the same questions, the sample of 7500 individual user queries represented 76077 queries (37% of the total 3-month pool). Overall 78.37% of sampled Cancer queries asked about 14 specific cancer types. Within each cancer type, queries were sorted into appropriate subcategories including at least the following: General Information, Symptoms, Diagnosis and Testing, Treatment, Statistics, Definition, and Cause/Risk/Link. The most-common specific cancer types mentioned in queries were Digestive/Gastrointestinal/Bowel (15.0%), Breast (11.7%), Skin (11.3%), and Genitourinary (10.5%). Additional subcategories of queries about specific cancer types varied, depending on user input. Queries that were not specific to a cancer type were also tracked and categorized. Conclusions Natural-language searching affords users the opportunity to fully express their information needs and can aid users naïve to the content and vocabulary. The specific queries analyzed for this study reflect news and research studies reported during the study dates and would surely change with different study dates. Analyzing queries from search engines represents one way of knowing what kinds of content to provide to users of a given Web site. Users ask questions using whole sentences and keywords, often misspelling words. Providing the option for natural-language searching does not obviate the need for good information architecture, usability engineering, and user testing in order to optimize user experience. PMID:14713659
Measuring up: Implementing a dental quality measure in the electronic health record context.
Bhardwaj, Aarti; Ramoni, Rachel; Kalenderian, Elsbeth; Neumann, Ana; Hebballi, Nutan B; White, Joel M; McClellan, Lyle; Walji, Muhammad F
2016-01-01
Quality improvement requires using quality measures that can be implemented in a valid manner. Using guidelines set forth by the Meaningful Use portion of the Health Information Technology for Economic and Clinical Health Act, the authors assessed the feasibility and performance of an automated electronic Meaningful Use dental clinical quality measure to determine the percentage of children who received fluoride varnish. The authors defined how to implement the automated measure queries in a dental electronic health record. Within records identified through automated query, the authors manually reviewed a subsample to assess the performance of the query. The automated query results revealed that 71.0% of patients had fluoride varnish compared with the manual chart review results that indicated 77.6% of patients had fluoride varnish. The automated quality measure performance results indicated 90.5% sensitivity, 90.8% specificity, 96.9% positive predictive value, and 75.2% negative predictive value. The authors' findings support the feasibility of using automated dental quality measure queries in the context of sufficient structured data. Information noted only in free text rather than in structured data would require using natural language processing approaches to effectively query electronic health records. To participate in self-directed quality improvement, dental clinicians must embrace the accountability era. Commitment to quality will require enhanced documentation to support near-term automated calculation of quality measures. Copyright © 2016 American Dental Association. Published by Elsevier Inc. All rights reserved.
Kaas, Quentin; Ruiz, Manuel; Lefranc, Marie-Paule
2004-01-01
IMGT/3Dstructure-DB and IMGT/Structural-Query are a novel 3D structure database and a new tool for immunological proteins. They are part of IMGT, the international ImMunoGenetics information system®, a high-quality integrated knowledge resource specializing in immunoglobulins (IG), T cell receptors (TR), major histocompatibility complex (MHC) and related proteins of the immune system (RPI) of human and other vertebrate species, which consists of databases, Web resources and interactive on-line tools. IMGT/3Dstructure-DB data are described according to the IMGT Scientific chart rules based on the IMGT-ONTOLOGY concepts. IMGT/3Dstructure-DB provides IMGT gene and allele identification of IG, TR and MHC proteins with known 3D structures, domain delimitations, amino acid positions according to the IMGT unique numbering and renumbered coordinate flat files. Moreover IMGT/3Dstructure-DB provides 2D graphical representations (or Collier de Perles) and results of contact analysis. The IMGT/StructuralQuery tool allows search of this database based on specific structural characteristics. IMGT/3Dstructure-DB and IMGT/StructuralQuery are freely available at http://imgt.cines.fr. PMID:14681396
ERIC Educational Resources Information Center
Lyall-Wilson, Jennifer Rae
2013-01-01
The dissertation research explores an approach to automatic concept-based query expansion to improve search engine performance. It uses a network-based approach for identifying the concept represented by the user's query and is founded on the idea that a collection-specific association thesaurus can be used to create a reasonable representation of…
Virtual Observatory Interfaces to the Chandra Data Archive
NASA Astrophysics Data System (ADS)
Tibbetts, M.; Harbo, P.; Van Stone, D.; Zografou, P.
2014-05-01
The Chandra Data Archive (CDA) plays a central role in the operation of the Chandra X-ray Center (CXC) by providing access to Chandra data. Proprietary interfaces have been the backbone of the CDA throughout the Chandra mission. While these interfaces continue to provide the depth and breadth of mission specific access Chandra users expect, the CXC has been adding Virtual Observatory (VO) interfaces to the Chandra proposal catalog and observation catalog. VO interfaces provide standards-based access to Chandra data through simple positional queries or more complex queries using the Astronomical Data Query Language. Recent development at the CDA has generalized our existing VO services to create a suite of services that can be configured to provide VO interfaces to any dataset. This approach uses a thin web service layer for the individual VO interfaces, a middle-tier query component which is shared among the VO interfaces for parsing, scheduling, and executing queries, and existing web services for file and data access. The CXC VO services provide Simple Cone Search (SCS), Simple Image Access (SIA), and Table Access Protocol (TAP) implementations for both the Chandra proposal and observation catalogs within the existing archive architecture. Our work with the Chandra proposal and observation catalogs, as well as additional datasets beyond the CDA, illustrates how we can provide configurable VO services to extend core archive functionality.
Concept-based query language approach to enterprise information systems
NASA Astrophysics Data System (ADS)
Niemi, Timo; Junkkari, Marko; Järvelin, Kalervo
2014-01-01
In enterprise information systems (EISs) it is necessary to model, integrate and compute very diverse data. In advanced EISs the stored data often are based both on structured (e.g. relational) and semi-structured (e.g. XML) data models. In addition, the ad hoc information needs of end-users may require the manipulation of data-oriented (structural), behavioural and deductive aspects of data. Contemporary languages capable of treating this kind of diversity suit only persons with good programming skills. In this paper we present a concept-oriented query language approach to manipulate this diversity so that the programming skill requirements are considerably reduced. In our query language, the features which need technical knowledge are hidden in application-specific concepts and structures. Therefore, users need not be aware of the underlying technology. Application-specific concepts and structures are represented by the modelling primitives of the extended RDOOM (relational deductive object-oriented modelling) which contains primitives for all crucial real world relationships (is-a relationship, part-of relationship, association), XML documents and views. Our query language also supports intensional and extensional-intensional queries, in addition to conventional extensional queries. In its query formulation, the end-user combines available application-specific concepts and structures through shared variables.
Luo, Yuan; Szolovits, Peter
2016-01-01
In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently retrieving such annotations that satisfy position constraints. Efficient storage and retrieval of stand-off annotations can facilitate tasks such as mapping unstructured text to electronic medical record ontologies. We first formulate this problem into the interval query problem, for which optimal query/update time is in general logarithm. We next perform a tight time complexity analysis on the basic interval tree query algorithm and show its nonoptimality when being applied to a collection of 13 query types from Allen's interval algebra. We then study two closely related state-of-the-art interval query algorithms, proposed query reformulations, and augmentations to the second algorithm. Our proposed algorithm achieves logarithmic time stabbing-max query time complexity and solves the stabbing-interval query tasks on all of Allen's relations in logarithmic time, attaining the theoretic lower bound. Updating time is kept logarithmic and the space requirement is kept linear at the same time. We also discuss interval management in external memory models and higher dimensions.
Luo, Yuan; Szolovits, Peter
2016-01-01
In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently retrieving such annotations that satisfy position constraints. Efficient storage and retrieval of stand-off annotations can facilitate tasks such as mapping unstructured text to electronic medical record ontologies. We first formulate this problem into the interval query problem, for which optimal query/update time is in general logarithm. We next perform a tight time complexity analysis on the basic interval tree query algorithm and show its nonoptimality when being applied to a collection of 13 query types from Allen’s interval algebra. We then study two closely related state-of-the-art interval query algorithms, proposed query reformulations, and augmentations to the second algorithm. Our proposed algorithm achieves logarithmic time stabbing-max query time complexity and solves the stabbing-interval query tasks on all of Allen’s relations in logarithmic time, attaining the theoretic lower bound. Updating time is kept logarithmic and the space requirement is kept linear at the same time. We also discuss interval management in external memory models and higher dimensions. PMID:27478379
Knowledge Query Language (KQL)
2016-02-12
Lexington Massachusetts This page intentionally left blank. iii EXECUTIVE SUMMARY Currently, queries for data ...retrieval from non-Structured Query Language (NoSQL) data stores are tightly coupled to the specific implementation of the data store implementation...independent of the storage content and format for querying NoSQL or relational data stores. This approach uses address expressions (or A-Expressions
Fragger: a protein fragment picker for structural queries.
Berenger, Francois; Simoncini, David; Voet, Arnout; Shrestha, Rojan; Zhang, Kam Y J
2017-01-01
Protein modeling and design activities often require querying the Protein Data Bank (PDB) with a structural fragment, possibly containing gaps. For some applications, it is preferable to work on a specific subset of the PDB or with unpublished structures. These requirements, along with specific user needs, motivated the creation of a new software to manage and query 3D protein fragments. Fragger is a protein fragment picker that allows protein fragment databases to be created and queried. All fragment lengths are supported and any set of PDB files can be used to create a database. Fragger can efficiently search a fragment database with a query fragment and a distance threshold. Matching fragments are ranked by distance to the query. The query fragment can have structural gaps and the allowed amino acid sequences matching a query can be constrained via a regular expression of one-letter amino acid codes. Fragger also incorporates a tool to compute the backbone RMSD of one versus many fragments in high throughput. Fragger should be useful for protein design, loop grafting and related structural bioinformatics tasks.
SPARK: Adapting Keyword Query to Semantic Search
NASA Astrophysics Data System (ADS)
Zhou, Qi; Wang, Chong; Xiong, Miao; Wang, Haofen; Yu, Yong
Semantic search promises to provide more accurate result than present-day keyword search. However, progress with semantic search has been delayed due to the complexity of its query languages. In this paper, we explore a novel approach of adapting keywords to querying the semantic web: the approach automatically translates keyword queries into formal logic queries so that end users can use familiar keywords to perform semantic search. A prototype system named 'SPARK' has been implemented in light of this approach. Given a keyword query, SPARK outputs a ranked list of SPARQL queries as the translation result. The translation in SPARK consists of three major steps: term mapping, query graph construction and query ranking. Specifically, a probabilistic query ranking model is proposed to select the most likely SPARQL query. In the experiment, SPARK achieved an encouraging translation result.
Knowledge Query Language (KQL)
2016-02-01
unlimited. This page intentionally left blank. iii EXECUTIVE SUMMARY Currently, queries for data ...retrieval from non-Structured Query Language (NoSQL) data stores are tightly coupled to the specific implementation of the data store implementation, making...of the storage content and format for querying NoSQL or relational data stores. This approach uses address expressions (or A-Expressions) embedded in
A New Publicly Available Chemical Query Language, CSRML ...
A new XML-based query language, CSRML, has been developed for representing chemical substructures, molecules, reaction rules, and reactions. CSRML queries are capable of integrating additional forms of information beyond the simple substructure (e.g., SMARTS) or reaction transformation (e.g., SMIRKS, reaction SMILES) queries currently in use. Chemotypes, a term used to represent advanced CSRML queries for repeated application can be encoded not only with connectivity and topology, but also with properties of atoms, bonds, electronic systems, or molecules. The CSRML language has been developed in parallel with a public set of chemotypes, i.e., the ToxPrint chemotypes, which are designed to provide excellent coverage of environmental, regulatory and commercial use chemical space, as well as to represent features and frameworks believed to be especially relevant to toxicity concerns. A software application, ChemoTyper, has also been developed and made publicly available to enable chemotype searching and fingerprinting against a target structure set. The public ChemoTyper houses the ToxPrint chemotype CSRML dictionary, as well as reference implementation so that the query specifications may be adopted by other chemical structure knowledge systems. The full specifications of the XML standard used in CSRML-based chemotypes are publicly available to facilitate and encourage the exchange of structural knowledge. Paper details specifications for a new XML-based query lan
An adaptable architecture for patient cohort identification from diverse data sources.
Bache, Richard; Miles, Simon; Taweel, Adel
2013-12-01
We define and validate an architecture for systems that identify patient cohorts for clinical trials from multiple heterogeneous data sources. This architecture has an explicit query model capable of supporting temporal reasoning and expressing eligibility criteria independently of the representation of the data used to evaluate them. The architecture has the key feature that queries defined according to the query model are both pre and post-processed and this is used to address both structural and semantic heterogeneity. The process of extracting the relevant clinical facts is separated from the process of reasoning about them. A specific instance of the query model is then defined and implemented. We show that the specific instance of the query model has wide applicability. We then describe how it is used to access three diverse data warehouses to determine patient counts. Although the proposed architecture requires greater effort to implement the query model than would be the case for using just SQL and accessing a data-based management system directly, this effort is justified because it supports both temporal reasoning and heterogeneous data sources. The query model only needs to be implemented once no matter how many data sources are accessed. Each additional source requires only the implementation of a lightweight adaptor. The architecture has been used to implement a specific query model that can express complex eligibility criteria and access three diverse data warehouses thus demonstrating the feasibility of this approach in dealing with temporal reasoning and data heterogeneity.
The role of organizational research in implementing evidence-based practice: QUERI Series
Yano, Elizabeth M
2008-01-01
Background Health care organizations exert significant influence on the manner in which clinicians practice and the processes and outcomes of care that patients experience. A greater understanding of the organizational milieu into which innovations will be introduced, as well as the organizational factors that are likely to foster or hinder the adoption and use of new technologies, care arrangements and quality improvement (QI) strategies are central to the effective implementation of research into practice. Unfortunately, much implementation research seems to not recognize or adequately address the influence and importance of organizations. Using examples from the U.S. Department of Veterans Affairs (VA) Quality Enhancement Research Initiative (QUERI), we describe the role of organizational research in advancing the implementation of evidence-based practice into routine care settings. Methods Using the six-step QUERI process as a foundation, we present an organizational research framework designed to improve and accelerate the implementation of evidence-based practice into routine care. Specific QUERI-related organizational research applications are reviewed, with discussion of the measures and methods used to apply them. We describe these applications in the context of a continuum of organizational research activities to be conducted before, during and after implementation. Results Since QUERI's inception, various approaches to organizational research have been employed to foster progress through QUERI's six-step process. We report on how explicit integration of the evaluation of organizational factors into QUERI planning has informed the design of more effective care delivery system interventions and enabled their improved "fit" to individual VA facilities or practices. We examine the value and challenges in conducting organizational research, and briefly describe the contributions of organizational theory and environmental context to the research framework. Conclusion Understanding the organizational context of delivering evidence-based practice is a critical adjunct to efforts to systematically improve quality. Given the size and diversity of VA practices, coupled with unique organizational data sources, QUERI is well-positioned to make valuable contributions to the field of implementation science. More explicit accommodation of organizational inquiry into implementation research agendas has helped QUERI researchers to better frame and extend their work as they move toward regional and national spread activities. PMID:18510749
The effectiveness of position- and composition-specific gap costs for protein similarity searches.
Stojmirović, Aleksandar; Gertz, E Michael; Altschul, Stephen F; Yu, Yi-Kuo
2008-07-01
The flexibility in gap cost enjoyed by hidden Markov models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments. We found that position-specific gap penalties have an advantage over uniform gap costs. We did not explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to have constant gap transition probabilities, albeit with much greater variance. We observed no effect of composition-specific gap costs on retrieval performance. These results suggest possible improvements to the PSI-BLAST protein database search program. The scripts for performing evaluations are available upon request from the authors.
Structuring Legacy Pathology Reports by openEHR Archetypes to Enable Semantic Querying.
Kropf, Stefan; Krücken, Peter; Mueller, Wolf; Denecke, Kerstin
2017-05-18
Clinical information is often stored as free text, e.g. in discharge summaries or pathology reports. These documents are semi-structured using section headers, numbered lists, items and classification strings. However, it is still challenging to retrieve relevant documents since keyword searches applied on complete unstructured documents result in many false positive retrieval results. We are concentrating on the processing of pathology reports as an example for unstructured clinical documents. The objective is to transform reports semi-automatically into an information structure that enables an improved access and retrieval of relevant data. The data is expected to be stored in a standardized, structured way to make it accessible for queries that are applied to specific sections of a document (section-sensitive queries) and for information reuse. Our processing pipeline comprises information modelling, section boundary detection and section-sensitive queries. For enabling a focused search in unstructured data, documents are automatically structured and transformed into a patient information model specified through openEHR archetypes. The resulting XML-based pathology electronic health records (PEHRs) are queried by XQuery and visualized by XSLT in HTML. Pathology reports (PRs) can be reliably structured into sections by a keyword-based approach. The information modelling using openEHR allows saving time in the modelling process since many archetypes can be reused. The resulting standardized, structured PEHRs allow accessing relevant data by retrieving data matching user queries. Mapping unstructured reports into a standardized information model is a practical solution for a better access to data. Archetype-based XML enables section-sensitive retrieval and visualisation by well-established XML techniques. Focussing the retrieval to particular sections has the potential of saving retrieval time and improving the accuracy of the retrieval.
Embedding strategies for effective use of information from multiple sequence alignments.
Henikoff, S.; Henikoff, J. G.
1997-01-01
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain. PMID:9070452
NASA Technical Reports Server (NTRS)
Denney, Ewen W.; Naylor, Dwight; Pai, Ganesh
2014-01-01
Querying a safety case to show how the various stakeholders' concerns about system safety are addressed has been put forth as one of the benefits of argument-based assurance (in a recent study by the Health Foundation, UK, which reviewed the use of safety cases in safety-critical industries). However, neither the literature nor current practice offer much guidance on querying mechanisms appropriate for, or available within, a safety case paradigm. This paper presents a preliminary approach that uses a formal basis for querying safety cases, specifically Goal Structuring Notation (GSN) argument structures. Our approach semantically enriches GSN arguments with domain-specific metadata that the query language leverages, along with its inherent structure, to produce views. We have implemented the approach in our toolset AdvoCATE, and illustrate it by application to a fragment of the safety argument for an Unmanned Aircraft System (UAS) being developed at NASA Ames. We also discuss the potential practical utility of our query mechanism within the context of the existing framework for UAS safety assurance.
AthMethPre: a web server for the prediction and query of mRNA m6A sites in Arabidopsis thaliana.
Xiang, Shunian; Yan, Zhangming; Liu, Ke; Zhang, Yaou; Sun, Zhirong
2016-10-18
N 6 -Methyladenosine (m 6 A) is the most prevalent and abundant modification in mRNA that has been linked to many key biological processes. High-throughput experiments have generated m 6 A-peaks across the transcriptome of A. thaliana, but the specific methylated sites were not assigned, which impedes the understanding of m 6 A functions in plants. Therefore, computational prediction of mRNA m 6 A sites becomes emergently important. Here, we present a method to predict the m 6 A sites for A. thaliana mRNA sequence(s). To predict the m 6 A sites of an mRNA sequence, we employed the support vector machine to build a classifier using the features of the positional flanking nucleotide sequence and position-independent k-mer nucleotide spectrum. Our method achieved good performance and was applied to a web server to provide service for the prediction of A. thaliana m 6 A sites. The server also provides a comprehensive database of predicted transcriptome-wide m 6 A sites and curated m 6 A-seq peaks from the literature for query and visualization. The AthMethPre web server is the first web server that provides a user-friendly tool for the prediction and query of A. thaliana mRNA m 6 A sites, which is freely accessible for public use at .
An adaptable architecture for patient cohort identification from diverse data sources
Bache, Richard; Miles, Simon; Taweel, Adel
2013-01-01
Objective We define and validate an architecture for systems that identify patient cohorts for clinical trials from multiple heterogeneous data sources. This architecture has an explicit query model capable of supporting temporal reasoning and expressing eligibility criteria independently of the representation of the data used to evaluate them. Method The architecture has the key feature that queries defined according to the query model are both pre and post-processed and this is used to address both structural and semantic heterogeneity. The process of extracting the relevant clinical facts is separated from the process of reasoning about them. A specific instance of the query model is then defined and implemented. Results We show that the specific instance of the query model has wide applicability. We then describe how it is used to access three diverse data warehouses to determine patient counts. Discussion Although the proposed architecture requires greater effort to implement the query model than would be the case for using just SQL and accessing a data-based management system directly, this effort is justified because it supports both temporal reasoning and heterogeneous data sources. The query model only needs to be implemented once no matter how many data sources are accessed. Each additional source requires only the implementation of a lightweight adaptor. Conclusions The architecture has been used to implement a specific query model that can express complex eligibility criteria and access three diverse data warehouses thus demonstrating the feasibility of this approach in dealing with temporal reasoning and data heterogeneity. PMID:24064442
HomPPI: a class of sequence homology based protein-protein interface prediction methods
2011-01-01
Background Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. Results We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence. Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein. Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably identified. The HomPPI web server is available at http://homppi.cs.iastate.edu/. Conclusions Sequence homology-based methods offer a class of computationally efficient and reliable approaches for predicting the protein-protein interface residues that participate in either obligate or transient interactions. For query proteins involved in transient interactions, the reliability of interface residue prediction can be improved by exploiting knowledge of putative interaction partners. PMID:21682895
NASA Astrophysics Data System (ADS)
Li, C.; Zhu, X.; Guo, W.; Liu, Y.; Huang, H.
2015-05-01
A method suitable for indoor complex semantic query considering the computation of indoor spatial relations is provided According to the characteristics of indoor space. This paper designs ontology model describing the space related information of humans, events and Indoor space objects (e.g. Storey and Room) as well as their relations to meet the indoor semantic query. The ontology concepts are used in IndoorSPARQL query language which extends SPARQL syntax for representing and querying indoor space. And four types specific primitives for indoor query, "Adjacent", "Opposite", "Vertical" and "Contain", are defined as query functions in IndoorSPARQL used to support quantitative spatial computations. Also a method is proposed to analysis the query language. Finally this paper adopts this method to realize indoor semantic query on the study area through constructing the ontology model for the study building. The experimental results show that the method proposed in this paper can effectively support complex indoor space semantic query.
Portal to the GALEX Data Archive
NASA Astrophysics Data System (ADS)
Smith, M. A.; Conti, A.; Shiao, B.; Volpicelli, C. A.
2004-05-01
In early February MAST began its hosting of the GALEX public "Early Release Observations" images (40,000 objects) and spectra (1000 objects). MAST will host a much larger "first release," the GALEX DR1, in October, 2004. In this poster we describe features of our on-line website at http://galex.stsci.edu for researchers interested in downloading and browsing GALEX UV image and spectral data. The site, is based on MS .NET technology and user queries are entered for classes of objects or sky regions on a "MAST-like" query forms or with detailed queries written in SQL. In the latter case examples are provided to tailor a query to a user's specifications. The site provides novel features, such as tooltips that return keyword definitions, "active images" that return object classification and coordinate information in a 2.5 arcmin radius around the selected object, self-documentation of terms and tables, and of course a tutorial for new navigators. The GALEX database employs a Hierarchial Triangular Mesh system for rapid data discovery, neighbor searches, and cross correlations with other catalogs. Our "GMAX" tool allows a coplotting of object positions for objects observed by GALEX and other US-NVO compliant mission websites such as Sloan, 2MASS, FIRST.... As a member of the new Skynode network, GALEX has reported its web services to the US-NVO registry. This permits users to generate queries from other sites to cross-correlate, compare, and plot GALEX data using US-NVO protocols. Future plans for limited on-line data analysis and footprint services are described.
Yousefi-Nooraie, Reza; Irani, Shirin; Mortaz-Hedjri, Soroush; Shakiba, Behnam
2013-10-01
The aim of this study was to compare the performance of three search methods in the retrieval of relevant clinical trials from PubMed to answer specific clinical questions. Included studies of a sample of 100 Cochrane reviews which recorded in PubMed were considered as the reference standard. The search queries were formulated based on the systematic review titles. Precision, recall and number of retrieved records for limiting the results to clinical trial publication type, and using sensitive and specific clinical queries filters were compared. The number of keywords, presence of specific names of intervention and syndrome in the search keywords were used in a model to predict the recalls and precisions. The Clinical queries-sensitive search strategy retrieved the largest number of records (33) and had the highest recall (41.6%) and lowest precision (4.8%). The presence of specific intervention name was the only significant predictor of all recalls and precisions (P = 0.016). The recall and precision of combination of simple clinical search queries and methodological search filters to find clinical trials in various subjects were considerably low. The limit field strategy yielded in higher precision and fewer retrieved records and approximately similar recall, compared with the clinical queries-sensitive strategy. Presence of specific intervention name in the search keywords increased both recall and precision. © 2010 John Wiley & Sons Ltd.
In-context query reformulation for failing SPARQL queries
NASA Astrophysics Data System (ADS)
Viswanathan, Amar; Michaelis, James R.; Cassidy, Taylor; de Mel, Geeth; Hendler, James
2017-05-01
Knowledge bases for decision support systems are growing increasingly complex, through continued advances in data ingest and management approaches. However, humans do not possess the cognitive capabilities to retain a bird's-eyeview of such knowledge bases, and may end up issuing unsatisfiable queries to such systems. This work focuses on the implementation of a query reformulation approach for graph-based knowledge bases, specifically designed to support the Resource Description Framework (RDF). The reformulation approach presented is instance-and schema-aware. Thus, in contrast to relaxation techniques found in the state-of-the-art, the presented approach produces in-context query reformulation.
SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases.
Schweiger, Dominik; Trajanoski, Zlatko; Pabinger, Stephan
2014-08-15
Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. SPARQLGraph offers an intuitive drag & drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers. This new graphical way of creating queries for biological Semantic Web databases considerably facilitates usability as it removes the requirement of knowing specific query languages and database structures. The system is freely available at http://sparqlgraph.i-med.ac.at.
Privacy Risks from Genomic Data-Sharing Beacons
Shringarpure, Suyash S.; Bustamante, Carlos D.
2015-01-01
The human genetics community needs robust protocols that enable secure sharing of genomic data from participants in genetic research. Beacons are web servers that answer allele-presence queries—such as “Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?”—with either “yes” or “no.” Here, we show that individuals in a beacon are susceptible to re-identification even if the only data shared include presence or absence information about alleles in a beacon. Specifically, we propose a likelihood-ratio test of whether a given individual is present in a given genetic beacon. Our test is not dependent on allele frequencies and is the most powerful test for a specified false-positive rate. Through simulations, we showed that in a beacon with 1,000 individuals, re-identification is possible with just 5,000 queries. Relatives can also be identified in the beacon. Re-identification is possible even in the presence of sequencing errors and variant-calling differences. In a beacon constructed with 65 European individuals from the 1000 Genomes Project, we demonstrated that it is possible to detect membership in the beacon with just 250 SNPs. With just 1,000 SNP queries, we were able to detect the presence of an individual genome from the Personal Genome Project in an existing beacon. Our results show that beacons can disclose membership and implied phenotypic information about participants and do not protect privacy a priori. We discuss risk mitigation through policies and standards such as not allowing anonymous pings of genetic beacons and requiring minimum beacon sizes. PMID:26522470
Walter User’s Manual (Version 1.0).
1987-09-01
queries and/or commands. 1.2 - How Walter Uses the Screen As shown in Figure 1-1, Walter divides the screen of your terminal into five separate areas...our attention to queries and how to submit them to the database. 1.3.1 - Submitting Queries A query is an expression consisting of words, parentheses...dates, but also with ranges of dates, such as "oct 15 : nov 15". Waiter recognizes three kinds of dates: * Specific dates of the form [date <month> <day
A Query Integrator and Manager for the Query Web
Brinkley, James F.; Detwiler, Landon T.
2012-01-01
We introduce two concepts: the Query Web as a layer of interconnected queries over the document web and the semantic web, and a Query Web Integrator and Manager (QI) that enables the Query Web to evolve. QI permits users to write, save and reuse queries over any web accessible source, including other queries saved in other installations of QI. The saved queries may be in any language (e.g. SPARQL, XQuery); the only condition for interconnection is that the queries return their results in some form of XML. This condition allows queries to chain off each other, and to be written in whatever language is appropriate for the task. We illustrate the potential use of QI for several biomedical use cases, including ontology view generation using a combination of graph-based and logical approaches, value set generation for clinical data management, image annotation using terminology obtained from an ontology web service, ontology-driven brain imaging data integration, small-scale clinical data integration, and wider-scale clinical data integration. Such use cases illustrate the current range of applications of QI and lead us to speculate about the potential evolution from smaller groups of interconnected queries into a larger query network that layers over the document and semantic web. The resulting Query Web could greatly aid researchers and others who now have to manually navigate through multiple information sources in order to answer specific questions. PMID:22531831
Optimizing a Query by Transformation and Expansion.
Glocker, Katrin; Knurr, Alexander; Dieter, Julia; Dominick, Friederike; Forche, Melanie; Koch, Christian; Pascoe Pérez, Analie; Roth, Benjamin; Ückert, Frank
2017-01-01
In the biomedical sector not only the amount of information produced and uploaded into the web is enormous, but also the number of sources where these data can be found. Clinicians and researchers spend huge amounts of time on trying to access this information and to filter the most important answers to a given question. As the formulation of these queries is crucial, automated query expansion is an effective tool to optimize a query and receive the best possible results. In this paper we introduce the concept of a workflow for an optimization of queries in the medical and biological sector by using a series of tools for expansion and transformation of the query. After the definition of attributes by the user, the query string is compared to previous queries in order to add semantic co-occurring terms to the query. Additionally, the query is enlarged by an inclusion of synonyms. The translation into database specific ontologies ensures the optimal query formulation for the chosen database(s). As this process can be performed in various databases at once, the results are ranked and normalized in order to achieve a comparable list of answers for a question.
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce.
Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel
2013-08-01
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce
Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel
2013-01-01
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS – a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive. PMID:24187650
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce.
Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng
2013-11-01
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS - a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing.
Predicting Drug Recalls From Internet Search Engine Queries.
Yom-Tov, Elad
2017-01-01
Batches of pharmaceuticals are sometimes recalled from the market when a safety issue or a defect is detected in specific production runs of a drug. Such problems are usually detected when patients or healthcare providers report abnormalities to medical authorities. Here, we test the hypothesis that defective production lots can be detected earlier by monitoring queries to Internet search engines. We extracted queries from the USA to the Bing search engine, which mentioned one of the 5195 pharmaceutical drugs during 2015 and all recall notifications issued by the Food and Drug Administration (FDA) during that year. By using attributes that quantify the change in query volume at the state level, we attempted to predict if a recall of a specific drug will be ordered by FDA in a time horizon ranging from 1 to 40 days in future. Our results show that future drug recalls can indeed be identified with an AUC of 0.791 and a lift at 5% of approximately 6 when predicting a recall occurring one day ahead. This performance degrades as prediction is made for longer periods ahead. The most indicative attributes for prediction are sudden spikes in query volume about a specific medicine in each state. Recalls of prescription drugs and those estimated to be of medium-risk are more likely to be identified using search query data. These findings suggest that aggregated Internet search engine data can be used to facilitate in early warning of faulty batches of medicines.
Virgilio, Massimiliano; Jordaens, Kurt; Breman, Floris C; Backeljau, Thierry; De Meyer, Marc
2012-01-01
We propose a general working strategy to deal with incomplete reference libraries in the DNA barcoding identification of species. Considering that (1) queries with a large genetic distance with their best DNA barcode match are more likely to be misidentified and (2) imposing a distance threshold profitably reduces identification errors, we modelled relationships between identification performances and distance thresholds in four DNA barcode libraries of Diptera (n = 4270), Lepidoptera (n = 7577), Hymenoptera (n = 2067) and Tephritidae (n = 602 DNA barcodes). In all cases, more restrictive distance thresholds produced a gradual increase in the proportion of true negatives, a gradual decrease of false positives and more abrupt variations in the proportions of true positives and false negatives. More restrictive distance thresholds improved precision, yet negatively affected accuracy due to the higher proportions of queries discarded (viz. having a distance query-best match above the threshold). Using a simple linear regression we calculated an ad hoc distance threshold for the tephritid library producing an estimated relative identification error <0.05. According to the expectations, when we used this threshold for the identification of 188 independently collected tephritids, less than 5% of queries with a distance query-best match below the threshold were misidentified. Ad hoc thresholds can be calculated for each particular reference library of DNA barcodes and should be used as cut-off mark defining whether we can proceed identifying the query with a known estimated error probability (e.g. 5%) or whether we should discard the query and consider alternative/complementary identification methods.
Virgilio, Massimiliano; Jordaens, Kurt; Breman, Floris C.; Backeljau, Thierry; De Meyer, Marc
2012-01-01
We propose a general working strategy to deal with incomplete reference libraries in the DNA barcoding identification of species. Considering that (1) queries with a large genetic distance with their best DNA barcode match are more likely to be misidentified and (2) imposing a distance threshold profitably reduces identification errors, we modelled relationships between identification performances and distance thresholds in four DNA barcode libraries of Diptera (n = 4270), Lepidoptera (n = 7577), Hymenoptera (n = 2067) and Tephritidae (n = 602 DNA barcodes). In all cases, more restrictive distance thresholds produced a gradual increase in the proportion of true negatives, a gradual decrease of false positives and more abrupt variations in the proportions of true positives and false negatives. More restrictive distance thresholds improved precision, yet negatively affected accuracy due to the higher proportions of queries discarded (viz. having a distance query-best match above the threshold). Using a simple linear regression we calculated an ad hoc distance threshold for the tephritid library producing an estimated relative identification error <0.05. According to the expectations, when we used this threshold for the identification of 188 independently collected tephritids, less than 5% of queries with a distance query-best match below the threshold were misidentified. Ad hoc thresholds can be calculated for each particular reference library of DNA barcodes and should be used as cut-off mark defining whether we can proceed identifying the query with a known estimated error probability (e.g. 5%) or whether we should discard the query and consider alternative/complementary identification methods. PMID:22359600
Chan, Emily H; Sahai, Vikram; Conrad, Corrie; Brownstein, John S
2011-05-01
A variety of obstacles including bureaucracy and lack of resources have interfered with timely detection and reporting of dengue cases in many endemic countries. Surveillance efforts have turned to modern data sources, such as Internet search queries, which have been shown to be effective for monitoring influenza-like illnesses. However, few have evaluated the utility of web search query data for other diseases, especially those of high morbidity and mortality or where a vaccine may not exist. In this study, we aimed to assess whether web search queries are a viable data source for the early detection and monitoring of dengue epidemics. Bolivia, Brazil, India, Indonesia and Singapore were chosen for analysis based on available data and adequate search volume. For each country, a univariate linear model was then built by fitting a time series of the fraction of Google search query volume for specific dengue-related queries from that country against a time series of official dengue case counts for a time-frame within 2003-2010. The specific combination of queries used was chosen to maximize model fit. Spurious spikes in the data were also removed prior to model fitting. The final models, fit using a training subset of the data, were cross-validated against both the overall dataset and a holdout subset of the data. All models were found to fit the data quite well, with validation correlations ranging from 0.82 to 0.99. Web search query data were found to be capable of tracking dengue activity in Bolivia, Brazil, India, Indonesia and Singapore. Whereas traditional dengue data from official sources are often not available until after some substantial delay, web search query data are available in near real-time. These data represent valuable complement to assist with traditional dengue surveillance.
Lee, Donghyun; Lee, Hojun
2016-01-01
Background Internet search query data reflect the attitudes of the users, using which we can measure the past orientation to commit suicide. Examinations of past orientation often highlight certain predispositions of attitude, many of which can be suicide risk factors. Objective To investigate the relationship between past orientation and suicide rate by examining Google search queries. Methods We measured the past orientation using Google search query data by comparing the search volumes of the past year and those of the future year, across the 50 US states and the District of Columbia during the period from 2004 to 2012. We constructed a panel dataset with independent variables as control variables; we then undertook an analysis using multiple ordinary least squares regression and methods that leverage the Akaike information criterion and the Bayesian information criterion. Results It was found that past orientation had a positive relationship with the suicide rate (P≤.001) and that it improves the goodness-of-fit of the model regarding the suicide rate. Unemployment rate (P≤.001 in Models 3 and 4), Gini coefficient (P≤.001), and population growth rate (P≤.001) had a positive relationship with the suicide rate, whereas the gross state product (P≤.001) showed a negative relationship with the suicide rate. Conclusions We empirically identified the positive relationship between the suicide rate and past orientation, which was measured by big data-driven Google search query. PMID:26868917
Lee, Donghyun; Lee, Hojun; Choi, Munkee
2016-02-11
Internet search query data reflect the attitudes of the users, using which we can measure the past orientation to commit suicide. Examinations of past orientation often highlight certain predispositions of attitude, many of which can be suicide risk factors. To investigate the relationship between past orientation and suicide rate by examining Google search queries. We measured the past orientation using Google search query data by comparing the search volumes of the past year and those of the future year, across the 50 US states and the District of Columbia during the period from 2004 to 2012. We constructed a panel dataset with independent variables as control variables; we then undertook an analysis using multiple ordinary least squares regression and methods that leverage the Akaike information criterion and the Bayesian information criterion. It was found that past orientation had a positive relationship with the suicide rate (P ≤ .001) and that it improves the goodness-of-fit of the model regarding the suicide rate. Unemployment rate (P ≤ .001 in Models 3 and 4), Gini coefficient (P ≤ .001), and population growth rate (P ≤ .001) had a positive relationship with the suicide rate, whereas the gross state product (P ≤ .001) showed a negative relationship with the suicide rate. We empirically identified the positive relationship between the suicide rate and past orientation, which was measured by big data-driven Google search query.
Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets
NASA Astrophysics Data System (ADS)
Juric, Mario
2011-01-01
The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.
Assisting Consumer Health Information Retrieval with Query Recommendations
Zeng, Qing T.; Crowell, Jonathan; Plovnick, Robert M.; Kim, Eunjung; Ngo, Long; Dibble, Emily
2006-01-01
Objective: Health information retrieval (HIR) on the Internet has become an important practice for millions of people, many of whom have problems forming effective queries. We have developed and evaluated a tool to assist people in health-related query formation. Design: We developed the Health Information Query Assistant (HIQuA) system. The system suggests alternative/additional query terms related to the user's initial query that can be used as building blocks to construct a better, more specific query. The recommended terms are selected according to their semantic distance from the original query, which is calculated on the basis of concept co-occurrences in medical literature and log data as well as semantic relations in medical vocabularies. Measurements: An evaluation of the HIQuA system was conducted and a total of 213 subjects participated in the study. The subjects were randomized into 2 groups. One group was given query recommendations and the other was not. Each subject performed HIR for both a predefined and a self-defined task. Results: The study showed that providing HIQuA recommendations resulted in statistically significantly higher rates of successful queries (odds ratio = 1.66, 95% confidence interval = 1.16–2.38), although no statistically significant impact on user satisfaction or the users' ability to accomplish the predefined retrieval task was found. Conclusion: Providing semantic-distance-based query recommendations can help consumers with query formation during HIR. PMID:16221944
Mansouri, Ava; Sarayani, Amir; Ashouri, Asieh; Sherafatmand, Mona; Hadjibabaie, Molouk; Gholami, Kheirollah
2015-01-01
Self-Medication (SM), i.e. using medications to treat oneself, is a major concern for health researchers and policy makers. The terms "self medication" or "self-medication" (SM terms) have been used to explain various concepts while several terms have also been employed to define this practice. Hence, retrieving relevant publications would require exhaustive literature screening. So, we assessed the current situation of SM terms in the literature to improve the relevancy of search outcomes. In this Systematic exploration, SM terms were searched in the 6 following databases and publisher's portals till April 2012: Web of Science, Scopus, PubMed, Google scholar, ScienceDirect, and Wiley. A simple search query was used to include only publications with SM terms. We used Relative-Risk (RR) to estimate the probability of SM terms use in related compared to unrelated publications. Sensitivity and specificity of SM terms as keywords in search query were also calculated. Relevant terms to SM practice were extracted and their Likelihood Ratio positive and negative (LR+/-) were calculated to assess their effect on the probability of search outcomes relevancy in addition to previous search queries. We also evaluated the content of unrelated publications. All mentioned steps were performed in title (TI) and title or abstract (TIAB) of publications. 1999 related and 1917 unrelated publications were found. SM terms RR was 4.5 in TI and 2.1 in TIAB. SM terms sensitivity and specificity respectively were 55.4% and 87.7% in TI and 84.0% and 59.5% in TIAB. "OTC" and "Over-The-Counter Medication", with LR+ 16.78 and 16.30 respectively, provided the most conclusive increase in the probability of the relevancy of publications. The most common unrelated SM themes were self-medication hypothesis, drug abuse and Zoopharmacognosy. Due to relatively low specificity or sensitivity of SM terms, relevant terms should be employed in search queries and clear definitions of SM applications should be applied to improve the relevancy of publications.
Mansouri, Ava; Sarayani, Amir; Ashouri, Asieh; Sherafatmand, Mona; Hadjibabaie, Molouk; Gholami, Kheirollah
2015-01-01
Background Self-Medication (SM), i.e. using medications to treat oneself, is a major concern for health researchers and policy makers. The terms “self medication” or “self-medication” (SM terms) have been used to explain various concepts while several terms have also been employed to define this practice. Hence, retrieving relevant publications would require exhaustive literature screening. So, we assessed the current situation of SM terms in the literature to improve the relevancy of search outcomes. Methods In this Systematic exploration, SM terms were searched in the 6 following databases and publisher’s portals till April 2012: Web of Science, Scopus, PubMed, Google scholar, ScienceDirect, and Wiley. A simple search query was used to include only publications with SM terms. We used Relative-Risk (RR) to estimate the probability of SM terms use in related compared to unrelated publications. Sensitivity and specificity of SM terms as keywords in search query were also calculated. Relevant terms to SM practice were extracted and their Likelihood Ratio positive and negative (LR+/-) were calculated to assess their effect on the probability of search outcomes relevancy in addition to previous search queries. We also evaluated the content of unrelated publications. All mentioned steps were performed in title (TI) and title or abstract (TIAB) of publications. Results 1999 related and 1917 unrelated publications were found. SM terms RR was 4.5 in TI and 2.1 in TIAB. SM terms sensitivity and specificity respectively were 55.4% and 87.7% in TI and 84.0% and 59.5% in TIAB. “OTC” and “Over-The-Counter Medication”, with LR+ 16.78 and 16.30 respectively, provided the most conclusive increase in the probability of the relevancy of publications. The most common unrelated SM themes were self-medication hypothesis, drug abuse and Zoopharmacognosy. Conclusions Due to relatively low specificity or sensitivity of SM terms, relevant terms should be employed in search queries and clear definitions of SM applications should be applied to improve the relevancy of publications. PMID:25932634
Which factors predict the time spent answering queries to a drug information centre?
Reppe, Linda A.; Spigset, Olav
2010-01-01
Objective To develop a model based upon factors able to predict the time spent answering drug-related queries to Norwegian drug information centres (DICs). Setting and method Drug-related queries received at 5 DICs in Norway from March to May 2007 were randomly assigned to 20 employees until each of them had answered a minimum of five queries. The employees reported the number of drugs involved, the type of literature search performed, and whether the queries were considered judgmental or not, using a specifically developed scoring system. Main outcome measures The scores of these three factors were added together to define a workload score for each query. Workload and its individual factors were subsequently related to the measured time spent answering the queries by simple or multiple linear regression analyses. Results Ninety-six query/answer pairs were analyzed. Workload significantly predicted the time spent answering the queries (adjusted R2 = 0.22, P < 0.001). Literature search was the individual factor best predicting the time spent answering the queries (adjusted R2 = 0.17, P < 0.001), and this variable also contributed the most in the multiple regression analyses. Conclusion The most important workload factor predicting the time spent handling the queries in this study was the type of literature search that had to be performed. The categorisation of queries as judgmental or not, also affected the time spent answering the queries. The number of drugs involved did not significantly influence the time spent answering drug information queries. PMID:20922480
Personalized query suggestion based on user behavior
NASA Astrophysics Data System (ADS)
Chen, Wanyu; Hao, Zepeng; Shao, Taihua; Chen, Honghui
Query suggestions help users refine their queries after they input an initial query. Previous work mainly concentrated on similarity-based and context-based query suggestion approaches. However, models that focus on adapting to a specific user (personalization) can help to improve the probability of the user being satisfied. In this paper, we propose a personalized query suggestion model based on users’ search behavior (UB model), where we inject relevance between queries and users’ search behavior into a basic probabilistic model. For the relevance between queries, we consider their semantical similarity and co-occurrence which indicates the behavior information from other users in web search. Regarding the current user’s preference to a query, we combine the user’s short-term and long-term search behavior in a linear fashion and deal with the data sparse problem with Bayesian probabilistic matrix factorization (BPMF). In particular, we also investigate the impact of different personalization strategies (the combination of the user’s short-term and long-term search behavior) on the performance of query suggestion reranking. We quantify the improvement of our proposed UB model against a state-of-the-art baseline using the public AOL query logs and show that it beats the baseline in terms of metrics used in query suggestion reranking. The experimental results show that: (i) for personalized ranking, users’ behavioral information helps to improve query suggestion effectiveness; and (ii) given a query, merging information inferred from the short-term and long-term search behavior of a particular user can result in a better performance than both plain approaches.
Development of a web-based video management and application processing system
NASA Astrophysics Data System (ADS)
Chan, Shermann S.; Wu, Yi; Li, Qing; Zhuang, Yueting
2001-07-01
How to facilitate efficient video manipulation and access in a web-based environment is becoming a popular trend for video applications. In this paper, we present a web-oriented video management and application processing system, based on our previous work on multimedia database and content-based retrieval. In particular, we extend the VideoMAP architecture with specific web-oriented mechanisms, which include: (1) Concurrency control facilities for the editing of video data among different types of users, such as Video Administrator, Video Producer, Video Editor, and Video Query Client; different users are assigned various priority levels for different operations on the database. (2) Versatile video retrieval mechanism which employs a hybrid approach by integrating a query-based (database) mechanism with content- based retrieval (CBR) functions; its specific language (CAROL/ST with CBR) supports spatio-temporal semantics of video objects, and also offers an improved mechanism to describe visual content of videos by content-based analysis method. (3) Query profiling database which records the `histories' of various clients' query activities; such profiles can be used to provide the default query template when a similar query is encountered by the same kind of users. An experimental prototype system is being developed based on the existing VideoMAP prototype system, using Java and VC++ on the PC platform.
Dynamic Querying of Mass-Storage RDF Data with Rule-Based Entailment Regimes
NASA Astrophysics Data System (ADS)
Ianni, Giovambattista; Krennwallner, Thomas; Martello, Alessandra; Polleres, Axel
RDF Schema (RDFS) as a lightweight ontology language is gaining popularity and, consequently, tools for scalable RDFS inference and querying are needed. SPARQL has become recently a W3C standard for querying RDF data, but it mostly provides means for querying simple RDF graphs only, whereas querying with respect to RDFS or other entailment regimes is left outside the current specification. In this paper, we show that SPARQL faces certain unwanted ramifications when querying ontologies in conjunction with RDF datasets that comprise multiple named graphs, and we provide an extension for SPARQL that remedies these effects. Moreover, since RDFS inference has a close relationship with logic rules, we generalize our approach to select a custom ruleset for specifying inferences to be taken into account in a SPARQL query. We show that our extensions are technically feasible by providing benchmark results for RDFS querying in our prototype system GiaBATA, which uses Datalog coupled with a persistent Relational Database as a back-end for implementing SPARQL with dynamic rule-based inference. By employing different optimization techniques like magic set rewriting our system remains competitive with state-of-the-art RDFS querying systems.
NASA Astrophysics Data System (ADS)
Skotniczny, Zbigniew
1989-12-01
The Query by Forms (QbF) system is a user-oriented interactive tool for querying large relational database with minimal queries difinition cost. The system was worked out under the assumption that user's time and effort for defining needed queries is the most severe bottleneck. The system may be applied in any Rdb/VMS databases system and is recommended for specific information systems of any project where end-user queries cannot be foreseen. The tool is dedicated to specialist of an application domain who have to analyze data maintained in database from any needed point of view, who do not need to know commercial databases languages. The paper presents the system developed as a compromise between its functionality and usability. User-system communication via a menu-driven "tree-like" structure of screen-forms which produces a query difinition and execution is discussed in detail. Output of query results (printed reports and graphics) is also discussed. Finally the paper shows one application of QbF to a HERA-project.
Pentoney, Christopher; Harwell, Jeff; Leroy, Gondy
2014-01-01
Searching for medical information online is a common activity. While it has been shown that forming good queries is difficult, Google's query suggestion tool, a type of query expansion, aims to facilitate query formation. However, it is unknown how this expansion, which is based on what others searched for, affects the information gathering of the online community. To measure the impact of social-based query expansion, this study compared it with content-based expansion, i.e., what is really in the text. We used 138,906 medical queries from the AOL User Session Collection and expanded them using Google's Autocomplete method (social-based) and the content of the Google Web Corpus (content-based). We evaluated the specificity and ambiguity of the expansion terms for trigram queries. We also looked at the impact on the actual results using domain diversity and expansion edit distance. Results showed that the social-based method provided more precise expansion terms as well as terms that were less ambiguous. Expanded queries do not differ significantly in diversity when expanded using the social-based method (6.72 different domains returned in the first ten results, on average) vs. content-based method (6.73 different domains, on average).
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce
Aji, Ablimit; Sun, Xiling; Vo, Hoang; Liu, Qioaling; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel; Wang, Fusheng
2016-01-01
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS – a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for workload specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing. PMID:27617325
Aggregating Queries Against Large Inventories of Remotely Accessible Data
NASA Astrophysics Data System (ADS)
Gallagher, J. H. R.; Fulker, D. W.
2016-12-01
Those seeking to discover data for a specific purpose often encounter search results that are so large as to be useless without computing assistance. This situation arises, with increasing frequency, in part because repositories contain ever greater numbers of granules, and their granularities may well be poorly aligned or even orthogonal to the data-selection needs of the user. This presentation describes a recently developed service for simultaneously querying large lists of OPeNDAP-accessible granules to extract specified data. The specifications include a richly expressive set of data-selection criteria—applicable to content as well as metadata—and the service has been tested successfully against lists naming hundreds of thousands of granules. Querying such numbers of local files (i.e., granules) on a desktop or laptop computer is practical (by using a scripting language, e.g.), but this practicality is diminished when the data are remote and thus best accessed through a Web-services interface. In these cases, which are increasingly common, scripted queries can take many hours because of inherent network latencies. Furthermore, communication dropouts can add fragility to such scripts, yielding gaps in the acquired results. In contrast, OPeNDAP's new aggregated-query services enable data discovery in the context of very large inventory sizes. These capabilities have been developed for use with OPeNDAP's Hyrax server, which is an open-source realization of DAP (for "Data Access Protocol," a specification widely used in NASA, NOAA and other data-intensive contexts). These aggregated-query services exhibit good response times (on the order of seconds, not hours) even for inventories that list hundreds of thousands of source granules.
Secure searching of biomarkers through hybrid homomorphic encryption scheme.
Kim, Miran; Song, Yongsoo; Cheon, Jung Hee
2017-07-26
As genome sequencing technology develops rapidly, there has lately been an increasing need to keep genomic data secure even when stored in the cloud and still used for research. We are interested in designing a protocol for the secure outsourcing matching problem on encrypted data. We propose an efficient method to securely search a matching position with the query data and extract some information at the position. After decryption, only a small amount of comparisons with the query information should be performed in plaintext state. We apply this method to find a set of biomarkers in encrypted genomes. The important feature of our method is to encode a genomic database as a single element of polynomial ring. Since our method requires a single homomorphic multiplication of hybrid scheme for query computation, it has the advantage over the previous methods in parameter size, computation complexity, and communication cost. In particular, the extraction procedure not only prevents leakage of database information that has not been queried by user but also reduces the communication cost by half. We evaluate the performance of our method and verify that the computation on large-scale personal data can be securely and practically outsourced to a cloud environment during data analysis. It takes about 3.9 s to search-and-extract the reference and alternate sequences at the queried position in a database of size 4M. Our solution for finding a set of biomarkers in DNA sequences shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment.
Spatial information semantic query based on SPARQL
NASA Astrophysics Data System (ADS)
Xiao, Zhifeng; Huang, Lei; Zhai, Xiaofang
2009-10-01
How can the efficiency of spatial information inquiries be enhanced in today's fast-growing information age? We are rich in geospatial data but poor in up-to-date geospatial information and knowledge that are ready to be accessed by public users. This paper adopts an approach for querying spatial semantic by building an Web Ontology language(OWL) format ontology and introducing SPARQL Protocol and RDF Query Language(SPARQL) to search spatial semantic relations. It is important to establish spatial semantics that support for effective spatial reasoning for performing semantic query. Compared to earlier keyword-based and information retrieval techniques that rely on syntax, we use semantic approaches in our spatial queries system. Semantic approaches need to be developed by ontology, so we use OWL to describe spatial information extracted by the large-scale map of Wuhan. Spatial information expressed by ontology with formal semantics is available to machines for processing and to people for understanding. The approach is illustrated by introducing a case study for using SPARQL to query geo-spatial ontology instances of Wuhan. The paper shows that making use of SPARQL to search OWL ontology instances can ensure the result's accuracy and applicability. The result also indicates constructing a geo-spatial semantic query system has positive efforts on forming spatial query and retrieval.
A Coding Method for Efficient Subgraph Querying on Vertex- and Edge-Labeled Graphs
Zhu, Lei; Song, Qinbao; Guo, Yuchen; Du, Lei; Zhu, Xiaoyan; Wang, Guangtao
2014-01-01
Labeled graphs are widely used to model complex data in many domains, so subgraph querying has been attracting more and more attention from researchers around the world. Unfortunately, subgraph querying is very time consuming since it involves subgraph isomorphism testing that is known to be an NP-complete problem. In this paper, we propose a novel coding method for subgraph querying that is based on Laplacian spectrum and the number of walks. Our method follows the filtering-and-verification framework and works well on graph databases with frequent updates. We also propose novel two-step filtering conditions that can filter out most false positives and prove that the two-step filtering conditions satisfy the no-false-negative requirement (no dismissal in answers). Extensive experiments on both real and synthetic graphs show that, compared with six existing counterpart methods, our method can effectively improve the efficiency of subgraph querying. PMID:24853266
Use of controlled vocabularies to improve biomedical information retrieval tasks.
Pasche, Emilie; Gobeill, Julien; Vishnyakova, Dina; Ruch, Patrick; Lovis, Christian
2013-01-01
The high heterogeneity of biomedical vocabulary is a major obstacle for information retrieval in large biomedical collections. Therefore, using biomedical controlled vocabularies is crucial for managing these contents. We investigate the impact of query expansion based on controlled vocabularies to improve the effectiveness of two search engines. Our strategy relies on the enrichment of users' queries with additional terms, directly derived from such vocabularies applied to infectious diseases and chemical patents. We observed that query expansion based on pathogen names resulted in improvements of the top-precision of our first search engine, while the normalization of diseases degraded the top-precision. The expansion of chemical entities, which was performed on the second search engine, positively affected the mean average precision. We have shown that query expansion of some types of biomedical entities has a great potential to improve search effectiveness; therefore a fine-tuning of query expansion strategies could help improving the performances of search engines.
Image Retrieval by Color Semantics with Incomplete Knowledge.
ERIC Educational Resources Information Center
Corridoni, Jacopo M.; Del Bimbo, Alberto; Vicario, Enrico
1998-01-01
Presents a system which supports image retrieval by high-level chromatic contents, the sensations that color accordances generate on the observer. Surveys Itten's theory of color semantics and discusses image description and query specification. Presents examples of visual querying. (AEF)
Johnson, Amy K; Mikati, Tarek; Mehta, Supriya D
2016-11-09
US surveillance of sexually transmitted diseases (STDs) is often delayed and incomplete which creates missed opportunities to identify and respond to trends in disease. Internet search engine data has the potential to be an efficient, economical and representative enhancement to the established surveillance system. Google Trends allows the download of de-identified search engine data, which has been used to demonstrate the positive and statistically significant association between STD-related search terms and STD rates. In this study, search engine user content was identified by surveying specific exposure groups of individuals (STD clinic patients and university students) aged 18-35. Participants were asked to list the terms they use to search for STD-related information. Google Correlate was used to validate search term content. On average STD clinic participant queries were longer compared to student queries. STD clinic participants were more likely to report using search terms that were related to symptomatology such as describing symptoms of STDs, while students were more likely to report searching for general information. These differences in search terms by subpopulation have implications for STD surveillance in populations at most risk for disease acquisition.
Chan, Emily H.; Sahai, Vikram; Conrad, Corrie; Brownstein, John S.
2011-01-01
Background A variety of obstacles including bureaucracy and lack of resources have interfered with timely detection and reporting of dengue cases in many endemic countries. Surveillance efforts have turned to modern data sources, such as Internet search queries, which have been shown to be effective for monitoring influenza-like illnesses. However, few have evaluated the utility of web search query data for other diseases, especially those of high morbidity and mortality or where a vaccine may not exist. In this study, we aimed to assess whether web search queries are a viable data source for the early detection and monitoring of dengue epidemics. Methodology/Principal Findings Bolivia, Brazil, India, Indonesia and Singapore were chosen for analysis based on available data and adequate search volume. For each country, a univariate linear model was then built by fitting a time series of the fraction of Google search query volume for specific dengue-related queries from that country against a time series of official dengue case counts for a time-frame within 2003–2010. The specific combination of queries used was chosen to maximize model fit. Spurious spikes in the data were also removed prior to model fitting. The final models, fit using a training subset of the data, were cross-validated against both the overall dataset and a holdout subset of the data. All models were found to fit the data quite well, with validation correlations ranging from 0.82 to 0.99. Conclusions/Significance Web search query data were found to be capable of tracking dengue activity in Bolivia, Brazil, India, Indonesia and Singapore. Whereas traditional dengue data from official sources are often not available until after some substantial delay, web search query data are available in near real-time. These data represent valuable complement to assist with traditional dengue surveillance. PMID:21647308
Method for indexing and retrieving manufacturing-specific digital imagery based on image content
Ferrell, Regina K.; Karnowski, Thomas P.; Tobin, Jr., Kenneth W.
2004-06-15
A method for indexing and retrieving manufacturing-specific digital images based on image content comprises three steps. First, at least one feature vector can be extracted from a manufacturing-specific digital image stored in an image database. In particular, each extracted feature vector corresponds to a particular characteristic of the manufacturing-specific digital image, for instance, a digital image modality and overall characteristic, a substrate/background characteristic, and an anomaly/defect characteristic. Notably, the extracting step includes generating a defect mask using a detection process. Second, using an unsupervised clustering method, each extracted feature vector can be indexed in a hierarchical search tree. Third, a manufacturing-specific digital image associated with a feature vector stored in the hierarchicial search tree can be retrieved, wherein the manufacturing-specific digital image has image content comparably related to the image content of the query image. More particularly, can include two data reductions, the first performed based upon a query vector extracted from a query image. Subsequently, a user can select relevant images resulting from the first data reduction. From the selection, a prototype vector can be calculated, from which a second-level data reduction can be performed. The second-level data reduction can result in a subset of feature vectors comparable to the prototype vector, and further comparable to the query vector. An additional fourth step can include managing the hierarchical search tree by substituting a vector average for several redundant feature vectors encapsulated by nodes in the hierarchical search tree.
Hybrid ontology for semantic information retrieval model using keyword matching indexing system.
Uthayan, K R; Mala, G S Anandha
2015-01-01
Ontology is the process of growth and elucidation of concepts of an information domain being common for a group of users. Establishing ontology into information retrieval is a normal method to develop searching effects of relevant information users require. Keywords matching process with historical or information domain is significant in recent calculations for assisting the best match for specific input queries. This research presents a better querying mechanism for information retrieval which integrates the ontology queries with keyword search. The ontology-based query is changed into a primary order to predicate logic uncertainty which is used for routing the query to the appropriate servers. Matching algorithms characterize warm area of researches in computer science and artificial intelligence. In text matching, it is more dependable to study semantics model and query for conditions of semantic matching. This research develops the semantic matching results between input queries and information in ontology field. The contributed algorithm is a hybrid method that is based on matching extracted instances from the queries and information field. The queries and information domain is focused on semantic matching, to discover the best match and to progress the executive process. In conclusion, the hybrid ontology in semantic web is sufficient to retrieve the documents when compared to standard ontology.
Hybrid Ontology for Semantic Information Retrieval Model Using Keyword Matching Indexing System
Uthayan, K. R.; Anandha Mala, G. S.
2015-01-01
Ontology is the process of growth and elucidation of concepts of an information domain being common for a group of users. Establishing ontology into information retrieval is a normal method to develop searching effects of relevant information users require. Keywords matching process with historical or information domain is significant in recent calculations for assisting the best match for specific input queries. This research presents a better querying mechanism for information retrieval which integrates the ontology queries with keyword search. The ontology-based query is changed into a primary order to predicate logic uncertainty which is used for routing the query to the appropriate servers. Matching algorithms characterize warm area of researches in computer science and artificial intelligence. In text matching, it is more dependable to study semantics model and query for conditions of semantic matching. This research develops the semantic matching results between input queries and information in ontology field. The contributed algorithm is a hybrid method that is based on matching extracted instances from the queries and information field. The queries and information domain is focused on semantic matching, to discover the best match and to progress the executive process. In conclusion, the hybrid ontology in semantic web is sufficient to retrieve the documents when compared to standard ontology. PMID:25922851
Multidimensional indexing structure for use with linear optimization queries
NASA Technical Reports Server (NTRS)
Bergman, Lawrence David (Inventor); Castelli, Vittorio (Inventor); Chang, Yuan-Chi (Inventor); Li, Chung-Sheng (Inventor); Smith, John Richard (Inventor)
2002-01-01
Linear optimization queries, which usually arise in various decision support and resource planning applications, are queries that retrieve top N data records (where N is an integer greater than zero) which satisfy a specific optimization criterion. The optimization criterion is to either maximize or minimize a linear equation. The coefficients of the linear equation are given at query time. Methods and apparatus are disclosed for constructing, maintaining and utilizing a multidimensional indexing structure of database records to improve the execution speed of linear optimization queries. Database records with numerical attributes are organized into a number of layers and each layer represents a geometric structure called convex hull. Such linear optimization queries are processed by searching from the outer-most layer of this multi-layer indexing structure inwards. At least one record per layer will satisfy the query criterion and the number of layers needed to be searched depends on the spatial distribution of records, the query-issued linear coefficients, and N, the number of records to be returned. When N is small compared to the total size of the database, answering the query typically requires searching only a small fraction of all relevant records, resulting in a tremendous speedup as compared to linearly scanning the entire dataset.
HU_DB at TREC 2014 Microblog Track
2014-11-01
available only via the API. Hence, we could not perform stemming on this corpus. Thus, queries using terms in a different grammatical form from that...of a tweet, could not return the tweet. A possible solution to this problem would have been expanding the query with modified grammatical forms of...looking for specific information do not seem to be suited for TTG. Some of the queries given for the task fell into the second category , which made
Child pornography in peer-to-peer networks.
Steel, Chad M S
2009-08-01
The presence of child pornography in peer-to-peer networks is not disputed, but there has been little effort done to quantify and analyze the distribution and nature of that content to-date. By performing an analysis of queries and query hits on the largest peer-to-peer network, we are able to both quantify and describe the nature of querying by child pornographers as well as the content they are sharing. Child pornography related content was identified and analyzed in 235,513 user queries and 194,444 query hits. The research confirmed a large amount of peer-to-peer traffic is dedicated to child pornography, but supply and demand must be separated for a better understanding. The most prevalent query and the top two most prevalent filenames returned as query hits were child pornography related. However, it would be inaccurate to state child pornography dominates peer-to-peer as 1% of all queries were related to child pornography and 1.45% of all query hits (unique filenames) were related to child pornography, consistent with a smaller study (Hughes et al., 2008). In addition to the above, research indicates that the median age searched for was 13 years old, and the majority of queries were gender-neutral, but of those with gender-related terms, 79% were female-oriented. Distribution-wise, the vast majority of content-specific searches are for movies at 99%, though images are still the most prevalent in availability. There is no shortage of child pornography supply and demand on peer-to-peer networks and by analyzing how consumers seek and distributors advertise content we can better understand their motivations. Understanding the behavior of child pornographers and how they search for content when contrasted with those sharing content provides a basis for finding and combating that behavior. For law enforcement, knowing the specific terms used allows more timely and accurate forensics and better identification of those seeking and distributing child pornography. For Internet researchers, better filtering and monitoring is possible. For mental health professionals, understanding the preferences and behaviors of those searching supports more effective treatment.
2011-01-01
Background When a specimen belongs to a species not yet represented in DNA barcode reference libraries there is disagreement over the effectiveness of using sequence comparisons to assign the query accurately to a higher taxon. Library completeness and the assignment criteria used have been proposed as critical factors affecting the accuracy of such assignments but have not been thoroughly investigated. We explored the accuracy of assignments to genus, tribe and subfamily in the Sphingidae, using the almost complete global DNA barcode reference library (1095 species) available for this family. Costa Rican sphingids (118 species), a well-documented, diverse subset of the family, with each of the tribes and subfamilies represented were used as queries. We simulated libraries with different levels of completeness (10-100% of the available species), and recorded assignments (positive or ambiguous) and their accuracy (true or false) under six criteria. Results A liberal tree-based criterion assigned 83% of queries accurately to genus, 74% to tribe and 90% to subfamily, compared to a strict tree-based criterion, which assigned 75% of queries accurately to genus, 66% to tribe and 84% to subfamily, with a library containing 100% of available species (but excluding the species of the query). The greater number of true positives delivered by more relaxed criteria was negatively balanced by the occurrence of more false positives. This effect was most sharply observed with libraries of the lowest completeness where, for example at the genus level, 32% of assignments were false positives with the liberal criterion versus < 1% when using the strict. We observed little difference (< 8% using the liberal criterion) however, in the overall accuracy of the assignments between the lowest and highest levels of library completeness at the tribe and subfamily level. Conclusions Our results suggest that when using a strict tree-based criterion for higher taxon assignment with DNA barcodes, the likelihood of assigning a query a genus name incorrectly is very low, if a genus name is provided it has a high likelihood of being accurate, and if no genus match is available the query can nevertheless be assigned to a subfamily with high accuracy regardless of library completeness. DNA barcoding often correctly assigned sphingid moths to higher taxa when species matches were unavailable, suggesting that barcode reference libraries can be useful for higher taxon assignments long before they achieve complete species coverage. PMID:21806794
Influenza-like illness surveillance on Twitter through automated learning of naïve language.
Gesualdo, Francesco; Stilo, Giovanni; Agricola, Eleonora; Gonfiantini, Michaela V; Pandolfi, Elisabetta; Velardi, Paola; Tozzi, Alberto E
2013-01-01
Twitter has the potential to be a timely and cost-effective source of data for syndromic surveillance. When speaking of an illness, Twitter users often report a combination of symptoms, rather than a suspected or final diagnosis, using naïve, everyday language. We developed a minimally trained algorithm that exploits the abundance of health-related web pages to identify all jargon expressions related to a specific technical term. We then translated an influenza case definition into a Boolean query, each symptom being described by a technical term and all related jargon expressions, as identified by the algorithm. Subsequently, we monitored all tweets that reported a combination of symptoms satisfying the case definition query. In order to geolocalize messages, we defined 3 localization strategies based on codes associated with each tweet. We found a high correlation coefficient between the trend of our influenza-positive tweets and ILI trends identified by US traditional surveillance systems.
Influenza-Like Illness Surveillance on Twitter through Automated Learning of Naïve Language
Gesualdo, Francesco; Stilo, Giovanni; Agricola, Eleonora; Gonfiantini, Michaela V.; Pandolfi, Elisabetta; Velardi, Paola; Tozzi, Alberto E.
2013-01-01
Twitter has the potential to be a timely and cost-effective source of data for syndromic surveillance. When speaking of an illness, Twitter users often report a combination of symptoms, rather than a suspected or final diagnosis, using naïve, everyday language. We developed a minimally trained algorithm that exploits the abundance of health-related web pages to identify all jargon expressions related to a specific technical term. We then translated an influenza case definition into a Boolean query, each symptom being described by a technical term and all related jargon expressions, as identified by the algorithm. Subsequently, we monitored all tweets that reported a combination of symptoms satisfying the case definition query. In order to geolocalize messages, we defined 3 localization strategies based on codes associated with each tweet. We found a high correlation coefficient between the trend of our influenza-positive tweets and ILI trends identified by US traditional surveillance systems. PMID:24324799
Deng, Michelle; Zollanvari, Amin; Alterovitz, Gil
2012-01-01
The immense corpus of biomedical literature existing today poses challenges in information search and integration. Many links between pieces of knowledge occur or are significant only under certain contexts-rather than under the entire corpus. This study proposes using networks of ontology concepts, linked based on their co-occurrences in annotations of abstracts of biomedical literature and descriptions of experiments, to draw conclusions based on context-specific queries and to better integrate existing knowledge. In particular, a Bayesian network framework is constructed to allow for the linking of related terms from two biomedical ontologies under the queried context concept. Edges in such a Bayesian network allow associations between biomedical concepts to be quantified and inference to be made about the existence of some concepts given prior information about others. This approach could potentially be a powerful inferential tool for context-specific queries, applicable to ontologies in other fields as well.
Deng, Michelle; Zollanvari, Amin; Alterovitz, Gil
2012-01-01
The immense corpus of biomedical literature existing today poses challenges in information search and integration. Many links between pieces of knowledge occur or are significant only under certain contexts—rather than under the entire corpus. This study proposes using networks of ontology concepts, linked based on their co-occurrences in annotations of abstracts of biomedical literature and descriptions of experiments, to draw conclusions based on context-specific queries and to better integrate existing knowledge. In particular, a Bayesian network framework is constructed to allow for the linking of related terms from two biomedical ontologies under the queried context concept. Edges in such a Bayesian network allow associations between biomedical concepts to be quantified and inference to be made about the existence of some concepts given prior information about others. This approach could potentially be a powerful inferential tool for context-specific queries, applicable to ontologies in other fields as well. PMID:22779044
Time series patterns and language support in DBMS
NASA Astrophysics Data System (ADS)
Telnarova, Zdenka
2017-07-01
This contribution is focused on pattern type Time Series as a rich in semantics representation of data. Some example of implementation of this pattern type in traditional Data Base Management Systems is briefly presented. There are many approaches how to manipulate with patterns and query patterns. Crucial issue can be seen in systematic approach to pattern management and specific pattern query language which takes into consideration semantics of patterns. Query language SQL-TS for manipulating with patterns is shown on Time Series data.
Ahmadi, Sepideh; Rabiee, Navid; Rabiee, Mohammad
2018-06-06
Aptamers have several positive advantages that made them eminent as a potential factor in diagnosing and treating diseases such as their application in prevention and treatment of diabetes. In this opinion-based mini review article, we aimed to investigate the DNA and RNA-based hybrid molecules specifically aptamers and had a logical conclusion as a promising future prospective in early diagnosis and treatment of diabetes. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
White, Ryen W; Horvitz, Eric
2017-03-01
A statistical model that predicts the appearance of strong evidence of a lung carcinoma diagnosis via analysis of large-scale anonymized logs of web search queries from millions of people across the United States. To evaluate the feasibility of screening patients at risk of lung carcinoma via analysis of signals from online search activity. We identified people who issue special queries that provide strong evidence of a recent diagnosis of lung carcinoma. We then considered patterns of symptoms expressed as searches about concerning symptoms over several months prior to the appearance of the landmark web queries. We built statistical classifiers that predict the future appearance of landmark queries based on the search log signals. This was a retrospective log analysis of the online activity of millions of web searchers seeking health-related information online. Of web searchers who queried for symptoms related to lung carcinoma, some (n = 5443 of 4 813 985) later issued queries that provide strong evidence of recent clinical diagnosis of lung carcinoma and are regarded as positive cases in our analysis. Additional evidence on the reliability of these queries as representing clinical diagnoses is based on the significant increase in follow-on searches for treatments and medications for these searchers and on the correlation between lung carcinoma incidence rates and our log-based statistics. The remaining symptom searchers (n = 4 808 542) are regarded as negative cases. Performance of the statistical model for early detection from online search behavior, for different lead times, different sets of signals, and different cohorts of searchers stratified by potential risk. The statistical classifier predicting the future appearance of landmark web queries based on search log signals identified searchers who later input queries consistent with a lung carcinoma diagnosis, with a true-positive rate ranging from 3% to 57% for false-positive rates ranging from 0.00001 to 0.001, respectively. The methods can be used to identify people at highest risk up to a year in advance of the inferred diagnosis time. The 5 factors associated with the highest relative risk (RR) were evidence of family history (RR = 7.548; 95% CI, 3.937-14.470), age (RR = 3.558; 95% CI, 3.357-3.772), radon (RR = 2.529; 95% CI, 1.137-5.624), primary location (RR = 2.463; 95% CI, 1.364-4.446), and occupation (RR = 1.969; 95% CI, 1.143-3.391). Evidence of smoking (RR = 1.646; 95% CI, 1.032-2.260) was important but not top-ranked, which was due to the difficulty of identifying smoking history from search terms. Pattern recognition based on data drawn from large-scale web search queries holds opportunity for identifying risk factors and frames new directions with early detection of lung carcinoma.
SP2Bench: A SPARQL Performance Benchmark
NASA Astrophysics Data System (ADS)
Schmidt, Michael; Hornung, Thomas; Meier, Michael; Pinkel, Christoph; Lausen, Georg
A meaningful analysis and comparison of both existing storage schemes for RDF data and evaluation approaches for SPARQL queries necessitates a comprehensive and universal benchmark platform. We present SP2Bench, a publicly available, language-specific performance benchmark for the SPARQL query language. SP2Bench is settled in the DBLP scenario and comprises a data generator for creating arbitrarily large DBLP-like documents and a set of carefully designed benchmark queries. The generated documents mirror vital key characteristics and social-world distributions encountered in the original DBLP data set, while the queries implement meaningful requests on top of this data, covering a variety of SPARQL operator constellations and RDF access patterns. In this chapter, we discuss requirements and desiderata for SPARQL benchmarks and present the SP2Bench framework, including its data generator, benchmark queries and performance metrics.
Object-Oriented Query Language For Events Detection From Images Sequences
NASA Astrophysics Data System (ADS)
Ganea, Ion Eugen
2015-09-01
In this paper is presented a method to represent the events extracted from images sequences and the query language used for events detection. Using an object oriented model the spatial and temporal relationships between salient objects and also between events are stored and queried. This works aims to unify the storing and querying phases for video events processing. The object oriented language syntax used for events processing allow the instantiation of the indexes classes in order to improve the accuracy of the query results. The experiments were performed on images sequences provided from sport domain and it shows the reliability and the robustness of the proposed language. To extend the language will be added a specific syntax for constructing the templates for abnormal events and for detection of the incidents as the final goal of the research.
Improved data retrieval from TreeBASE via taxonomic and linguistic data enrichment
Anwar, Nadia; Hunt, Ela
2009-01-01
Background TreeBASE, the only data repository for phylogenetic studies, is not being used effectively since it does not meet the taxonomic data retrieval requirements of the systematics community. We show, through an examination of the queries performed on TreeBASE, that data retrieval using taxon names is unsatisfactory. Results We report on a new wrapper supporting taxon queries on TreeBASE by utilising a Taxonomy and Classification Database (TCl-Db) we created. TCl-Db holds merged and consolidated taxonomic names from multiple data sources and can be used to translate hierarchical, vernacular and synonym queries into specific query terms in TreeBASE. The query expansion supported by TCl-Db shows very significant information retrieval quality improvement. The wrapper can be accessed at the URL The methodology we developed is scalable and can be applied to new data, as those become available in the future. Conclusion Significantly improved data retrieval quality is shown for all queries, and additional flexibility is achieved via user-driven taxonomy selection. PMID:19426482
Bat-Inspired Algorithm Based Query Expansion for Medical Web Information Retrieval.
Khennak, Ilyes; Drias, Habiba
2017-02-01
With the increasing amount of medical data available on the Web, looking for health information has become one of the most widely searched topics on the Internet. Patients and people of several backgrounds are now using Web search engines to acquire medical information, including information about a specific disease, medical treatment or professional advice. Nonetheless, due to a lack of medical knowledge, many laypeople have difficulties in forming appropriate queries to articulate their inquiries, which deem their search queries to be imprecise due the use of unclear keywords. The use of these ambiguous and vague queries to describe the patients' needs has resulted in a failure of Web search engines to retrieve accurate and relevant information. One of the most natural and promising method to overcome this drawback is Query Expansion. In this paper, an original approach based on Bat Algorithm is proposed to improve the retrieval effectiveness of query expansion in medical field. In contrast to the existing literature, the proposed approach uses Bat Algorithm to find the best expanded query among a set of expanded query candidates, while maintaining low computational complexity. Moreover, this new approach allows the determination of the length of the expanded query empirically. Numerical results on MEDLINE, the on-line medical information database, show that the proposed approach is more effective and efficient compared to the baseline.
Query construction, entropy, and generalization in neural-network models
NASA Astrophysics Data System (ADS)
Sollich, Peter
1994-05-01
We study query construction algorithms, which aim at improving the generalization ability of systems that learn from examples by choosing optimal, nonredundant training sets. We set up a general probabilistic framework for deriving such algorithms from the requirement of optimizing a suitable objective function; specifically, we consider the objective functions entropy (or information gain) and generalization error. For two learning scenarios, the high-low game and the linear perceptron, we evaluate the generalization performance obtained by applying the corresponding query construction algorithms and compare it to training on random examples. We find qualitative differences between the two scenarios due to the different structure of the underlying rules (nonlinear and ``noninvertible'' versus linear); in particular, for the linear perceptron, random examples lead to the same generalization ability as a sequence of queries in the limit of an infinite number of examples. We also investigate learning algorithms which are ill matched to the learning environment and find that, in this case, minimum entropy queries can in fact yield a lower generalization ability than random examples. Finally, we study the efficiency of single queries and its dependence on the learning history, i.e., on whether the previous training examples were generated randomly or by querying, and the difference between globally and locally optimal query construction.
A similarity-based data warehousing environment for medical images.
Teixeira, Jefferson William; Annibal, Luana Peixoto; Felipe, Joaquim Cezar; Ciferri, Ricardo Rodrigues; Ciferri, Cristina Dutra de Aguiar
2015-11-01
A core issue of the decision-making process in the medical field is to support the execution of analytical (OLAP) similarity queries over images in data warehousing environments. In this paper, we focus on this issue. We propose imageDWE, a non-conventional data warehousing environment that enables the storage of intrinsic features taken from medical images in a data warehouse and supports OLAP similarity queries over them. To comply with this goal, we introduce the concept of perceptual layer, which is an abstraction used to represent an image dataset according to a given feature descriptor in order to enable similarity search. Based on this concept, we propose the imageDW, an extended data warehouse with dimension tables specifically designed to support one or more perceptual layers. We also detail how to build an imageDW and how to load image data into it. Furthermore, we show how to process OLAP similarity queries composed of a conventional predicate and a similarity search predicate that encompasses the specification of one or more perceptual layers. Moreover, we introduce an index technique to improve the OLAP query processing over images. We carried out performance tests over a data warehouse environment that consolidated medical images from exams of several modalities. The results demonstrated the feasibility and efficiency of our proposed imageDWE to manage images and to process OLAP similarity queries. The results also demonstrated that the use of the proposed index technique guaranteed a great improvement in query processing. Copyright © 2015 Elsevier Ltd. All rights reserved.
Hanauer, David A; Wu, Danny T Y; Yang, Lei; Mei, Qiaozhu; Murkowski-Steffy, Katherine B; Vydiswaran, V G Vinod; Zheng, Kai
2017-03-01
The utility of biomedical information retrieval environments can be severely limited when users lack expertise in constructing effective search queries. To address this issue, we developed a computer-based query recommendation algorithm that suggests semantically interchangeable terms based on an initial user-entered query. In this study, we assessed the value of this approach, which has broad applicability in biomedical information retrieval, by demonstrating its application as part of a search engine that facilitates retrieval of information from electronic health records (EHRs). The query recommendation algorithm utilizes MetaMap to identify medical concepts from search queries and indexed EHR documents. Synonym variants from UMLS are used to expand the concepts along with a synonym set curated from historical EHR search logs. The empirical study involved 33 clinicians and staff who evaluated the system through a set of simulated EHR search tasks. User acceptance was assessed using the widely used technology acceptance model. The search engine's performance was rated consistently higher with the query recommendation feature turned on vs. off. The relevance of computer-recommended search terms was also rated high, and in most cases the participants had not thought of these terms on their own. The questions on perceived usefulness and perceived ease of use received overwhelmingly positive responses. A vast majority of the participants wanted the query recommendation feature to be available to assist in their day-to-day EHR search tasks. Challenges persist for users to construct effective search queries when retrieving information from biomedical documents including those from EHRs. This study demonstrates that semantically-based query recommendation is a viable solution to addressing this challenge. Published by Elsevier Inc.
Ayers, John W; Althouse, Benjamin M; Ribisl, Kurt M; Emery, Sherry
2014-05-01
The Internet is revolutionizing tobacco control, but few have harnessed the Web for surveillance. We demonstrate for the first time an approach for analyzing aggregate Internet search queries that captures precise changes in population considerations about tobacco. We compared tobacco-related Google queries originating in the United States during the week of the State Children's Health Insurance Program (SCHIP) 2009 cigarette excise tax increase with a historic baseline. Specific queries were then ranked according to their relative increases while also considering approximations of changes in absolute search volume. Individual queries with the largest relative increases the week of the SCHIP tax were "cigarettes Indian reservations" 640% (95% CI, 472-918), "free cigarettes online" 557% (95% CI, 432-756), and "Indian reservations cigarettes" 542% (95% CI, 414-733), amounting to about 7,500 excess searches. By themes, the largest relative increases were tribal cigarettes 246% (95% CI, 228-265), "free" cigarettes 215% (95% CI, 191-242), and cigarette stores 176% (95% CI, 160-193), accounting for 21,000, 27,000, and 90,000 excess queries. All avoidance queries, including those aforementioned themes, relatively increased 150% (95% CI, 144-155) or 550,000 from their baseline. All cessation queries increased 46% (95% CI, 44-48), or 175,000, around SCHIP; including themes for "cold turkey" 19% (95% CI, 11-27) or 2,600, cessation products 47% (95% CI, 44-50) or 78,000, and dubious cessation approaches (e.g., hypnosis) 40% (95% CI, 33-47) or 2,300. The SCHIP tax motivated specific changes in population considerations. Our strategy can support evaluations that temporally link tobacco control measures with instantaneous population reactions, as well as serve as a springboard for traditional studies, for example, including survey questionnaire design.
Clone DB: an integrated NCBI resource for clone-associated data
Schneider, Valerie A.; Chen, Hsiu-Chuan; Clausen, Cliff; Meric, Peter A.; Zhou, Zhigang; Bouk, Nathan; Husain, Nora; Maglott, Donna R.; Church, Deanna M.
2013-01-01
The National Center for Biotechnology Information (NCBI) Clone DB (http://www.ncbi.nlm.nih.gov/clone/) is an integrated resource providing information about and facilitating access to clones, which serve as valuable research reagents in many fields, including genome sequencing and variation analysis. Clone DB represents an expansion and replacement of the former NCBI Clone Registry and has records for genomic and cell-based libraries and clones representing more than 100 different eukaryotic taxa. Records provide details of library construction, associated sequences, map positions and information about resource distribution. Clone DB is indexed in the NCBI Entrez system and can be queried by fields that include organism, clone name, gene name and sequence identifier. Whenever possible, genomic clones are mapped to reference assemblies and their map positions provided in clone records. Clones mapping to specific genomic regions can also be searched for using the NCBI Clone Finder tool, which accepts queries based on sequence coordinates or features such as gene or transcript names. Clone DB makes reports of library, clone and placement data on its FTP site available for download. With Clone DB, users now have available to them a centralized resource that provides them with the tools they will need to make use of these important research reagents. PMID:23193260
Computer systems and methods for the query and visualization multidimensional databases
Stolte, Chris; Tang, Diane L.; Hanrahan, Patrick
2017-04-25
A method of generating a data visualization is performed at a computer having a display, one or more processors, and memory. The memory stores one or more programs for execution by the one or more processors. The process receives user specification of a plurality of characteristics of a data visualization. The data visualization is based on data from a multidimensional database. The characteristics specify at least x-position and y-position of data marks corresponding to tuples of data retrieved from the database. The process generates a data visualization according to the specified plurality of characteristics. The data visualization has an x-axis defined based on data for one or more first fields from the database that specify x-position of the data marks and the data visualization has a y-axis defined based on data for one or more second fields from the database that specify y-position of the data marks.
Efficient privacy-preserving string search and an application in genomics.
Shimizu, Kana; Nuida, Koji; Rätsch, Gunnar
2016-06-01
Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within [Formula: see text] 4.6 s and [Formula: see text] 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm. https://github.com/iskana/PBWT-sec and https://github.com/ratschlab/PBWT-sec shimizu-kana@aist.go.jp or Gunnar.Ratsch@ratschlab.org Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Efficient privacy-preserving string search and an application in genomics
Shimizu, Kana; Nuida, Koji; Rätsch, Gunnar
2016-01-01
Motivation: Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. Approach: We propose a novel approach that combines efficient string data structures such as the Burrows–Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows–Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. Results: We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within ≈ 4.6 s and ≈ 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm. Availability and implementation: https://github.com/iskana/PBWT-sec and https://github.com/ratschlab/PBWT-sec. Contacts: shimizu-kana@aist.go.jp or Gunnar.Ratsch@ratschlab.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27153731
Aftermath of bustamante attack on genomic beacon service.
Aziz, Md Momin Al; Ghasemi, Reza; Waliullah, Md; Mohammed, Noman
2017-07-26
With the enormous need for federated eco-system for holding global genomic and clinical data, Global Alliance for Genomic and Health (GA4GH) has created an international website called beacon service which allows a researcher to find out whether a specific dataset can be utilized to his or her research beforehand. This simple webservice is quite useful as it allows queries like whether a certain position of a target chromosome has a specific nucleotide. However, the increased integration of individuals genomic data into clinical practice and research raised serious privacy concern. Though the answer of such queries are yes or no in Bacon network, it results in serious privacy implication as demonstrated in a recent work from Shringarpure and Bustamante. In their attack model, the authors demonstrated that with a limited number of queries, presence of an individual in any dataset can be determined. We propose two lightweight algorithms (based on randomized response) which captures the efficacy while preserving the privacy of the participants in a genomic beacon service. We also elaborate the strength and weakness of the attack by explaining some of their statistical and mathematical models using real world genomic database. We extend their experimental simulations for different adversarial assumptions and parameters. We experimentally evaluated the solutions on the original attack model with different parameters for better understanding of the privacy and utility tradeoffs provided by these two methods. Also, the statistical analysis further elaborates the different aspects of the prior attack which leads to a better risk management for the participants in a beacon service. The differentially private and lightweight solutions discussed here will make the attack much difficult to succeed while maintaining the fundamental motivation of beacon database network.
Toward An Unstructured Mesh Database
NASA Astrophysics Data System (ADS)
Rezaei Mahdiraji, Alireza; Baumann, Peter Peter
2014-05-01
Unstructured meshes are used in several application domains such as earth sciences (e.g., seismology), medicine, oceanography, cli- mate modeling, GIS as approximate representations of physical objects. Meshes subdivide a domain into smaller geometric elements (called cells) which are glued together by incidence relationships. The subdivision of a domain allows computational manipulation of complicated physical structures. For instance, seismologists model earthquakes using elastic wave propagation solvers on hexahedral meshes. The hexahedral con- tains several hundred millions of grid points and millions of hexahedral cells. Each vertex node in the hexahedrals stores a multitude of data fields. To run simulation on such meshes, one needs to iterate over all the cells, iterate over incident cells to a given cell, retrieve coordinates of cells, assign data values to cells, etc. Although meshes are used in many application domains, to the best of our knowledge there is no database vendor that support unstructured mesh features. Currently, the main tool for querying and manipulating unstructured meshes are mesh libraries, e.g., CGAL and GRAL. Mesh li- braries are dedicated libraries which includes mesh algorithms and can be run on mesh representations. The libraries do not scale with dataset size, do not have declarative query language, and need deep C++ knowledge for query implementations. Furthermore, due to high coupling between the implementations and input file structure, the implementations are less reusable and costly to maintain. A dedicated mesh database offers the following advantages: 1) declarative querying, 2) ease of maintenance, 3) hiding mesh storage structure from applications, and 4) transparent query optimization. To design a mesh database, the first challenge is to define a suitable generic data model for unstructured meshes. We proposed ImG-Complexes data model as a generic topological mesh data model which extends incidence graph model to multi-incidence relationships. We instrument ImG model with sets of optional and application-specific constraints which can be used to check validity of meshes for a specific class of object such as manifold, pseudo-manifold, and simplicial manifold. We conducted experiments to measure the performance of the graph database solution in processing mesh queries and compare it with GrAL mesh library and PostgreSQL database on synthetic and real mesh datasets. The experiments show that each system perform well on specific types of mesh queries, e.g., graph databases perform well on global path-intensive queries. In the future, we investigate database operations for the ImG model and design a mesh query language.
An intelligent user interface for browsing satellite data catalogs
NASA Technical Reports Server (NTRS)
Cromp, Robert F.; Crook, Sharon
1989-01-01
A large scale domain-independent spatial data management expert system that serves as a front-end to databases containing spatial data is described. This system is unique for two reasons. First, it uses spatial search techniques to generate a list of all the primary keys that fall within a user's spatial constraints prior to invoking the database management system, thus substantially decreasing the amount of time required to answer a user's query. Second, a domain-independent query expert system uses a domain-specific rule base to preprocess the user's English query, effectively mapping a broad class of queries into a smaller subset that can be handled by a commercial natural language processing system. The methods used by the spatial search module and the query expert system are explained, and the system architecture for the spatial data management expert system is described. The system is applied to data from the International Ultraviolet Explorer (IUE) satellite, and results are given.
ESTminer: a Web interface for mining EST contig and cluster databases.
Huang, Yecheng; Pumphrey, Janie; Gingle, Alan R
2005-03-01
ESTminer is a Web application and database schema for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The Web interface contains a query frame that allows the selection of contigs/clusters with specific cDNA library makeup or a threshold number of members. The results are displayed as color-coded tree nodes, where the color indicates the fractional size of each cDNA library component. The nodes are expandable, revealing library statistics as well as EST or contig members, with links to sequence data, GenBank records or user configurable links. Also, the interface allows 'queries within queries' where the result set of a query is further filtered by the subsequent query. ESTminer is implemented in Java/JSP and the package, including MySQL and Oracle schema creation scripts, is available from http://cggc.agtec.uga.edu/Data/download.asp agingle@uga.edu.
Horvath, Dragos; Marcou, Gilles; Varnek, Alexandre
2013-07-22
This study is an exhaustive analysis of the neighborhood behavior over a large coherent data set (ChEMBL target/ligand pairs of known Ki, for 165 targets with >50 associated ligands each). It focuses on similarity-based virtual screening (SVS) success defined by the ascertained optimality index. This is a weighted compromise between purity and retrieval rate of active hits in the neighborhood of an active query. One key issue addressed here is the impact of Tversky asymmetric weighing of query vs candidate features (represented as integer-value ISIDA colored fragment/pharmacophore triplet count descriptor vectors). The nearly a 3/4 million independent SVS runs showed that Tversky scores with a strong bias in favor of query-specific features are, by far, the most successful and the least failure-prone out of a set of nine other dissimilarity scores. These include classical Tanimoto, which failed to defend its privileged status in practical SVS applications. Tversky performance is not significantly conditioned by tuning of its bias parameter α. Both initial "guesses" of α = 0.9 and 0.7 were more successful than Tanimoto (at its turn, better than Euclid). Tversky was eventually tested in exhaustive similarity searching within the library of 1.6 M commercial + bioactive molecules at http://infochim.u-strasbg.fr/webserv/VSEngine.html , comparing favorably to Tanimoto in terms of "scaffold hopping" propensity. Therefore, it should be used at least as often as, perhaps in parallel to Tanimoto in SVS. Analysis with respect to query subclasses highlighted relationships of query complexity (simply expressed in terms of pharmacophore pattern counts) and/or target nature vs SVS success likelihood. SVS using more complex queries are more robust with respect to the choice of their operational premises (descriptors, metric). Yet, they are best handled by "pro-query" Tversky scores at α > 0.5. Among simpler queries, one may distinguish between "growable" (allowing for active analogs with additional features), and a few "conservative" queries not allowing any growth. These (typically bioactive amine transporter ligands) form the specific application domain of "pro-candidate" biased Tversky scores at α < 0.5.
Balaur, Irina; Saqi, Mansoor; Barat, Ana; Lysenko, Artem; Mazein, Alexander; Rawlings, Christopher J; Ruskin, Heather J; Auffray, Charles
2017-10-01
The development of colorectal cancer (CRC)-the third most common cancer type-has been associated with deregulations of cellular mechanisms stimulated by both genetic and epigenetic events. StatEpigen is a manually curated and annotated database, containing information on interdependencies between genetic and epigenetic signals, and specialized currently for CRC research. Although StatEpigen provides a well-developed graphical user interface for information retrieval, advanced queries involving associations between multiple concepts can benefit from more detailed graph representation of the integrated data. This can be achieved by using a graph database (NoSQL) approach. Data were extracted from StatEpigen and imported to our newly developed EpiGeNet, a graph database for storage and querying of conditional relationships between molecular (genetic and epigenetic) events observed at different stages of colorectal oncogenesis. We illustrate the enhanced capability of EpiGeNet for exploration of different queries related to colorectal tumor progression; specifically, we demonstrate the query process for (i) stage-specific molecular events, (ii) most frequently observed genetic and epigenetic interdependencies in colon adenoma, and (iii) paths connecting key genes reported in CRC and associated events. The EpiGeNet framework offers improved capability for management and visualization of data on molecular events specific to CRC initiation and progression.
Analyzing Medical Image Search Behavior: Semantics and Prediction of Query Results.
De-Arteaga, Maria; Eggel, Ivan; Kahn, Charles E; Müller, Henning
2015-10-01
Log files of information retrieval systems that record user behavior have been used to improve the outcomes of retrieval systems, understand user behavior, and predict events. In this article, a log file of the ARRS GoldMiner search engine containing 222,005 consecutive queries is analyzed. Time stamps are available for each query, as well as masked IP addresses, which enables to identify queries from the same person. This article describes the ways in which physicians (or Internet searchers interested in medical images) search and proposes potential improvements by suggesting query modifications. For example, many queries contain only few terms and therefore are not specific; others contain spelling mistakes or non-medical terms that likely lead to poor or empty results. One of the goals of this report is to predict the number of results a query will have since such a model allows search engines to automatically propose query modifications in order to avoid result lists that are empty or too large. This prediction is made based on characteristics of the query terms themselves. Prediction of empty results has an accuracy above 88%, and thus can be used to automatically modify the query to avoid empty result sets for a user. The semantic analysis and data of reformulations done by users in the past can aid the development of better search systems, particularly to improve results for novice users. Therefore, this paper gives important ideas to better understand how people search and how to use this knowledge to improve the performance of specialized medical search engines.
Hierarchical classification method and its application in shape representation
NASA Astrophysics Data System (ADS)
Ireton, M. A.; Oakley, John P.; Xydeas, Costas S.
1992-04-01
In this paper we describe a technique for performing shaped-based content retrieval of images from a large database. In order to be able to formulate such user-generated queries about visual objects, we have developed an hierarchical classification technique. This hierarchical classification technique enables similarity matching between objects, with the position in the hierarchy signifying the level of generality to be used in the query. The classification technique is unsupervised, robust, and general; it can be applied to any suitable parameter set. To establish the potential of this classifier for aiding visual querying, we have applied it to the classification of the 2-D outlines of leaves.
TDR Targets: a chemogenomics resource for neglected diseases.
Magariños, María P; Carmona, Santiago J; Crowther, Gregory J; Ralph, Stuart A; Roos, David S; Shanmugam, Dhanasekaran; Van Voorhis, Wesley C; Agüero, Fernán
2012-01-01
The TDR Targets Database (http://tdrtargets.org) has been designed and developed as an online resource to facilitate the rapid identification and prioritization of molecular targets for drug development, focusing on pathogens responsible for neglected human diseases. The database integrates pathogen specific genomic information with functional data (e.g. expression, phylogeny, essentiality) for genes collected from various sources, including literature curation. This information can be browsed and queried using an extensive web interface with functionalities for combining, saving, exporting and sharing the query results. Target genes can be ranked and prioritized using numerical weights assigned to the criteria used for querying. In this report we describe recent updates to the TDR Targets database, including the addition of new genomes (specifically helminths), and integration of chemical structure, property and bioactivity information for biological ligands, drugs and inhibitors and cheminformatic tools for querying and visualizing these chemical data. These changes greatly facilitate exploration of linkages (both known and predicted) between genes and small molecules, yielding insight into whether particular proteins may be druggable, effectively allowing the navigation of chemical space in a genomics context.
TDR Targets: a chemogenomics resource for neglected diseases
Magariños, María P.; Carmona, Santiago J.; Crowther, Gregory J.; Ralph, Stuart A.; Roos, David S.; Shanmugam, Dhanasekaran; Van Voorhis, Wesley C.; Agüero, Fernán
2012-01-01
The TDR Targets Database (http://tdrtargets.org) has been designed and developed as an online resource to facilitate the rapid identification and prioritization of molecular targets for drug development, focusing on pathogens responsible for neglected human diseases. The database integrates pathogen specific genomic information with functional data (e.g. expression, phylogeny, essentiality) for genes collected from various sources, including literature curation. This information can be browsed and queried using an extensive web interface with functionalities for combining, saving, exporting and sharing the query results. Target genes can be ranked and prioritized using numerical weights assigned to the criteria used for querying. In this report we describe recent updates to the TDR Targets database, including the addition of new genomes (specifically helminths), and integration of chemical structure, property and bioactivity information for biological ligands, drugs and inhibitors and cheminformatic tools for querying and visualizing these chemical data. These changes greatly facilitate exploration of linkages (both known and predicted) between genes and small molecules, yielding insight into whether particular proteins may be druggable, effectively allowing the navigation of chemical space in a genomics context. PMID:22116064
Jadhav, Ashutosh; Andrews, Donna; Fiksdal, Alexander; Kumbamu, Ashok; McCormick, Jennifer B; Misitano, Andrew; Nelsen, Laurie; Ryu, Euijung; Sheth, Amit; Wu, Stephen
2014-01-01
Background The number of people using the Internet and mobile/smart devices for health information seeking is increasing rapidly. Although the user experience for online health information seeking varies with the device used, for example, smart devices (SDs) like smartphones/tablets versus personal computers (PCs) like desktops/laptops, very few studies have investigated how online health information seeking behavior (OHISB) may differ by device. Objective The objective of this study is to examine differences in OHISB between PCs and SDs through a comparative analysis of large-scale health search queries submitted through Web search engines from both types of devices. Methods Using the Web analytics tool, IBM NetInsight OnDemand, and based on the type of devices used (PCs or SDs), we obtained the most frequent health search queries between June 2011 and May 2013 that were submitted on Web search engines and directed users to the Mayo Clinic’s consumer health information website. We performed analyses on “Queries with considering repetition counts (QwR)” and “Queries without considering repetition counts (QwoR)”. The dataset contains (1) 2.74 million and 3.94 million QwoR, respectively for PCs and SDs, and (2) more than 100 million QwR for both PCs and SDs. We analyzed structural properties of the queries (length of the search queries, usage of query operators and special characters in health queries), types of search queries (keyword-based, wh-questions, yes/no questions), categorization of the queries based on health categories and information mentioned in the queries (gender, age-groups, temporal references), misspellings in the health queries, and the linguistic structure of the health queries. Results Query strings used for health information searching via PCs and SDs differ by almost 50%. The most searched health categories are “Symptoms” (1 in 3 search queries), “Causes”, and “Treatments & Drugs”. The distribution of search queries for different health categories differs with the device used for the search. Health queries tend to be longer and more specific than general search queries. Health queries from SDs are longer and have slightly fewer spelling mistakes than those from PCs. Users specify words related to women and children more often than that of men and any other age group. Most of the health queries are formulated using keywords; the second-most common are wh- and yes/no questions. Users ask more health questions using SDs than PCs. Almost all health queries have at least one noun and health queries from SDs are more descriptive than those from PCs. Conclusions This study is a large-scale comparative analysis of health search queries to understand the effects of device type (PCs vs SDs) used on OHISB. The study indicates that the device used for online health information search plays an important role in shaping how health information searches by consumers and patients are executed. PMID:25000537
Jadhav, Ashutosh; Andrews, Donna; Fiksdal, Alexander; Kumbamu, Ashok; McCormick, Jennifer B; Misitano, Andrew; Nelsen, Laurie; Ryu, Euijung; Sheth, Amit; Wu, Stephen; Pathak, Jyotishman
2014-07-04
The number of people using the Internet and mobile/smart devices for health information seeking is increasing rapidly. Although the user experience for online health information seeking varies with the device used, for example, smart devices (SDs) like smartphones/tablets versus personal computers (PCs) like desktops/laptops, very few studies have investigated how online health information seeking behavior (OHISB) may differ by device. The objective of this study is to examine differences in OHISB between PCs and SDs through a comparative analysis of large-scale health search queries submitted through Web search engines from both types of devices. Using the Web analytics tool, IBM NetInsight OnDemand, and based on the type of devices used (PCs or SDs), we obtained the most frequent health search queries between June 2011 and May 2013 that were submitted on Web search engines and directed users to the Mayo Clinic's consumer health information website. We performed analyses on "Queries with considering repetition counts (QwR)" and "Queries without considering repetition counts (QwoR)". The dataset contains (1) 2.74 million and 3.94 million QwoR, respectively for PCs and SDs, and (2) more than 100 million QwR for both PCs and SDs. We analyzed structural properties of the queries (length of the search queries, usage of query operators and special characters in health queries), types of search queries (keyword-based, wh-questions, yes/no questions), categorization of the queries based on health categories and information mentioned in the queries (gender, age-groups, temporal references), misspellings in the health queries, and the linguistic structure of the health queries. Query strings used for health information searching via PCs and SDs differ by almost 50%. The most searched health categories are "Symptoms" (1 in 3 search queries), "Causes", and "Treatments & Drugs". The distribution of search queries for different health categories differs with the device used for the search. Health queries tend to be longer and more specific than general search queries. Health queries from SDs are longer and have slightly fewer spelling mistakes than those from PCs. Users specify words related to women and children more often than that of men and any other age group. Most of the health queries are formulated using keywords; the second-most common are wh- and yes/no questions. Users ask more health questions using SDs than PCs. Almost all health queries have at least one noun and health queries from SDs are more descriptive than those from PCs. This study is a large-scale comparative analysis of health search queries to understand the effects of device type (PCs vs. SDs) used on OHISB. The study indicates that the device used for online health information search plays an important role in shaping how health information searches by consumers and patients are executed.
Computing health quality measures using Informatics for Integrating Biology and the Bedside.
Klann, Jeffrey G; Murphy, Shawn N
2013-04-19
The Health Quality Measures Format (HQMF) is a Health Level 7 (HL7) standard for expressing computable Clinical Quality Measures (CQMs). Creating tools to process HQMF queries in clinical databases will become increasingly important as the United States moves forward with its Health Information Technology Strategic Plan to Stages 2 and 3 of the Meaningful Use incentive program (MU2 and MU3). Informatics for Integrating Biology and the Bedside (i2b2) is one of the analytical databases used as part of the Office of the National Coordinator (ONC)'s Query Health platform to move toward this goal. Our goal is to integrate i2b2 with the Query Health HQMF architecture, to prepare for other HQMF use-cases (such as MU2 and MU3), and to articulate the functional overlap between i2b2 and HQMF. Therefore, we analyze the structure of HQMF, and then we apply this understanding to HQMF computation on the i2b2 clinical analytical database platform. Specifically, we develop a translator between two query languages, HQMF and i2b2, so that the i2b2 platform can compute HQMF queries. We use the HQMF structure of queries for aggregate reporting, which define clinical data elements and the temporal and logical relationships between them. We use the i2b2 XML format, which allows flexible querying of a complex clinical data repository in an easy-to-understand domain-specific language. The translator can represent nearly any i2b2-XML query as HQMF and execute in i2b2 nearly any HQMF query expressible in i2b2-XML. This translator is part of the freely available reference implementation of the QueryHealth initiative. We analyze limitations of the conversion and find it covers many, but not all, of the complex temporal and logical operators required by quality measures. HQMF is an expressive language for defining quality measures, and it will be important to understand and implement for CQM computation, in both meaningful use and population health. However, its current form might allow complexity that is intractable for current database systems (both in terms of implementation and computation). Our translator, which supports the subset of HQMF currently expressible in i2b2-XML, may represent the beginnings of a practical compromise. It is being pilot-tested in two Query Health demonstration projects, and it can be further expanded to balance computational tractability with the advanced features needed by measure developers.
Computing Health Quality Measures Using Informatics for Integrating Biology and the Bedside
Murphy, Shawn N
2013-01-01
Background The Health Quality Measures Format (HQMF) is a Health Level 7 (HL7) standard for expressing computable Clinical Quality Measures (CQMs). Creating tools to process HQMF queries in clinical databases will become increasingly important as the United States moves forward with its Health Information Technology Strategic Plan to Stages 2 and 3 of the Meaningful Use incentive program (MU2 and MU3). Informatics for Integrating Biology and the Bedside (i2b2) is one of the analytical databases used as part of the Office of the National Coordinator (ONC)’s Query Health platform to move toward this goal. Objective Our goal is to integrate i2b2 with the Query Health HQMF architecture, to prepare for other HQMF use-cases (such as MU2 and MU3), and to articulate the functional overlap between i2b2 and HQMF. Therefore, we analyze the structure of HQMF, and then we apply this understanding to HQMF computation on the i2b2 clinical analytical database platform. Specifically, we develop a translator between two query languages, HQMF and i2b2, so that the i2b2 platform can compute HQMF queries. Methods We use the HQMF structure of queries for aggregate reporting, which define clinical data elements and the temporal and logical relationships between them. We use the i2b2 XML format, which allows flexible querying of a complex clinical data repository in an easy-to-understand domain-specific language. Results The translator can represent nearly any i2b2-XML query as HQMF and execute in i2b2 nearly any HQMF query expressible in i2b2-XML. This translator is part of the freely available reference implementation of the QueryHealth initiative. We analyze limitations of the conversion and find it covers many, but not all, of the complex temporal and logical operators required by quality measures. Conclusions HQMF is an expressive language for defining quality measures, and it will be important to understand and implement for CQM computation, in both meaningful use and population health. However, its current form might allow complexity that is intractable for current database systems (both in terms of implementation and computation). Our translator, which supports the subset of HQMF currently expressible in i2b2-XML, may represent the beginnings of a practical compromise. It is being pilot-tested in two Query Health demonstration projects, and it can be further expanded to balance computational tractability with the advanced features needed by measure developers. PMID:23603227
Raising the IQ in full-text searching via intelligent querying
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kero, R.; Russell, L.; Swietlik, C.
1994-11-01
Current Information Retrieval (IR) technologies allow for efficient access to relevant information, provided that user selected query terms coincide with the specific linguistical choices made by the authors whose works constitute the text-base. Therefore, the challenge is to enhance the limited searching capability of state-of-the-practice IR. This can be done either with augmented clients that overcome current server searching deficiencies, or with added capabilities that can augment searching algorithms on the servers. The technology being investigated is that of deductive databases, with a set of new techniques called cooperative answering. This technology utilizes semantic networks to allow for navigation betweenmore » possible query search term alternatives. The augmented search terms are passed to an IR engine and the results can be compared. The project utilizes the OSTI Environment, Safety and Health Thesaurus to populate the domain specific semantic network and the text base of ES&H related documents from the Facility Profile Information Management System as the domain specific search space.« less
Fast Query-Optimized Kernel-Machine Classification
NASA Technical Reports Server (NTRS)
Mazzoni, Dominic; DeCoste, Dennis
2004-01-01
A recently developed algorithm performs kernel-machine classification via incremental approximate nearest support vectors. The algorithm implements support-vector machines (SVMs) at speeds 10 to 100 times those attainable by use of conventional SVM algorithms. The algorithm offers potential benefits for classification of images, recognition of speech, recognition of handwriting, and diverse other applications in which there are requirements to discern patterns in large sets of data. SVMs constitute a subset of kernel machines (KMs), which have become popular as models for machine learning and, more specifically, for automated classification of input data on the basis of labeled training data. While similar in many ways to k-nearest-neighbors (k-NN) models and artificial neural networks (ANNs), SVMs tend to be more accurate. Using representations that scale only linearly in the numbers of training examples, while exploring nonlinear (kernelized) feature spaces that are exponentially larger than the original input dimensionality, KMs elegantly and practically overcome the classic curse of dimensionality. However, the price that one must pay for the power of KMs is that query-time complexity scales linearly with the number of training examples, making KMs often orders of magnitude more computationally expensive than are ANNs, decision trees, and other popular machine learning alternatives. The present algorithm treats an SVM classifier as a special form of a k-NN. The algorithm is based partly on an empirical observation that one can often achieve the same classification as that of an exact KM by using only small fraction of the nearest support vectors (SVs) of a query. The exact KM output is a weighted sum over the kernel values between the query and the SVs. In this algorithm, the KM output is approximated with a k-NN classifier, the output of which is a weighted sum only over the kernel values involving k selected SVs. Before query time, there are gathered statistics about how misleading the output of the k-NN model can be, relative to the outputs of the exact KM for a representative set of examples, for each possible k from 1 to the total number of SVs. From these statistics, there are derived upper and lower thresholds for each step k. These thresholds identify output levels for which the particular variant of the k-NN model already leans so strongly positively or negatively that a reversal in sign is unlikely, given the weaker SV neighbors still remaining. At query time, the partial output of each query is incrementally updated, stopping as soon as it exceeds the predetermined statistical thresholds of the current step. For an easy query, stopping can occur as early as step k = 1. For more difficult queries, stopping might not occur until nearly all SVs are touched. A key empirical observation is that this approach can tolerate very approximate nearest-neighbor orderings. In experiments, SVs and queries were projected to a subspace comprising the top few principal- component dimensions and neighbor orderings were computed in that subspace. This approach ensured that the overhead of the nearest-neighbor computations was insignificant, relative to that of the exact KM computation.
Visually defining and querying consistent multi-granular clinical temporal abstractions.
Combi, Carlo; Oliboni, Barbara
2012-02-01
The main goal of this work is to propose a framework for the visual specification and query of consistent multi-granular clinical temporal abstractions. We focus on the issue of querying patient clinical information by visually defining and composing temporal abstractions, i.e., high level patterns derived from several time-stamped raw data. In particular, we focus on the visual specification of consistent temporal abstractions with different granularities and on the visual composition of different temporal abstractions for querying clinical databases. Temporal abstractions on clinical data provide a concise and high-level description of temporal raw data, and a suitable way to support decision making. Granularities define partitions on the time line and allow one to represent time and, thus, temporal clinical information at different levels of detail, according to the requirements coming from the represented clinical domain. The visual representation of temporal information has been considered since several years in clinical domains. Proposed visualization techniques must be easy and quick to understand, and could benefit from visual metaphors that do not lead to ambiguous interpretations. Recently, physical metaphors such as strips, springs, weights, and wires have been proposed and evaluated on clinical users for the specification of temporal clinical abstractions. Visual approaches to boolean queries have been considered in the last years and confirmed that the visual support to the specification of complex boolean queries is both an important and difficult research topic. We propose and describe a visual language for the definition of temporal abstractions based on a set of intuitive metaphors (striped wall, plastered wall, brick wall), allowing the clinician to use different granularities. A new algorithm, underlying the visual language, allows the physician to specify only consistent abstractions, i.e., abstractions not containing contradictory conditions on the component abstractions. Moreover, we propose a visual query language where different temporal abstractions can be composed to build complex queries: temporal abstractions are visually connected through the usual logical connectives AND, OR, and NOT. The proposed visual language allows one to simply define temporal abstractions by using intuitive metaphors, and to specify temporal intervals related to abstractions by using different temporal granularities. The physician can interact with the designed and implemented tool by point-and-click selections, and can visually compose queries involving several temporal abstractions. The evaluation of the proposed granularity-related metaphors consisted in two parts: (i) solving 30 interpretation exercises by choosing the correct interpretation of a given screenshot representing a possible scenario, and (ii) solving a complex exercise, by visually specifying through the interface a scenario described only in natural language. The exercises were done by 13 subjects. The percentage of correct answers to the interpretation exercises were slightly different with respect to the considered metaphors (54.4--striped wall, 73.3--plastered wall, 61--brick wall, and 61--no wall), but post hoc statistical analysis on means confirmed that differences were not statistically significant. The result of the user's satisfaction questionnaire related to the evaluation of the proposed granularity-related metaphors ratified that there are no preferences for one of them. The evaluation of the proposed logical notation consisted in two parts: (i) solving five interpretation exercises provided by a screenshot representing a possible scenario and by three different possible interpretations, of which only one was correct, and (ii) solving five exercises, by visually defining through the interface a scenario described only in natural language. Exercises had an increasing difficulty. The evaluation involved a total of 31 subjects. Results related to this evaluation phase confirmed us about the soundness of the proposed solution even in comparison with a well known proposal based on a tabular query form (the only significant difference is that our proposal requires more time for the training phase: 21 min versus 14 min). In this work we have considered the issue of visually composing and querying temporal clinical patient data. In this context we have proposed a visual framework for the specification of consistent temporal abstractions with different granularities and for the visual composition of different temporal abstractions to build (possibly) complex queries on clinical databases. A new algorithm has been proposed to check the consistency of the specified granular abstraction. From the evaluation of the proposed metaphors and interfaces and from the comparison of the visual query language with a well known visual method for boolean queries, the soundness of the overall system has been confirmed; moreover, pros and cons and possible improvements emerged from the comparison of different visual metaphors and solutions. Copyright © 2011 Elsevier B.V. All rights reserved.
Manchester visual query language
NASA Astrophysics Data System (ADS)
Oakley, John P.; Davis, Darryl N.; Shann, Richard T.
1993-04-01
We report a database language for visual retrieval which allows queries on image feature information which has been computed and stored along with images. The language is novel in that it provides facilities for dealing with feature data which has actually been obtained from image analysis. Each line in the Manchester Visual Query Language (MVQL) takes a set of objects as input and produces another, usually smaller, set as output. The MVQL constructs are mainly based on proven operators from the field of digital image analysis. An example is the Hough-group operator which takes as input a specification for the objects to be grouped, a specification for the relevant Hough space, and a definition of the voting rule. The output is a ranked list of high scoring bins. The query could be directed towards one particular image or an entire image database, in the latter case the bins in the output list would in general be associated with different images. We have implemented MVQL in two layers. The command interpreter is a Lisp program which maps each MVQL line to a sequence of commands which are used to control a specialized database engine. The latter is a hybrid graph/relational system which provides low-level support for inheritance and schema evolution. In the paper we outline the language and provide examples of useful queries. We also describe our solution to the engineering problems associated with the implementation of MVQL.
Analysis of DNS Cache Effects on Query Distribution
2013-01-01
This paper studies the DNS cache effects that occur on query distribution at the CN top-level domain (TLD) server. We first filter out the malformed DNS queries to purify the log data pollution according to six categories. A model for DNS resolution, more specifically DNS caching, is presented. We demonstrate the presence and magnitude of DNS cache effects and the cache sharing effects on the request distribution through analytic model and simulation. CN TLD log data results are provided and analyzed based on the cache model. The approximate TTL distribution for domain name is inferred quantificationally. PMID:24396313
Analysis of DNS cache effects on query distribution.
Wang, Zheng
2013-01-01
This paper studies the DNS cache effects that occur on query distribution at the CN top-level domain (TLD) server. We first filter out the malformed DNS queries to purify the log data pollution according to six categories. A model for DNS resolution, more specifically DNS caching, is presented. We demonstrate the presence and magnitude of DNS cache effects and the cache sharing effects on the request distribution through analytic model and simulation. CN TLD log data results are provided and analyzed based on the cache model. The approximate TTL distribution for domain name is inferred quantificationally.
Locating Sequence on FPC Maps and Selecting a Minimal Tiling Path
Engler, Friedrich W.; Hatfield, James; Nelson, William; Soderlund, Carol A.
2003-01-01
This study discusses three software tools, the first two aid in integrating sequence with an FPC physical map and the third automatically selects a minimal tiling path given genomic draft sequence and BAC end sequences. The first tool, FSD (FPC Simulated Digest), takes a sequenced clone and adds it back to the map based on a fingerprint generated by an in silico digest of the clone. This allows verification of sequenced clone positions and the integration of sequenced clones that were not originally part of the FPC map. The second tool, BSS (Blast Some Sequence), takes a query sequence and positions it on the map based on sequence associated with the clones in the map. BSS has multiple uses as follows: (1) When the query is a file of marker sequences, they can be added as electronic markers. (2) When the query is draft sequence, the results of BSS can be used to close gaps in a sequenced clone or the physical map. (3) When the query is a sequenced clone and the target is BAC end sequences, one may select the next clone for sequencing using both sequence comparison results and map location. (4) When the query is whole-genome draft sequence and the target is BAC end sequences, the results can be used to select many clones for a minimal tiling path at once. The third tool, pickMTP, automates the majority of this last usage of BSS. Results are presented using the rice FPC map, BAC end sequences, and whole-genome shotgun from Syngenta. PMID:12915486
Computer systems and methods for the query and visualization of multidimensional databases
Stolte, Chris; Tang, Diane L.; Hanrahan, Patrick
2006-08-08
A method and system for producing graphics. A hierarchical structure of a database is determined. A visual table, comprising a plurality of panes, is constructed by providing a specification that is in a language based on the hierarchical structure of the database. In some cases, this language can include fields that are in the database schema. The database is queried to retrieve a set of tuples in accordance with the specification. A subset of the set of tuples is associated with a pane in the plurality of panes.
Computer systems and methods for the query and visualization of multidimensional database
Stolte, Chris; Tang, Diane L.; Hanrahan, Patrick
2010-05-11
A method and system for producing graphics. A hierarchical structure of a database is determined. A visual table, comprising a plurality of panes, is constructed by providing a specification that is in a language based on the hierarchical structure of the database. In some cases, this language can include fields that are in the database schema. The database is queried to retrieve a set of tuples in accordance with the specification. A subset of the set of tuples is associated with a pane in the plurality of panes.
Huebner-Bloder, Gudrun; Duftschmid, Georg; Kohler, Michael; Rinner, Christoph; Saboor, Samrend; Ammenwerth, Elske
2012-01-01
Cross-institutional longitudinal Electronic Health Records (EHR), as introduced in Austria at the moment, increase the challenge of information overload of healthcare professionals. We developed an innovative cross-institutional EHR query prototype that offers extended query options, including searching for specific information items or sets of information items. The available query options were derived from a systematic analysis of information needs of diabetes specialists during patient encounters. The prototype operates in an IHE-XDS-based environment where ISO/EN 13606-structured documents are available. We conducted a controlled study with seven diabetes specialists to assess the feasibility and impact of this EHR query prototype on efficient retrieving of patient information to answer typical clinical questions. The controlled study showed that the specialists were quicker and more successful (measured in percentage of expected information items found) in finding patient information compared to the standard full-document search options. The participants also appreciated the extended query options. PMID:23304308
2014-01-01
Introduction: The Internet is revolutionizing tobacco control, but few have harnessed the Web for surveillance. We demonstrate for the first time an approach for analyzing aggregate Internet search queries that captures precise changes in population considerations about tobacco. Methods: We compared tobacco-related Google queries originating in the United States during the week of the State Children’s Health Insurance Program (SCHIP) 2009 cigarette excise tax increase with a historic baseline. Specific queries were then ranked according to their relative increases while also considering approximations of changes in absolute search volume. Results: Individual queries with the largest relative increases the week of the SCHIP tax were “cigarettes Indian reservations” 640% (95% CI, 472–918), “free cigarettes online” 557% (95% CI, 432–756), and “Indian reservations cigarettes” 542% (95% CI, 414–733), amounting to about 7,500 excess searches. By themes, the largest relative increases were tribal cigarettes 246% (95% CI, 228–265), “free” cigarettes 215% (95% CI, 191–242), and cigarette stores 176% (95% CI, 160–193), accounting for 21,000, 27,000, and 90,000 excess queries. All avoidance queries, including those aforementioned themes, relatively increased 150% (95% CI, 144–155) or 550,000 from their baseline. All cessation queries increased 46% (95% CI, 44–48), or 175,000, around SCHIP; including themes for “cold turkey” 19% (95% CI, 11–27) or 2,600, cessation products 47% (95% CI, 44–50) or 78,000, and dubious cessation approaches (e.g., hypnosis) 40% (95% CI, 33–47) or 2,300. Conclusions: The SCHIP tax motivated specific changes in population considerations. Our strategy can support evaluations that temporally link tobacco control measures with instantaneous population reactions, as well as serve as a springboard for traditional studies, for example, including survey questionnaire design. PMID:24323570
2017-01-01
Reusing the data from healthcare information systems can effectively facilitate clinical trials (CTs). How to select candidate patients eligible for CT recruitment criteria is a central task. Related work either depends on DBA (database administrator) to convert the recruitment criteria to native SQL queries or involves the data mapping between a standard ontology/information model and individual data source schema. This paper proposes an alternative computer-aided CT recruitment paradigm, based on syntax translation between different DSLs (domain-specific languages). In this paradigm, the CT recruitment criteria are first formally represented as production rules. The referenced rule variables are all from the underlying database schema. Then the production rule is translated to an intermediate query-oriented DSL (e.g., LINQ). Finally, the intermediate DSL is directly mapped to native database queries (e.g., SQL) automated by ORM (object-relational mapping). PMID:29065644
Faceted Visualization of Three Dimensional Neuroanatomy By Combining Ontology with Faceted Search
Veeraraghavan, Harini; Miller, James V.
2013-01-01
In this work, we present a faceted-search based approach for visualization of anatomy by combining a three dimensional digital atlas with an anatomy ontology. Specifically, our approach provides a drill-down search interface that exposes the relevant pieces of information (obtained by searching the ontology) for a user query. Hence, the user can produce visualizations starting with minimally specified queries. Furthermore, by automatically translating the user queries into the controlled terminology our approach eliminates the need for the user to use controlled terminology. We demonstrate the scalability of our approach using an abdominal atlas and the same ontology. We implemented our visualization tool on the opensource 3D Slicer software. We present results of our visualization approach by combining a modified Foundational Model of Anatomy (FMA) ontology with the Surgical Planning Laboratory (SPL) Brain 3D digital atlas, and geometric models specific to patients computed using the SPL brain tumor dataset. PMID:24006207
Faceted visualization of three dimensional neuroanatomy by combining ontology with faceted search.
Veeraraghavan, Harini; Miller, James V
2014-04-01
In this work, we present a faceted-search based approach for visualization of anatomy by combining a three dimensional digital atlas with an anatomy ontology. Specifically, our approach provides a drill-down search interface that exposes the relevant pieces of information (obtained by searching the ontology) for a user query. Hence, the user can produce visualizations starting with minimally specified queries. Furthermore, by automatically translating the user queries into the controlled terminology our approach eliminates the need for the user to use controlled terminology. We demonstrate the scalability of our approach using an abdominal atlas and the same ontology. We implemented our visualization tool on the opensource 3D Slicer software. We present results of our visualization approach by combining a modified Foundational Model of Anatomy (FMA) ontology with the Surgical Planning Laboratory (SPL) Brain 3D digital atlas, and geometric models specific to patients computed using the SPL brain tumor dataset.
Zhang, Yinsheng; Zhang, Guoming; Shang, Qian
2017-01-01
Reusing the data from healthcare information systems can effectively facilitate clinical trials (CTs). How to select candidate patients eligible for CT recruitment criteria is a central task. Related work either depends on DBA (database administrator) to convert the recruitment criteria to native SQL queries or involves the data mapping between a standard ontology/information model and individual data source schema. This paper proposes an alternative computer-aided CT recruitment paradigm, based on syntax translation between different DSLs (domain-specific languages). In this paradigm, the CT recruitment criteria are first formally represented as production rules. The referenced rule variables are all from the underlying database schema. Then the production rule is translated to an intermediate query-oriented DSL (e.g., LINQ). Finally, the intermediate DSL is directly mapped to native database queries (e.g., SQL) automated by ORM (object-relational mapping).
CUFID-query: accurate network querying through random walk based network flow estimation.
Jeong, Hyundoo; Qian, Xiaoning; Yoon, Byung-Jun
2017-12-28
Functional modules in biological networks consist of numerous biomolecules and their complicated interactions. Recent studies have shown that biomolecules in a functional module tend to have similar interaction patterns and that such modules are often conserved across biological networks of different species. As a result, such conserved functional modules can be identified through comparative analysis of biological networks. In this work, we propose a novel network querying algorithm based on the CUFID (Comparative network analysis Using the steady-state network Flow to IDentify orthologous proteins) framework combined with an efficient seed-and-extension approach. The proposed algorithm, CUFID-query, can accurately detect conserved functional modules as small subnetworks in the target network that are expected to perform similar functions to the given query functional module. The CUFID framework was recently developed for probabilistic pairwise global comparison of biological networks, and it has been applied to pairwise global network alignment, where the framework was shown to yield accurate network alignment results. In the proposed CUFID-query algorithm, we adopt the CUFID framework and extend it for local network alignment, specifically to solve network querying problems. First, in the seed selection phase, the proposed method utilizes the CUFID framework to compare the query and the target networks and to predict the probabilistic node-to-node correspondence between the networks. Next, the algorithm selects and greedily extends the seed in the target network by iteratively adding nodes that have frequent interactions with other nodes in the seed network, in a way that the conductance of the extended network is maximally reduced. Finally, CUFID-query removes irrelevant nodes from the querying results based on the personalized PageRank vector for the induced network that includes the fully extended network and its neighboring nodes. Through extensive performance evaluation based on biological networks with known functional modules, we show that CUFID-query outperforms the existing state-of-the-art algorithms in terms of prediction accuracy and biological significance of the predictions.
Towards computational improvement of DNA database indexing and short DNA query searching.
Stojanov, Done; Koceski, Sašo; Mileva, Aleksandra; Koceska, Nataša; Bande, Cveta Martinovska
2014-09-03
In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.
Generalized query-based active learning to identify differentially methylated regions in DNA.
Haque, Md Muksitul; Holder, Lawrence B; Skinner, Michael K; Cook, Diane J
2013-01-01
Active learning is a supervised learning technique that reduces the number of examples required for building a successful classifier, because it can choose the data it learns from. This technique holds promise for many biological domains in which classified examples are expensive and time-consuming to obtain. Most traditional active learning methods ask very specific queries to the Oracle (e.g., a human expert) to label an unlabeled example. The example may consist of numerous features, many of which are irrelevant. Removing such features will create a shorter query with only relevant features, and it will be easier for the Oracle to answer. We propose a generalized query-based active learning (GQAL) approach that constructs generalized queries based on multiple instances. By constructing appropriately generalized queries, we can achieve higher accuracy compared to traditional active learning methods. We apply our active learning method to find differentially DNA methylated regions (DMRs). DMRs are DNA locations in the genome that are known to be involved in tissue differentiation, epigenetic regulation, and disease. We also apply our method on 13 other data sets and show that our method is better than another popular active learning technique.
Chikayama, Eisuke; Yamashina, Ryo; Komatsu, Keiko; Tsuboi, Yuuri; Sakata, Kenji; Kikuchi, Jun; Sekiyama, Yasuyo
2016-01-01
Foods from agriculture and fishery products are processed using various technologies. Molecular mixture analysis during food processing has the potential to help us understand the molecular mechanisms involved, thus enabling better cooking of the analyzed foods. To date, there has been no web-based tool focusing on accumulating Nuclear Magnetic Resonance (NMR) spectra from various types of food processing. Therefore, we have developed a novel web-based tool, FoodPro, that includes a food NMR spectrum database and computes covariance and correlation spectra to tasting and hardness. As a result, FoodPro has accumulated 236 aqueous (extracted in D2O) and 131 hydrophobic (extracted in CDCl3) experimental bench-top 60-MHz NMR spectra, 1753 tastings scored by volunteers, and 139 hardness measurements recorded by a penetrometer, all placed into a core database. The database content was roughly classified into fish and vegetable groups from the viewpoint of different spectrum patterns. FoodPro can query a user food NMR spectrum, search similar NMR spectra with a specified similarity threshold, and then compute estimated tasting and hardness, covariance, and correlation spectra to tasting and hardness. Querying fish spectra exemplified specific covariance spectra to tasting and hardness, giving positive covariance for tasting at 1.31 ppm for lactate and 3.47 ppm for glucose and a positive covariance for hardness at 3.26 ppm for trimethylamine N-oxide. PMID:27775560
Chikayama, Eisuke; Yamashina, Ryo; Komatsu, Keiko; Tsuboi, Yuuri; Sakata, Kenji; Kikuchi, Jun; Sekiyama, Yasuyo
2016-10-19
Foods from agriculture and fishery products are processed using various technologies. Molecular mixture analysis during food processing has the potential to help us understand the molecular mechanisms involved, thus enabling better cooking of the analyzed foods. To date, there has been no web-based tool focusing on accumulating Nuclear Magnetic Resonance (NMR) spectra from various types of food processing. Therefore, we have developed a novel web-based tool, FoodPro, that includes a food NMR spectrum database and computes covariance and correlation spectra to tasting and hardness. As a result, FoodPro has accumulated 236 aqueous (extracted in D₂O) and 131 hydrophobic (extracted in CDCl₃) experimental bench-top 60-MHz NMR spectra, 1753 tastings scored by volunteers, and 139 hardness measurements recorded by a penetrometer, all placed into a core database. The database content was roughly classified into fish and vegetable groups from the viewpoint of different spectrum patterns. FoodPro can query a user food NMR spectrum, search similar NMR spectra with a specified similarity threshold, and then compute estimated tasting and hardness, covariance, and correlation spectra to tasting and hardness. Querying fish spectra exemplified specific covariance spectra to tasting and hardness, giving positive covariance for tasting at 1.31 ppm for lactate and 3.47 ppm for glucose and a positive covariance for hardness at 3.26 ppm for trimethylamine N -oxide.
BlackOPs: increasing confidence in variant detection through mappability filtering.
Cabanski, Christopher R; Wilkerson, Matthew D; Soloway, Matthew; Parker, Joel S; Liu, Jinze; Prins, Jan F; Marron, J S; Perou, Charles M; Hayes, D Neil
2013-10-01
Identifying variants using high-throughput sequencing data is currently a challenge because true biological variants can be indistinguishable from technical artifacts. One source of technical artifact results from incorrectly aligning experimentally observed sequences to their true genomic origin ('mismapping') and inferring differences in mismapped sequences to be true variants. We developed BlackOPs, an open-source tool that simulates experimental RNA-seq and DNA whole exome sequences derived from the reference genome, aligns these sequences by custom parameters, detects variants and outputs a blacklist of positions and alleles caused by mismapping. Blacklists contain thousands of artifact variants that are indistinguishable from true variants and, for a given sample, are expected to be almost completely false positives. We show that these blacklist positions are specific to the alignment algorithm and read length used, and BlackOPs allows users to generate a blacklist specific to their experimental setup. We queried the dbSNP and COSMIC variant databases and found numerous variants indistinguishable from mapping errors. We demonstrate how filtering against blacklist positions reduces the number of potential false variants using an RNA-seq glioblastoma cell line data set. In summary, accounting for mapping-caused variants tuned to experimental setups reduces false positives and, therefore, improves genome characterization by high-throughput sequencing.
A novel methodology for querying web images
NASA Astrophysics Data System (ADS)
Prabhakara, Rashmi; Lee, Ching Cheng
2005-01-01
Ever since the advent of Internet, there has been an immense growth in the amount of image data that is available on the World Wide Web. With such a magnitude of image availability, an efficient and effective image retrieval system is required to make use of this information. This research presents an effective image matching and indexing technique that improvises on existing integrated image retrieval methods. The proposed technique follows a two-phase approach, integrating query by topic and query by example specification methods. The first phase consists of topic-based image retrieval using an improved text information retrieval (IR) technique that makes use of the structured format of HTML documents. It consists of a focused crawler that not only provides for the user to enter the keyword for the topic-based search but also, the scope in which the user wants to find the images. The second phase uses the query by example specification to perform a low-level content-based image match for the retrieval of smaller and relatively closer results of the example image. Information related to the image feature is automatically extracted from the query image by the image processing system. A technique that is not computationally intensive based on color feature is used to perform content-based matching of images. The main goal is to develop a functional image search and indexing system and to demonstrate that better retrieval results can be achieved with this proposed hybrid search technique.
A novel methodology for querying web images
NASA Astrophysics Data System (ADS)
Prabhakara, Rashmi; Lee, Ching Cheng
2004-12-01
Ever since the advent of Internet, there has been an immense growth in the amount of image data that is available on the World Wide Web. With such a magnitude of image availability, an efficient and effective image retrieval system is required to make use of this information. This research presents an effective image matching and indexing technique that improvises on existing integrated image retrieval methods. The proposed technique follows a two-phase approach, integrating query by topic and query by example specification methods. The first phase consists of topic-based image retrieval using an improved text information retrieval (IR) technique that makes use of the structured format of HTML documents. It consists of a focused crawler that not only provides for the user to enter the keyword for the topic-based search but also, the scope in which the user wants to find the images. The second phase uses the query by example specification to perform a low-level content-based image match for the retrieval of smaller and relatively closer results of the example image. Information related to the image feature is automatically extracted from the query image by the image processing system. A technique that is not computationally intensive based on color feature is used to perform content-based matching of images. The main goal is to develop a functional image search and indexing system and to demonstrate that better retrieval results can be achieved with this proposed hybrid search technique.
VIGOR: Interactive Visual Exploration of Graph Query Results.
Pienta, Robert; Hohman, Fred; Endert, Alex; Tamersoy, Acar; Roundy, Kevin; Gates, Chris; Navathe, Shamkant; Chau, Duen Horng
2018-01-01
Finding patterns in graphs has become a vital challenge in many domains from biological systems, network security, to finance (e.g., finding money laundering rings of bankers and business owners). While there is significant interest in graph databases and querying techniques, less research has focused on helping analysts make sense of underlying patterns within a group of subgraph results. Visualizing graph query results is challenging, requiring effective summarization of a large number of subgraphs, each having potentially shared node-values, rich node features, and flexible structure across queries. We present VIGOR, a novel interactive visual analytics system, for exploring and making sense of query results. VIGOR uses multiple coordinated views, leveraging different data representations and organizations to streamline analysts sensemaking process. VIGOR contributes: (1) an exemplar-based interaction technique, where an analyst starts with a specific result and relaxes constraints to find other similar results or starts with only the structure (i.e., without node value constraints), and adds constraints to narrow in on specific results; and (2) a novel feature-aware subgraph result summarization. Through a collaboration with Symantec, we demonstrate how VIGOR helps tackle real-world problems through the discovery of security blindspots in a cybersecurity dataset with over 11,000 incidents. We also evaluate VIGOR with a within-subjects study, demonstrating VIGOR's ease of use over a leading graph database management system, and its ability to help analysts understand their results at higher speed and make fewer errors.
Progressive content-based retrieval of image and video with adaptive and iterative refinement
NASA Technical Reports Server (NTRS)
Li, Chung-Sheng (Inventor); Turek, John Joseph Edward (Inventor); Castelli, Vittorio (Inventor); Chen, Ming-Syan (Inventor)
1998-01-01
A method and apparatus for minimizing the time required to obtain results for a content based query in a data base. More specifically, with this invention, the data base is partitioned into a plurality of groups. Then, a schedule or sequence of groups is assigned to each of the operations of the query, where the schedule represents the order in which an operation of the query will be applied to the groups in the schedule. Each schedule is arranged so that each application of the operation operates on the group which will yield intermediate results that are closest to final results.
Ad-Hoc Queries over Document Collections - A Case Study
NASA Astrophysics Data System (ADS)
Löser, Alexander; Lutter, Steffen; Düssel, Patrick; Markl, Volker
We discuss the novel problem of supporting analytical business intelligence queries over web-based textual content, e.g., BI-style reports based on 100.000's of documents from an ad-hoc web search result. Neither conventional search engines nor conventional Business Intelligence and ETL tools address this problem, which lies at the intersection of their capabilities. "Google Squared" or our system GOOLAP.info, are examples of these kinds of systems. They execute information extraction methods over one or several document collections at query time and integrate extracted records into a common view or tabular structure. Frequent extraction and object resolution failures cause incomplete records which could not be joined into a record answering the query. Our focus is the identification of join-reordering heuristics maximizing the size of complete records answering a structured query. With respect to given costs for document extraction we propose two novel join-operations: The multi-way CJ-operator joins records from multiple relationships extracted from a single document. The two-way join-operator DJ ensures data density by removing incomplete records from results. In a preliminary case study we observe that our join-reordering heuristics positively impact result size, record density and lower execution costs.
Moon, Rachel Y; Hauck, Fern R; Kellams, Ann L; Colson, Eve R; Geller, Nicole L; Heeren, Timothy C; Kerr, Stephen M; Corwin, Michael J
To assess how mothers' choice of e-mail or text messages (SMS) to receive safe sleep communications is associated with educational video viewing and responses to care practice queries. Seven hundred ninety-two new mothers received safe sleep-related communications for 60 days after newborn hospital discharge as part of a trial of health education interventions on infant care practices. Mothers chose e-mail or SMS for study communications and were sent 22 short safe sleep videos and 41 queries regarding infant care practices. Study communications via e-mail were elected by 55.7% of participants. The SMS group had a modestly higher overall view rate of videos (59.1% vs 54.4%; adjusted odds ratio [aOR], 1.39; 95% confidence interval [CI], 1.07-1.81) and a substantially higher response rate to queries (70.0% vs 45.2%; aOR, 3.48; 95% CI, 2.74-4.43). Participants more commonly opted to receive infant care practice videos and queries via e-mail. SMS was associated with higher viewing and response rates, especially for query responses. These results highlight the importance of understanding how specific modalities of communication might vary in reach. Copyright © 2017 Academic Pediatric Association. Published by Elsevier Inc. All rights reserved.
Xiao, Fuyuan; Aritsugi, Masayoshi; Wang, Qing; Zhang, Rong
2016-09-01
For efficient and sophisticated analysis of complex event patterns that appear in streams of big data from health care information systems and support for decision-making, a triaxial hierarchical model is proposed in this paper. Our triaxial hierarchical model is developed by focusing on hierarchies among nested event pattern queries with an event concept hierarchy, thereby allowing us to identify the relationships among the expressions and sub-expressions of the queries extensively. We devise a cost-based heuristic by means of the triaxial hierarchical model to find an optimised query execution plan in terms of the costs of both the operators and the communications between them. According to the triaxial hierarchical model, we can also calculate how to reuse the results of the common sub-expressions in multiple queries. By integrating the optimised query execution plan with the reuse schemes, a multi-query optimisation strategy is developed to accomplish efficient processing of multiple nested event pattern queries. We present empirical studies in which the performance of multi-query optimisation strategy was examined under various stream input rates and workloads. Specifically, the workloads of pattern queries can be used for supporting monitoring patients' conditions. On the other hand, experiments with varying input rates of streams can correspond to changes of the numbers of patients that a system should manage, whereas burst input rates can correspond to changes of rushes of patients to be taken care of. The experimental results have shown that, in Workload 1, our proposal can improve about 4 and 2 times throughput comparing with the relative works, respectively; in Workload 2, our proposal can improve about 3 and 2 times throughput comparing with the relative works, respectively; in Workload 3, our proposal can improve about 6 times throughput comparing with the relative work. The experimental results demonstrated that our proposal was able to process complex queries efficiently which can support health information systems and further decision-making. Copyright © 2016 Elsevier B.V. All rights reserved.
Artificial Intelligence - Research and Applications
1975-05-01
G, »aln H, Harrow A, Brain B, Deutsch P, Duda R, Flues T, Garvey P. Hart G, Hendrlx 0, Lynch B. Meyer M. Pattner C . Sacerdotl D ...System a. The Procedural Net b. Task-Specific Knowledge c . The Planning Algorithm d . The Execution Algorithm 3. The Semantics of Assembly and...101 3. Querying State Description Models 103 a. Truth Values 103 b. Generators Instead of Backtracking 104 c . The Query Functions 107 d
Ong, Edison; Xiang, Zuoshuang; Zhao, Bin; Liu, Yue; Lin, Yu; Zheng, Jie; Mungall, Chris; Courtot, Mélanie; Ruttenberg, Alan; He, Yongqun
2017-01-01
Linked Data (LD) aims to achieve interconnected data by representing entities using Unified Resource Identifiers (URIs), and sharing information using Resource Description Frameworks (RDFs) and HTTP. Ontologies, which logically represent entities and relations in specific domains, are the basis of LD. Ontobee (http://www.ontobee.org/) is a linked ontology data server that stores ontology information using RDF triple store technology and supports query, visualization and linkage of ontology terms. Ontobee is also the default linked data server for publishing and browsing biomedical ontologies in the Open Biological Ontology (OBO) Foundry (http://obofoundry.org) library. Ontobee currently hosts more than 180 ontologies (including 131 OBO Foundry Library ontologies) with over four million terms. Ontobee provides a user-friendly web interface for querying and visualizing the details and hierarchy of a specific ontology term. Using the eXtensible Stylesheet Language Transformation (XSLT) technology, Ontobee is able to dereference a single ontology term URI, and then output RDF/eXtensible Markup Language (XML) for computer processing or display the HTML information on a web browser for human users. Statistics and detailed information are generated and displayed for each ontology listed in Ontobee. In addition, a SPARQL web interface is provided for custom advanced SPARQL queries of one or multiple ontologies. PMID:27733503
A new reference implementation of the PSICQUIC web service.
del-Toro, Noemi; Dumousseau, Marine; Orchard, Sandra; Jimenez, Rafael C; Galeota, Eugenia; Launay, Guillaume; Goll, Johannes; Breuer, Karin; Ono, Keiichiro; Salwinski, Lukasz; Hermjakob, Henning
2013-07-01
The Proteomics Standard Initiative Common QUery InterfaCe (PSICQUIC) specification was created by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) to enable computational access to molecular-interaction data resources by means of a standard Web Service and query language. Currently providing >150 million binary interaction evidences from 28 servers globally, the PSICQUIC interface allows the concurrent search of multiple molecular-interaction information resources using a single query. Here, we present an extension of the PSICQUIC specification (version 1.3), which has been released to be compliant with the enhanced standards in molecular interactions. The new release also includes a new reference implementation of the PSICQUIC server available to the data providers. It offers augmented web service capabilities and improves the user experience. PSICQUIC has been running for almost 5 years, with a user base growing from only 4 data providers to 28 (April 2013) allowing access to 151 310 109 binary interactions. The power of this web service is shown in PSICQUIC View web application, an example of how to simultaneously query, browse and download results from the different PSICQUIC servers. This application is free and open to all users with no login requirement (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml).
Harvesting implementation for the GI-cat distributed catalog
NASA Astrophysics Data System (ADS)
Boldrini, Enrico; Papeschi, Fabrizio; Bigagli, Lorenzo; Mazzetti, Paolo
2010-05-01
GI-cat framework implements a distributed catalog service supporting different international standards and interoperability arrangements in use by the geoscientific community. The distribution functionality in conjunction with the mediation functionality allows to seamlessly query remote heterogeneous data sources, including OGC Web Services - e.e. OGC CSW, WCS, WFS and WMS, community standards such as UNIDATA THREDDS/OPeNDAP, SeaDataNet CDI (Common Data Index), GBIF (Global Biodiversity Information Facility) services and OpenSearch engines. In the GI-cat modular architecture a distributor component carry out the distribution functionality by query delegation to the mediator components (one for each different data source). Each of these mediator components is able to query a specific data source and convert back the results by mapping of the foreign data model to the GI-cat internal one, based on ISO 19139. In order to cope with deployment scenarios in which local data is expected, an harvesting approach has been experimented. The new strategy comes in addition to the consolidated distributed approach, allowing the user to switch between a remote and a local search at will for each federated resource; this extends GI-cat configuration possibilities. The harvesting strategy is designed in GI-cat by the use at the core of a local cache component, implemented as a native XML database and based on eXist. The different heterogeneous sources are queried for the bulk of available data; this data is then injected into the cache component after being converted to the GI-cat data model. The query and conversion steps are performed by the mediator components that were are part of the GI-cat framework. Afterward each new query can be exercised against local data that have been stored in the cache component. Considering both advantages and shortcomings that affect harvesting and query distribution approaches, it comes out that a user driven tuning is required to take the best of them. This is often related to the specific user scenarios to be implemented. GI-cat proved to be a flexible framework to address user need. The GI-cat configurator tool was updated to make such a tuning possible: each data source can be configured to enable either harvesting or query distribution approaches; in the former case an appropriate harvesting interval can be set.
Steppan, Martin; Kraus, Ludwig; Piontek, Daniela; Siciliano, Valeria
2013-01-01
Prevalence estimation of cannabis use is usually based on self-report data. Although there is evidence on the reliability of this data source, its cross-cultural validity is still a major concern. External objective criteria are needed for this purpose. In this study, cannabis-related search engine query data are used as an external criterion. Data on cannabis use were taken from the 2007 European School Survey Project on Alcohol and Other Drugs (ESPAD). Provincial data came from three Italian nation-wide studies using the same methodology (2006-2008; ESPAD-Italia). Information on cannabis-related search engine query data was based on Google search volume indices (GSI). (1) Reliability analysis was conducted for GSI. (2) Latent measurement models of "true" cannabis prevalence were tested using perceived availability, web-based cannabis searches and self-reported prevalence as indicators. (3) Structure models were set up to test the influences of response tendencies and geographical position (latitude, longitude). In order to test the stability of the models, analyses were conducted on country level (Europe, US) and on provincial level in Italy. Cannabis-related GSI were found to be highly reliable and constant over time. The overall measurement model was highly significant in both data sets. On country level, no significant effects of response bias indicators and geographical position on perceived availability, web-based cannabis searches and self-reported prevalence were found. On provincial level, latitude had a significant positive effect on availability indicating that perceived availability of cannabis in northern Italy was higher than expected from the other indicators. Although GSI showed weaker associations with cannabis use than perceived availability, the findings underline the external validity and usefulness of search engine query data as external criteria. The findings suggest an acceptable relative comparability of national (provincial) prevalence estimates of cannabis use that are based on a common survey methodology. Search engine query data are a too weak indicator to base prevalence estimations on this source only, but in combination with other sources (waste water analysis, sales of cigarette paper) they may provide satisfactory estimates. Copyright © 2012. Published by Elsevier B.V.
Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks
Zhao, Yongan; Carey, Knox; Lloyd, David; Sofia, Heidi; Baker, Dixie; Flicek, Paul; Shringarpure, Suyash; Bustamante, Carlos; Wang, Shuang; Jiang, Xiaoqian; Ohno-Machado, Lucila; Tang, Haixu; Wang, XiaoFeng; Hubaux, Jean-Pierre
2018-01-01
The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context—a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or “beacon”) is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards. While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual’s whole genome sequence), the individual’s membership in a beacon can be inferred through repeated queries for variants present in the individual’s genome. In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets. PMID:28339683
Seasonality in seeking mental health information on Google.
Ayers, John W; Althouse, Benjamin M; Allem, Jon-Patrick; Rosenquist, J Niels; Ford, Daniel E
2013-05-01
Population mental health surveillance is an important challenge limited by resource constraints, long time lags in data collection, and stigma. One promising approach to bridge similar gaps elsewhere has been the use of passively generated digital data. This article assesses the viability of aggregate Internet search queries for real-time monitoring of several mental health problems, specifically in regard to seasonal patterns of seeking out mental health information. All Google mental health queries were monitored in the U.S. and Australia from 2006 to 2010. Additionally, queries were subdivided among those including the terms ADHD (attention deficit-hyperactivity disorder); anxiety; bipolar; depression; anorexia or bulimia (eating disorders); OCD (obsessive-compulsive disorder); schizophrenia; and suicide. A wavelet phase analysis was used to isolate seasonal components in the trends, and based on this model, the mean search volume in winter was compared with that in summer, as performed in 2012. All mental health queries followed seasonal patterns with winter peaks and summer troughs amounting to a 14% (95% CI=11%, 16%) difference in volume for the U.S. and 11% (95% CI=7%, 15%) for Australia. These patterns also were evident for all specific subcategories of illness or problem. For instance, seasonal differences ranged from 7% (95% CI=5%, 10%) for anxiety (followed by OCD, bipolar, depression, suicide, ADHD, schizophrenia) to 37% (95% CI=31%, 44%) for eating disorder queries in the U.S. Several nonclinical motivators for query seasonality (such as media trends or academic interest) were explored and rejected. Information seeking on Google across all major mental illnesses and/or problems followed seasonal patterns similar to those found for seasonal affective disorder. These are the first data published on patterns of seasonality in information seeking encompassing all the major mental illnesses, notable also because they likely would have gone undetected using traditional surveillance. Copyright © 2013. Published by Elsevier Inc.
GELLO: an object-oriented query and expression language for clinical decision support.
Sordo, Margarita; Ogunyemi, Omolola; Boxwala, Aziz A; Greenes, Robert A
2003-01-01
GELLO is a purpose-specific, object-oriented (OO) query and expression language. GELLO is the result of a concerted effort of the Decision Systems Group (DSG) working with the HL7 Clinical Decision Support Technical Committee (CDSTC) to provide the HL7 community with a common format for data encoding and manipulation. GELLO will soon be submitted for ballot to the HL7 CDSTC for consideration as a standard.
Mamouras, Konstantinos; Raghothaman, Mukund; Alur, Rajeev; Ives, Zachary G; Khanna, Sanjeev
2017-06-01
Real-time decision making in emerging IoT applications typically relies on computing quantitative summaries of large data streams in an efficient and incremental manner. To simplify the task of programming the desired logic, we propose StreamQRE, which provides natural and high-level constructs for processing streaming data. Our language has a novel integration of linguistic constructs from two distinct programming paradigms: streaming extensions of relational query languages and quantitative extensions of regular expressions. The former allows the programmer to employ relational constructs to partition the input data by keys and to integrate data streams from different sources, while the latter can be used to exploit the logical hierarchy in the input stream for modular specifications. We first present the core language with a small set of combinators, formal semantics, and a decidable type system. We then show how to express a number of common patterns with illustrative examples. Our compilation algorithm translates the high-level query into a streaming algorithm with precise complexity bounds on per-item processing time and total memory footprint. We also show how to integrate approximation algorithms into our framework. We report on an implementation in Java, and evaluate it with respect to existing high-performance engines for processing streaming data. Our experimental evaluation shows that (1) StreamQRE allows more natural and succinct specification of queries compared to existing frameworks, (2) the throughput of our implementation is higher than comparable systems (for example, two-to-four times greater than RxJava), and (3) the approximation algorithms supported by our implementation can lead to substantial memory savings.
Mamouras, Konstantinos; Raghothaman, Mukund; Alur, Rajeev; Ives, Zachary G.; Khanna, Sanjeev
2017-01-01
Real-time decision making in emerging IoT applications typically relies on computing quantitative summaries of large data streams in an efficient and incremental manner. To simplify the task of programming the desired logic, we propose StreamQRE, which provides natural and high-level constructs for processing streaming data. Our language has a novel integration of linguistic constructs from two distinct programming paradigms: streaming extensions of relational query languages and quantitative extensions of regular expressions. The former allows the programmer to employ relational constructs to partition the input data by keys and to integrate data streams from different sources, while the latter can be used to exploit the logical hierarchy in the input stream for modular specifications. We first present the core language with a small set of combinators, formal semantics, and a decidable type system. We then show how to express a number of common patterns with illustrative examples. Our compilation algorithm translates the high-level query into a streaming algorithm with precise complexity bounds on per-item processing time and total memory footprint. We also show how to integrate approximation algorithms into our framework. We report on an implementation in Java, and evaluate it with respect to existing high-performance engines for processing streaming data. Our experimental evaluation shows that (1) StreamQRE allows more natural and succinct specification of queries compared to existing frameworks, (2) the throughput of our implementation is higher than comparable systems (for example, two-to-four times greater than RxJava), and (3) the approximation algorithms supported by our implementation can lead to substantial memory savings. PMID:29151821
Irrelevance Reasoning in Knowledge Based Systems
NASA Technical Reports Server (NTRS)
Levy, A. Y.
1993-01-01
This dissertation considers the problem of reasoning about irrelevance of knowledge in a principled and efficient manner. Specifically, it is concerned with two key problems: (1) developing algorithms for automatically deciding what parts of a knowledge base are irrelevant to a query and (2) the utility of relevance reasoning. The dissertation describes a novel tool, the query-tree, for reasoning about irrelevance. Based on the query-tree, we develop several algorithms for deciding what formulas are irrelevant to a query. Our general framework sheds new light on the problem of detecting independence of queries from updates. We present new results that significantly extend previous work in this area. The framework also provides a setting in which to investigate the connection between the notion of irrelevance and the creation of abstractions. We propose a new approach to research on reasoning with abstractions, in which we investigate the properties of an abstraction by considering the irrelevance claims on which it is based. We demonstrate the potential of the approach for the cases of abstraction of predicates and projection of predicate arguments. Finally, we describe an application of relevance reasoning to the domain of modeling physical devices.
Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples
Wilks, Christopher; Gaddipati, Phani; Nellore, Abhinav
2018-01-01
Abstract Motivation As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Results Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Availability and implementation Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license. Contact chris.wilks@jhu.edu or langmea@cs.jhu.edu Supplementary information Supplementary data are available at Bioinformatics online. PMID:28968689
Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples.
Wilks, Christopher; Gaddipati, Phani; Nellore, Abhinav; Langmead, Ben
2018-01-01
As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license. chris.wilks@jhu.edu or langmea@cs.jhu.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Ontology-Driven Provenance Management in eScience: An Application in Parasite Research
NASA Astrophysics Data System (ADS)
Sahoo, Satya S.; Weatherly, D. Brent; Mutharaju, Raghava; Anantharam, Pramod; Sheth, Amit; Tarleton, Rick L.
Provenance, from the French word "provenir", describes the lineage or history of a data entity. Provenance is critical information in scientific applications to verify experiment process, validate data quality and associate trust values with scientific results. Current industrial scale eScience projects require an end-to-end provenance management infrastructure. This infrastructure needs to be underpinned by formal semantics to enable analysis of large scale provenance information by software applications. Further, effective analysis of provenance information requires well-defined query mechanisms to support complex queries over large datasets. This paper introduces an ontology-driven provenance management infrastructure for biology experiment data, as part of the Semantic Problem Solving Environment (SPSE) for Trypanosoma cruzi (T.cruzi). This provenance infrastructure, called T.cruzi Provenance Management System (PMS), is underpinned by (a) a domain-specific provenance ontology called Parasite Experiment ontology, (b) specialized query operators for provenance analysis, and (c) a provenance query engine. The query engine uses a novel optimization technique based on materialized views called materialized provenance views (MPV) to scale with increasing data size and query complexity. This comprehensive ontology-driven provenance infrastructure not only allows effective tracking and management of ongoing experiments in the Tarleton Research Group at the Center for Tropical and Emerging Global Diseases (CTEGD), but also enables researchers to retrieve the complete provenance information of scientific results for publication in literature.
Bientzle, Martina; Griewatz, Jan; Kimmerle, Joachim; Küppers, Julia; Cress, Ulrike; Lammerding-Koeppel, Maria
2015-11-25
Medical expert forums on the Internet play an increasing role in patient counseling. Therefore, it is important to understand how doctor-patient communication is influenced in such forums both by features of the patients or advice seekers, as expressed in their forum queries, and by characteristics of the medical experts involved. In this experimental study, we aimed to examine in what way (1) the particular wording of patient queries and (2) medical experts' therapeutic health concepts (for example, beliefs around adhering to a distinctly scientific understanding of diagnosis and treatment and a clear focus on evidence-based medicine) impact communication behavior of the medical experts in an Internet forum. Advanced medical students (in their ninth semester of medical training) were recruited as participants. Participation in the online forum was part of a communication training embedded in a gynecology course. We first measured their biomedical therapeutic health concept (hereinafter called "biomedical concept"). Then they participated in an online forum where they answered fictitious patient queries about mammography screening that either included scientific or emotional wording in a between-group design. We analyzed participants' replies with regard to the following dimensions: their use of scientific or emotional wording, the amount of communicated information, and their attempt to build a positive doctor-patient relationship. This study was carried out with 117 medical students (73 women, 41 men, 3 did not indicate their sex). We found evidence that both the wording of patient queries and the participants' biomedical concept influenced participants' response behavior. They answered emotional patient queries in a more emotional way (mean 0.92, SD 1.02) than scientific patient queries (mean 0.26, SD 0.55; t74=3.48, P<.001, d=0.81). We also found a significant interaction effect between participants' use of scientific or emotional wording and type of patient query (F2,74=10.29, P<.01, partial η(2)=0.12) indicating that participants used scientific wording independently of the type of patient query, whereas they used emotional wording particularly when replying to emotional patient queries. In addition, the more pronounced the medical experts' biomedical concept was, the more scientifically (adjusted β=.20; F1,75=2.95, P=.045) and the less emotionally (adjusted β=-.22; F1,74=3.66, P=.03) they replied to patient queries. Finally, we found that participants' biomedical concept predicted their engagement in relationship building (adjusted β=-.26): The more pronounced their biomedical concept was, the less they attempted to build a positive doctor-patient relationship (F1,74=5.39, P=.02). Communication training for medical experts could aim to address this issue of recognizing patients' communication styles and needs in certain situations in order to teach medical experts how to take those aspects adequately into account. In addition, communication training should also make medical experts aware of their individual therapeutic health concepts and the consequential implications in communication situations.
Accelerating Research Impact in a Learning Health Care System
Elwy, A. Rani; Sales, Anne E.; Atkins, David
2017-01-01
Background: Since 1998, the Veterans Health Administration (VHA) Quality Enhancement Research Initiative (QUERI) has supported more rapid implementation of research into clinical practice. Objectives: With the passage of the Veterans Access, Choice and Accountability Act of 2014 (Choice Act), QUERI further evolved to support VHA’s transformation into a Learning Health Care System by aligning science with clinical priority goals based on a strategic planning process and alignment of funding priorities with updated VHA priority goals in response to the Choice Act. Design: QUERI updated its strategic goals in response to independent assessments mandated by the Choice Act that recommended VHA reduce variation in care by providing a clear path to implement best practices. Specifically, QUERI updated its application process to ensure its centers (Programs) focus on cross-cutting VHA priorities and specify roadmaps for implementation of research-informed practices across different settings. QUERI also increased funding for scientific evaluations of the Choice Act and other policies in response to Commission on Care recommendations. Results: QUERI’s national network of Programs deploys effective practices using implementation strategies across different settings. QUERI Choice Act evaluations informed the law’s further implementation, setting the stage for additional rigorous national evaluations of other VHA programs and policies including community provider networks. Conclusions: Grounded in implementation science and evidence-based policy, QUERI serves as an example of how to operationalize core components of a Learning Health Care System, notably through rigorous evaluation and scientific testing of implementation strategies to ultimately reduce variation in quality and improve overall population health. PMID:27997456
An XML-Based Manipulation and Query Language for Rule-Based Information
NASA Astrophysics Data System (ADS)
Mansour, Essam; Höpfner, Hagen
Rules are utilized to assist in the monitoring process that is required in activities, such as disease management and customer relationship management. These rules are specified according to the application best practices. Most of research efforts emphasize on the specification and execution of these rules. Few research efforts focus on managing these rules as one object that has a management life-cycle. This paper presents our manipulation and query language that is developed to facilitate the maintenance of this object during its life-cycle and to query the information contained in this object. This language is based on an XML-based model. Furthermore, we evaluate the model and language using a prototype system applied to a clinical case study.
Small sum privacy and large sum utility in data publishing.
Fu, Ada Wai-Chee; Wang, Ke; Wong, Raymond Chi-Wing; Wang, Jia; Jiang, Minhao
2014-08-01
While the study of privacy preserving data publishing has drawn a lot of interest, some recent work has shown that existing mechanisms do not limit all inferences about individuals. This paper is a positive note in response to this finding. We point out that not all inference attacks should be countered, in contrast to all existing works known to us, and based on this we propose a model called SPLU. This model protects sensitive information, by which we refer to answers for aggregate queries with small sums, while queries with large sums are answered with higher accuracy. Using SPLU, we introduce a sanitization algorithm to protect data while maintaining high data utility for queries with large sums. Empirical results show that our method behaves as desired. Copyright © 2014 Elsevier Inc. All rights reserved.
Brolan, Claire E; Te, Vannarath; Floden, Nadia; Hill, Peter S; Forman, Lisa
2017-01-01
Since the new global health and development goal, Sustainable Development Goal (SDG) 3, and its nine targets and four means of implementation were introduced to the world through a United Nations (UN) General Assembly resolution in September 2015, right to health practitioners have queried whether this goal mirrors the content of the human right to health in international law. This study examines the text of the UN SDG resolution, Transforming our world: the 2030 Agenda for Sustainable Development , from a right to health minimalist and right to health maximalist analytic perspective. When reviewing the UN SDG resolution's text, a right to health minimalist questions whether the content of the right to health is at least implicitly included in this document, specifically focusing on SDG 3 and its metrics framework. A right to health maximalist, on the other hand, queries whether the content of the right to health is explicitly included. This study finds that whether the right to health is contained in the UN SDG resolution, and the SDG metrics therein, ultimately depends on the individual analyst's subjective persuasion in relation to right to health minimalism or maximalism. We conclude that the UN General Assembly's lack of cogency on the right to health's position in the UN SDG resolution will continue to blur if not divest human rights' (and specifically the right to health's) integral relationship to high-level development planning, implementation and SDG monitoring and evaluation efforts.
Heterogeneous database integration in biomedicine.
Sujansky, W
2001-08-01
The rapid expansion of biomedical knowledge, reduction in computing costs, and spread of internet access have created an ocean of electronic data. The decentralized nature of our scientific community and healthcare system, however, has resulted in a patchwork of diverse, or heterogeneous, database implementations, making access to and aggregation of data across databases very difficult. The database heterogeneity problem applies equally to clinical data describing individual patients and biological data characterizing our genome. Specifically, databases are highly heterogeneous with respect to the data models they employ, the data schemas they specify, the query languages they support, and the terminologies they recognize. Heterogeneous database systems attempt to unify disparate databases by providing uniform conceptual schemas that resolve representational heterogeneities, and by providing querying capabilities that aggregate and integrate distributed data. Research in this area has applied a variety of database and knowledge-based techniques, including semantic data modeling, ontology definition, query translation, query optimization, and terminology mapping. Existing systems have addressed heterogeneous database integration in the realms of molecular biology, hospital information systems, and application portability.
Gao, JianZhao; Tao, Xue-Wen; Zhao, Jia; Feng, Yuan-Ming; Cai, Yu-Dong; Zhang, Ning
2017-01-01
Lysine acetylation, as one type of post-translational modifications (PTM), plays key roles in cellular regulations and can be involved in a variety of human diseases. However, it is often high-cost and time-consuming to use traditional experimental approaches to identify the lysine acetylation sites. Therefore, effective computational methods should be developed to predict the acetylation sites. In this study, we developed a position-specific method for epsilon lysine acetylation site prediction. Sequences of acetylated proteins were retrieved from the UniProt database. Various kinds of features such as position specific scoring matrix (PSSM), amino acid factors (AAF), and disorders were incorporated. A feature selection method based on mRMR (Maximum Relevance Minimum Redundancy) and IFS (Incremental Feature Selection) was employed. Finally, 319 optimal features were selected from total 541 features. Using the 319 optimal features to encode peptides, a predictor was constructed based on dagging. As a result, an accuracy of 69.56% with MCC of 0.2792 was achieved. We analyzed the optimal features, which suggested some important factors determining the lysine acetylation sites. We developed a position-specific method for epsilon lysine acetylation site prediction. A set of optimal features was selected. Analysis of the optimal features provided insights into the mechanism of lysine acetylation sites, providing guidance of experimental validation. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Why do people google movement disorders? An infodemiological study of information seeking behaviors.
Brigo, Francesco; Erro, Roberto
2016-05-01
Millions of people worldwide everyday search Google or Wikipedia to look for health-related information. Aim of this study was to evaluate and interpret web search queries for terms related to movement disorders (MD) in English-speaking countries and their changes over time. We analyzed information regarding the volume of online searches in Google and Wikipedia for the most common MD and their treatments. We determined the highest search volume peaks to identify possible relation with online news headlines. The volume of searches for some queries related to MD entered in Google enormously increased over time. Most queries were related to definition, subtypes, symptoms and treatment (mostly to adverse effects, or alternatively, to possible alternative treatments). The highest peaks of MD search queries were temporally related to news about celebrities suffering from MD, to specific mass-media events or to news concerning pharmaceutic companies or scientific discoveries on MD. An increasing number of people use Google and Wikipedia to look for terms related to MD to obtain information on definitions, causes and symptoms, possibly to aid initial self-diagnosis. MD information demand and the actual prevalence of different MDs do not travel together: web search volume may mirrors patients' fears and worries about some particular disorders perceived as more serious than others, or may be driven by release of news about celebrities suffering from MD, "breaking news" or specific mass-media events regarding MD.
Web image retrieval using an effective topic and content-based technique
NASA Astrophysics Data System (ADS)
Lee, Ching-Cheng; Prabhakara, Rashmi
2005-03-01
There has been an exponential growth in the amount of image data that is available on the World Wide Web since the early development of Internet. With such a large amount of information and image available and its usefulness, an effective image retrieval system is thus greatly needed. In this paper, we present an effective approach with both image matching and indexing techniques that improvise on existing integrated image retrieval methods. This technique follows a two-phase approach, integrating query by topic and query by example specification methods. In the first phase, The topic-based image retrieval is performed by using an improved text information retrieval (IR) technique that makes use of the structured format of HTML documents. This technique consists of a focused crawler that not only provides for the user to enter the keyword for the topic-based search but also, the scope in which the user wants to find the images. In the second phase, we use query by example specification to perform a low-level content-based image match in order to retrieve smaller and relatively closer results of the example image. From this, information related to the image feature is automatically extracted from the query image. The main objective of our approach is to develop a functional image search and indexing technique and to demonstrate that better retrieval results can be achieved.
Ong, Edison; Xiang, Zuoshuang; Zhao, Bin; Liu, Yue; Lin, Yu; Zheng, Jie; Mungall, Chris; Courtot, Mélanie; Ruttenberg, Alan; He, Yongqun
2017-01-04
Linked Data (LD) aims to achieve interconnected data by representing entities using Unified Resource Identifiers (URIs), and sharing information using Resource Description Frameworks (RDFs) and HTTP. Ontologies, which logically represent entities and relations in specific domains, are the basis of LD. Ontobee (http://www.ontobee.org/) is a linked ontology data server that stores ontology information using RDF triple store technology and supports query, visualization and linkage of ontology terms. Ontobee is also the default linked data server for publishing and browsing biomedical ontologies in the Open Biological Ontology (OBO) Foundry (http://obofoundry.org) library. Ontobee currently hosts more than 180 ontologies (including 131 OBO Foundry Library ontologies) with over four million terms. Ontobee provides a user-friendly web interface for querying and visualizing the details and hierarchy of a specific ontology term. Using the eXtensible Stylesheet Language Transformation (XSLT) technology, Ontobee is able to dereference a single ontology term URI, and then output RDF/eXtensible Markup Language (XML) for computer processing or display the HTML information on a web browser for human users. Statistics and detailed information are generated and displayed for each ontology listed in Ontobee. In addition, a SPARQL web interface is provided for custom advanced SPARQL queries of one or multiple ontologies. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Kettunen, Jyrki; Eirola, Emil; Paakkonen, Heikki
2018-01-01
Background Some of the temporal variations and clock-like rhythms that govern several different health-related behaviors can be traced in near real-time with the help of search engine data. This is especially useful when studying phenomena where little or no traditional data exist. One specific area where traditional data are incomplete is the study of diurnal mood variations, or daily changes in individuals’ overall mood state in relation to depression-like symptoms. Objective The objective of this exploratory study was to analyze diurnal variations for interest in depression on the Web to discover hourly patterns of depression interest and help seeking. Methods Hourly query volume data for 6 depression-related queries in Finland were downloaded from Google Trends in March 2017. A continuous wavelet transform (CWT) was applied to the hourly data to focus on the diurnal variation. Longer term trends and noise were also eliminated from the data to extract the diurnal variation for each query term. An analysis of variance was conducted to determine the statistical differences between the distributions of each hour. Data were also trichotomized and analyzed in 3 time blocks to make comparisons between different time periods during the day. Results Search volumes for all depression-related query terms showed a unimodal regular pattern during the 24 hours of the day. All queries feature clear peaks during the nighttime hours around 11 PM to 4 AM and troughs between 5 AM and 10 PM. In the means of the CWT-reconstructed data, the differences in nighttime and daytime interest are evident, with a difference of 37.3 percentage points (pp) for the term “Depression,” 33.5 pp for “Masennustesti,” 30.6 pp for “Masennus,” 12.8 pp for “Depression test,” 12.0 pp for “Masennus testi,” and 11.8 pp for “Masennus oireet.” The trichotomization showed peaks in the first time block (00.00 AM-7.59 AM) for all 6 terms. The search volumes then decreased significantly during the second time block (8.00 AM-3.59 PM) for the terms “Masennus oireet” (P<.001), “Masennus” (P=.001), “Depression” (P=.005), and “Depression test” (P=.004). Higher search volumes for the terms “Masennus” (P=.14), “Masennustesti” (P=.07), and “Depression test” (P=.10) were present between the second and third time blocks. Conclusions Help seeking for depression has clear diurnal patterns, with significant rise in depression-related query volumes toward the evening and night. Thus, search engine query data support the notion of the evening-worse pattern in diurnal mood variation. Information on the timely nature of depression-related interest on an hourly level could improve the chances for early intervention, which is beneficial for positive health outcomes. PMID:29792291
Tana, Jonas Christoffer; Kettunen, Jyrki; Eirola, Emil; Paakkonen, Heikki
2018-05-23
Some of the temporal variations and clock-like rhythms that govern several different health-related behaviors can be traced in near real-time with the help of search engine data. This is especially useful when studying phenomena where little or no traditional data exist. One specific area where traditional data are incomplete is the study of diurnal mood variations, or daily changes in individuals' overall mood state in relation to depression-like symptoms. The objective of this exploratory study was to analyze diurnal variations for interest in depression on the Web to discover hourly patterns of depression interest and help seeking. Hourly query volume data for 6 depression-related queries in Finland were downloaded from Google Trends in March 2017. A continuous wavelet transform (CWT) was applied to the hourly data to focus on the diurnal variation. Longer term trends and noise were also eliminated from the data to extract the diurnal variation for each query term. An analysis of variance was conducted to determine the statistical differences between the distributions of each hour. Data were also trichotomized and analyzed in 3 time blocks to make comparisons between different time periods during the day. Search volumes for all depression-related query terms showed a unimodal regular pattern during the 24 hours of the day. All queries feature clear peaks during the nighttime hours around 11 PM to 4 AM and troughs between 5 AM and 10 PM. In the means of the CWT-reconstructed data, the differences in nighttime and daytime interest are evident, with a difference of 37.3 percentage points (pp) for the term "Depression," 33.5 pp for "Masennustesti," 30.6 pp for "Masennus," 12.8 pp for "Depression test," 12.0 pp for "Masennus testi," and 11.8 pp for "Masennus oireet." The trichotomization showed peaks in the first time block (00.00 AM-7.59 AM) for all 6 terms. The search volumes then decreased significantly during the second time block (8.00 AM-3.59 PM) for the terms "Masennus oireet" (P<.001), "Masennus" (P=.001), "Depression" (P=.005), and "Depression test" (P=.004). Higher search volumes for the terms "Masennus" (P=.14), "Masennustesti" (P=.07), and "Depression test" (P=.10) were present between the second and third time blocks. Help seeking for depression has clear diurnal patterns, with significant rise in depression-related query volumes toward the evening and night. Thus, search engine query data support the notion of the evening-worse pattern in diurnal mood variation. Information on the timely nature of depression-related interest on an hourly level could improve the chances for early intervention, which is beneficial for positive health outcomes. ©Jonas Christoffer Tana, Jyrki Kettunen, Emil Eirola, Heikki Paakkonen. Originally published in JMIR Mental Health (http://mental.jmir.org), 23.05.2018.
NASA Astrophysics Data System (ADS)
McWhirter, J.; Boler, F. M.; Bock, Y.; Jamason, P.; Squibb, M. B.; Noll, C. E.; Blewitt, G.; Kreemer, C. W.
2010-12-01
Three geodesy Archive Centers, Scripps Orbit and Permanent Array Center (SOPAC), NASA's Crustal Dynamics Data Information System (CDDIS) and UNAVCO are engaged in a joint effort to define and develop a common Web Service Application Programming Interface (API) for accessing geodetic data holdings. This effort is funded by the NASA ROSES ACCESS Program to modernize the original GPS Seamless Archive Centers (GSAC) technology which was developed in the 1990s. A new web service interface, the GSAC-WS, is being developed to provide uniform and expanded mechanisms through which users can access our data repositories. In total, our respective archives hold tens of millions of files and contain a rich collection of site/station metadata. Though we serve similar user communities, we currently provide a range of different access methods, query services and metadata formats. This leads to a lack of consistency in the userís experience and a duplication of engineering efforts. The GSAC-WS API and its reference implementation in an underlying Java-based GSAC Service Layer (GSL) supports metadata and data queries into site/station oriented data archives. The general nature of this API makes it applicable to a broad range of data systems. The overall goals of this project include providing consistent and rich query interfaces for end users and client programs, the development of enabling technology to facilitate third party repositories in developing these web service capabilities and to enable the ability to perform data queries across a collection of federated GSAC-WS enabled repositories. A fundamental challenge faced in this project is to provide a common suite of query services across a heterogeneous collection of data yet enabling each repository to expose their specific metadata holdings. To address this challenge we are developing a "capabilities" based service where a repository can describe its specific query and metadata capabilities. Furthermore, the architecture of the GSL is based on a model-view paradigm that decouples the underlying data model semantics from particular representations of the data model. This will allow for the GSAC-WS enabled repositories to evolve their service offerings to incorporate new metadata definition formats (e.g., ISO-19115, FGDC, JSON, etc.) and new techniques for accessing their holdings. Building on the core GSAC-WS implementations the project is also developing a federated/distributed query service. This service will seamlessly integrate with the GSAC Service Layer and will support data and metadata queries across a collection of federated GSAC repositories.
JBioWH: an open-source Java framework for bioinformatics data integration
Vera, Roberto; Perez-Riverol, Yasset; Perez, Sonia; Ligeti, Balázs; Kertész-Farkas, Attila; Pongor, Sándor
2013-01-01
The Java BioWareHouse (JBioWH) project is an open-source platform-independent programming framework that allows a user to build his/her own integrated database from the most popular data sources. JBioWH can be used for intensive querying of multiple data sources and the creation of streamlined task-specific data sets on local PCs. JBioWH is based on a MySQL relational database scheme and includes JAVA API parser functions for retrieving data from 20 public databases (e.g. NCBI, KEGG, etc.). It also includes a client desktop application for (non-programmer) users to query data. In addition, JBioWH can be tailored for use in specific circumstances, including the handling of massive queries for high-throughput analyses or CPU intensive calculations. The framework is provided with complete documentation and application examples and it can be downloaded from the Project Web site at http://code.google.com/p/jbiowh. A MySQL server is available for demonstration purposes at hydrax.icgeb.trieste.it:3307. Database URL: http://code.google.com/p/jbiowh PMID:23846595
JBioWH: an open-source Java framework for bioinformatics data integration.
Vera, Roberto; Perez-Riverol, Yasset; Perez, Sonia; Ligeti, Balázs; Kertész-Farkas, Attila; Pongor, Sándor
2013-01-01
The Java BioWareHouse (JBioWH) project is an open-source platform-independent programming framework that allows a user to build his/her own integrated database from the most popular data sources. JBioWH can be used for intensive querying of multiple data sources and the creation of streamlined task-specific data sets on local PCs. JBioWH is based on a MySQL relational database scheme and includes JAVA API parser functions for retrieving data from 20 public databases (e.g. NCBI, KEGG, etc.). It also includes a client desktop application for (non-programmer) users to query data. In addition, JBioWH can be tailored for use in specific circumstances, including the handling of massive queries for high-throughput analyses or CPU intensive calculations. The framework is provided with complete documentation and application examples and it can be downloaded from the Project Web site at http://code.google.com/p/jbiowh. A MySQL server is available for demonstration purposes at hydrax.icgeb.trieste.it:3307. Database URL: http://code.google.com/p/jbiowh.
Semantator: semantic annotator for converting biomedical text to linked data.
Tao, Cui; Song, Dezhao; Sharma, Deepak; Chute, Christopher G
2013-10-01
More than 80% of biomedical data is embedded in plain text. The unstructured nature of these text-based documents makes it challenging to easily browse and query the data of interest in them. One approach to facilitate browsing and querying biomedical text is to convert the plain text to a linked web of data, i.e., converting data originally in free text to structured formats with defined meta-level semantics. In this paper, we introduce Semantator (Semantic Annotator), a semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools. The annotated results will be stored in RDF and can be queried using the SPARQL query language. In addition, semantic reasoners can be directly applied to the annotated data for consistency checking and knowledge inference. Semantator has been released online and was used by the biomedical ontology community who provided positive feedbacks. Our evaluation results indicated that (1) Semantator can perform the annotation functionalities as designed; (2) Semantator can be adopted in real applications in clinical and transactional research; and (3) the annotated results using Semantator can be easily used in Semantic-web-based reasoning tools for further inference. Copyright © 2013 Elsevier Inc. All rights reserved.
2011-03-01
9 Figure 7. RDS preferences widget after loading an unusual font (left) and RDS SPARQL query widget (right...Entered By Individual: SGT Juan Gonzalez DOI: 2007-01-06 13:00:00 Date Entered: 2007-01-06 23:32:03 Subject: Al-Qaeda Reading Material Source...preferences widget after loading an unusual font (left) and RDS SPARQL query widget (right). NetKernel and RDS-specific modules are specified with a URL
MANTIS: a phylogenetic framework for multi-species genome comparisons.
Tzika, Athanasia C; Helaers, Raphaël; Van de Peer, Yves; Milinkovitch, Michel C
2008-01-15
Practitioners of comparative genomics face huge analytical challenges as whole genome sequences and functional/expression data accumulate. Furthermore, the field would greatly benefit from a better integration of this wealth of data with evolutionary concepts. Here, we present MANTIS, a relational database for the analysis of (i) gains and losses of genes on specific branches of the metazoan phylogeny, (ii) reconstructed genome content of ancestral species and (iii) over- or under-representation of functions/processes and tissue specificity of gained, duplicated and lost genes. MANTIS estimates the most likely positions of gene losses on the true phylogeny using a maximum-likelihood function. A user-friendly interface and an extensive query system allow to investigate questions pertaining to gene identity, phylogenetic mapping and function/expression parameters. MANTIS is freely available at http://www.mantisdb.org and constitutes the missing link between multi-species genome comparisons and functional analyses.
Geographic Video 3d Data Model And Retrieval
NASA Astrophysics Data System (ADS)
Han, Z.; Cui, C.; Kong, Y.; Wu, H.
2014-04-01
Geographic video includes both spatial and temporal geographic features acquired through ground-based or non-ground-based cameras. With the popularity of video capture devices such as smartphones, the volume of user-generated geographic video clips has grown significantly and the trend of this growth is quickly accelerating. Such a massive and increasing volume poses a major challenge to efficient video management and query. Most of the today's video management and query techniques are based on signal level content extraction. They are not able to fully utilize the geographic information of the videos. This paper aimed to introduce a geographic video 3D data model based on spatial information. The main idea of the model is to utilize the location, trajectory and azimuth information acquired by sensors such as GPS receivers and 3D electronic compasses in conjunction with video contents. The raw spatial information is synthesized to point, line, polygon and solid according to the camcorder parameters such as focal length and angle of view. With the video segment and video frame, we defined the three categories geometry object using the geometry model of OGC Simple Features Specification for SQL. We can query video through computing the spatial relation between query objects and three categories geometry object such as VFLocation, VSTrajectory, VSFOView and VFFovCone etc. We designed the query methods using the structured query language (SQL) in detail. The experiment indicate that the model is a multiple objective, integration, loosely coupled, flexible and extensible data model for the management of geographic stereo video.
Associative memory model for searching an image database by image snippet
NASA Astrophysics Data System (ADS)
Khan, Javed I.; Yun, David Y.
1994-09-01
This paper presents an associative memory called an multidimensional holographic associative computing (MHAC), which can be potentially used to perform feature based image database query using image snippet. MHAC has the unique capability to selectively focus on specific segments of a query frame during associative retrieval. As a result, this model can perform search on the basis of featural significance described by a subset of the snippet pixels. This capability is critical for visual query in image database because quite often the cognitive index features in the snippet are statistically weak. Unlike, the conventional artificial associative memories, MHAC uses a two level representation and incorporates additional meta-knowledge about the reliability status of segments of information it receives and forwards. In this paper we present the analysis of focus characteristics of MHAC.
Motivated Proteins: A web application for studying small three-dimensional protein motifs
Leader, David P; Milner-White, E James
2009-01-01
Background Small loop-shaped motifs are common constituents of the three-dimensional structure of proteins. Typically they comprise between three and seven amino acid residues, and are defined by a combination of dihedral angles and hydrogen bonding partners. The most abundant of these are αβ-motifs, asx-motifs, asx-turns, β-bulges, β-bulge loops, β-turns, nests, niches, Schellmann loops, ST-motifs, ST-staples and ST-turns. We have constructed a database of such motifs from a range of high-quality protein structures and built a web application as a visual interface to this. Description The web application, Motivated Proteins, provides access to these 12 motifs (with 48 sub-categories) in a database of over 400 representative proteins. Queries can be made for specific categories or sub-categories of motif, motifs in the vicinity of ligands, motifs which include part of an enzyme active site, overlapping motifs, or motifs which include a particular amino acid sequence. Individual proteins can be specified, or, where appropriate, motifs for all proteins listed. The results of queries are presented in textual form as an (X)HTML table, and may be saved as parsable plain text or XML. Motifs can be viewed and manipulated either individually or in the context of the protein in the Jmol applet structural viewer. Cartoons of the motifs imposed on a linear representation of protein secondary structure are also provided. Summary information for the motifs is available, as are histograms of amino acid distribution, and graphs of dihedral angles at individual positions in the motifs. Conclusion Motivated Proteins is a publicly and freely accessible web application that enables protein scientists to study small three-dimensional motifs without requiring knowledge of either Structured Query Language or the underlying database schema. PMID:19210785
Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks.
Raisaro, Jean Louis; Tramèr, Florian; Ji, Zhanglong; Bu, Diyue; Zhao, Yongan; Carey, Knox; Lloyd, David; Sofia, Heidi; Baker, Dixie; Flicek, Paul; Shringarpure, Suyash; Bustamante, Carlos; Wang, Shuang; Jiang, Xiaoqian; Ohno-Machado, Lucila; Tang, Haixu; Wang, XiaoFeng; Hubaux, Jean-Pierre
2017-07-01
The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context-a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or "beacon") is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards.While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual's whole genome sequence), the individual's membership in a beacon can be inferred through repeated queries for variants present in the individual's genome.In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association.
An interactive system for computer-aided diagnosis of breast masses.
Wang, Xingwei; Li, Lihua; Liu, Wei; Xu, Weidong; Lederman, Dror; Zheng, Bin
2012-10-01
Although mammography is the only clinically accepted imaging modality for screening the general population to detect breast cancer, interpreting mammograms is difficult with lower sensitivity and specificity. To provide radiologists "a visual aid" in interpreting mammograms, we developed and tested an interactive system for computer-aided detection and diagnosis (CAD) of mass-like cancers. Using this system, an observer can view CAD-cued mass regions depicted on one image and then query any suspicious regions (either cued or not cued by CAD). CAD scheme automatically segments the suspicious region or accepts manually defined region and computes a set of image features. Using content-based image retrieval (CBIR) algorithm, CAD searches for a set of reference images depicting "abnormalities" similar to the queried region. Based on image retrieval results and a decision algorithm, a classification score is assigned to the queried region. In this study, a reference database with 1,800 malignant mass regions and 1,800 benign and CAD-generated false-positive regions was used. A modified CBIR algorithm with a new function of stretching the attributes in the multi-dimensional space and decision scheme was optimized using a genetic algorithm. Using a leave-one-out testing method to classify suspicious mass regions, we compared the classification performance using two CBIR algorithms with either equally weighted or optimally stretched attributes. Using the modified CBIR algorithm, the area under receiver operating characteristic curve was significantly increased from 0.865 ± 0.006 to 0.897 ± 0.005 (p < 0.001). This study demonstrated the feasibility of developing an interactive CAD system with a large reference database and achieving improved performance.
Bibliometric analysis on Australian rural health publications from 2006 to 2012.
Mendis, Kumara; Edwards, Tegan; Stevens, Wendy; McCrossin, Tim
2014-08-01
To review Australian rural health (ARH) publications in PubMed from 2006 to 2012 and address ARH issues raised by the 2013 Health and Medical Research report. Retrospective observational study. Internet-based bibliometric analysis using PubMed. MEDLINE-indexed ARH publications from 2006 to 2012 were retrieved using PubMed queries. ARH publications were defined as Australian publications that explore issues relevant to the health of the regional, rural or remote Australian population. Two authors independently reviewed a random sample of 5% of publications for validity. Analysis determined country of origin (Australia); publications relevant to the National Health Priority Areas, the 2013 National Rural Health Alliance priority areas and Rural Clinical Schools/University Departments of Rural Health; and journal frequencies and publication types. ARH publications increased from 286 in 2006 to 393 in 2012 and made up 1.4% of all Australian PubMed publications. Combined, the health priority areas were addressed in 52% of ARH publications. Rural Clinical Schools/University Departments of Rural Health articles made up 7% of ARH publications. An increase in cohort studies, systematic reviews and reviews indicated improved quality of articles. ARH articles were most commonly published in the Australian Journal of Rural Health (15.9%), Rural and Remote Health (13.4%) and the Medical Journal of Australia (6.3%). Striking a balance between broadening the queries (increasing sensitivity) and limiting the false positives by restricting the breadth of the queries (increasing specificity) was the main limitation. This reproducible analysis, repeated at given timelines, can track the progress of ARH publications and provide directions regarding future rural health research. © 2014 National Rural Health Alliance Inc.
Yu, Amy Y X; Quan, Hude; McRae, Andrew; Wagner, Gabrielle O; Hill, Michael D; Coutts, Shelagh B
2017-09-18
Validation of administrative data case definitions is key for accurate passive surveillance of disease. Transient ischemic attack (TIA) is a condition primarily managed in the emergency department. However, prior validation studies have focused on data after inpatient hospitalization. We aimed to determine the validity of the Canadian 10th International Classification of Diseases (ICD-10-CA) codes for TIA in the national ambulatory administrative database. We performed a diagnostic accuracy study of four ICD-10-CA case definition algorithms for TIA in the emergency department setting. The study population was obtained from two ongoing studies on the diagnosis of TIA and minor stroke versus stroke mimic using serum biomarkers and neuroimaging. Two reference standards were used 1) the emergency department clinical diagnosis determined by chart abstractors and 2) the 90-day final diagnosis, both obtained by stroke neurologists, to calculate the sensitivity, specificity, positive and negative predictive values (PPV and NPV) of the ICD-10-CA algorithms for TIA. Among 417 patients, emergency department adjudication showed 163 (39.1%) TIA, 155 (37.2%) ischemic strokes, and 99 (23.7%) stroke mimics. The most restrictive algorithm, defined as a TIA code in the main position had the lowest sensitivity (36.8%), but highest specificity (92.5%) and PPV (76.0%). The most inclusive algorithm, defined as a TIA code in any position with and without query prefix had the highest sensitivity (63.8%), but lowest specificity (81.5%) and PPV (68.9%). Sensitivity, specificity, PPV, and NPV were overall lower when using the 90-day diagnosis as reference standard. Emergency department administrative data reflect diagnosis of suspected TIA with high specificity, but underestimate the burden of disease. Future studies are necessary to understand the reasons for the low to moderate sensitivity.
Hynes, Denise M; Weddle, Timothy; Smith, Nina; Whittier, Erika; Atkins, David; Francis, Joseph
2010-01-01
As the Department of Veterans Affairs (VA) Health Services Research and Development Service's Quality Enhancement Research Initiative (QUERI) has progressed, health information technology (HIT) has occupied a crucial role in implementation research projects. We evaluated the role of HIT in VA QUERI implementation research, including HIT use and development, the contributions implementation research has made to HIT development, and HIT-related barriers and facilitators to implementation research. Key informants from nine disease-specific QUERI Centers. Documentation analysis of 86 implementation project abstracts followed up by semi-structured interviews with key informants from each of the nine QUERI centers. We used qualitative and descriptive analyses. We found: (1) HIT provided data and information to facilitate implementation research, (2) implementation research helped to further HIT development in a variety of uses including the development of clinical decision support systems (23 of 86 implementation research projects), and (3) common HIT barriers to implementation research existed but could be overcome by collaborations with clinical and administrative leadership. Our review of the implementation research progress in the VA revealed interdependency on an HIT infrastructure and research-based development. Collaboration with multiple stakeholders is a key factor in successful use and development of HIT in implementation research efforts and in advancing evidence-based practice.
Random and Directed Walk-Based Top-k Queries in Wireless Sensor Networks
Fu, Jun-Song; Liu, Yun
2015-01-01
In wireless sensor networks, filter-based top-k query approaches are the state-of-the-art solutions and have been extensively researched in the literature, however, they are very sensitive to the network parameters, including the size of the network, dynamics of the sensors’ readings and declines in the overall range of all the readings. In this work, a random walk-based top-k query approach called RWTQ and a directed walk-based top-k query approach called DWTQ are proposed. At the beginning of a top-k query, one or several tokens are sent to the specific node(s) in the network by the base station. Then, each token walks in the network independently to record and process the readings in a random or directed way. A strategy of choosing the “right” way in DWTQ is carefully designed for the token(s) to arrive at the high-value regions as soon as possible. When designing the walking strategy for DWTQ, the spatial correlations of the readings are also considered. Theoretical analysis and simulation results indicate that RWTQ and DWTQ both are very robust against these parameters discussed previously. In addition, DWTQ outperforms TAG, FILA and EXTOK in transmission cost, energy consumption and network lifetime. PMID:26016914
Federated ontology-based queries over cancer data
2012-01-01
Background Personalised medicine provides patients with treatments that are specific to their genetic profiles. It requires efficient data sharing of disparate data types across a variety of scientific disciplines, such as molecular biology, pathology, radiology and clinical practice. Personalised medicine aims to offer the safest and most effective therapeutic strategy based on the gene variations of each subject. In particular, this is valid in oncology, where knowledge about genetic mutations has already led to new therapies. Current molecular biology techniques (microarrays, proteomics, epigenetic technology and improved DNA sequencing technology) enable better characterisation of cancer tumours. The vast amounts of data, however, coupled with the use of different terms - or semantic heterogeneity - in each discipline makes the retrieval and integration of information difficult. Results Existing software infrastructures for data-sharing in the cancer domain, such as caGrid, support access to distributed information. caGrid follows a service-oriented model-driven architecture. Each data source in caGrid is associated with metadata at increasing levels of abstraction, including syntactic, structural, reference and domain metadata. The domain metadata consists of ontology-based annotations associated with the structural information of each data source. However, caGrid's current querying functionality is given at the structural metadata level, without capitalising on the ontology-based annotations. This paper presents the design of and theoretical foundations for distributed ontology-based queries over cancer research data. Concept-based queries are reformulated to the target query language, where join conditions between multiple data sources are found by exploiting the semantic annotations. The system has been implemented, as a proof of concept, over the caGrid infrastructure. The approach is applicable to other model-driven architectures. A graphical user interface has been developed, supporting ontology-based queries over caGrid data sources. An extensive evaluation of the query reformulation technique is included. Conclusions To support personalised medicine in oncology, it is crucial to retrieve and integrate molecular, pathology, radiology and clinical data in an efficient manner. The semantic heterogeneity of the data makes this a challenging task. Ontologies provide a formal framework to support querying and integration. This paper provides an ontology-based solution for querying distributed databases over service-oriented, model-driven infrastructures. PMID:22373043
KnowledgeLink: Impact of Context-Sensitive Information Retrieval on Clinicians' Information Needs
Maviglia, Saverio M.; Yoon, Catherine S.; Bates, David W.; Kuperman, Gilad
2006-01-01
Objective: Infobuttons are message-based content search and retrieval functions embedded within other applications that dynamically return information relevant to the clinical task at hand. The objective of this study was to determine whether infobuttons effectively answer providers' questions about medications or affect patient care decisions. Design: The authors implemented and evaluated a medication infobutton application called KnowledgeLink. Health care providers at 18 outpatient clinics were randomized to one of two versions of KnowledgeLink, one that linked to information from Micromedex (Thomson Micromedex, Greenwood Village, Co) and the other to material from SkolarMD (Wolters Kluwer Health, Palo Alto, CA). Measurements: Data were collected about the frequency of use and demographics of users, patients, and drugs that were queried. Users were periodically surveyed with short questionnaires and then with a more extensive survey at the end of one year. Results: During the first year, KnowledgeLink was used 7,972 times by 359 users to look up information about 1,723 medications for 4,961 patients. Clinicians used KnowledgeLink twice a month on average, and during an average of 1.2% of patient encounters. KnowledgeLink was used by a wide variety of medical staff, not just physicians and nurse practitioners. The frequency of usage and the questions asked varied with user role (primary care physician, specialist physician, nurse practitioner). Although the median KnowledgeLink session was brief (21 seconds), KnowledgeLink answered users' queries 84% of the time, and altered patient care decisions 15% of the time. Users rated KnowledgeLink favorably on multiple scales, recommended extending KnowledgeLink to other content domains, and suggested enhancing the interface to allow refinement of the query and selection of the target resource. Conclusion: An infobutton can satisfy information needs about medications. Although used infrequently and for brief sessions, KnowledgeLink was positively received, answered most users' questions, and had a significant impact on medical decision making. The next steps would be to broaden the domains that KnowledgeLink covers to more specifically tailor results to the user type, to provide options when queries are not immediately answered, and to implement KnowledgeLink within other electronic clinical applications. PMID:16221942
Challenging a dogma: five-year survival does not equal cure in all colorectal cancer patients.
Abdel-Rahman, Omar
2018-02-01
The current study tried to evaluate the factors affecting 10- to 20- years' survival among long term survivors (>5 years) of colorectal cancer (CRC). Surveillance, Epidemiology and End Results (SEER) database (1988-2008) was queried through SEER*Stat program.Univariate probability of overall and cancer-specific survival was determined and the difference between groups was examined. Multivariate analysis for factors affecting overall and cancer-specific survival was also conducted. Among node positive patients (Dukes C), 34% of the deaths beyond 5 years can be attributed to CRC; while among M1 patients, 63% of the deaths beyond 5 years can be attributed to CRC. The following factors were predictors of better overall survival in multivariate analysis: younger age, white race (versus black race), female gender, Right colon location (versus rectal location), earlier stage and surgery (P <0.0001 for all parameters). Similarly, the following factors were predictors of better cancer-specific survival in multivariate analysis: younger age, white race (versus black race), female gender, Right colon location (versus left colon and rectal locations), earlier stage and surgery (P <0.0001 for all parameters). Among node positive long-term CRC survivors, more than one third of all deaths can be attributed to CRC.
Accuracy of telephone reference service in health sciences libraries.
Paskoff, B M
1991-01-01
Six factual queries were unobtrusively telephoned to fifty-one U.S. academic health sciences and hospital libraries. The majority of the queries (63.4%) were answered accurately. Referrals to another library or information source were made for 25.2% of the queries. Eleven answers (3.6%) were inaccurate, and no answer was provided for 7.8% of the queries. There was a correlation between the number of accurate answers provided and the presence of at least one staff member with a master's degree in library and information science. The correlation between employing a librarian certified by the Medical Library Association (MLA) and providing accurate answers was significant. The majority of referrals were to specific sources. If these "helpful referrals" are counted with accurate answers as correct responses, they total 76.8% of the answers. In a follow-up survey, five libraries stated that they did not provide accurate answers because they did not own an appropriate source. Staff-related problems were given as reasons for other than accurate answers by two of the libraries, while eight indicated that library policy prevented them from providing answers to the public. PMID:2039904
Hripcsak, George; Knirsch, Charles; Zhou, Li; Wilcox, Adam; Melton, Genevieve B
2007-03-01
Data mining in electronic medical records may facilitate clinical research, but much of the structured data may be miscoded, incomplete, or non-specific. The exploitation of narrative data using natural language processing may help, although nesting, varying granularity, and repetition remain challenges. In a study of community-acquired pneumonia using electronic records, these issues led to poor classification. Limiting queries to accurate, complete records led to vastly reduced, possibly biased samples. We exploited knowledge latent in the electronic records to improve classification. A similarity metric was used to cluster cases. We defined discordance as the degree to which cases within a cluster give different answers for some query that addresses a classification task of interest. Cases with higher discordance are more likely to be incorrectly classified, and can be reviewed manually to adjust the classification, improve the query, or estimate the likely accuracy of the query. In a study of pneumonia--in which the ICD9-CM coding was found to be very poor--the discordance measure was statistically significantly correlated with classification correctness (.45; 95% CI .15-.62).
NASA Technical Reports Server (NTRS)
Ong, K. G.; Wang, J.; Singh, R. S.; Bachas, L. G.; Grimes, C. A.; Daunert, S. (Principal Investigator)
2001-01-01
A new technique is presented for in-vivo remote query measurement of the complex permittivity spectra of a biological culture solution. A sensor comprised of a printed inductor-capacitor resonant-circuit is placed within the culture solution of interest, with the impedance spectrum of the sensor measured using a remotely located loop antenna; the complex permittivity spectra of the culture is calculated from the measured impedance spectrum. The remote query nature of the sensor platform enables, for example, the in-vivo real-time monitoring of bacteria or yeast growth from within sealed opaque containers. The wireless monitoring technique does not require a specific alignment between sensor and antenna. Results are presented for studies conducted on laboratory strains of Bacillus subtilis, Escherichia coli JM109, Pseudomonas putida and Saccharomyces cerevisiae.
Ballesteros, Michael F.; Webb, Kevin; McClure, Roderick J.
2017-01-01
Introduction The Centers for Disease Control and Prevention (CDC) developed the Web-based Injury Statistics Query and Reporting System (WISQARSTM) to meet the data needs of injury practitioners. In 2015, CDC completed a Portfolio Review of this system to inform its future development. Methods Evaluation questions addressed utilization, technology and innovation, data sources, and tools and training. Data were collected through environmental scans, a review of peer-reviewed and grey literature, a web search, and stakeholder interviews. Results Review findings led to specific recommendations for each evaluation question. Response CDC reviewed each recommendation and initiated several enhancements that will improve the ability of injury prevention practitioners to leverage these data, better make sense of query results, and incorporate findings and key messages into prevention practices. PMID:28454867
Published physiologically based pharmacokinetic (PBPK) models from peer-reviewed articles are often well-parameterized, thoroughly-vetted, and can be utilized as excellent resources for the construction of models pertaining to related chemicals. Specifically, chemical-specific pa...
Brolan, Claire E; Te, Vannarath; Floden, Nadia; Hill, Peter S; Forman, Lisa
2017-01-01
Since the new global health and development goal, Sustainable Development Goal (SDG) 3, and its nine targets and four means of implementation were introduced to the world through a United Nations (UN) General Assembly resolution in September 2015, right to health practitioners have queried whether this goal mirrors the content of the human right to health in international law. This study examines the text of the UN SDG resolution, Transforming our world: the 2030 Agenda for Sustainable Development, from a right to health minimalist and right to health maximalist analytic perspective. When reviewing the UN SDG resolution’s text, a right to health minimalist questions whether the content of the right to health is at least implicitly included in this document, specifically focusing on SDG 3 and its metrics framework. A right to health maximalist, on the other hand, queries whether the content of the right to health is explicitly included. This study finds that whether the right to health is contained in the UN SDG resolution, and the SDG metrics therein, ultimately depends on the individual analyst’s subjective persuasion in relation to right to health minimalism or maximalism. We conclude that the UN General Assembly’s lack of cogency on the right to health’s position in the UN SDG resolution will continue to blur if not divest human rights’ (and specifically the right to health’s) integral relationship to high-level development planning, implementation and SDG monitoring and evaluation efforts. PMID:29225946
Secure and Privacy-Preserving Body Sensor Data Collection and Query Scheme.
Zhu, Hui; Gao, Lijuan; Li, Hui
2016-02-01
With the development of body sensor networks and the pervasiveness of smart phones, different types of personal data can be collected in real time by body sensors, and the potential value of massive personal data has attracted considerable interest recently. However, the privacy issues of sensitive personal data are still challenging today. Aiming at these challenges, in this paper, we focus on the threats from telemetry interface and present a secure and privacy-preserving body sensor data collection and query scheme, named SPCQ, for outsourced computing. In the proposed SPCQ scheme, users' personal information is collected by body sensors in different types and converted into multi-dimension data, and each dimension is converted into the form of a number and uploaded to the cloud server, which provides a secure, efficient and accurate data query service, while the privacy of sensitive personal information and users' query data is guaranteed. Specifically, based on an improved homomorphic encryption technology over composite order group, we propose a special weighted Euclidean distance contrast algorithm (WEDC) for multi-dimension vectors over encrypted data. With the SPCQ scheme, the confidentiality of sensitive personal data, the privacy of data users' queries and accurate query service can be achieved in the cloud server. Detailed analysis shows that SPCQ can resist various security threats from telemetry interface. In addition, we also implement SPCQ on an embedded device, smart phone and laptop with a real medical database, and extensive simulation results demonstrate that our proposed SPCQ scheme is highly efficient in terms of computation and communication costs.
Noar, Seth M; Ribisl, Kurt M; Althouse, Benjamin M; Willoughby, Jessica Fitts; Ayers, John W
2013-12-01
Announcements of cancer diagnoses from public figures may stimulate cancer information seeking and media coverage about cancer. This study used digital surveillance to quantify the effects of pancreatic cancer public figure announcements on online cancer information seeking and cancer media coverage. We compiled a list of public figures (N = 25) who had been diagnosed with or had died from pancreatic cancer between 2006 and 2011. We specified interrupted time series models using data from Google Trends to examine search query shifts for pancreatic cancer and other cancers. Weekly media coverage archived on Google News were also analyzed. Most public figures' pancreatic cancer announcements corresponded with no appreciable change in pancreatic cancer search queries or media coverage. In contrast, Patrick Swayze's diagnosis was associated with a 285% (95% confidence interval [CI]: 212 to 360) increase in pancreatic cancer search queries, though it was only weakly associated with increases in pancreatic cancer media coverage. Steve Jobs's death was associated with a 197% (95% CI: 131 to 266) increase in pancreatic cancer queries and a 3517% (95% CI: 2882 to 4492) increase in pancreatic cancer media coverage. In general, a doubling in pancreatic cancer-specific media coverage corresponded with a 325% increase in pancreatic cancer queries. Digital surveillance is an important tool for future cancer control research and practice. The current application of these methods suggested that pancreatic cancer announcements (diagnosis or death) by particular public figures stimulated media coverage of and online information seeking for pancreatic cancer.
Secure and Privacy-Preserving Body Sensor Data Collection and Query Scheme
Zhu, Hui; Gao, Lijuan; Li, Hui
2016-01-01
With the development of body sensor networks and the pervasiveness of smart phones, different types of personal data can be collected in real time by body sensors, and the potential value of massive personal data has attracted considerable interest recently. However, the privacy issues of sensitive personal data are still challenging today. Aiming at these challenges, in this paper, we focus on the threats from telemetry interface and present a secure and privacy-preserving body sensor data collection and query scheme, named SPCQ, for outsourced computing. In the proposed SPCQ scheme, users’ personal information is collected by body sensors in different types and converted into multi-dimension data, and each dimension is converted into the form of a number and uploaded to the cloud server, which provides a secure, efficient and accurate data query service, while the privacy of sensitive personal information and users’ query data is guaranteed. Specifically, based on an improved homomorphic encryption technology over composite order group, we propose a special weighted Euclidean distance contrast algorithm (WEDC) for multi-dimension vectors over encrypted data. With the SPCQ scheme, the confidentiality of sensitive personal data, the privacy of data users’ queries and accurate query service can be achieved in the cloud server. Detailed analysis shows that SPCQ can resist various security threats from telemetry interface. In addition, we also implement SPCQ on an embedded device, smart phone and laptop with a real medical database, and extensive simulation results demonstrate that our proposed SPCQ scheme is highly efficient in terms of computation and communication costs. PMID:26840319
Automatic generation of investigator bibliographies for institutional research networking systems.
Johnson, Stephen B; Bales, Michael E; Dine, Daniel; Bakken, Suzanne; Albert, Paul J; Weng, Chunhua
2014-10-01
Publications are a key data source for investigator profiles and research networking systems. We developed ReCiter, an algorithm that automatically extracts bibliographies from PubMed using institutional information about the target investigators. ReCiter executes a broad query against PubMed, groups the results into clusters that appear to constitute distinct author identities and selects the cluster that best matches the target investigator. Using information about investigators from one of our institutions, we compared ReCiter results to queries based on author name and institution and to citations extracted manually from the Scopus database. Five judges created a gold standard using citations of a random sample of 200 investigators. About half of the 10,471 potential investigators had no matching citations in PubMed, and about 45% had fewer than 70 citations. Interrater agreement (Fleiss' kappa) for the gold standard was 0.81. Scopus achieved the best recall (sensitivity) of 0.81, while name-based queries had 0.78 and ReCiter had 0.69. ReCiter attained the best precision (positive predictive value) of 0.93 while Scopus had 0.85 and name-based queries had 0.31. ReCiter accesses the most current citation data, uses limited computational resources and minimizes manual entry by investigators. Generation of bibliographies using named-based queries will not yield high accuracy. Proprietary databases can perform well but requite manual effort. Automated generation with higher recall is possible but requires additional knowledge about investigators. Copyright © 2014 Elsevier Inc. All rights reserved.
Automatic generation of investigator bibliographies for institutional research networking systems
Johnson, Stephen B.; Bales, Michael E.; Dine, Daniel; Bakken, Suzanne; Albert, Paul J.; Weng, Chunhua
2014-01-01
Objective Publications are a key data source for investigator profiles and research networking systems. We developed ReCiter, an algorithm that automatically extracts bibliographies from PubMed using institutional information about the target investigators. Methods ReCiter executes a broad query against PubMed, groups the results into clusters that appear to constitute distinct author identities and selects the cluster that best matches the target investigator. Using information about investigators from one of our institutions, we compared ReCiter results to queries based on author name and institution and to citations extracted manually from the Scopus database. Five judges created a gold standard using citations of a random sample of 200 investigators. Results About half of the 10,471 potential investigators had no matching citations in PubMed, and about 45% had fewer than 70 citations. Interrater agreement (Fleiss’ kappa) for the gold standard was 0.81. Scopus achieved the best recall (sensitivity) of 0.81, while name-based queries had 0.78 and ReCiter had 0.69. ReCiter attained the best precision (positive predictive value) of 0.93 while Scopus had 0.85 and name-based queries had 0.31. Discussion ReCiter accesses the most current citation data, uses limited computational resources and minimizes manual entry by investigators. Generation of bibliographies using named-based queries will not yield high accuracy. Proprietary databases can perform well but requite manual effort. Automated generation with higher recall is possible but requires additional knowledge about investigators. PMID:24694772
[Limiting a Medline/PubMed query to the "best" articles using the JCR relative impact factor].
Avillach, P; Kerdelhué, G; Devos, P; Maisonneuve, H; Darmoni, S J
2014-12-01
Medline/PubMed is the most frequently used medical bibliographic research database. The aim of this study was to propose a new generic method to limit any Medline/PubMed query based on the relative impact factor and the A & B categories of the SIGAPS score. The entire PubMed corpus was used for the feasibility study, then ten frequent diseases in terms of PubMed indexing and the citations of four Nobel prize winners. The relative impact factor (RIF) was calculated by medical specialty defined in Journal Citation Reports. The two queries, which included all the journals in category A (or A OR B), were added to any Medline/PubMed query as a central point of the feasibility study. Limitation using the SIGAPS category A was larger than the when using the Core Clinical Journals (CCJ): 15.65% of PubMed corpus vs 8.64% for CCJ. The response time of this limit applied to the entire PubMed corpus was less than two seconds. For five diseases out of ten, limiting the citations with the RIF was more effective than with the CCJ. For the four Nobel prize winners, limiting the citations with the RIF was more effective than the CCJ. The feasibility study to apply a new filter based on the relative impact factor on any Medline/PubMed query was positive. Copyright © 2014 Elsevier Masson SAS. All rights reserved.
OpenSearch technology for geospatial resources discovery
NASA Astrophysics Data System (ADS)
Papeschi, Fabrizio; Enrico, Boldrini; Mazzetti, Paolo
2010-05-01
In 2005, the term Web 2.0 has been coined by Tim O'Reilly to describe a quickly growing set of Web-based applications that share a common philosophy of "mutually maximizing collective intelligence and added value for each participant by formalized and dynamic information sharing". Around this same period, OpenSearch a new Web 2.0 technology, was developed. More properly, OpenSearch is a collection of technologies that allow publishing of search results in a format suitable for syndication and aggregation. It is a way for websites and search engines to publish search results in a standard and accessible format. Due to its strong impact on the way the Web is perceived by users and also due its relevance for businesses, Web 2.0 has attracted the attention of both mass media and the scientific community. This explosive growth in popularity of Web 2.0 technologies like OpenSearch, and practical applications of Service Oriented Architecture (SOA) resulted in an increased interest in similarities, convergence, and a potential synergy of these two concepts. SOA is considered as the philosophy of encapsulating application logic in services with a uniformly defined interface and making these publicly available via discovery mechanisms. Service consumers may then retrieve these services, compose and use them according to their current needs. A great degree of similarity between SOA and Web 2.0 may be leading to a convergence between the two paradigms. They also expose divergent elements, such as the Web 2.0 support to the human interaction in opposition to the typical SOA machine-to-machine interaction. According to these considerations, the Geospatial Information (GI) domain, is also moving first steps towards a new approach of data publishing and discovering, in particular taking advantage of the OpenSearch technology. A specific GI niche is represented by the OGC Catalog Service for Web (CSW) that is part of the OGC Web Services (OWS) specifications suite, which provides a set of services for discovery, access, and processing of geospatial resources in a SOA framework. GI-cat is a distributed CSW framework implementation developed by the ESSI Lab of the Italian National Research Council (CNR-IMAA) and the University of Florence. It provides brokering and mediation functionalities towards heterogeneous resources and inventories, exposing several standard interfaces for query distribution. This work focuses on a new GI-cat interface which allows the catalog to be queried according to the OpenSearch syntax specification, thus filling the gap between the SOA architectural design of the CSW and the Web 2.0. At the moment, there is no OGC standard specification about this topic, but an official change request has been proposed in order to enable the OGC catalogues to support OpenSearch queries. In this change request, an OpenSearch extension is proposed providing a standard mechanism to query a resource based on temporal and geographic extents. Two new catalog operations are also proposed, in order to publish a suitable OpenSearch interface. This extended interface is implemented by the modular GI-cat architecture adding a new profiling module called "OpenSearch profiler". Since GI-cat also acts as a clearinghouse catalog, another component called "OpenSearch accessor" is added in order to access OpenSearch compliant services. An important role in the GI-cat extension, is played by the adopted mapping strategy. Two different kind of mappings are required: query, and response elements mapping. Query mapping is provided in order to fit the simple OpenSearch query syntax to the complex CSW query expressed by the OGC Filter syntax. GI-cat internal data model is based on the ISO-19115 profile, that is more complex than the simple XML syndication formats, such as RSS 2.0 and Atom 1.0, suggested by OpenSearch. Once response elements are available, in order to be presented, they need to be translated from the GI-cat internal data model, to the above mentioned syndication formats; the mapping processing, is bidirectional. When GI-cat is used to access OpenSearch compliant services, the CSW query must be mapped to the OpenSearch query, and the response elements, must be translated according to the GI-cat internal data model. As results of such extensions, GI-cat provides a user friendly facade to the complex CSW interface, thus enabling it to be queried, for example, using a browser toolbar.
Spatial Query for Planetary Data
NASA Technical Reports Server (NTRS)
Shams, Khawaja S.; Crockett, Thomas M.; Powell, Mark W.; Joswig, Joseph C.; Fox, Jason M.
2011-01-01
Science investigators need to quickly and effectively assess past observations of specific locations on a planetary surface. This innovation involves a location-based search technology that was adapted and applied to planetary science data to support a spatial query capability for mission operations software. High-performance location-based searching requires the use of spatial data structures for database organization. Spatial data structures are designed to organize datasets based on their coordinates in a way that is optimized for location-based retrieval. The particular spatial data structure that was adapted for planetary data search is the R+ tree.
Predicting user click behaviour in search engine advertisements
NASA Astrophysics Data System (ADS)
Daryaie Zanjani, Mohammad; Khadivi, Shahram
2015-10-01
According to the specific requirements and interests of users, search engines select and display advertisements that match user needs and have higher probability of attracting users' attention based on their previous search history. New objects such as user, advertisement or query cause a deterioration of precision in targeted advertising due to their lack of history. This article surveys this challenge. In the case of new objects, we first extract similar observed objects to the new object and then we use their history as the history of new object. Similarity between objects is measured based on correlation, which is a relation between user and advertisement when the advertisement is displayed to the user. This method is used for all objects, so it has helped us to accurately select relevant advertisements for users' queries. In our proposed model, we assume that similar users behave in a similar manner. We find that users with few queries are similar to new users. We will show that correlation between users and advertisements' keywords is high. Thus, users who pay attention to advertisements' keywords, click similar advertisements. In addition, users who pay attention to specific brand names might have similar behaviours too.
RadSearch: a RIS/PACS integrated query tool
NASA Astrophysics Data System (ADS)
Tsao, Sinchai; Documet, Jorge; Moin, Paymann; Wang, Kevin; Liu, Brent J.
2008-03-01
Radiology Information Systems (RIS) contain a wealth of information that can be used for research, education, and practice management. However, the sheer amount of information available makes querying specific data difficult and time consuming. Previous work has shown that a clinical RIS database and its RIS text reports can be extracted, duplicated and indexed for searches while complying with HIPAA and IRB requirements. This project's intent is to provide a software tool, the RadSearch Toolkit, to allow intelligent indexing and parsing of RIS reports for easy yet powerful searches. In addition, the project aims to seamlessly query and retrieve associated images from the Picture Archiving and Communication System (PACS) in situations where an integrated RIS/PACS is in place - even subselecting individual series, such as in an MRI study. RadSearch's application of simple text parsing techniques to index text-based radiology reports will allow the search engine to quickly return relevant results. This powerful combination will be useful in both private practice and academic settings; administrators can easily obtain complex practice management information such as referral patterns; researchers can conduct retrospective studies with specific, multiple criteria; teaching institutions can quickly and effectively create thorough teaching files.
REDIdb: the RNA editing database.
Picardi, Ernesto; Regina, Teresa Maria Rosaria; Brennicke, Axel; Quagliariello, Carla
2007-01-01
The RNA Editing Database (REDIdb) is an interactive, web-based database created and designed with the aim to allocate RNA editing events such as substitutions, insertions and deletions occurring in a wide range of organisms. The database contains both fully and partially sequenced DNA molecules for which editing information is available either by experimental inspection (in vitro) or by computational detection (in silico). Each record of REDIdb is organized in a specific flat-file containing a description of the main characteristics of the entry, a feature table with the editing events and related details and a sequence zone with both the genomic sequence and the corresponding edited transcript. REDIdb is a relational database in which the browsing and identification of editing sites has been simplified by means of two facilities to either graphically display genomic or cDNA sequences or to show the corresponding alignment. In both cases, all editing sites are highlighted in colour and their relative positions are detailed by mousing over. New editing positions can be directly submitted to REDIdb after a user-specific registration to obtain authorized secure access. This first version of REDIdb database stores 9964 editing events and can be freely queried at http://biologia.unical.it/py_script/search.html.
Earth-Base: A Free And Open Source, RESTful Earth Sciences Platform
NASA Astrophysics Data System (ADS)
Kishor, P.; Heim, N. A.; Peters, S. E.; McClennen, M.
2012-12-01
This presentation describes the motivation, concept, and architecture behind Earth-Base, a web-based, RESTful data-management, analysis and visualization platform for earth sciences data. Traditionally web applications have been built directly accessing data from a database using a scripting language. While such applications are great at bring results to a wide audience, they are limited in scope to the imagination and capabilities of the application developer. Earth-Base decouples the data store from the web application by introducing an intermediate "data application" tier. The data application's job is to query the data store using self-documented, RESTful URIs, and send the results back formatted as JavaScript Object Notation (JSON). Decoupling the data store from the application allows virtually limitless flexibility in developing applications, both web-based for human consumption or programmatic for machine consumption. It also allows outside developers to use the data in their own applications, potentially creating applications that the original data creator and app developer may not have even thought of. Standardized specifications for URI-based querying and JSON-formatted results make querying and developing applications easy. URI-based querying also allows utilizing distributed datasets easily. Companion mechanisms for querying data snapshots aka time-travel, usage tracking and license management, and verification of semantic equivalence of data are also described. The latter promotes the "What You Expect Is What You Get" (WYEIWYG) principle that can aid in data citation and verification.
Ji, Yanqing; Ying, Hao; Tran, John; Dews, Peter; Massanari, R Michael
2016-07-19
Finding highly relevant articles from biomedical databases is challenging not only because it is often difficult to accurately express a user's underlying intention through keywords but also because a keyword-based query normally returns a long list of hits with many citations being unwanted by the user. This paper proposes a novel biomedical literature search system, called BiomedSearch, which supports complex queries and relevance feedback. The system employed association mining techniques to build a k-profile representing a user's relevance feedback. More specifically, we developed a weighted interest measure and an association mining algorithm to find the strength of association between a query and each concept in the article(s) selected by the user as feedback. The top concepts were utilized to form a k-profile used for the next-round search. BiomedSearch relies on Unified Medical Language System (UMLS) knowledge sources to map text files to standard biomedical concepts. It was designed to support queries with any levels of complexity. A prototype of BiomedSearch software was made and it was preliminarily evaluated using the Genomics data from TREC (Text Retrieval Conference) 2006 Genomics Track. Initial experiment results indicated that BiomedSearch increased the mean average precision (MAP) for a set of queries. With UMLS and association mining techniques, BiomedSearch can effectively utilize users' relevance feedback to improve the performance of biomedical literature search.
An Application Programming Interface for Synthetic Snowflake Particle Structure and Scattering Data
NASA Technical Reports Server (NTRS)
Lammers, Matthew; Kuo, Kwo-Sen
2017-01-01
The work by Kuo and colleagues on growing synthetic snowflakes and calculating their single-scattering properties has demonstrated great potential to improve the retrievals of snowfall. To grant colleagues flexible and targeted access to their large collection of sizes and shapes at fifteen (15) microwave frequencies, we have developed a web-based Application Programming Interface (API) integrated with NASA Goddard's Precipitation Processing System (PPS) Group. It is our hope that the API will enable convenient programmatic utilization of the database. To help users better understand the API's capabilities, we have developed an interactive web interface called the OpenSSP API Query Builder, which implements an intuitive system of mechanisms for selecting shapes, sizes, and frequencies to generate queries, with which the API can then extract and return data from the database. The Query Builder also allows for the specification of normalized particle size distributions by setting pertinent parameters, with which the API can also return mean geometric and scattering properties for each size bin. Additionally, the Query Builder interface enables downloading of raw scattering and particle structure data packages. This presentation will describe some of the challenges and successes associated with developing such an API. Examples of its usage will be shown both through downloading output and pulling it into a spreadsheet, as well as querying the API programmatically and working with the output in code.
Inverse multipath fingerprinting for millimeter wave V2I beam alignment.
DOT National Transportation Integrated Search
2017-05-01
Efficient beam alignment is a crucial component in millimeter wave systems with analog beamforming, especially in fast-changing vehicular settings. This paper uses the vehicles position (e.g., available via GPS) to query the multipath fingerprint ...
Does an Otolaryngology-Specific Database Have Added Value? A Comparative Feasibility Analysis.
Bellmunt, Angela M; Roberts, Rhonda; Lee, Walter T; Schulz, Kris; Pynnonen, Melissa A; Crowson, Matthew G; Witsell, David; Parham, Kourosh; Langman, Alan; Vambutas, Andrea; Ryan, Sheila E; Shin, Jennifer J
2016-07-01
There are multiple nationally representative databases that support epidemiologic and outcomes research, and it is unknown whether an otolaryngology-specific resource would prove indispensable or superfluous. Therefore, our objective was to determine the feasibility of analyses in the National Ambulatory Medical Care Survey (NAMCS) and National Hospital Ambulatory Medical Care Survey (NHAMCS) databases as compared with the otolaryngology-specific Creating Healthcare Excellence through Education and Research (CHEER) database. Parallel analyses in 2 data sets. Ambulatory visits in the United States. To test a fixed hypothesis that could be directly compared between data sets, we focused on a condition with expected prevalence high enough to substantiate availability in both. This query also encompassed a broad span of diagnoses to sample the breadth of available information. Specifically, we compared an assessment of suspected risk factors for sensorineural hearing loss in subjects 0 to 21 years of age, according to a predetermined protocol. We also assessed the feasibility of 6 additional diagnostic queries among all age groups. In the NAMCS/NHAMCS data set, the number of measured observations was not sufficient to support reliable numeric conclusions (percentage standard error among risk factors: 38.6-92.1). Analysis of the CHEER database demonstrated that age, sex, meningitis, and cytomegalovirus were statistically significant factors associated with pediatric sensorineural hearing loss (P < .01). Among the 6 additional diagnostic queries assessed, NAMCS/NHAMCS usage was also infeasible; the CHEER database contained 1585 to 212,521 more observations per annum. An otolaryngology-specific database has added utility when compared with already available national ambulatory databases. © American Academy of Otolaryngology—Head and Neck Surgery Foundation 2016.
Reppe, Linda Amundstuen; Spigset, Olav; Kampmann, Jens Peter; Damkier, Per; Christensen, Hanne Rolighed; Böttiger, Ylva; Schjøtt, Jan
2017-05-01
The aim of this study was to identify structure and language elements affecting the quality of responses from Scandinavian drug information centres (DICs). Six different fictitious drug-related queries were sent to each of seven Scandinavian DICs. The centres were blinded for which queries were part of the study. The responses were assessed qualitatively by six clinical pharmacologists (internal experts) and six general practitioners (GPs, external experts). In addition, linguistic aspects of the responses were evaluated by a plain language expert. The quality of responses was generally judged as satisfactory to good. Presenting specific advice and conclusions were considered to improve the quality of the responses. However, small nuances in language formulations could affect the individual judgments of the experts, e.g. on whether or not advice was given. Some experts preferred the use of primary sources to the use of secondary and tertiary sources. Both internal and external experts criticised the use of abbreviations, professional terminology and study findings that was left unexplained. The plain language expert emphasised the importance of defining and explaining pharmacological terms to ensure that enquirers understand the response as intended. In addition, more use of active voice and less compressed text structure would be desirable. This evaluation of responses to DIC queries may give some indications on how to improve written responses on drug-related queries with respect to language and text structure. Giving specific advice and precise conclusions and avoiding too compressed language and non-standard abbreviations may aid to reach this goal.
Wong, Paul Wai-Ching; Fu, King-Wa; Yau, Rickey Sai-Pong; Ma, Helen Hei-Man; Law, Yik-Wa; Chang, Shu-Sen; Yip, Paul Siu-Fai
2013-01-11
The Internet's potential impact on suicide is of major public health interest as easy online access to pro-suicide information or specific suicide methods may increase suicide risk among vulnerable Internet users. Little is known, however, about users' actual searching and browsing behaviors of online suicide-related information. To investigate what webpages people actually clicked on after searching with suicide-related queries on a search engine and to examine what queries people used to get access to pro-suicide websites. A retrospective observational study was done. We used a web search dataset released by America Online (AOL). The dataset was randomly sampled from all AOL subscribers' web queries between March and May 2006 and generated by 657,000 service subscribers. We found 5526 search queries (0.026%, 5526/21,000,000) that included the keyword "suicide". The 5526 search queries included 1586 different search terms and were generated by 1625 unique subscribers (0.25%, 1625/657,000). Of these queries, 61.38% (3392/5526) were followed by users clicking on a search result. Of these 3392 queries, 1344 (39.62%) webpages were clicked on by 930 unique users but only 1314 of those webpages were accessible during the study period. Each clicked-through webpage was classified into 11 categories. The categories of the most visited webpages were: entertainment (30.13%; 396/1314), scientific information (18.31%; 240/1314), and community resources (14.53%; 191/1314). Among the 1314 accessed webpages, we could identify only two pro-suicide websites. We found that the search terms used to access these sites included "commiting suicide with a gas oven", "hairless goat", "pictures of murder by strangulation", and "photo of a severe burn". A limitation of our study is that the database may be dated and confined to mainly English webpages. Searching or browsing suicide-related or pro-suicide webpages was uncommon, although a small group of users did access websites that contain detailed suicide method information.
Evaluation of an ontological resource for pharmacovigilance.
Jaulent, Marie-Christine; Alecu, Iulian
2009-01-01
In this work, we present a methodology for evaluating an ontology designed in a previous study to describe adverse drug reactions. We evaluate it in term of its fitness for grouping cases in pharmacovigilance. We define as gold standard the Standardized MedDRA Queries (SMQs) developed manually to group terms representing similar medical conditions. We perform an automatic search in the ontology in order to retrieve concepts related to the medical conditions. An optimal query is built for each medical condition. The evaluation relies on the comparison between the terms in the SMQ and the terms subsumed by the query. The result is quantified by sensitivity and specificity. We applied this methodology for 24 SMQs and we obtain a mean sensitivity of 0.82. This work allows validating the semantic resource and provides, in perspective, tools to maintain the ontology while the knowledge is evolving.
Supporting diagnosis and treatment in medical care based on Big Data processing.
Lupşe, Oana-Sorina; Crişan-Vida, Mihaela; Stoicu-Tivadar, Lăcrămioara; Bernard, Elena
2014-01-01
With information and data in all domains growing every day, it is difficult to manage and extract useful knowledge for specific situations. This paper presents an integrated system architecture to support the activity in the Ob-Gin departments with further developments in using new technology to manage Big Data processing - using Google BigQuery - in the medical domain. The data collected and processed with Google BigQuery results from different sources: two Obstetrics & Gynaecology Departments, the TreatSuggest application - an application for suggesting treatments, and a home foetal surveillance system. Data is uploaded in Google BigQuery from Bega Hospital Timişoara, Romania. The analysed data is useful for the medical staff, researchers and statisticians from public health domain. The current work describes the technological architecture and its processing possibilities that in the future will be proved based on quality criteria to lead to a better decision process in diagnosis and public health.
Development of a replicated database of DHCP data for evaluation of drug use.
Graber, S E; Seneker, J A; Stahl, A A; Franklin, K O; Neel, T E; Miller, R A
1996-01-01
This case report describes development and testing of a method to extract clinical information stored in the Veterans Affairs (VA) Decentralized Hospital Computer System (DHCP) for the purpose of analyzing data about groups of patients. The authors used a microcomputer-based, structured query language (SQL)-compatible, relational database system to replicate a subset of the Nashville VA Hospital's DHCP patient database. This replicated database contained the complete current Nashville DHCP prescription, provider, patient, and drug data sets, and a subset of the laboratory data. A pilot project employed this replicated database to answer questions that might arise in drug-use evaluation, such as identification of cases of polypharmacy, suboptimal drug regimens, and inadequate laboratory monitoring of drug therapy. These database queries included as candidates for review all prescriptions for all outpatients. The queries demonstrated that specific drug-use events could be identified for any time interval represented in the replicated database. PMID:8653451
Development of a replicated database of DHCP data for evaluation of drug use.
Graber, S E; Seneker, J A; Stahl, A A; Franklin, K O; Neel, T E; Miller, R A
1996-01-01
This case report describes development and testing of a method to extract clinical information stored in the Veterans Affairs (VA) Decentralized Hospital Computer System (DHCP) for the purpose of analyzing data about groups of patients. The authors used a microcomputer-based, structured query language (SQL)-compatible, relational database system to replicate a subset of the Nashville VA Hospital's DHCP patient database. This replicated database contained the complete current Nashville DHCP prescription, provider, patient, and drug data sets, and a subset of the laboratory data. A pilot project employed this replicated database to answer questions that might arise in drug-use evaluation, such as identification of cases of polypharmacy, suboptimal drug regimens, and inadequate laboratory monitoring of drug therapy. These database queries included as candidates for review all prescriptions for all outpatients. The queries demonstrated that specific drug-use events could be identified for any time interval represented in the replicated database.
Ashkenazy, Haim; Abadi, Shiran; Martz, Eric; Chay, Ofer; Mayrose, Itay; Pupko, Tal; Ben-Tal, Nir
2016-01-01
The degree of evolutionary conservation of an amino acid in a protein or a nucleic acid in DNA/RNA reflects a balance between its natural tendency to mutate and the overall need to retain the structural integrity and function of the macromolecule. The ConSurf web server (http://consurf.tau.ac.il), established over 15 years ago, analyses the evolutionary pattern of the amino/nucleic acids of the macromolecule to reveal regions that are important for structure and/or function. Starting from a query sequence or structure, the server automatically collects homologues, infers their multiple sequence alignment and reconstructs a phylogenetic tree that reflects their evolutionary relations. These data are then used, within a probabilistic framework, to estimate the evolutionary rates of each sequence position. Here we introduce several new features into ConSurf, including automatic selection of the best evolutionary model used to infer the rates, the ability to homology-model query proteins, prediction of the secondary structure of query RNA molecules from sequence, the ability to view the biological assembly of a query (in addition to the single chain), mapping of the conservation grades onto 2D RNA models and an advanced view of the phylogenetic tree that enables interactively rerunning ConSurf with the taxa of a sub-tree. PMID:27166375
A semantic proteomics dashboard (SemPoD) for data management in translational research.
Jayapandian, Catherine P; Zhao, Meng; Ewing, Rob M; Zhang, Guo-Qiang; Sahoo, Satya S
2012-01-01
One of the primary challenges in translational research data management is breaking down the barriers between the multiple data silos and the integration of 'omics data with clinical information to complete the cycle from the bench to the bedside. The role of contextual metadata, also called provenance information, is a key factor ineffective data integration, reproducibility of results, correct attribution of original source, and answering research queries involving "What", "Where", "When", "Which", "Who", "How", and "Why" (also known as the W7 model). But, at present there is limited or no effective approach to managing and leveraging provenance information for integrating data across studies or projects. Hence, there is an urgent need for a paradigm shift in creating a "provenance-aware" informatics platform to address this challenge. We introduce an ontology-driven, intuitive Semantic Proteomics Dashboard (SemPoD) that uses provenance together with domain information (semantic provenance) to enable researchers to query, compare, and correlate different types of data across multiple projects, and allow integration with legacy data to support their ongoing research. The SemPoD platform, currently in use at the Case Center for Proteomics and Bioinformatics (CPB), consists of three components: (a) Ontology-driven Visual Query Composer, (b) Result Explorer, and (c) Query Manager. Currently, SemPoD allows provenance-aware querying of 1153 mass-spectrometry experiments from 20 different projects. SemPod uses the systems molecular biology provenance ontology (SysPro) to support a dynamic query composition interface, which automatically updates the components of the query interface based on previous user selections and efficiently prunes the result set usinga "smart filtering" approach. The SysPro ontology re-uses terms from the PROV-ontology (PROV-O) being developed by the World Wide Web Consortium (W3C) provenance working group, the minimum information required for reporting a molecular interaction experiment (MIMIx), and the minimum information about a proteomics experiment (MIAPE) guidelines. The SemPoD was evaluated both in terms of user feedback and as scalability of the system. SemPoD is an intuitive and powerful provenance ontology-driven data access and query platform that uses the MIAPE and MIMIx metadata guideline to create an integrated view over large-scale systems molecular biology datasets. SemPoD leverages the SysPro ontology to create an intuitive dashboard for biologists to compose queries, explore the results, and use a query manager for storing queries for later use. SemPoD can be deployed over many existing database applications storing 'omics data, including, as illustrated here, the LabKey data-management system. The initial user feedback evaluating the usability and functionality of SemPoD has been very positive and it is being considered for wider deployment beyond the proteomics domain, and in other 'omics' centers.
Clinical application of brain imaging for the diagnosis of mood disorders: the current state of play
Savitz, J B; Rauch, S L; Drevets, W C
2013-01-01
In response to queries about whether brain imaging technology has reached the point where it is useful for making a clinical diagnosis and for helping to guide treatment selection, the American Psychiatric Association (APA) has recently written a position paper on the Clinical Application of Brain Imaging in Psychiatry. The following perspective piece is based on our contribution to this APA position paper, which specifically emphasized the application of neuroimaging in mood disorders. We present an introductory overview of the challenges faced by researchers in developing valid and reliable biomarkers for psychiatric disorders, followed by a synopsis of the extant neuroimaging findings in mood disorders, and an evidence-based review of the current research on brain imaging biomarkers in adult mood disorders. Although there are a number of promising results, by the standards proposed below, we argue that there are currently no brain imaging biomarkers that are clinically useful for establishing diagnosis or predicting treatment outcome in mood disorders. PMID:23546169
Savitz, J B; Rauch, S L; Drevets, W C
2013-05-01
In response to queries about whether brain imaging technology has reached the point where it is useful for making a clinical diagnosis and for helping to guide treatment selection, the American Psychiatric Association (APA) has recently written a position paper on the Clinical Application of Brain Imaging in Psychiatry. The following perspective piece is based on our contribution to this APA position paper, which specifically emphasized the application of neuroimaging in mood disorders. We present an introductory overview of the challenges faced by researchers in developing valid and reliable biomarkers for psychiatric disorders, followed by a synopsis of the extant neuroimaging findings in mood disorders, and an evidence-based review of the current research on brain imaging biomarkers in adult mood disorders. Although there are a number of promising results, by the standards proposed below, we argue that there are currently no brain imaging biomarkers that are clinically useful for establishing diagnosis or predicting treatment outcome in mood disorders.
Dugan, J M; Berrios, D C; Liu, X; Kim, D K; Kaizer, H; Fagan, L M
1999-01-01
Our group has built an information retrieval system based on a complex semantic markup of medical textbooks. We describe the construction of a set of web-based knowledge-acquisition tools that expedites the collection and maintenance of the concepts required for text markup and the search interface required for information retrieval from the marked text. In the text markup system, domain experts (DEs) identify sections of text that contain one or more elements from a finite set of concepts. End users can then query the text using a predefined set of questions, each of which identifies a subset of complementary concepts. The search process matches that subset of concepts to relevant points in the text. The current process requires that the DE invest significant time to generate the required concepts and questions. We propose a new system--called ACQUIRE (Acquisition of Concepts and Queries in an Integrated Retrieval Environment)--that assists a DE in two essential tasks in the text-markup process. First, it helps her to develop, edit, and maintain the concept model: the set of concepts with which she marks the text. Second, ACQUIRE helps her to develop a query model: the set of specific questions that end users can later use to search the marked text. The DE incorporates concepts from the concept model when she creates the questions in the query model. The major benefit of the ACQUIRE system is a reduction in the time and effort required for the text-markup process. We compared the process of concept- and query-model creation using ACQUIRE to the process used in previous work by rebuilding two existing models that we previously constructed manually. We observed a significant decrease in the time required to build and maintain the concept and query models.
Tourassi, Georgia D; Harrawood, Brian; Singh, Swatee; Lo, Joseph Y; Floyd, Carey E
2007-01-01
The purpose of this study was to evaluate image similarity measures employed in an information-theoretic computer-assisted detection (IT-CAD) scheme. The scheme was developed for content-based retrieval and detection of masses in screening mammograms. The study is aimed toward an interactive clinical paradigm where physicians query the proposed IT-CAD scheme on mammographic locations that are either visually suspicious or indicated as suspicious by other cuing CAD systems. The IT-CAD scheme provides an evidence-based, second opinion for query mammographic locations using a knowledge database of mass and normal cases. In this study, eight entropy-based similarity measures were compared with respect to retrieval precision and detection accuracy using a database of 1820 mammographic regions of interest. The IT-CAD scheme was then validated on a separate database for false positive reduction of progressively more challenging visual cues generated by an existing, in-house mass detection system. The study showed that the image similarity measures fall into one of two categories; one category is better suited to the retrieval of semantically similar cases while the second is more effective with knowledge-based decisions regarding the presence of a true mass in the query location. In addition, the IT-CAD scheme yielded a substantial reduction in false-positive detections while maintaining high detection rate for malignant masses.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tourassi, Georgia D.; Harrawood, Brian; Singh, Swatee
The purpose of this study was to evaluate image similarity measures employed in an information-theoretic computer-assisted detection (IT-CAD) scheme. The scheme was developed for content-based retrieval and detection of masses in screening mammograms. The study is aimed toward an interactive clinical paradigm where physicians query the proposed IT-CAD scheme on mammographic locations that are either visually suspicious or indicated as suspicious by other cuing CAD systems. The IT-CAD scheme provides an evidence-based, second opinion for query mammographic locations using a knowledge database of mass and normal cases. In this study, eight entropy-based similarity measures were compared with respect to retrievalmore » precision and detection accuracy using a database of 1820 mammographic regions of interest. The IT-CAD scheme was then validated on a separate database for false positive reduction of progressively more challenging visual cues generated by an existing, in-house mass detection system. The study showed that the image similarity measures fall into one of two categories; one category is better suited to the retrieval of semantically similar cases while the second is more effective with knowledge-based decisions regarding the presence of a true mass in the query location. In addition, the IT-CAD scheme yielded a substantial reduction in false-positive detections while maintaining high detection rate for malignant masses.« less
Achieve Location Privacy-Preserving Range Query in Vehicular Sensing
Lu, Rongxing; Ma, Maode; Bao, Haiyong
2017-01-01
Modern vehicles are equipped with a plethora of on-board sensors and large on-board storage, which enables them to gather and store various local-relevant data. However, the wide application of vehicular sensing has its own challenges, among which location-privacy preservation and data query accuracy are two critical problems. In this paper, we propose a novel range query scheme, which helps the data requester to accurately retrieve the sensed data from the distributive on-board storage in vehicular ad hoc networks (VANETs) with location privacy preservation. The proposed scheme exploits structured scalars to denote the locations of data requesters and vehicles, and achieves the privacy-preserving location matching with the homomorphic Paillier cryptosystem technique. Detailed security analysis shows that the proposed range query scheme can successfully preserve the location privacy of the involved data requesters and vehicles, and protect the confidentiality of the sensed data. In addition, performance evaluations are conducted to show the efficiency of the proposed scheme, in terms of computation delay and communication overhead. Specifically, the computation delay and communication overhead are not dependent on the length of the scalar, and they are only proportional to the number of vehicles. PMID:28786943
Achieve Location Privacy-Preserving Range Query in Vehicular Sensing.
Kong, Qinglei; Lu, Rongxing; Ma, Maode; Bao, Haiyong
2017-08-08
Modern vehicles are equipped with a plethora of on-board sensors and large on-board storage, which enables them to gather and store various local-relevant data. However, the wide application of vehicular sensing has its own challenges, among which location-privacy preservation and data query accuracy are two critical problems. In this paper, we propose a novel range query scheme, which helps the data requester to accurately retrieve the sensed data from the distributive on-board storage in vehicular ad hoc networks (VANETs) with location privacy preservation. The proposed scheme exploits structured scalars to denote the locations of data requesters and vehicles, and achieves the privacy-preserving location matching with the homomorphic Paillier cryptosystem technique. Detailed security analysis shows that the proposed range query scheme can successfully preserve the location privacy of the involved data requesters and vehicles, and protect the confidentiality of the sensed data. In addition, performance evaluations are conducted to show the efficiency of the proposed scheme, in terms of computation delay and communication overhead. Specifically, the computation delay and communication overhead are not dependent on the length of the scalar, and they are only proportional to the number of vehicles.
Cameron, Delroy; Sheth, Amit P; Jaykumar, Nishita; Thirunarayan, Krishnaprasad; Anand, Gaurish; Smith, Gary A
2014-12-01
While contemporary semantic search systems offer to improve classical keyword-based search, they are not always adequate for complex domain specific information needs. The domain of prescription drug abuse, for example, requires knowledge of both ontological concepts and "intelligible constructs" not typically modeled in ontologies. These intelligible constructs convey essential information that include notions of intensity, frequency, interval, dosage and sentiments, which could be important to the holistic needs of the information seeker. In this paper, we present a hybrid approach to domain specific information retrieval that integrates ontology-driven query interpretation with synonym-based query expansion and domain specific rules, to facilitate search in social media on prescription drug abuse. Our framework is based on a context-free grammar (CFG) that defines the query language of constructs interpretable by the search system. The grammar provides two levels of semantic interpretation: 1) a top-level CFG that facilitates retrieval of diverse textual patterns, which belong to broad templates and 2) a low-level CFG that enables interpretation of specific expressions belonging to such textual patterns. These low-level expressions occur as concepts from four different categories of data: 1) ontological concepts, 2) concepts in lexicons (such as emotions and sentiments), 3) concepts in lexicons with only partial ontology representation, called lexico-ontology concepts (such as side effects and routes of administration (ROA)), and 4) domain specific expressions (such as date, time, interval, frequency and dosage) derived solely through rules. Our approach is embodied in a novel Semantic Web platform called PREDOSE, which provides search support for complex domain specific information needs in prescription drug abuse epidemiology. When applied to a corpus of over 1 million drug abuse-related web forum posts, our search framework proved effective in retrieving relevant documents when compared with three existing search systems.
Cameron, Delroy; Sheth, Amit P.; Jaykumar, Nishita; Thirunarayan, Krishnaprasad; Anand, Gaurish; Smith, Gary A.
2015-01-01
While contemporary semantic search systems offer to improve classical keyword-based search, they are not always adequate for complex domain specific information needs. The domain of prescription drug abuse, for example, requires knowledge of both ontological concepts and “intelligible constructs” not typically modeled in ontologies. These intelligible constructs convey essential information that include notions of intensity, frequency, interval, dosage and sentiments, which could be important to the holistic needs of the information seeker. In this paper, we present a hybrid approach to domain specific information retrieval that integrates ontology-driven query interpretation with synonym-based query expansion and domain specific rules, to facilitate search in social media on prescription drug abuse. Our framework is based on a context-free grammar (CFG) that defines the query language of constructs interpretable by the search system. The grammar provides two levels of semantic interpretation: 1) a top-level CFG that facilitates retrieval of diverse textual patterns, which belong to broad templates and 2) a low-level CFG that enables interpretation of specific expressions belonging to such textual patterns. These low-level expressions occur as concepts from four different categories of data: 1) ontological concepts, 2) concepts in lexicons (such as emotions and sentiments), 3) concepts in lexicons with only partial ontology representation, called lexico-ontology concepts (such as side effects and routes of administration (ROA)), and 4) domain specific expressions (such as date, time, interval, frequency and dosage) derived solely through rules. Our approach is embodied in a novel Semantic Web platform called PREDOSE, which provides search support for complex domain specific information needs in prescription drug abuse epidemiology. When applied to a corpus of over 1 million drug abuse-related web forum posts, our search framework proved effective in retrieving relevant documents when compared with three existing search systems. PMID:25814917
Added Value of Selected Images Embedded Into Radiology Reports to Referring Clinicians
Iyer, Veena R.; Hahn, Peter F.; Blaszkowsky, Lawrence S.; Thayer, Sarah P.; Halpern, Elkan F.; Harisinghani, Mukesh G.
2011-01-01
Purpose The aim of this study was to evaluate the added utility of embedding images for findings described in radiology text reports to referring clinicians. Methods Thirty-five cases referred for abdominal CT scans in 2007 and 2008 were included. Referring physicians were asked to view text-only reports, followed by the same reports with pertinent images embedded. For each pair of reports, a questionnaire was administered. A 5-point, Likert-type scale was used to assess if the clinical query was satisfactorily answered by the text-only report. A “yes-or-no” question was used to assess whether the report with images answered the clinical query better; a positive answer to this question generated “yes-or-no” queries to examine whether the report with images helped in making a more confident decision on management, whether it reduced time spent in forming the plan, and whether it altered management. The questionnaire asked whether a radiologist would be contacted with queries on reading the text-only report and the report with images. Results In 32 of 35 cases, the text-only reports satisfactorily answered the clinical queries. In these 32 cases, the reports with attached images helped in making more confident management decisions and reduced time in planning management. Attached images altered management in 2 cases. Radiologists would have been consulted for clarifications in 21 and 10 cases on reading the text-only reports and the reports with embedded images, respectively. Conclusions Providing relevant images with reports saves time, increases physicians' confidence in deciding treatment plans, and can alter management. PMID:20193926
Hu, Qiyue; Peng, Zhengwei; Kostrowicki, Jaroslav; Kuki, Atsuo
2011-01-01
Pfizer Global Virtual Library (PGVL) of 10(13) readily synthesizable molecules offers a tremendous opportunity for lead optimization and scaffold hopping in drug discovery projects. However, mining into a chemical space of this size presents a challenge for the concomitant design informatics due to the fact that standard molecular similarity searches against a collection of explicit molecules cannot be utilized, since no chemical information system could create and manage more than 10(8) explicit molecules. Nevertheless, by accepting a tolerable level of false negatives in search results, we were able to bypass the need for full 10(13) enumeration and enabled the efficient similarity search and retrieval into this huge chemical space for practical usage by medicinal chemists. In this report, two search methods (LEAP1 and LEAP2) are presented. The first method uses PGVL reaction knowledge to disassemble the incoming search query molecule into a set of reactants and then uses reactant-level similarities into actual available starting materials to focus on a much smaller sub-region of the full virtual library compound space. This sub-region is then explicitly enumerated and searched via a standard similarity method using the original query molecule. The second method uses a fuzzy mapping onto candidate reactions and does not require exact disassembly of the incoming query molecule. Instead Basis Products (or capped reactants) are mapped into the query molecule and the resultant asymmetric similarity scores are used to prioritize the corresponding reactions and reactant sets. All sets of Basis Products are inherently indexed to specific reactions and specific starting materials. This again allows focusing on a much smaller sub-region for explicit enumeration and subsequent standard product-level similarity search. A set of validation studies were conducted. The results have shown that the level of false negatives for the disassembly-based method is acceptable when the query molecule can be recognized for exact disassembly, and the fuzzy reaction mapping method based on Basis Products has an even better performance in terms of lower false-negative rate because it is not limited by the requirement that the query molecule needs to be recognized by any disassembly algorithm. Both search methods have been implemented and accessed through a powerful desktop molecular design tool (see ref. (33) for details). The chapter will end with a comparison of published search methods against large virtual chemical space.
Climate Change: On Scientists and Advocacy
NASA Technical Reports Server (NTRS)
Schmidt, Gavin A.
2014-01-01
Last year, I asked a crowd of a few hundred geoscientists from around the world what positions related to climate science and policy they would be comfortable publicly advocating. I presented a list of recommendations that included increased research funding, greater resources for education, and specific emission reduction technologies. In almost every case, a majority of the audience felt comfortable arguing for them. The only clear exceptions were related to geo-engineering research and nuclear power. I had queried the researchers because the relationship between science and advocacy is marked by many assumptions and little clarity. This despite the fact that the basic question of how scientists can be responsible advocates on issues related to their expertise has been discussed for decades most notably in the case of climate change by the late Stephen Schneider.
Application based on ArcObject inquiry and Google maps demonstration to real estate database
NASA Astrophysics Data System (ADS)
Hwang, JinTsong
2007-06-01
Real estate industry in Taiwan has been flourishing in recent years. To acquire various and abundant information of real estate for sale is the same goal for the consumers and the brokerages. Therefore, before looking at the property, it is important to get all pertinent information possible. Not only this beneficial for the real estate agent as they can provide the sellers with the most information, thereby solidifying the interest of the buyer, but may also save time and the cost of manpower were something out of place. Most of the brokerage sites are aware of utilizes Internet as form of media for publicity however; the contents are limited to specific property itself and the functions of query are mostly just provided searching by condition. This paper proposes a query interface on website which gives function of zone query by spatial analysis for non-GIS users, developing a user-friendly interface with ArcObject in VB6, and query by condition. The inquiry results can show on the web page which is embedded functions of Google Maps and the UrMap API on it. In addition, the demonstration of inquiry results will give the multimedia present way which includes hyperlink to Google Earth with surrounding of the property, the Virtual Reality scene of house, panorama of interior of building and so on. Therefore, the website provides extra spatial solution for query and demonstration abundant information of real estate in two-dimensional and three-dimensional types of view.
Tai, David; Fang, Jianwen
2012-08-27
The large sizes of today's chemical databases require efficient algorithms to perform similarity searches. It can be very time consuming to compare two large chemical databases. This paper seeks to build upon existing research efforts by describing a novel strategy for accelerating existing search algorithms for comparing large chemical collections. The quest for efficiency has focused on developing better indexing algorithms by creating heuristics for searching individual chemical against a chemical library by detecting and eliminating needless similarity calculations. For comparing two chemical collections, these algorithms simply execute searches for each chemical in the query set sequentially. The strategy presented in this paper achieves a speedup upon these algorithms by indexing the set of all query chemicals so redundant calculations that arise in the case of sequential searches are eliminated. We implement this novel algorithm by developing a similarity search program called Symmetric inDexing or SymDex. SymDex shows over a 232% maximum speedup compared to the state-of-the-art single query search algorithm over real data for various fingerprint lengths. Considerable speedup is even seen for batch searches where query set sizes are relatively small compared to typical database sizes. To the best of our knowledge, SymDex is the first search algorithm designed specifically for comparing chemical libraries. It can be adapted to most, if not all, existing indexing algorithms and shows potential for accelerating future similarity search algorithms for comparing chemical databases.
ASON: An OWL-S based ontology for astrophysical services
NASA Astrophysics Data System (ADS)
Louge, T.; Karray, M. H.; Archimède, B.; Knödlseder, J.
2018-07-01
Modern astrophysics heavily relies on Web services to expose most of the data coming from many different instruments and researches worldwide. The virtual observatory (VO) has been designed to allow scientists to locate, retrieve and analyze useful information among those heterogeneous data. The use of ontologies has been studied in the VO context for astrophysical concerns like object types or astrophysical services subjects. On the operative point of view, ontological description of astrophysical services for interoperability and querying still has to be considered. In this paper, we design a global ontology (Astrophysical Services ONtology, ASON) based on web Ontology Language for Services (OWL-S) to enhance existing astrophysical services description. By expressing together VO specific and non-VO specific services design, it will improve the automation of services queries and allow automatic composition of heterogeneous astrophysical services.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Whiteaker, Jeffrey R.; Halusa, Goran; Hoofnagle, Andrew N.
2016-02-12
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) of the National Cancer Institute (NCI) has launched an Assay Portal (http://assays.cancer.gov) to serve as an open-source repository of well-characterized targeted proteomic assays. The portal is designed to curate and disseminate highly characterized, targeted mass spectrometry (MS)-based assays by providing detailed assay performance characterization data, standard operating procedures, and access to reagents. Assay content is accessed via the portal through queries to find assays targeting proteins associated with specific cellular pathways, protein complexes, or specific chromosomal regions. The position of the peptide analytes for which there are available assays are mapped relative tomore » other features of interest in the protein, such as sequence domains, isoforms, single nucleotide polymorphisms, and post-translational modifications. The overarching goals are to enable robust quantification of all human proteins and to standardize the quantification of targeted MS-based assays to ultimately enable harmonization of results over time and across laboratories.« less
Whiteaker, Jeffrey R; Halusa, Goran N; Hoofnagle, Andrew N; Sharma, Vagisha; MacLean, Brendan; Yan, Ping; Wrobel, John A; Kennedy, Jacob; Mani, D R; Zimmerman, Lisa J; Meyer, Matthew R; Mesri, Mehdi; Boja, Emily; Carr, Steven A; Chan, Daniel W; Chen, Xian; Chen, Jing; Davies, Sherri R; Ellis, Matthew J C; Fenyö, David; Hiltke, Tara; Ketchum, Karen A; Kinsinger, Chris; Kuhn, Eric; Liebler, Daniel C; Liu, Tao; Loss, Michael; MacCoss, Michael J; Qian, Wei-Jun; Rivers, Robert; Rodland, Karin D; Ruggles, Kelly V; Scott, Mitchell G; Smith, Richard D; Thomas, Stefani; Townsend, R Reid; Whiteley, Gordon; Wu, Chaochao; Zhang, Hui; Zhang, Zhen; Rodriguez, Henry; Paulovich, Amanda G
2016-01-01
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) of the National Cancer Institute (NCI) has launched an Assay Portal (http://assays.cancer.gov) to serve as an open-source repository of well-characterized targeted proteomic assays. The portal is designed to curate and disseminate highly characterized, targeted mass spectrometry (MS)-based assays by providing detailed assay performance characterization data, standard operating procedures, and access to reagents. Assay content is accessed via the portal through queries to find assays targeting proteins associated with specific cellular pathways, protein complexes, or specific chromosomal regions. The position of the peptide analytes for which there are available assays are mapped relative to other features of interest in the protein, such as sequence domains, isoforms, single nucleotide polymorphisms, and posttranslational modifications. The overarching goals are to enable robust quantification of all human proteins and to standardize the quantification of targeted MS-based assays to ultimately enable harmonization of results over time and across laboratories.
New NED XML/VOtable Services and Client Interface Applications
NASA Astrophysics Data System (ADS)
Pevunova, O.; Good, J.; Mazzarella, J.; Berriman, G. B.; Madore, B.
2005-12-01
The NASA/IPAC Extragalactic Database (NED) provides data and cross-identifications for over 7 million extragalactic objects fused from thousands of survey catalogs and journal articles. The data cover all frequencies from radio through gamma rays and include positions, redshifts, photometry and spectral energy distributions (SEDs), sizes, and images. NED services have traditionally supplied data in HTML format for connections from Web browsers, and a custom ASCII data structure for connections by remote computer programs written in the C programming language. We describe new services that provide responses from NED queries in XML documents compliant with the international virtual observatory VOtable protocol. The XML/VOtable services support cone searches, all-sky searches based on object attributes (survey names, cross-IDs, redshifts, flux densities), and requests for detailed object data. Initial services have been inserted into the NVO registry, and others will follow soon. The first client application is a Style Sheet specification for rendering NED VOtable query results in Web browsers that support XML. The second prototype application is a Java applet that allows users to compare multiple SEDs. The new XML/VOtable output mode will also simplify the integration of data from NED into visualization and analysis packages, software agents, and other virtual observatory applications. We show an example SED from NED plotted using VOPlot. The NED website is: http://nedwww.ipac.caltech.edu.
An evolution-based DNA-binding residue predictor using a dynamic query-driven learning scheme.
Chai, H; Zhang, J; Yang, G; Ma, Z
2016-11-15
DNA-binding proteins play a pivotal role in various biological activities. Identification of DNA-binding residues (DBRs) is of great importance for understanding the mechanism of gene regulations and chromatin remodeling. Most traditional computational methods usually construct their predictors on static non-redundant datasets. They excluded many homologous DNA-binding proteins so as to guarantee the generalization capability of their models. However, those ignored samples may potentially provide useful clues when studying protein-DNA interactions, which have not obtained enough attention. In view of this, we propose a novel method, namely DQPred-DBR, to fill the gap of DBR predictions. First, a large-scale extensible sample pool was compiled. Second, evolution-based features in the form of a relative position specific score matrix and covariant evolutionary conservation descriptors were used to encode the feature space. Third, a dynamic query-driven learning scheme was designed to make more use of proteins with known structure and functions. In comparison with a traditional static model, the introduction of dynamic models could obviously improve the prediction performance. Experimental results from the benchmark and independent datasets proved that our DQPred-DBR had promising generalization capability. It was capable of producing decent predictions and outperforms many state-of-the-art methods. For the convenience of academic use, our proposed method was also implemented as a web server at .
Improving average ranking precision in user searches for biomedical research datasets
Gobeill, Julien; Gaudinat, Arnaud; Vachon, Thérèse; Ruch, Patrick
2017-01-01
Abstract Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL: https://biocaddie.org/benchmark-data PMID:29220475
Berlin, Conny; Blanch, Carles; Lewis, David J; Maladorno, Dionigi D; Michel, Christiane; Petrin, Michael; Sarp, Severine; Close, Philippe
2012-06-01
The detection of safety signals with medicines is an essential activity to protect public health. Despite widespread acceptance, it is unclear whether recently applied statistical algorithms provide enhanced performance characteristics when compared with traditional systems. Novartis has adopted a novel system for automated signal detection on the basis of disproportionality methods within a safety data mining application (Empirica™ Signal System [ESS]). ESS uses two algorithms for routine analyses: empirical Bayes Multi-item Gamma Poisson Shrinker and logistic regression (LR). A model was developed comprising 14 medicines, categorized as "new" or "established." A standard was prepared on the basis of safety findings selected from traditional sources. ESS results were compared with the standard to calculate the positive predictive value (PPV), specificity, and sensitivity. PPVs of the lower one-sided 5% and 0.05% confidence limits of the Bayes geometric mean (EB05) and of the LR odds ratio (LR0005) almost coincided for all the drug-event combinations studied. There was no obvious difference comparing the PPV of the leading Medical Dictionary for Regulatory Activities (MedDRA) terms to the PPV for all terms. The PPV of narrow MedDRA query searches was higher than that for broad searches. The widely used threshold value of EB05 = 2.0 or LR0005 = 2.0 together with more than three spontaneous reports of the drug-event combination produced balanced results for PPV, sensitivity, and specificity. Consequently, performance characteristics were best for leading terms with narrow MedDRA query searches irrespective of applying Multi-item Gamma Poisson Shrinker or LR at a threshold value of 2.0. This research formed the basis for the configuration of ESS for signal detection at Novartis. Copyright © 2011 John Wiley & Sons, Ltd.
SIRSALE: integrated video database management tools
NASA Astrophysics Data System (ADS)
Brunie, Lionel; Favory, Loic; Gelas, J. P.; Lefevre, Laurent; Mostefaoui, Ahmed; Nait-Abdesselam, F.
2002-07-01
Video databases became an active field of research during the last decade. The main objective in such systems is to provide users with capabilities to friendly search, access and playback distributed stored video data in the same way as they do for traditional distributed databases. Hence, such systems need to deal with hard issues : (a) video documents generate huge volumes of data and are time sensitive (streams must be delivered at a specific bitrate), (b) contents of video data are very hard to be automatically extracted and need to be humanly annotated. To cope with these issues, many approaches have been proposed in the literature including data models, query languages, video indexing etc. In this paper, we present SIRSALE : a set of video databases management tools that allow users to manipulate video documents and streams stored in large distributed repositories. All the proposed tools are based on generic models that can be customized for specific applications using ad-hoc adaptation modules. More precisely, SIRSALE allows users to : (a) browse video documents by structures (sequences, scenes, shots) and (b) query the video database content by using a graphical tool, adapted to the nature of the target video documents. This paper also presents an annotating interface which allows archivists to describe the content of video documents. All these tools are coupled to a video player integrating remote VCR functionalities and are based on active network technology. So, we present how dedicated active services allow an optimized video transport for video streams (with Tamanoir active nodes). We then describe experiments of using SIRSALE on an archive of news video and soccer matches. The system has been demonstrated to professionals with a positive feedback. Finally, we discuss open issues and present some perspectives.
Masseroli, Marco; Kaitoua, Abdulrahman; Pinoli, Pietro; Ceri, Stefano
2016-12-01
While a huge amount of (epi)genomic data of multiple types is becoming available by using Next Generation Sequencing (NGS) technologies, the most important emerging problem is the so-called tertiary analysis, concerned with sense making, e.g., discovering how different (epi)genomic regions and their products interact and cooperate with each other. We propose a paradigm shift in tertiary analysis, based on the use of the Genomic Data Model (GDM), a simple data model which links genomic feature data to their associated experimental, biological and clinical metadata. GDM encompasses all the data formats which have been produced for feature extraction from (epi)genomic datasets. We specifically describe the mapping to GDM of SAM (Sequence Alignment/Map), VCF (Variant Call Format), NARROWPEAK (for called peaks produced by NGS ChIP-seq or DNase-seq methods), and BED (Browser Extensible Data) formats, but GDM supports as well all the formats describing experimental datasets (e.g., including copy number variations, DNA somatic mutations, or gene expressions) and annotations (e.g., regarding transcription start sites, genes, enhancers or CpG islands). We downloaded and integrated samples of all the above-mentioned data types and formats from multiple sources. The GDM is able to homogeneously describe semantically heterogeneous data and makes the ground for providing data interoperability, e.g., achieved through the GenoMetric Query Language (GMQL), a high-level, declarative query language for genomic big data. The combined use of the data model and the query language allows comprehensive processing of multiple heterogeneous data, and supports the development of domain-specific data-driven computations and bio-molecular knowledge discovery. Copyright © 2016 Elsevier Inc. All rights reserved.
Downing, N Lance; Adler-Milstein, Julia; Palma, Jonathan P; Lane, Steven; Eisenberg, Matthew; Sharp, Christopher; Longhurst, Christopher A
2017-01-01
Provider organizations increasingly have the ability to exchange patient health information electronically. Organizational health information exchange (HIE) policy decisions can impact the extent to which external information is readily available to providers, but this relationship has not been well studied. Our objective was to examine the relationship between electronic exchange of patient health information across organizations and organizational HIE policy decisions. We focused on 2 key decisions: whether to automatically search for information from other organizations and whether to require HIE-specific patient consent. We conducted a retrospective time series analysis of the effect of automatic querying and the patient consent requirement on the monthly volume of clinical summaries exchanged. We could not assess degree of use or usefulness of summaries, organizational decision-making processes, or generalizability to other vendors. Between 2013 and 2015, clinical summary exchange volume increased by 1349% across 11 organizations. Nine of the 11 systems were set up to enable auto-querying, and auto-querying was associated with a significant increase in the monthly rate of exchange (P = .006 for change in trend). Seven of the 11 organizations did not require patient consent specifically for HIE, and these organizations experienced a greater increase in volume of exchange over time compared to organizations that required consent. Automatic querying and limited consent requirements are organizational HIE policy decisions that impact the volume of exchange, and ultimately the information available to providers to support optimal care. Future efforts to ensure effective HIE may need to explicitly address these factors. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Query Auto-Completion Based on Word2vec Semantic Similarity
NASA Astrophysics Data System (ADS)
Shao, Taihua; Chen, Honghui; Chen, Wanyu
2018-04-01
Query auto-completion (QAC) is the first step of information retrieval, which helps users formulate the entire query after inputting only a few prefixes. Regarding the models of QAC, the traditional method ignores the contribution from the semantic relevance between queries. However, similar queries always express extremely similar search intention. In this paper, we propose a hybrid model FS-QAC based on query semantic similarity as well as the query frequency. We choose word2vec method to measure the semantic similarity between intended queries and pre-submitted queries. By combining both features, our experiments show that FS-QAC model improves the performance when predicting the user’s query intention and helping formulate the right query. Our experimental results show that the optimal hybrid model contributes to a 7.54% improvement in terms of MRR against a state-of-the-art baseline using the public AOL query logs.
EquiX-A Search and Query Language for XML.
ERIC Educational Resources Information Center
Cohen, Sara; Kanza, Yaron; Kogan, Yakov; Sagiv, Yehoshua; Nutt, Werner; Serebrenik, Alexander
2002-01-01
Describes EquiX, a search language for XML that combines querying with searching to query the data and the meta-data content of Web pages. Topics include search engines; a data model for XML documents; search query syntax; search query semantics; an algorithm for evaluating a query on a document; and indexing EquiX queries. (LRW)
searchSCF: Using MongoDB to Enable Richer Searches of Locally Hosted Science Data Repositories
NASA Astrophysics Data System (ADS)
Knosp, B.
2016-12-01
Science teams today are in the unusual position of almost having too much data available to them. Modern sensors and models are capable of outputting terabytes of data per day, which can make it difficult to find specific subsets of data. The sheer size of files can also make it time consuming to retrieve this big data from national data archive centers. Thus, many science teams choose to store what data they can on their local systems, but they are not always equipped with tools to help them intelligently organize and search their data. In its local data repository, the Aura Microwave Limb Sounder (MLS) science team at NASA's Jet Propulsion Laboratory has collected over 300TB of atmospheric science data from 71 missions/models that aid in validation, algorithm development, and research activities. When the project began, the team developed a MySQL database to aid in data queries, but this database was only designed to keep track of MLS and a few ancillary data sets, leving much of the data uncatalogued. The team has also seen database query time rise over the life of the mission. Even though the MLS science team's data holdings are not the size of a national data center's, team members still need tools to help them discover and utilize the data that they have on-hand. Over the past year, members of the science team have been looking for solutions to (1) store information on all the data sets they have collected in a single database, (2) store more metadata about each data file, (3) develop queries that can find relationships among these disparate data types, and (4) plug any new functions developed around this database into existing analysis, visualization, and web tools, transparently to users. In this presentation, I will discuss the searchSCF package that is currently under development. This package includes a NoSQL database management system (MongoDB) and a set of Python tools that both ingests data into the database and supports user queries. I will also highlight case studies of how this system could be used by the MLS science team, and how it could be implemented by other science teams with local data repositories.
Efficient strategies to find diagnostic test accuracy studies in kidney journals.
Rogerson, Thomas E; Ladhani, Maleeka; Mitchell, Ruth; Craig, Jonathan C; Webster, Angela C
2015-08-01
Nephrologists looking for quick answers to diagnostic clinical questions in MEDLINE can use a range of published search strategies or Clinical Query limits to improve the precision of their searches. We aimed to evaluate existing search strategies for finding diagnostic test accuracy studies in nephrology journals. We assessed the accuracy of 14 search strategies for retrieving diagnostic test accuracy studies from three nephrology journals indexed in MEDLINE. Two investigators hand searched the same journals to create a reference set of diagnostic test accuracy studies to compare search strategy results against. We identified 103 diagnostic test accuracy studies, accounting for 2.1% of all studies published. The most specific search strategy was the Narrow Clinical Queries limit (sensitivity: 0.20, 95% CI 0.13-0.29; specificity: 0.99, 95% CI 0.99-0.99). Using the Narrow Clinical Queries limit, a searcher would need to screen three (95% CI 2-6) articles to find one diagnostic study. The most sensitive search strategy was van der Weijden 1999 Extended (sensitivity: 0.95; 95% CI 0.89-0.98; specificity 0.55, 95% CI 0.53-0.56) but required a searcher to screen 24 (95% CI 23-26) articles to find one diagnostic study. Bachmann 2002 was the best balanced search strategy, which was sensitive (0.88, 95% CI 0.81-0.94), but also specific (0.74, 95% CI 0.73-0.75), with a number needed to screen of 15 (95% CI 14-17). Diagnostic studies are infrequently published in nephrology journals. The addition of a strategy for diagnostic studies to a subject search strategy in MEDLINE may reduce the records needed to screen while preserving adequate search sensitivity for routine clinical use. © 2015 Asian Pacific Society of Nephrology.
Harsha, Asheesh K; Schmitt, J Eric; Stavropoulos, S William
2014-01-01
To analyze Internet search data to characterize the temporal and geographic interest of Internet users in the United States in varicose vein treatment. From January 1, 2004, to September 1, 2012, the Google Trends tool was used to analyze query data for "varicose vein treatment" to identify individuals seeking treatment information for varicose veins. The term "varicose vein treatment" returned a search volume index (SVI), representing the search frequency relative to the total search volume during a specific time interval and region. Linear regression analysis and Kruskal-Wallis one-way analysis of variance were employed to characterize search results. Search traffic for varicose vein treatment increased by 520% over the 104-month study period. There was an annual mean increase of 28% (range, -18%-100%; standard deviation [SD], 35%), with a statistically significant linear increase in average yearly SVI over time (R(2) = 0.94, P < .0001). All years showed positive growth in mean SVI except for 2008 (18% decrease). There were statistically significant differences in SVI by month (Kruskal-Wallis, P < .0001) with significantly higher mean SVI compared with other months in May (190% increase; range, 26%-670%; SD, 15%) and June (209% increase; range, 35%-700%; SD, 20%). The southern United States showed significantly higher search traffic than all other regions (Tukey-Kramer, P < .00001). There have been significant increases in Internet search traffic related to varicose vein treatment in the past 8 years. Reflected in this trend is an annual peak in search traffic in the late spring months with an overall geographic bias toward southern states. Rigorous analysis of Internet search queries for medical procedures may prove useful to guide the efficient use of limited resources and marketing dollars. © 2013 The Society of Interventional Radiology Published by SIR All rights reserved.
Spatial and symbolic queries for 3D image data
NASA Astrophysics Data System (ADS)
Benson, Daniel C.; Zick, Gregory L.
1992-04-01
We present a query system for an object-oriented biomedical imaging database containing 3-D anatomical structures and their corresponding 2-D images. The graphical interface facilitates the formation of spatial queries, nonspatial or symbolic queries, and combined spatial/symbolic queries. A query editor is used for the creation and manipulation of 3-D query objects as volumes, surfaces, lines, and points. Symbolic predicates are formulated through a combination of text fields and multiple choice selections. Query results, which may include images, image contents, composite objects, graphics, and alphanumeric data, are displayed in multiple views. Objects returned by the query may be selected directly within the views for further inspection or modification, or for use as query objects in subsequent queries. Our image database query system provides visual feedback and manipulation of spatial query objects, multiple views of volume data, and the ability to combine spatial and symbolic queries. The system allows for incremental enhancement of existing objects and the addition of new objects and spatial relationships. The query system is designed for databases containing symbolic and spatial data. This paper discuses its application to data acquired in biomedical 3- D image reconstruction, but it is applicable to other areas such as CAD/CAM, geographical information systems, and computer vision.
GenoQuery: a new querying module for functional annotation in a genomic warehouse
Lemoine, Frédéric; Labedan, Bernard; Froidevaux, Christine
2008-01-01
Motivation: We have to cope with both a deluge of new genome sequences and a huge amount of data produced by high-throughput approaches used to exploit these genomic features. Crossing and comparing such heterogeneous and disparate data will help improving functional annotation of genomes. This requires designing elaborate integration systems such as warehouses for storing and querying these data. Results: We have designed a relational genomic warehouse with an original multi-layer architecture made of a databases layer and an entities layer. We describe a new querying module, GenoQuery, which is based on this architecture. We use the entities layer to define mixed queries. These mixed queries allow searching for instances of biological entities and their properties in the different databases, without specifying in which database they should be found. Accordingly, we further introduce the central notion of alternative queries. Such queries have the same meaning as the original mixed queries, while exploiting complementarities yielded by the various integrated databases of the warehouse. We explain how GenoQuery computes all the alternative queries of a given mixed query. We illustrate how useful this querying module is by means of a thorough example. Availability: http://www.lri.fr/~lemoine/GenoQuery/ Contact: chris@lri.fr, lemoine@lri.fr PMID:18586731
G-Bean: an ontology-graph based web tool for biomedical literature retrieval
2014-01-01
Background Currently, most people use NCBI's PubMed to search the MEDLINE database, an important bibliographical information source for life science and biomedical information. However, PubMed has some drawbacks that make it difficult to find relevant publications pertaining to users' individual intentions, especially for non-expert users. To ameliorate the disadvantages of PubMed, we developed G-Bean, a graph based biomedical search engine, to search biomedical articles in MEDLINE database more efficiently. Methods G-Bean addresses PubMed's limitations with three innovations: (1) Parallel document index creation: a multithreaded index creation strategy is employed to generate the document index for G-Bean in parallel; (2) Ontology-graph based query expansion: an ontology graph is constructed by merging four major UMLS (Version 2013AA) vocabularies, MeSH, SNOMEDCT, CSP and AOD, to cover all concepts in National Library of Medicine (NLM) database; a Personalized PageRank algorithm is used to compute concept relevance in this ontology graph and the Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme is used to re-rank the concepts. The top 500 ranked concepts are selected for expanding the initial query to retrieve more accurate and relevant information; (3) Retrieval and re-ranking of documents based on user's search intention: after the user selects any article from the existing search results, G-Bean analyzes user's selections to determine his/her true search intention and then uses more relevant and more specific terms to retrieve additional related articles. The new articles are presented to the user in the order of their relevance to the already selected articles. Results Performance evaluation with 106 OHSUMED benchmark queries shows that G-Bean returns more relevant results than PubMed does when using these queries to search the MEDLINE database. PubMed could not even return any search result for some OHSUMED queries because it failed to form the appropriate Boolean query statement automatically from the natural language query strings. G-Bean is available at http://bioinformatics.clemson.edu/G-Bean/index.php. Conclusions G-Bean addresses PubMed's limitations with ontology-graph based query expansion, automatic document indexing, and user search intention discovery. It shows significant advantages in finding relevant articles from the MEDLINE database to meet the information need of the user. PMID:25474588
G-Bean: an ontology-graph based web tool for biomedical literature retrieval.
Wang, James Z; Zhang, Yuanyuan; Dong, Liang; Li, Lin; Srimani, Pradip K; Yu, Philip S
2014-01-01
Currently, most people use NCBI's PubMed to search the MEDLINE database, an important bibliographical information source for life science and biomedical information. However, PubMed has some drawbacks that make it difficult to find relevant publications pertaining to users' individual intentions, especially for non-expert users. To ameliorate the disadvantages of PubMed, we developed G-Bean, a graph based biomedical search engine, to search biomedical articles in MEDLINE database more efficiently. G-Bean addresses PubMed's limitations with three innovations: (1) Parallel document index creation: a multithreaded index creation strategy is employed to generate the document index for G-Bean in parallel; (2) Ontology-graph based query expansion: an ontology graph is constructed by merging four major UMLS (Version 2013AA) vocabularies, MeSH, SNOMEDCT, CSP and AOD, to cover all concepts in National Library of Medicine (NLM) database; a Personalized PageRank algorithm is used to compute concept relevance in this ontology graph and the Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme is used to re-rank the concepts. The top 500 ranked concepts are selected for expanding the initial query to retrieve more accurate and relevant information; (3) Retrieval and re-ranking of documents based on user's search intention: after the user selects any article from the existing search results, G-Bean analyzes user's selections to determine his/her true search intention and then uses more relevant and more specific terms to retrieve additional related articles. The new articles are presented to the user in the order of their relevance to the already selected articles. Performance evaluation with 106 OHSUMED benchmark queries shows that G-Bean returns more relevant results than PubMed does when using these queries to search the MEDLINE database. PubMed could not even return any search result for some OHSUMED queries because it failed to form the appropriate Boolean query statement automatically from the natural language query strings. G-Bean is available at http://bioinformatics.clemson.edu/G-Bean/index.php. G-Bean addresses PubMed's limitations with ontology-graph based query expansion, automatic document indexing, and user search intention discovery. It shows significant advantages in finding relevant articles from the MEDLINE database to meet the information need of the user.
Griffon, N; Schuers, M; Dhombres, F; Merabti, T; Kerdelhué, G; Rollin, L; Darmoni, S J
2016-08-02
Despite international initiatives like Orphanet, it remains difficult to find up-to-date information about rare diseases. The aim of this study is to propose an exhaustive set of queries for PubMed based on terminological knowledge and to evaluate it versus the queries based on expertise provided by the most frequently used resource in Europe: Orphanet. Four rare disease terminologies (MeSH, OMIM, HPO and HRDO) were manually mapped to each other permitting the automatic creation of expended terminological queries for rare diseases. For 30 rare diseases, 30 citations retrieved by Orphanet expert query and/or query based on terminological knowledge were assessed for relevance by two independent reviewers unaware of the query's origin. An adjudication procedure was used to resolve any discrepancy. Precision, relative recall and F-measure were all computed. For each Orphanet rare disease (n = 8982), there was a corresponding terminological query, in contrast with only 2284 queries provided by Orphanet. Only 553 citations were evaluated due to queries with 0 or only a few hits. There were no significant differences between the Orpha query and terminological query in terms of precision, respectively 0.61 vs 0.52 (p = 0.13). Nevertheless, terminological queries retrieved more citations more often than Orpha queries (0.57 vs. 0.33; p = 0.01). Interestingly, Orpha queries seemed to retrieve older citations than terminological queries (p < 0.0001). The terminological queries proposed in this study are now currently available for all rare diseases. They may be a useful tool for both precision or recall oriented literature search.
Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering.
Deveci, Mehmet; Küçüktunç, Onur; Eren, Kemal; Bozdağ, Doruk; Kaya, Kamer; Çatalyürek, Ümit V
2016-01-01
Rapid development and increasing popularity of gene expression microarrays have resulted in a number of studies on the discovery of co-regulated genes. One important way of discovering such co-regulations is the query-based search since gene co-expressions may indicate a shared role in a biological process. Although there exist promising query-driven search methods adapting clustering, they fail to capture many genes that function in the same biological pathway because microarray datasets are fraught with spurious samples or samples of diverse origin, or the pathways might be regulated under only a subset of samples. On the other hand, a class of clustering algorithms known as biclustering algorithms which simultaneously cluster both the items and their features are useful while analyzing gene expression data, or any data in which items are related in only a subset of their samples. This means that genes need not be related in all samples to be clustered together. Because many genes only interact under specific circumstances, biclustering may recover the relationships that traditional clustering algorithms can easily miss. In this chapter, we briefly summarize the literature using biclustering for querying co-regulated genes. Then we present a novel biclustering approach and evaluate its performance by a thorough experimental analysis.
Analyzing Document Retrievability in Patent Retrieval Settings
NASA Astrophysics Data System (ADS)
Bashir, Shariq; Rauber, Andreas
Most information retrieval settings, such as web search, are typically precision-oriented, i.e. they focus on retrieving a small number of highly relevant documents. However, in specific domains, such as patent retrieval or law, recall becomes more relevant than precision: in these cases the goal is to find all relevant documents, requiring algorithms to be tuned more towards recall at the cost of precision. This raises important questions with respect to retrievability and search engine bias: depending on how the similarity between a query and documents is measured, certain documents may be more or less retrievable in certain systems, up to some documents not being retrievable at all within common threshold settings. Biases may be oriented towards popularity of documents (increasing weight of references), towards length of documents, favour the use of rare or common words; rely on structural information such as metadata or headings, etc. Existing accessibility measurement techniques are limited as they measure retrievability with respect to all possible queries. In this paper, we improve accessibility measurement by considering sets of relevant and irrelevant queries for each document. This simulates how recall oriented users create their queries when searching for relevant information. We evaluate retrievability scores using a corpus of patents from US Patent and Trademark Office.
Sehnal, David; Pravda, Lukáš; Svobodová Vařeková, Radka; Ionescu, Crina-Maria; Koča, Jaroslav
2015-07-01
Well defined biomacromolecular patterns such as binding sites, catalytic sites, specific protein or nucleic acid sequences, etc. precisely modulate many important biological phenomena. We introduce PatternQuery, a web-based application designed for detection and fast extraction of such patterns. The application uses a unique query language with Python-like syntax to define the patterns that will be extracted from datasets provided by the user, or from the entire Protein Data Bank (PDB). Moreover, the database-wide search can be restricted using a variety of criteria, such as PDB ID, resolution, and organism of origin, to provide only relevant data. The extraction generally takes a few seconds for several hundreds of entries, up to approximately one hour for the whole PDB. The detected patterns are made available for download to enable further processing, as well as presented in a clear tabular and graphical form directly in the browser. The unique design of the language and the provided service could pave the way towards novel PDB-wide analyses, which were either difficult or unfeasible in the past. The application is available free of charge at http://ncbr.muni.cz/PatternQuery. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
ERIC Educational Resources Information Center
Lonergan, David
2010-01-01
Situations arise almost every day of a reference librarian's working life in which a student (or a professor, a colleague, or a person off the street) asks a straightforward question that the librarian recognizes in a specific way. On the other hand, there are specific questions that come up fairly often and routinely lead to further queries. What…
An advanced web query interface for biological databases
Latendresse, Mario; Karp, Peter D.
2010-01-01
Although most web-based biological databases (DBs) offer some type of web-based form to allow users to author DB queries, these query forms are quite restricted in the complexity of DB queries that they can formulate. They can typically query only one DB, and can query only a single type of object at a time (e.g. genes) with no possible interaction between the objects—that is, in SQL parlance, no joins are allowed between DB objects. Writing precise queries against biological DBs is usually left to a programmer skillful enough in complex DB query languages like SQL. We present a web interface for building precise queries for biological DBs that can construct much more precise queries than most web-based query forms, yet that is user friendly enough to be used by biologists. It supports queries containing multiple conditions, and connecting multiple object types without using the join concept, which is unintuitive to biologists. This interactive web interface is called the Structured Advanced Query Page (SAQP). Users interactively build up a wide range of query constructs. Interactive documentation within the SAQP describes the schema of the queried DBs. The SAQP is based on BioVelo, a query language based on list comprehension. The SAQP is part of the Pathway Tools software and is available as part of several bioinformatics web sites powered by Pathway Tools, including the BioCyc.org site that contains more than 500 Pathway/Genome DBs. PMID:20624715
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liang, Ying; Gao, Yajun; Jones, Alan M.
The three-member family of Arabidopsis extra-large G proteins (XLG1-3) defines the prototype of an atypical Ga subunit in the heterotrimeric G protein complex. Some recent evidence indicate that XLG subunits operate along with its Gbg dimer in root morphology, stress responsiveness, and cytokinin induced development, however downstream targets of activated XLG proteins in the stress pathways are rarely known. In order to assemble a set of candidate XLG-targeted proteins, a yeast two-hybrid complementation-based screen was performed using XLG protein baits to query interactions between XLG and partner protein found in glucose-treated seedlings, roots, and Arabidopsis cells in culture. Seventy twomore » interactors were identified and >60% of a test set displayed in vivo interaction with XLG proteins. Gene co-expression analysis shows that >70% of the interactors are positively correlated with the corresponding XLG partners. Gene Ontology enrichment for all the candidates indicates stress responses and posits a molecular mechanism involving a specific set of transcription factor partners to XLG. Genes encoding two of these transcription factors, SZF1 and 2, require XLG proteins for full NaCl-induced expression. Furthermore, the subcellular localization of the XLG proteins in the nucleus, endosome, and plasma membrane is dependent on the specific interacting partner.« less
Liang, Ying; Gao, Yajun; Jones, Alan M.
2017-06-13
The three-member family of Arabidopsis extra-large G proteins (XLG1-3) defines the prototype of an atypical Ga subunit in the heterotrimeric G protein complex. Some recent evidence indicate that XLG subunits operate along with its Gbg dimer in root morphology, stress responsiveness, and cytokinin induced development, however downstream targets of activated XLG proteins in the stress pathways are rarely known. In order to assemble a set of candidate XLG-targeted proteins, a yeast two-hybrid complementation-based screen was performed using XLG protein baits to query interactions between XLG and partner protein found in glucose-treated seedlings, roots, and Arabidopsis cells in culture. Seventy twomore » interactors were identified and >60% of a test set displayed in vivo interaction with XLG proteins. Gene co-expression analysis shows that >70% of the interactors are positively correlated with the corresponding XLG partners. Gene Ontology enrichment for all the candidates indicates stress responses and posits a molecular mechanism involving a specific set of transcription factor partners to XLG. Genes encoding two of these transcription factors, SZF1 and 2, require XLG proteins for full NaCl-induced expression. Furthermore, the subcellular localization of the XLG proteins in the nucleus, endosome, and plasma membrane is dependent on the specific interacting partner.« less
Standard biological parts knowledgebase.
Galdzicki, Michal; Rodriguez, Cesar; Chandran, Deepak; Sauro, Herbert M; Gennari, John H
2011-02-24
We have created the Knowledgebase of Standard Biological Parts (SBPkb) as a publically accessible Semantic Web resource for synthetic biology (sbolstandard.org). The SBPkb allows researchers to query and retrieve standard biological parts for research and use in synthetic biology. Its initial version includes all of the information about parts stored in the Registry of Standard Biological Parts (partsregistry.org). SBPkb transforms this information so that it is computable, using our semantic framework for synthetic biology parts. This framework, known as SBOL-semantic, was built as part of the Synthetic Biology Open Language (SBOL), a project of the Synthetic Biology Data Exchange Group. SBOL-semantic represents commonly used synthetic biology entities, and its purpose is to improve the distribution and exchange of descriptions of biological parts. In this paper, we describe the data, our methods for transformation to SBPkb, and finally, we demonstrate the value of our knowledgebase with a set of sample queries. We use RDF technology and SPARQL queries to retrieve candidate "promoter" parts that are known to be both negatively and positively regulated. This method provides new web based data access to perform searches for parts that are not currently possible.
Assistant Superintendent Hiring Criteria Used by Golf Course Superintendents
ERIC Educational Resources Information Center
Schlossberg, Maxim J.; Greene, Wilmot; Karnok, Keith J.
2004-01-01
Of the many opportunities available upon graduating, most turfgrass management/turfgrass science students seek assistant golf course superintendent positions. By tradition, faculty are responsible for preparing graduates to serve as capable assistant superintendents. Moreover, faculty are queried for guidance on how to best compete for these…
SPARQL Query Re-writing Using Partonomy Based Transformation Rules
NASA Astrophysics Data System (ADS)
Jain, Prateek; Yeh, Peter Z.; Verma, Kunal; Henson, Cory A.; Sheth, Amit P.
Often the information present in a spatial knowledge base is represented at a different level of granularity and abstraction than the query constraints. For querying ontology's containing spatial information, the precise relationships between spatial entities has to be specified in the basic graph pattern of SPARQL query which can result in long and complex queries. We present a novel approach to help users intuitively write SPARQL queries to query spatial data, rather than relying on knowledge of the ontology structure. Our framework re-writes queries, using transformation rules to exploit part-whole relations between geographical entities to address the mismatches between query constraints and knowledge base. Our experiments were performed on completely third party datasets and queries. Evaluations were performed on Geonames dataset using questions from National Geographic Bee serialized into SPARQL and British Administrative Geography Ontology using questions from a popular trivia website. These experiments demonstrate high precision in retrieval of results and ease in writing queries.
NASA Technical Reports Server (NTRS)
Brown, David B.
1988-01-01
A history of the Query Utility Environment for Software Testing (QUEST)/Ada is presented. A fairly comprehensive literature review which is targeted toward issues of Ada testing is given. The definition of the system structure and the high level interfaces are then presented. The design of the three major components is described. The QUEST/Ada IORL System Specifications to this point in time are included in the Appendix. A paper is also included in the appendix which gives statistical evidence of the validity of the test case generation approach which is being integrated into QUEST/Ada.
NASA Astrophysics Data System (ADS)
Arenas, Marcelo; Gutierrez, Claudio; Pérez, Jorge
The Resource Description Framework (RDF) is the standard data model for representing information about World Wide Web resources. In January 2008, it was released the recommendation of the W3C for querying RDF data, a query language called SPARQL. In this chapter, we give a detailed description of the semantics of this language. We start by focusing on the definition of a formal semantics for the core part of SPARQL, and then move to the definition for the entire language, including all the features in the specification of SPARQL by the W3C such as blank nodes in graph patterns and bag semantics for solutions.
Graphical modeling and query language for hospitals.
Barzdins, Janis; Barzdins, Juris; Rencis, Edgars; Sostaks, Agris
2013-01-01
So far there has been little evidence that implementation of the health information technologies (HIT) is leading to health care cost savings. One of the reasons for this lack of impact by the HIT likely lies in the complexity of the business process ownership in the hospitals. The goal of our research is to develop a business model-based method for hospital use which would allow doctors to retrieve directly the ad-hoc information from various hospital databases. We have developed a special domain-specific process modelling language called the MedMod. Formally, we define the MedMod language as a profile on UML Class diagrams, but we also demonstrate it on examples, where we explain the semantics of all its elements informally. Moreover, we have developed the Process Query Language (PQL) that is based on MedMod process definition language. The purpose of PQL is to allow a doctor querying (filtering) runtime data of hospital's processes described using MedMod. The MedMod language tries to overcome deficiencies in existing process modeling languages, allowing to specify the loosely-defined sequence of the steps to be performed in the clinical process. The main advantages of PQL are in two main areas - usability and efficiency. They are: 1) the view on data through "glasses" of familiar process, 2) the simple and easy-to-perceive means of setting filtering conditions require no more expertise than using spreadsheet applications, 3) the dynamic response to each step in construction of the complete query that shortens the learning curve greatly and reduces the error rate, and 4) the selected means of filtering and data retrieving allows to execute queries in O(n) time regarding the size of the dataset. We are about to continue developing this project with three further steps. First, we are planning to develop user-friendly graphical editors for the MedMod process modeling and query languages. The second step is to do evaluation of usability the proposed language and tool involving the physicians from several hospitals in Latvia and working with real data from these hospitals. Our third step is to develop an efficient implementation of the query language.
SAFE: SPARQL Federation over RDF Data Cubes with Access Control.
Khan, Yasar; Saleem, Muhammad; Mehdi, Muntazir; Hogan, Aidan; Mehmood, Qaiser; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh
2017-02-01
Several query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data. We present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE's indexing system does not hold any data instances-it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change. We validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.
2013-01-01
Background The Internet’s potential impact on suicide is of major public health interest as easy online access to pro-suicide information or specific suicide methods may increase suicide risk among vulnerable Internet users. Little is known, however, about users’ actual searching and browsing behaviors of online suicide-related information. Objective To investigate what webpages people actually clicked on after searching with suicide-related queries on a search engine and to examine what queries people used to get access to pro-suicide websites. Methods A retrospective observational study was done. We used a web search dataset released by America Online (AOL). The dataset was randomly sampled from all AOL subscribers’ web queries between March and May 2006 and generated by 657,000 service subscribers. Results We found 5526 search queries (0.026%, 5526/21,000,000) that included the keyword "suicide". The 5526 search queries included 1586 different search terms and were generated by 1625 unique subscribers (0.25%, 1625/657,000). Of these queries, 61.38% (3392/5526) were followed by users clicking on a search result. Of these 3392 queries, 1344 (39.62%) webpages were clicked on by 930 unique users but only 1314 of those webpages were accessible during the study period. Each clicked-through webpage was classified into 11 categories. The categories of the most visited webpages were: entertainment (30.13%; 396/1314), scientific information (18.31%; 240/1314), and community resources (14.53%; 191/1314). Among the 1314 accessed webpages, we could identify only two pro-suicide websites. We found that the search terms used to access these sites included “commiting suicide with a gas oven”, “hairless goat”, “pictures of murder by strangulation”, and “photo of a severe burn”. A limitation of our study is that the database may be dated and confined to mainly English webpages. Conclusions Searching or browsing suicide-related or pro-suicide webpages was uncommon, although a small group of users did access websites that contain detailed suicide method information. PMID:23305632
Genomes as geography: using GIS technology to build interactive genome feature maps
Dolan, Mary E; Holden, Constance C; Beard, M Kate; Bult, Carol J
2006-01-01
Background Many commonly used genome browsers display sequence annotations and related attributes as horizontal data tracks that can be toggled on and off according to user preferences. Most genome browsers use only simple keyword searches and limit the display of detailed annotations to one chromosomal region of the genome at a time. We have employed concepts, methodologies, and tools that were developed for the display of geographic data to develop a Genome Spatial Information System (GenoSIS) for displaying genomes spatially, and interacting with genome annotations and related attribute data. In contrast to the paradigm of horizontally stacked data tracks used by most genome browsers, GenoSIS uses the concept of registered spatial layers composed of spatial objects for integrated display of diverse data. In addition to basic keyword searches, GenoSIS supports complex queries, including spatial queries, and dynamically generates genome maps. Our adaptation of the geographic information system (GIS) model in a genome context supports spatial representation of genome features at multiple scales with a versatile and expressive query capability beyond that supported by existing genome browsers. Results We implemented an interactive genome sequence feature map for the mouse genome in GenoSIS, an application that uses ArcGIS, a commercially available GIS software system. The genome features and their attributes are represented as spatial objects and data layers that can be toggled on and off according to user preferences or displayed selectively in response to user queries. GenoSIS supports the generation of custom genome maps in response to complex queries about genome features based on both their attributes and locations. Our example application of GenoSIS to the mouse genome demonstrates the powerful visualization and query capability of mature GIS technology applied in a novel domain. Conclusion Mapping tools developed specifically for geographic data can be exploited to display, explore and interact with genome data. The approach we describe here is organism independent and is equally useful for linear and circular chromosomes. One of the unique capabilities of GenoSIS compared to existing genome browsers is the capacity to generate genome feature maps dynamically in response to complex attribute and spatial queries. PMID:16984652
Titanbrowse: a new paradigm for access, visualization and analysis of hyperspectral imaging
NASA Astrophysics Data System (ADS)
Penteado, Paulo F.
2016-10-01
Currently there are archives and tools to explore remote sensing imaging, but these lack some functionality needed for hyperspectral imagers: 1) Querying and serving only whole datacubes is not enough, since in each cube there is typically a large variation in observation geometry over the spatial pixels. Thus, often the most useful unit for selecting observations of interest is not a whole cube but rather a single spectrum. 2) Pixel-specific geometric data included in the standard pipelines is calculated at only one point per pixel. Particularly for selections of pixels from many different cubes, or observations near the limb, it is necessary to know the actual extent of each pixel. 3) Database queries need not only metadata, but also by the spectral data. For instance, one query might look for atypical values of some band, or atypical relations between bands, denoting spectral features (such as ratios or differences between bands). 4) There is the need to evaluate arbitrary, dynamically-defined, complex functions of the data (beyond just simple arithmetic operations), both for selection in the queries, and for visualization, to interactively tune the queries to the observations of interest. 5) Making the most useful query for some analysis often requires interactive visualization integrated with data selection and processing, because the user needs to explore how different functions of the data vary over the observations without having to download data and import it into visualization software. 6) Complementary to interactive use, an API allowing programmatic access to the system is needed for systematic data analyses. 7) Direct access to calibrated and georeferenced data, without the need to download data and software and learn to process it.We present titanbrowse, a database, exploration and visualization system for Cassini VIMS observations of Titan, designed to fullfill the aforementioned needs. While it originallly ran on data in the user's computer, we are now developing an online version, so that users do not need to download software and data. The server, which we maintain, processes the queries and communicates the results to the client the user runs. http://ppenteado.net/titanbrowse.
Bitsch, A; Jacobi, S; Melber, C; Wahnschaffe, U; Simetska, N; Mangelsdorf, I
2006-12-01
A database for repeated dose toxicity data has been developed. Studies were selected by data quality. Review documents or risk assessments were used to get a pre-screened selection of available valid data. The structure of the chemicals should be rather simple for well defined chemical categories. The database consists of three core data sets for each chemical: (1) structural features and physico-chemical data, (2) data on study design, (3) study results. To allow consistent queries, a high degree of standardization categories and glossaries were developed for relevant parameters. At present, the database consists of 364 chemicals investigated in 1018 studies which resulted in a total of 6002 specific effects. Standard queries have been developed, which allow analyzing the influence of structural features or PC data on LOELs, target organs and effects. Furthermore, it can be used as an expert system. First queries have shown that the database is a very valuable tool.
Performing private database queries in a real-world environment using a quantum protocol.
Chan, Philip; Lucio-Martinez, Itzel; Mo, Xiaofan; Simon, Christoph; Tittel, Wolfgang
2014-06-10
In the well-studied cryptographic primitive 1-out-of-N oblivious transfer, a user retrieves a single element from a database of size N without the database learning which element was retrieved. While it has previously been shown that a secure implementation of 1-out-of-N oblivious transfer is impossible against arbitrarily powerful adversaries, recent research has revealed an interesting class of private query protocols based on quantum mechanics in a cheat sensitive model. Specifically, a practical protocol does not need to guarantee that the database provider cannot learn what element was retrieved if doing so carries the risk of detection. The latter is sufficient motivation to keep a database provider honest. However, none of the previously proposed protocols could cope with noisy channels. Here we present a fault-tolerant private query protocol, in which the novel error correction procedure is integral to the security of the protocol. Furthermore, we present a proof-of-concept demonstration of the protocol over a deployed fibre.
Clean Air Markets - Compliance Query Wizard
The Compliance Query Wizard is part of a suite of Clean Air Markets-related tools that are accessible at http://ampd.epa.gov/ampd/. The Compliance module provides final compliance results. Using the Compliance Query Wizard, the user can find compliance information associated with specific programs, facilities, states or time frames. Quick Reports and Prepackaged Datasets are also available for data that are commonly requested. Final compliance results are available for all years since 1995 for the Acid Rain Program and for the various NOx trading programs EPA has operated since 1999.EPA's Clean Air Markets Division (CAMD) includes several market-based regulatory programs designed to improve air quality and ecosystems. The most well-known of these programs are EPA's Acid Rain Program and the NOx Programs, which reduce emissions of sulfur dioxide (SO2) and nitrogen oxides (NOx)-compounds that adversely affect air quality, the environment, and public health. CAMD also plays an integral role in the development and implementation of the Clean Air Interstate Rule (CAIR).
Performing private database queries in a real-world environment using a quantum protocol
Chan, Philip; Lucio-Martinez, Itzel; Mo, Xiaofan; Simon, Christoph; Tittel, Wolfgang
2014-01-01
In the well-studied cryptographic primitive 1-out-of-N oblivious transfer, a user retrieves a single element from a database of size N without the database learning which element was retrieved. While it has previously been shown that a secure implementation of 1-out-of-N oblivious transfer is impossible against arbitrarily powerful adversaries, recent research has revealed an interesting class of private query protocols based on quantum mechanics in a cheat sensitive model. Specifically, a practical protocol does not need to guarantee that the database provider cannot learn what element was retrieved if doing so carries the risk of detection. The latter is sufficient motivation to keep a database provider honest. However, none of the previously proposed protocols could cope with noisy channels. Here we present a fault-tolerant private query protocol, in which the novel error correction procedure is integral to the security of the protocol. Furthermore, we present a proof-of-concept demonstration of the protocol over a deployed fibre. PMID:24913129
Interactive and Versatile Navigation of Structural Databases.
Korb, Oliver; Kuhn, Bernd; Hert, Jérôme; Taylor, Neil; Cole, Jason; Groom, Colin; Stahl, Martin
2016-05-12
We present CSD-CrossMiner, a novel tool for pharmacophore-based searches in crystal structure databases. Intuitive pharmacophore queries describing, among others, protein-ligand interaction patterns, ligand scaffolds, or protein environments can be built and modified interactively. Matching crystal structures are overlaid onto the query and visualized as soon as they are available, enabling the researcher to quickly modify a hypothesis on the fly. We exemplify the utility of the approach by showing applications relevant to real-world drug discovery projects, including the identification of novel fragments for a specific protein environment or scaffold hopping. The ability to concurrently search protein-ligand binding sites extracted from the Protein Data Bank (PDB) and small organic molecules from the Cambridge Structural Database (CSD) using the same pharmacophore query further emphasizes the flexibility of CSD-CrossMiner. We believe that CSD-CrossMiner closes an important gap in mining structural data and will allow users to extract more value from the growing number of available crystal structures.
Clean Air Markets - Allowances Query Wizard
The Allowances Query Wizard is part of a suite of Clean Air Markets-related tools that are accessible at http://camddataandmaps.epa.gov/gdm/index.cfm. The Allowances module allows the user to view allowance data associated with EPA's emissions trading programs. Allowance data can be specified and organized using the Allowance Query Wizard to find allowances information associated with specific accounts, companies, transactions, programs, facilities, representatives, allowance type, or by date. Quick Reports and Prepackaged Datasets are also available for data that are commonly requested.EPA's Clean Air Markets Division (CAMD) includes several market-based regulatory programs designed to improve air quality and ecosystems. The most well-known of these programs are EPA's Acid Rain Program and the NOx Programs, which reduce emissions of sulfur dioxide (SO2) and nitrogen oxides (NOx)-compounds that adversely affect air quality, the environment, and public health. CAMD also plays an integral role in the development and implementation of the Clean Air Interstate Rule (CAIR).
Cadastral Positioning Accuracy Improvement: a Case Study in Malaysia
NASA Astrophysics Data System (ADS)
Hashim, N. M.; Omar, A. H.; Omar, K. M.; Abdullah, N. M.; Yatim, M. H. M.
2016-09-01
Cadastral map is a parcel-based information which is specifically designed to define the limitation of boundaries. In Malaysia, the cadastral map is under authority of the Department of Surveying and Mapping Malaysia (DSMM). With the growth of spatial based technology especially Geographical Information System (GIS), DSMM decided to modernize and reform its cadastral legacy datasets by generating an accurate digital based representation of cadastral parcels. These legacy databases usually are derived from paper parcel maps known as certified plan. The cadastral modernization will result in the new cadastral database no longer being based on single and static parcel paper maps, but on a global digital map. Despite the strict process of the cadastral modernization, this reform has raised unexpected queries that remain essential to be addressed. The main focus of this study is to review the issues that have been generated by this transition. The transformed cadastral database should be additionally treated to minimize inherent errors and to fit them to the new satellite based coordinate system with high positional accuracy. This review result will be applied as a foundation for investigation to study the systematic and effectiveness method for Positional Accuracy Improvement (PAI) in cadastral database modernization.
2006-06-01
SPARQL SPARQL Protocol and RDF Query Language SQL Structured Query Language SUMO Suggested Upper Merged Ontology SW... Query optimization algorithms are implemented in the Pellet reasoner in order to ensure querying a knowledge base is efficient . These algorithms...memory as a treelike structure in order for the data to be queried . XML Query (XQuery) is the standard language used when querying XML
Implementation of Quantum Private Queries Using Nuclear Magnetic Resonance
NASA Astrophysics Data System (ADS)
Wang, Chuan; Hao, Liang; Zhao, Lian-Jie
2011-08-01
We present a modified protocol for the realization of a quantum private query process on a classical database. Using one-qubit query and CNOT operation, the query process can be realized in a two-mode database. In the query process, the data privacy is preserved as the sender would not reveal any information about the database besides her query information, and the database provider cannot retain any information about the query. We implement the quantum private query protocol in a nuclear magnetic resonance system. The density matrix of the memory registers are constructed.
Gauging interest of the general public in laser-assisted in situ keratomileusis eye surgery.
Stein, Joshua D; Childers, David M; Nan, Bin; Mian, Shahzad I
2013-07-01
To assess interest among members of the general public in laser-assisted in situ keratomileusis (LASIK) surgery and how levels of interest in this procedure have changed over time in the United States and other countries. Using the Google Trends Web site, we determined the weekly frequency of queries involving the term "LASIK" from January 1, 2007, through January 1, 2011, in the United States, United Kingdom, Canada, and India. We fit separate regression models for each of the countries to assess whether residents of these countries differed in their querying rates on specific dates and over time. Similar analyses were performed to compare 4 US states. Additional regression models compared general public interest in LASIK surgery before and after the release of a 2008 Food and Drug Administration report describing complaints associated with this procedure. During 2007 to 2011, the Google query rate for "LASIK" was highest among persons residing in India, followed by the United Kingdom, Canada, and the United States. During this time period, the query rate declined by 40% in the United States, 24% in India, and 22% in the United Kingdom, and it increased by 8% in Canada. In all 4 of the US states examined, the query rate declined-by 52% in Florida, 56% in New York, 54% in Texas, and 42% in California. Interest in LASIK declined further among US citizens after the Food and Drug Administration report release. Interest among the general public in LASIK surgery has been waning in recent years.
Data augmentation-assisted deep learning of hand-drawn partially colored sketches for visual search
Muhammad, Khan; Baik, Sung Wook
2017-01-01
In recent years, image databases are growing at exponential rates, making their management, indexing, and retrieval, very challenging. Typical image retrieval systems rely on sample images as queries. However, in the absence of sample query images, hand-drawn sketches are also used. The recent adoption of touch screen input devices makes it very convenient to quickly draw shaded sketches of objects to be used for querying image databases. This paper presents a mechanism to provide access to visual information based on users’ hand-drawn partially colored sketches using touch screen devices. A key challenge for sketch-based image retrieval systems is to cope with the inherent ambiguity in sketches due to the lack of colors, textures, shading, and drawing imperfections. To cope with these issues, we propose to fine-tune a deep convolutional neural network (CNN) using augmented dataset to extract features from partially colored hand-drawn sketches for query specification in a sketch-based image retrieval framework. The large augmented dataset contains natural images, edge maps, hand-drawn sketches, de-colorized, and de-texturized images which allow CNN to effectively model visual contents presented to it in a variety of forms. The deep features extracted from CNN allow retrieval of images using both sketches and full color images as queries. We also evaluated the role of partial coloring or shading in sketches to improve the retrieval performance. The proposed method is tested on two large datasets for sketch recognition and sketch-based image retrieval and achieved better classification and retrieval performance than many existing methods. PMID:28859140
Implementation of relational data base management systems on micro-computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huang, C.L.
1982-01-01
This dissertation describes an implementation of a Relational Data Base Management System on a microcomputer. A specific floppy disk based hardward called TERAK is being used, and high level query interface which is similar to a subset of the SEQUEL language is provided. The system contains sub-systems such as I/O, file management, virtual memory management, query system, B-tree management, scanner, command interpreter, expression compiler, garbage collection, linked list manipulation, disk space management, etc. The software has been implemented to fulfill the following goals: (1) it is highly modularized. (2) The system is physically segmented into 16 logically independent, overlayable segments,more » in a way such that a minimal amount of memory is needed at execution time. (3) Virtual memory system is simulated that provides the system with seemingly unlimited memory space. (4) A language translator is applied to recognize user requests in the query language. The code generation of this translator generates compact code for the execution of UPDATE, DELETE, and QUERY commands. (5) A complete set of basic functions needed for on-line data base manipulations is provided through the use of a friendly query interface. (6) To eliminate the dependency on the environment (both software and hardware) as much as possible, so that it would be easy to transplant the system to other computers. (7) To simulate each relation as a sequential file. It is intended to be a highly efficient, single user system suited to be used by small or medium sized organizations for, say, administrative purposes. Experiments show that quite satisfying results have indeed been achieved.« less
NASA Astrophysics Data System (ADS)
Xiong, Wei; Qiu, Bo; Tian, Qi; Mueller, Henning; Xu, Changsheng
2005-04-01
Medical image retrieval is still mainly a research domain with a large variety of applications and techniques. With the ImageCLEF 2004 benchmark, an evaluation framework has been created that includes a database, query topics and ground truth data. Eleven systems (with a total of more than 50 runs) compared their performance in various configurations. The results show that there is not any one feature that performs well on all query tasks. Key to successful retrieval is rather the selection of features and feature weights based on a specific set of input features, thus on the query task. In this paper we propose a novel method based on query topic dependent image features (QTDIF) for content-based medical image retrieval. These feature sets are designed to capture both inter-category and intra-category statistical variations to achieve good retrieval performance in terms of recall and precision. We have used Gaussian Mixture Models (GMM) and blob representation to model medical images and construct the proposed novel QTDIF for CBIR. Finally, trained multi-class support vector machines (SVM) are used for image similarity ranking. The proposed methods have been tested over the Casimage database with around 9000 images, for the given 26 image topics, used for imageCLEF 2004. The retrieval performance has been compared with the medGIFT system, which is based on the GNU Image Finding Tool (GIFT). The experimental results show that the proposed QTDIF-based CBIR can provide significantly better performance than systems based general features only.
Silva, Sara; Gouveia-Oliveira, Rodrigo; Maretzek, António; Carriço, João; Gudnason, Thorolfur; Kristinsson, Karl G; Ekdahl, Karl; Brito-Avô, António; Tomasz, Alexander; Sanches, Ilda Santos; Lencastre, Hermínia de; Almeida, Jonas
2003-01-01
Background EURIS (European Resistance Intervention Study) was launched as a multinational study in September of 2000 to identify the multitude of complex risk factors that contribute to the high carriage rate of drug resistant Streptococcus pneumoniae strains in children attending Day Care Centers in several European countries. Access to the very large number of data required the development of a web-based infrastructure – EURISWEB – that includes a relational online database, coupled with a query system for data retrieval, and allows integrative storage of demographic, clinical and molecular biology data generated in EURIS. Methods All components of the system were developed using open source programming tools: data storage management was supported by PostgreSQL, and the hypertext preprocessor to generate the web pages was implemented using PHP. The query system is based on a software agent running in the background specifically developed for EURIS. Results The website currently contains data related to 13,500 nasopharyngeal samples and over one million measures taken from 5,250 individual children, as well as over one thousand pre-made and user-made queries aggregated into several reports, approximately. It is presently in use by participating researchers from three countries (Iceland, Portugal and Sweden). Conclusion An operational model centered on a PHP engine builds the interface between the user and the database automatically, allowing an easy maintenance of the system. The query system is also sufficiently adaptable to allow the integration of several advanced data analysis procedures far more demanding than simple queries, eventually including artificial intelligence predictive models. PMID:12846930
A study of medical and health queries to web search engines.
Spink, Amanda; Yang, Yin; Jansen, Jim; Nykanen, Pirrko; Lorence, Daniel P; Ozmutlu, Seda; Ozmutlu, H Cenk
2004-03-01
This paper reports findings from an analysis of medical or health queries to different web search engines. We report results: (i). comparing samples of 10000 web queries taken randomly from 1.2 million query logs from the AlltheWeb.com and Excite.com commercial web search engines in 2001 for medical or health queries, (ii). comparing the 2001 findings from Excite and AlltheWeb.com users with results from a previous analysis of medical and health related queries from the Excite Web search engine for 1997 and 1999, and (iii). medical or health advice-seeking queries beginning with the word 'should'. Findings suggest: (i). a small percentage of web queries are medical or health related, (ii). the top five categories of medical or health queries were: general health, weight issues, reproductive health and puberty, pregnancy/obstetrics, and human relationships, and (iii). over time, the medical and health queries may have declined as a proportion of all web queries, as the use of specialized medical/health websites and e-commerce-related queries has increased. Findings provide insights into medical and health-related web querying and suggests some implications for the use of the general web search engines when seeking medical/health information.
RDF-GL: A SPARQL-Based Graphical Query Language for RDF
NASA Astrophysics Data System (ADS)
Hogenboom, Frederik; Milea, Viorel; Frasincar, Flavius; Kaymak, Uzay
This chapter presents RDF-GL, a graphical query language (GQL) for RDF. The GQL is based on the textual query language SPARQL and mainly focuses on SPARQL SELECT queries. The advantage of a GQL over textual query languages is that complexity is hidden through the use of graphical symbols. RDF-GL is supported by a Java-based editor, SPARQLinG, which is presented as well. The editor does not only allow for RDF-GL query creation, but also converts RDF-GL queries to SPARQL queries and is able to subsequently execute these. Experiments show that using the GQL in combination with the editor makes RDF querying more accessible for end users.
2009-04-01
information on user’s interests. In that case, the polarity takes the value of zero. Positive polarity examples: Query, Question/Assertion, cut/paste, chat ...Polarity Query (Keywords/Question/Assertion) 1 +1 cut/paste 0.9 +1 Selection from list 0.8 +1 Saving/printing 0.7 +1 Chat 0.6 +1 Reading doc/Web...3. logging all VIGEstimates (from UMS and IMS separately) and user snap shots as xml files for post‐process analysis As new InfoPacks come into the
CDAO-Store: Ontology-driven Data Integration for Phylogenetic Analysis
2011-01-01
Background The Comparative Data Analysis Ontology (CDAO) is an ontology developed, as part of the EvoInfo and EvoIO groups supported by the National Evolutionary Synthesis Center, to provide semantic descriptions of data and transformations commonly found in the domain of phylogenetic analysis. The core concepts of the ontology enable the description of phylogenetic trees and associated character data matrices. Results Using CDAO as the semantic back-end, we developed a triple-store, named CDAO-Store. CDAO-Store is a RDF-based store of phylogenetic data, including a complete import of TreeBASE. CDAO-Store provides a programmatic interface, in the form of web services, and a web-based front-end, to perform both user-defined as well as domain-specific queries; domain-specific queries include search for nearest common ancestors, minimum spanning clades, filter multiple trees in the store by size, author, taxa, tree identifier, algorithm or method. In addition, CDAO-Store provides a visualization front-end, called CDAO-Explorer, which can be used to view both character data matrices and trees extracted from the CDAO-Store. CDAO-Store provides import capabilities, enabling the addition of new data to the triple-store; files in PHYLIP, MEGA, nexml, and NEXUS formats can be imported and their CDAO representations added to the triple-store. Conclusions CDAO-Store is made up of a versatile and integrated set of tools to support phylogenetic analysis. To the best of our knowledge, CDAO-Store is the first semantically-aware repository of phylogenetic data with domain-specific querying capabilities. The portal to CDAO-Store is available at http://www.cs.nmsu.edu/~cdaostore. PMID:21496247
CDAO-store: ontology-driven data integration for phylogenetic analysis.
Chisham, Brandon; Wright, Ben; Le, Trung; Son, Tran Cao; Pontelli, Enrico
2011-04-15
The Comparative Data Analysis Ontology (CDAO) is an ontology developed, as part of the EvoInfo and EvoIO groups supported by the National Evolutionary Synthesis Center, to provide semantic descriptions of data and transformations commonly found in the domain of phylogenetic analysis. The core concepts of the ontology enable the description of phylogenetic trees and associated character data matrices. Using CDAO as the semantic back-end, we developed a triple-store, named CDAO-Store. CDAO-Store is a RDF-based store of phylogenetic data, including a complete import of TreeBASE. CDAO-Store provides a programmatic interface, in the form of web services, and a web-based front-end, to perform both user-defined as well as domain-specific queries; domain-specific queries include search for nearest common ancestors, minimum spanning clades, filter multiple trees in the store by size, author, taxa, tree identifier, algorithm or method. In addition, CDAO-Store provides a visualization front-end, called CDAO-Explorer, which can be used to view both character data matrices and trees extracted from the CDAO-Store. CDAO-Store provides import capabilities, enabling the addition of new data to the triple-store; files in PHYLIP, MEGA, nexml, and NEXUS formats can be imported and their CDAO representations added to the triple-store. CDAO-Store is made up of a versatile and integrated set of tools to support phylogenetic analysis. To the best of our knowledge, CDAO-Store is the first semantically-aware repository of phylogenetic data with domain-specific querying capabilities. The portal to CDAO-Store is available at http://www.cs.nmsu.edu/~cdaostore.
Cumulative query method for influenza surveillance using search engine data.
Seo, Dong-Woo; Jo, Min-Woo; Sohn, Chang Hwan; Shin, Soo-Yong; Lee, JaeHo; Yu, Maengsoo; Kim, Won Young; Lim, Kyoung Soo; Lee, Sang-Il
2014-12-16
Internet search queries have become an important data source in syndromic surveillance system. However, there is currently no syndromic surveillance system using Internet search query data in South Korea. The objective of this study was to examine correlations between our cumulative query method and national influenza surveillance data. Our study was based on the local search engine, Daum (approximately 25% market share), and influenza-like illness (ILI) data from the Korea Centers for Disease Control and Prevention. A quota sampling survey was conducted with 200 participants to obtain popular queries. We divided the study period into two sets: Set 1 (the 2009/10 epidemiological year for development set 1 and 2010/11 for validation set 1) and Set 2 (2010/11 for development Set 2 and 2011/12 for validation Set 2). Pearson's correlation coefficients were calculated between the Daum data and the ILI data for the development set. We selected the combined queries for which the correlation coefficients were .7 or higher and listed them in descending order. Then, we created a cumulative query method n representing the number of cumulative combined queries in descending order of the correlation coefficient. In validation set 1, 13 cumulative query methods were applied, and 8 had higher correlation coefficients (min=.916, max=.943) than that of the highest single combined query. Further, 11 of 13 cumulative query methods had an r value of ≥.7, but 4 of 13 combined queries had an r value of ≥.7. In validation set 2, 8 of 15 cumulative query methods showed higher correlation coefficients (min=.975, max=.987) than that of the highest single combined query. All 15 cumulative query methods had an r value of ≥.7, but 6 of 15 combined queries had an r value of ≥.7. Cumulative query method showed relatively higher correlation with national influenza surveillance data than combined queries in the development and validation set.
A Modular Framework for Transforming Structured Data into HTML with Machine-Readable Annotations
NASA Astrophysics Data System (ADS)
Patton, E. W.; West, P.; Rozell, E.; Zheng, J.
2010-12-01
There is a plethora of web-based Content Management Systems (CMS) available for maintaining projects and data, i.a. However, each system varies in its capabilities and often content is stored separately and accessed via non-uniform web interfaces. Moving from one CMS to another (e.g., MediaWiki to Drupal) can be cumbersome, especially if a large quantity of data must be adapted to the new system. To standardize the creation, display, management, and sharing of project information, we have assembled a framework that uses existing web technologies to transform data provided by any service that supports the SPARQL Protocol and RDF Query Language (SPARQL) queries into HTML fragments, allowing it to be embedded in any existing website. The framework utilizes a two-tier XML Stylesheet Transformation (XSLT) that uses existing ontologies (e.g., Friend-of-a-Friend, Dublin Core) to interpret query results and render them as HTML documents. These ontologies can be used in conjunction with custom ontologies suited to individual needs (e.g., domain-specific ontologies for describing data records). Furthermore, this transformation process encodes machine-readable annotations, namely, the Resource Description Framework in attributes (RDFa), into the resulting HTML, so that capable parsers and search engines can extract the relationships between entities (e.g, people, organizations, datasets). To facilitate editing of content, the framework provides a web-based form system, mapping each query to a dynamically generated form that can be used to modify and create entities, while keeping the native data store up-to-date. This open framework makes it easy to duplicate data across many different sites, allowing researchers to distribute their data in many different online forums. In this presentation we will outline the structure of queries and the stylesheets used to transform them, followed by a brief walkthrough that follows the data from storage to human- and machine-accessible web page. We conclude with a discussion on content caching and steps toward performing queries across multiple domains.
Towards ontology-driven navigation of the lipid bibliosphere
Baker, Christopher JO; Kanagasabai, Rajaraman; Ang, Wee Tiong; Veeramani, Anitha; Low, Hong-Sang; Wenk, Markus R
2008-01-01
Background The indexing of scientific literature and content is a relevant and contemporary requirement within life science information systems. Navigating information available in legacy formats continues to be a challenge both in enterprise and academic domains. The emergence of semantic web technologies and their fusion with artificial intelligence techniques has provided a new toolkit with which to address these data integration challenges. In the emerging field of lipidomics such navigation challenges are barriers to the translation of scientific results into actionable knowledge, critical to the treatment of diseases such as Alzheimer's syndrome, Mycobacterium infections and cancer. Results We present a literature-driven workflow involving document delivery and natural language processing steps generating tagged sentences containing lipid, protein and disease names, which are instantiated to custom designed lipid ontology. We describe the design challenges in capturing lipid nomenclature, the mandate of the ontology and its role as query model in the navigation of the lipid bibliosphere. We illustrate the extent of the description logic-based A-box query capability provided by the instantiated ontology using a graphical query composer to query sentences describing lipid-protein and lipid-disease correlations. Conclusion As scientists accept the need to readjust the manner in which we search for information and derive knowledge we illustrate a system that can constrain the literature explosion and knowledge navigation problems. Specifically we have focussed on solving this challenge for lipidomics researchers who have to deal with the lack of standardized vocabulary, differing classification schemes, and a wide array of synonyms before being able to derive scientific insights. The use of the OWL-DL variant of the Web Ontology Language (OWL) and description logic reasoning is pivotal in this regard, providing the lipid scientist with advanced query access to the results of text mining algorithms instantiated into the ontology. The visual query paradigm assists in the adoption of this technology. PMID:18315858
Dugan, J. M.; Berrios, D. C.; Liu, X.; Kim, D. K.; Kaizer, H.; Fagan, L. M.
1999-01-01
Our group has built an information retrieval system based on a complex semantic markup of medical textbooks. We describe the construction of a set of web-based knowledge-acquisition tools that expedites the collection and maintenance of the concepts required for text markup and the search interface required for information retrieval from the marked text. In the text markup system, domain experts (DEs) identify sections of text that contain one or more elements from a finite set of concepts. End users can then query the text using a predefined set of questions, each of which identifies a subset of complementary concepts. The search process matches that subset of concepts to relevant points in the text. The current process requires that the DE invest significant time to generate the required concepts and questions. We propose a new system--called ACQUIRE (Acquisition of Concepts and Queries in an Integrated Retrieval Environment)--that assists a DE in two essential tasks in the text-markup process. First, it helps her to develop, edit, and maintain the concept model: the set of concepts with which she marks the text. Second, ACQUIRE helps her to develop a query model: the set of specific questions that end users can later use to search the marked text. The DE incorporates concepts from the concept model when she creates the questions in the query model. The major benefit of the ACQUIRE system is a reduction in the time and effort required for the text-markup process. We compared the process of concept- and query-model creation using ACQUIRE to the process used in previous work by rebuilding two existing models that we previously constructed manually. We observed a significant decrease in the time required to build and maintain the concept and query models. Images Figure 1 Figure 2 Figure 4 Figure 5 PMID:10566457
Towards ontology-driven navigation of the lipid bibliosphere.
Baker, Christopher Jo; Kanagasabai, Rajaraman; Ang, Wee Tiong; Veeramani, Anitha; Low, Hong-Sang; Wenk, Markus R
2008-01-01
The indexing of scientific literature and content is a relevant and contemporary requirement within life science information systems. Navigating information available in legacy formats continues to be a challenge both in enterprise and academic domains. The emergence of semantic web technologies and their fusion with artificial intelligence techniques has provided a new toolkit with which to address these data integration challenges. In the emerging field of lipidomics such navigation challenges are barriers to the translation of scientific results into actionable knowledge, critical to the treatment of diseases such as Alzheimer's syndrome, Mycobacterium infections and cancer. We present a literature-driven workflow involving document delivery and natural language processing steps generating tagged sentences containing lipid, protein and disease names, which are instantiated to custom designed lipid ontology. We describe the design challenges in capturing lipid nomenclature, the mandate of the ontology and its role as query model in the navigation of the lipid bibliosphere. We illustrate the extent of the description logic-based A-box query capability provided by the instantiated ontology using a graphical query composer to query sentences describing lipid-protein and lipid-disease correlations. As scientists accept the need to readjust the manner in which we search for information and derive knowledge we illustrate a system that can constrain the literature explosion and knowledge navigation problems. Specifically we have focussed on solving this challenge for lipidomics researchers who have to deal with the lack of standardized vocabulary, differing classification schemes, and a wide array of synonyms before being able to derive scientific insights. The use of the OWL-DL variant of the Web Ontology Language (OWL) and description logic reasoning is pivotal in this regard, providing the lipid scientist with advanced query access to the results of text mining algorithms instantiated into the ontology. The visual query paradigm assists in the adoption of this technology.
GeoNetwork powered GI-cat: a geoportal hybrid solution
NASA Astrophysics Data System (ADS)
Baldini, Alessio; Boldrini, Enrico; Santoro, Mattia; Mazzetti, Paolo
2010-05-01
To the aim of setting up a Spatial Data Infrastructures (SDI) the creation of a system for the metadata management and discovery plays a fundamental role. An effective solution is the use of a geoportal (e.g. FAO/ESA geoportal), that has the important benefit of being accessible from a web browser. With this work we present a solution based integrating two of the available frameworks: GeoNetwork and GI-cat. GeoNetwork is an opensource software designed to improve accessibility of a wide variety of data together with the associated ancillary information (metadata), at different scale and from multidisciplinary sources; data are organized and documented in a standard and consistent way. GeoNetwork implements both the Portal and Catalog components of a Spatial Data Infrastructure (SDI) defined in the OGC Reference Architecture. It provides tools for managing and publishing metadata on spatial data and related services. GeoNetwork allows harvesting of various types of web data sources e.g. OGC Web Services (e.g. CSW, WCS, WMS). GI-cat is a distributed catalog based on a service-oriented framework of modular components and can be customized and tailored to support different deployment scenarios. It can federate a multiplicity of catalogs services, as well as inventory and access services in order to discover and access heterogeneous ESS resources. The federated resources are exposed by GI-cat through several standard catalog interfaces (e.g. OGC CSW AP ISO, OpenSearch, etc.) and by the GI-cat extended interface. Specific components implement mediation services for interfacing heterogeneous service providers, each of which exposes a specific standard specification; such components are called Accessors. These mediating components solve providers data modelmultiplicity by mapping them onto the GI-cat internal data model which implements the ISO 19115 Core profile. Accessors also implement the query protocol mapping; first they translate the query requests expressed according to the interface protocols exposed by GI-cat into the multiple query dialects spoken by the resource service providers. Currently, a number of well-accepted catalog and inventory services are supported, including several OGC Web Services, THREDDS Data Server, SeaDataNet Common Data Index, GBIF and OpenSearch engines. A GeoNetwork powered GI-cat has been developed in order to exploit the best of the two frameworks. The new system uses a modified version of GeoNetwork web interface in order to add the capability of querying also the specified GI-cat catalog and not only the GeoNetwork internal database. The resulting system consists in a geoportal in which GI-cat plays the role of the search engine. This new system allows to distribute the query on the different types of data sources linked to a GI-cat. The metadata results of the query are then visualized by the Geonetwork web interface. This configuration was experimented in the framework of GIIDA, a project of the Italian National Research Council (CNR) focused on data accessibility and interoperability. A second advantage of this solution is achieved setting up a GeoNetwork catalog amongst the accessors of the GI-cat instance. Such a configuration will allow in turn GI-cat to run the query against the internal GeoNetwork database. This allows to have both the harvesting and the metadata editor functionalities provided by GeoNetwork and the distributed search functionality of GI-cat available in a consistent way through the same web interface.
Huang, Chung-Chi; Lu, Zhiyong
2016-01-01
Identifying relevant papers from the literature is a common task in biocuration. Most current biomedical literature search systems primarily rely on matching user keywords. Semantic search, on the other hand, seeks to improve search accuracy by understanding the entities and contextual relations in user keywords. However, past research has mostly focused on semantically identifying biological entities (e.g. chemicals, diseases and genes) with little effort on discovering semantic relations. In this work, we aim to discover biomedical semantic relations in PubMed queries in an automated and unsupervised fashion. Specifically, we focus on extracting and understanding the contextual information (or context patterns) that is used by PubMed users to represent semantic relations between entities such as ‘CHEMICAL-1 compared to CHEMICAL-2.’ With the advances in automatic named entity recognition, we first tag entities in PubMed queries and then use tagged entities as knowledge to recognize pattern semantics. More specifically, we transform PubMed queries into context patterns involving participating entities, which are subsequently projected to latent topics via latent semantic analysis (LSA) to avoid the data sparseness and specificity issues. Finally, we mine semantically similar contextual patterns or semantic relations based on LSA topic distributions. Our two separate evaluation experiments of chemical-chemical (CC) and chemical–disease (CD) relations show that the proposed approach significantly outperforms a baseline method, which simply measures pattern semantics by similarity in participating entities. The highest performance achieved by our approach is nearly 0.9 and 0.85 respectively for the CC and CD task when compared against the ground truth in terms of normalized discounted cumulative gain (nDCG), a standard measure of ranking quality. These results suggest that our approach can effectively identify and return related semantic patterns in a ranked order covering diverse bio-entity relations. To assess the potential utility of our automated top-ranked patterns of a given relation in semantic search, we performed a pilot study on frequently sought semantic relations in PubMed and observed improved literature retrieval effectiveness based on post-hoc human relevance evaluation. Further investigation in larger tests and in real-world scenarios is warranted. PMID:27016698
Dhanasekaran, A Ranjitha; Pearson, Jon L; Ganesan, Balasubramanian; Weimer, Bart C
2015-02-25
Mass spectrometric analysis of microbial metabolism provides a long list of possible compounds. Restricting the identification of the possible compounds to those produced by the specific organism would benefit the identification process. Currently, identification of mass spectrometry (MS) data is commonly done using empirically derived compound databases. Unfortunately, most databases contain relatively few compounds, leaving long lists of unidentified molecules. Incorporating genome-encoded metabolism enables MS output identification that may not be included in databases. Using an organism's genome as a database restricts metabolite identification to only those compounds that the organism can produce. To address the challenge of metabolomic analysis from MS data, a web-based application to directly search genome-constructed metabolic databases was developed. The user query returns a genome-restricted list of possible compound identifications along with the putative metabolic pathways based on the name, formula, SMILES structure, and the compound mass as defined by the user. Multiple queries can be done simultaneously by submitting a text file created by the user or obtained from the MS analysis software. The user can also provide parameters specific to the experiment's MS analysis conditions, such as mass deviation, adducts, and detection mode during the query so as to provide additional levels of evidence to produce the tentative identification. The query results are provided as an HTML page and downloadable text file of possible compounds that are restricted to a specific genome. Hyperlinks provided in the HTML file connect the user to the curated metabolic databases housed in ProCyc, a Pathway Tools platform, as well as the KEGG Pathway database for visualization and metabolic pathway analysis. Metabolome Searcher, a web-based tool, facilitates putative compound identification of MS output based on genome-restricted metabolic capability. This enables researchers to rapidly extend the possible identifications of large data sets for metabolites that are not in compound databases. Putative compound names with their associated metabolic pathways from metabolomics data sets are returned to the user for additional biological interpretation and visualization. This novel approach enables compound identification by restricting the possible masses to those encoded in the genome.
van Baal, Sjozef; Kaimakis, Polynikis; Phommarinh, Manyphong; Koumbi, Daphne; Cuppens, Harry; Riccardino, Francesca; Macek, Milan; Scriver, Charles R; Patrinos, George P
2007-01-01
Frequency of INherited Disorders database (FINDbase) (http://www.findbase.org) is a relational database, derived from the ETHNOS software, recording frequencies of causative mutations leading to inherited disorders worldwide. Database records include the population and ethnic group, the disorder name and the related gene, accompanied by links to any corresponding locus-specific mutation database, to the respective Online Mendelian Inheritance in Man entries and the mutation together with its frequency in that population. The initial information is derived from the published literature, locus-specific databases and genetic disease consortia. FINDbase offers a user-friendly query interface, providing instant access to the list and frequencies of the different mutations. Query outputs can be either in a table or graphical format, accompanied by reference(s) on the data source. Registered users from three different groups, namely administrator, national coordinator and curator, are responsible for database curation and/or data entry/correction online via a password-protected interface. Databaseaccess is free of charge and there are no registration requirements for data querying. FINDbase provides a simple, web-based system for population-based mutation data collection and retrieval and can serve not only as a valuable online tool for molecular genetic testing of inherited disorders but also as a non-profit model for sustainable database funding, in the form of a 'database-journal'.
Bandyopadhyay, Deepak; Huan, Jun; Prins, Jan; Snoeyink, Jack; Wang, Wei; Tropsha, Alexander
2009-11-01
Protein function prediction is one of the central problems in computational biology. We present a novel automated protein structure-based function prediction method using libraries of local residue packing patterns that are common to most proteins in a known functional family. Critical to this approach is the representation of a protein structure as a graph where residue vertices (residue name used as a vertex label) are connected by geometrical proximity edges. The approach employs two steps. First, it uses a fast subgraph mining algorithm to find all occurrences of family-specific labeled subgraphs for all well characterized protein structural and functional families. Second, it queries a new structure for occurrences of a set of motifs characteristic of a known family, using a graph index to speed up Ullman's subgraph isomorphism algorithm. The confidence of function inference from structure depends on the number of family-specific motifs found in the query structure compared with their distribution in a large non-redundant database of proteins. This method can assign a new structure to a specific functional family in cases where sequence alignments, sequence patterns, structural superposition and active site templates fail to provide accurate annotation.
Using Generalized Annotated Programs to Solve Social Network Diffusion Optimization Problems
2013-01-01
as follows: —Let kall be the k value for the SNDOP-ALL query and for each SNDOP query i, let ki be the k for that query. For each query i, set ki... kall − 1. —Number each element of vi ∈ V such that gI(vi) and V C(vi) are true. For the ith SNDOP query, let vi be the corresponding element of V —Let...vertices of S. PROOF. We set up |V | SNDOP-queries as follows: —Let kall be the k value for the SNDOP-ALL query and and for each SNDOP-query i, let ki be
A web-based data-querying tool based on ontology-driven methodology and flowchart-based model.
Ping, Xiao-Ou; Chung, Yufang; Tseng, Yi-Ju; Liang, Ja-Der; Yang, Pei-Ming; Huang, Guan-Tarn; Lai, Feipei
2013-10-08
Because of the increased adoption rate of electronic medical record (EMR) systems, more health care records have been increasingly accumulating in clinical data repositories. Therefore, querying the data stored in these repositories is crucial for retrieving the knowledge from such large volumes of clinical data. The aim of this study is to develop a Web-based approach for enriching the capabilities of the data-querying system along the three following considerations: (1) the interface design used for query formulation, (2) the representation of query results, and (3) the models used for formulating query criteria. The Guideline Interchange Format version 3.5 (GLIF3.5), an ontology-driven clinical guideline representation language, was used for formulating the query tasks based on the GLIF3.5 flowchart in the Protégé environment. The flowchart-based data-querying model (FBDQM) query execution engine was developed and implemented for executing queries and presenting the results through a visual and graphical interface. To examine a broad variety of patient data, the clinical data generator was implemented to automatically generate the clinical data in the repository, and the generated data, thereby, were employed to evaluate the system. The accuracy and time performance of the system for three medical query tasks relevant to liver cancer were evaluated based on the clinical data generator in the experiments with varying numbers of patients. In this study, a prototype system was developed to test the feasibility of applying a methodology for building a query execution engine using FBDQMs by formulating query tasks using the existing GLIF. The FBDQM-based query execution engine was used to successfully retrieve the clinical data based on the query tasks formatted using the GLIF3.5 in the experiments with varying numbers of patients. The accuracy of the three queries (ie, "degree of liver damage," "degree of liver damage when applying a mutually exclusive setting," and "treatments for liver cancer") was 100% for all four experiments (10 patients, 100 patients, 1000 patients, and 10,000 patients). Among the three measured query phases, (1) structured query language operations, (2) criteria verification, and (3) other, the first two had the longest execution time. The ontology-driven FBDQM-based approach enriched the capabilities of the data-querying system. The adoption of the GLIF3.5 increased the potential for interoperability, shareability, and reusability of the query tasks.
Fu, Lawrence D.; Aphinyanaphongs, Yindalon; Wang, Lily; Aliferis, Constantin F.
2011-01-01
Evaluating the biomedical literature and health-related websites for quality are challenging information retrieval tasks. Current commonly used methods include impact factor for journals, PubMed’s clinical query filters and machine learning-based filter models for articles, and PageRank for websites. Previous work has focused on the average performance of these methods without considering the topic, and it is unknown how performance varies for specific topics or focused searches. Clinicians, researchers, and users should be aware when expected performance is not achieved for specific topics. The present work analyzes the behavior of these methods for a variety of topics. Impact factor, clinical query filters, and PageRank vary widely across different topics while a topic-specific impact factor and machine learning-based filter models are more stable. The results demonstrate that a method may perform excellently on average but struggle when used on a number of narrower topics. Topic adjusted metrics and other topic robust methods have an advantage in such situations. Users of traditional topic-sensitive metrics should be aware of their limitations. PMID:21419864
Tang, Guo-Qing; Maxwell, E. Stuart
2008-01-01
The amphibian Xenopus provides a model organism for investigating microRNA expression during vertebrate embryogenesis and development. Searching available Xenopus genome databases using known human pre-miRNAs as query sequences, more than 300 genes encoding 142 Xenopus tropicalis miRNAs were identified. Analysis of Xenopus tropicalis miRNA genes revealed a predominate positioning within introns of protein-coding and nonprotein-coding RNA Pol II-transcribed genes. MiRNA genes were also located in pre-mRNA exons and positioned intergenically between known protein-coding genes. Many miRNA species were found in multiple locations and in more than one genomic context. MiRNA genes were also clustered throughout the genome, indicating the potential for the cotranscription and coordinate expression of miRNAs located in a given cluster. Northern blot analysis confirmed the expression of many identified miRNAs in both X. tropicalis and X. laevis. Comparison of X. tropicalis and X. laevis blots revealed comparable expression profiles, although several miRNAs exhibited species-specific expression in different tissues. More detailed analysis revealed that for some miRNAs, the tissue-specific expression profile of the pri-miRNA precursor was distinctly different from that of the mature miRNA profile. Differential miRNA precursor processing in both the nucleus and cytoplasm was implicated in the observed tissue-specific differences. These observations indicated that post-transcriptional processing plays an important role in regulating miRNA expression in the amphibian Xenopus. PMID:18032731
Semantics Enabled Queries in EuroGEOSS: a Discovery Augmentation Approach
NASA Astrophysics Data System (ADS)
Santoro, M.; Mazzetti, P.; Fugazza, C.; Nativi, S.; Craglia, M.
2010-12-01
One of the main challenges in Earth Science Informatics is to build interoperability frameworks which allow users to discover, evaluate, and use information from different scientific domains. This needs to address multidisciplinary interoperability challenges concerning both technological and scientific aspects. From the technological point of view, it is necessary to provide a set of special interoperability arrangement in order to develop flexible frameworks that allow a variety of loosely-coupled services to interact with each other. From a scientific point of view, it is necessary to document clearly the theoretical and methodological assumptions underpinning applications in different scientific domains, and develop cross-domain ontologies to facilitate interdisciplinary dialogue and understanding. In this presentation we discuss a brokering approach that extends the traditional Service Oriented Architecture (SOA) adopted by most Spatial Data Infrastructures (SDIs) to provide the necessary special interoperability arrangements. In the EC-funded EuroGEOSS (A European approach to GEOSS) project, we distinguish among three possible functional brokering components: discovery, access and semantics brokers. This presentation focuses on the semantics broker, the Discovery Augmentation Component (DAC), which was specifically developed to address the three thematic areas covered by the EuroGEOSS project: biodiversity, forestry and drought. The EuroGEOSS DAC federates both semantics (e.g. SKOS repositories) and ISO-compliant geospatial catalog services. The DAC can be queried using common geospatial constraints (i.e. what, where, when, etc.). Two different augmented discovery styles are supported: a) automatic query expansion; b) user assisted query expansion. In the first case, the main discovery steps are: i. the query keywords (the what constraint) are “expanded” with related concepts/terms retrieved from the set of federated semantic services. A default expansion regards the multilinguality relationship; ii. The resulting queries are submitted to the federated catalog services; iii. The DAC performs a “smart” aggregation of the queries results and provides them back to the client. In the second case, the main discovery steps are: i. the user browses the federated semantic repositories and selects the concepts/terms-of-interest; ii. The DAC creates the set of geospatial queries based on the selected concepts/terms and submits them to the federated catalog services; iii. The DAC performs a “smart” aggregation of the queries results and provides them back to the client. A Graphical User Interface (GUI) was also developed for testing and interacting with the DAC. The entire brokering framework is deployed in the context of EuroGEOSS infrastructure and it is used in a couple of GEOSS AIP-3 use scenarios: the “e-Habitat Use Scenario” for the Biodiversity and Climate Change topic, and the “Comprehensive Drought Index Use Scenario” for Water/Drought topic
Modeling Group Interactions via Open Data Sources
2011-08-30
data. The state-of-art search engines are designed to help general query-specific search and not suitable for finding disconnected online groups. The...groups, (2) developing innovative mathematical and statistical models and efficient algorithms that leverage existing search engines and employ
SkyQuery - A Prototype Distributed Query and Cross-Matching Web Service for the Virtual Observatory
NASA Astrophysics Data System (ADS)
Thakar, A. R.; Budavari, T.; Malik, T.; Szalay, A. S.; Fekete, G.; Nieto-Santisteban, M.; Haridas, V.; Gray, J.
2002-12-01
We have developed a prototype distributed query and cross-matching service for the VO community, called SkyQuery, which is implemented with hierarchichal Web Services. SkyQuery enables astronomers to run combined queries on existing distributed heterogeneous astronomy archives. SkyQuery provides a simple, user-friendly interface to run distributed queries over the federation of registered astronomical archives in the VO. The SkyQuery client connects to the portal Web Service, which farms the query out to the individual archives, which are also Web Services called SkyNodes. The cross-matching algorithm is run recursively on each SkyNode. Each archive is a relational DBMS with a HTM index for fast spatial lookups. The results of the distributed query are returned as an XML DataSet that is automatically rendered by the client. SkyQuery also returns the image cutout corresponding to the query result. SkyQuery finds not only matches between the various catalogs, but also dropouts - objects that exist in some of the catalogs but not in others. This is often as important as finding matches. We demonstrate the utility of SkyQuery with a brown-dwarf search between SDSS and 2MASS, and a search for radio-quiet quasars in SDSS, 2MASS and FIRST. The importance of a service like SkyQuery for the worldwide astronomical community cannot be overstated: data on the same objects in various archives is mapped in different wavelength ranges and looks very different due to different errors, instrument sensitivities and other peculiarities of each archive. Our cross-matching algorithm preforms a fuzzy spatial join across multiple catalogs. This type of cross-matching is currently often done by eye, one object at a time. A static cross-identification table for a set of archives would become obsolete by the time it was built - the exponential growth of astronomical data means that a dynamic cross-identification mechanism like SkyQuery is the only viable option. SkyQuery was funded by a grant from the NASA AISR program.
Applications of Derandomization Theory in Coding
NASA Astrophysics Data System (ADS)
Cheraghchi, Mahdi
2011-07-01
Randomized techniques play a fundamental role in theoretical computer science and discrete mathematics, in particular for the design of efficient algorithms and construction of combinatorial objects. The basic goal in derandomization theory is to eliminate or reduce the need for randomness in such randomized constructions. In this thesis, we explore some applications of the fundamental notions in derandomization theory to problems outside the core of theoretical computer science, and in particular, certain problems related to coding theory. First, we consider the wiretap channel problem which involves a communication system in which an intruder can eavesdrop a limited portion of the transmissions, and construct efficient and information-theoretically optimal communication protocols for this model. Then we consider the combinatorial group testing problem. In this classical problem, one aims to determine a set of defective items within a large population by asking a number of queries, where each query reveals whether a defective item is present within a specified group of items. We use randomness condensers to explicitly construct optimal, or nearly optimal, group testing schemes for a setting where the query outcomes can be highly unreliable, as well as the threshold model where a query returns positive if the number of defectives pass a certain threshold. Finally, we design ensembles of error-correcting codes that achieve the information-theoretic capacity of a large class of communication channels, and then use the obtained ensembles for construction of explicit capacity achieving codes. [This is a shortened version of the actual abstract in the thesis.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zamora, Antonio
Advanced Natural Language Processing Tools for Web Information Retrieval, Content Analysis, and Synthesis. The goal of this SBIR was to implement and evaluate several advanced Natural Language Processing (NLP) tools and techniques to enhance the precision and relevance of search results by analyzing and augmenting search queries and by helping to organize the search output obtained from heterogeneous databases and web pages containing textual information of interest to DOE and the scientific-technical user communities in general. The SBIR investigated 1) the incorporation of spelling checkers in search applications, 2) identification of significant phrases and concepts using a combination of linguisticmore » and statistical techniques, and 3) enhancement of the query interface and search retrieval results through the use of semantic resources, such as thesauri. A search program with a flexible query interface was developed to search reference databases with the objective of enhancing search results from web queries or queries of specialized search systems such as DOE's Information Bridge. The DOE ETDE/INIS Joint Thesaurus was processed to create a searchable database. Term frequencies and term co-occurrences were used to enhance the web information retrieval by providing algorithmically-derived objective criteria to organize relevant documents into clusters containing significant terms. A thesaurus provides an authoritative overview and classification of a field of knowledge. By organizing the results of a search using the thesaurus terminology, the output is more meaningful than when the results are just organized based on the terms that co-occur in the retrieved documents, some of which may not be significant. An attempt was made to take advantage of the hierarchy provided by broader and narrower terms, as well as other field-specific information in the thesauri. The search program uses linguistic morphological routines to find relevant entries regardless of whether terms are stored in singular or plural form. Implementation of additional inflectional morphology processes for verbs can enhance retrieval further, but this has to be balanced by the possibility of broadening the results too much. In addition to the DOE energy thesaurus, other sources of specialized organized knowledge such as the Medical Subject Headings (MeSH), the Unified Medical Language System (UMLS), and Wikipedia were investigated. The supporting role of the NLP thesaurus search program was enhanced by incorporating spelling aid and a part-of-speech tagger to cope with misspellings in the queries and to determine the grammatical roles of the query words and identify nouns for special processing. To improve precision, multiple modes of searching were implemented including Boolean operators, and field-specific searches. Programs to convert a thesaurus or reference file into searchable support files can be deployed easily, and the resulting files are immediately searchable to produce relevance-ranked results with builtin spelling aid, morphological processing, and advanced search logic. Demonstration systems were built for several databases, including the DOE energy thesaurus.« less
Teng, Rui; Leibnitz, Kenji; Miura, Ryu
2013-01-01
An essential application of wireless sensor networks is to successfully respond to user queries. Query packet losses occur in the query dissemination due to wireless communication problems such as interference, multipath fading, packet collisions, etc. The losses of query messages at sensor nodes result in the failure of sensor nodes reporting the requested data. Hence, the reliable and successful dissemination of query messages to sensor nodes is a non-trivial problem. The target of this paper is to enable highly successful query delivery to sensor nodes by localized and energy-efficient discovery, and recovery of query losses. We adopt local and collective cooperation among sensor nodes to increase the success rate of distributed discoveries and recoveries. To enable the scalability in the operations of discoveries and recoveries, we employ a distributed name resolution mechanism at each sensor node to allow sensor nodes to self-detect the correlated queries and query losses, and then efficiently locally respond to the query losses. We prove that the collective discovery of query losses has a high impact on the success of query dissemination and reveal that scalability can be achieved by using the proposed approach. We further study the novel features of the cooperation and competition in the collective recovery at PHY and MAC layers, and show that the appropriate number of detectors can achieve optimal successful recovery rate. We evaluate the proposed approach with both mathematical analyses and computer simulations. The proposed approach enables a high rate of successful delivery of query messages and it results in short route lengths to recover from query losses. The proposed approach is scalable and operates in a fully distributed manner. PMID:23748172
Ontological Approach to Military Knowledge Modeling and Management
2004-03-01
federated search mechanism has to reformulate user queries (expressed using the ontology) in the query languages of the different sources (e.g. SQL...ontologies as a common terminology – Unified query to perform federated search • Query processing – Ontology mapping to sources reformulate queries
A Visual Interface for Querying Heterogeneous Phylogenetic Databases.
Jamil, Hasan M
2017-01-01
Despite the recent growth in the number of phylogenetic databases, access to these wealth of resources remain largely tool or form-based interface driven. It is our thesis that the flexibility afforded by declarative query languages may offer the opportunity to access these repositories in a better way, and to use such a language to pose truly powerful queries in unprecedented ways. In this paper, we propose a substantially enhanced closed visual query language, called PhyQL, that can be used to query phylogenetic databases represented in a canonical form. The canonical representation presented helps capture most phylogenetic tree formats in a convenient way, and is used as the storage model for our PhyloBase database for which PhyQL serves as the query language. We have implemented a visual interface for the end users to pose PhyQL queries using visual icons, and drag and drop operations defined over them. Once a query is posed, the interface translates the visual query into a Datalog query for execution over the canonical database. Responses are returned as hyperlinks to phylogenies that can be viewed in several formats using the tree viewers supported by PhyloBase. Results cached in PhyQL buffer allows secondary querying on the computed results making it a truly powerful querying architecture.
Goetz, Matthew B; Bowman, Candice; Hoang, Tuyen; Anaya, Henry; Osborn, Teresa; Gifford, Allen L; Asch, Steven M
2008-03-19
We describe how we used the framework of the U.S. Department of Veterans Affairs (VA) Quality Enhancement Research Initiative (QUERI) to develop a program to improve rates of diagnostic testing for the Human Immunodeficiency Virus (HIV). This venture was prompted by the observation by the CDC that 25% of HIV-infected patients do not know their diagnosis - a point of substantial importance to the VA, which is the largest provider of HIV care in the United States. Following the QUERI steps (or process), we evaluated: 1) whether undiagnosed HIV infection is a high-risk, high-volume clinical issue within the VA, 2) whether there are evidence-based recommendations for HIV testing, 3) whether there are gaps in the performance of VA HIV testing, and 4) the barriers and facilitators to improving current practice in the VA.Based on our findings, we developed and initiated a QUERI step 4/phase 1 pilot project using the precepts of the Chronic Care Model. Our improvement strategy relies upon electronic clinical reminders to provide decision support; audit/feedback as a clinical information system, and appropriate changes in delivery system design. These activities are complemented by academic detailing and social marketing interventions to achieve provider activation. Our preliminary formative evaluation indicates the need to ensure leadership and team buy-in, address facility-specific barriers, refine the reminder, and address factors that contribute to inter-clinic variances in HIV testing rates. Preliminary unadjusted data from the first seven months of our program show 3-5 fold increases in the proportion of at-risk patients who are offered HIV testing at the VA sites (stations) where the pilot project has been undertaken; no change was seen at control stations. This project demonstrates the early success of the application of the QUERI process to the development of a program to improve HIV testing rates. Preliminary unadjusted results show that the coordinated use of audit/feedback, provider activation, and organizational change can increase HIV testing rates for at-risk patients. We are refining our program prior to extending our work to a small-scale, multi-site evaluation (QUERI step 4/phase 2). We also plan to evaluate the durability/sustainability of the intervention effect, the costs of HIV testing, and the number of newly identified HIV-infected patients. Ultimately, we will evaluate this program in other geographically dispersed stations (QUERI step 4/phases 3 and 4).
Goetz, Matthew B; Bowman, Candice; Hoang, Tuyen; Anaya, Henry; Osborn, Teresa; Gifford, Allen L; Asch, Steven M
2008-01-01
Background We describe how we used the framework of the U.S. Department of Veterans Affairs (VA) Quality Enhancement Research Initiative (QUERI) to develop a program to improve rates of diagnostic testing for the Human Immunodeficiency Virus (HIV). This venture was prompted by the observation by the CDC that 25% of HIV-infected patients do not know their diagnosis – a point of substantial importance to the VA, which is the largest provider of HIV care in the United States. Methods Following the QUERI steps (or process), we evaluated: 1) whether undiagnosed HIV infection is a high-risk, high-volume clinical issue within the VA, 2) whether there are evidence-based recommendations for HIV testing, 3) whether there are gaps in the performance of VA HIV testing, and 4) the barriers and facilitators to improving current practice in the VA. Based on our findings, we developed and initiated a QUERI step 4/phase 1 pilot project using the precepts of the Chronic Care Model. Our improvement strategy relies upon electronic clinical reminders to provide decision support; audit/feedback as a clinical information system, and appropriate changes in delivery system design. These activities are complemented by academic detailing and social marketing interventions to achieve provider activation. Results Our preliminary formative evaluation indicates the need to ensure leadership and team buy-in, address facility-specific barriers, refine the reminder, and address factors that contribute to inter-clinic variances in HIV testing rates. Preliminary unadjusted data from the first seven months of our program show 3–5 fold increases in the proportion of at-risk patients who are offered HIV testing at the VA sites (stations) where the pilot project has been undertaken; no change was seen at control stations. Discussion This project demonstrates the early success of the application of the QUERI process to the development of a program to improve HIV testing rates. Preliminary unadjusted results show that the coordinated use of audit/feedback, provider activation, and organizational change can increase HIV testing rates for at-risk patients. We are refining our program prior to extending our work to a small-scale, multi-site evaluation (QUERI step 4/phase 2). We also plan to evaluate the durability/sustainability of the intervention effect, the costs of HIV testing, and the number of newly identified HIV-infected patients. Ultimately, we will evaluate this program in other geographically dispersed stations (QUERI step 4/phases 3 and 4). PMID:18353185
Sánchez-de-Madariaga, Ricardo; Muñoz, Adolfo; Castro, Antonio L; Moreno, Oscar; Pascual, Mario
2018-01-01
This research shows a protocol to assess the computational complexity of querying relational and non-relational (NoSQL (not only Structured Query Language)) standardized electronic health record (EHR) medical information database systems (DBMS). It uses a set of three doubling-sized databases, i.e. databases storing 5000, 10,000 and 20,000 realistic standardized EHR extracts, in three different database management systems (DBMS): relational MySQL object-relational mapping (ORM), document-based NoSQL MongoDB, and native extensible markup language (XML) NoSQL eXist. The average response times to six complexity-increasing queries were computed, and the results showed a linear behavior in the NoSQL cases. In the NoSQL field, MongoDB presents a much flatter linear slope than eXist. NoSQL systems may also be more appropriate to maintain standardized medical information systems due to the special nature of the updating policies of medical information, which should not affect the consistency and efficiency of the data stored in NoSQL databases. One limitation of this protocol is the lack of direct results of improved relational systems such as archetype relational mapping (ARM) with the same data. However, the interpolation of doubling-size database results to those presented in the literature and other published results suggests that NoSQL systems might be more appropriate in many specific scenarios and problems to be solved. For example, NoSQL may be appropriate for document-based tasks such as EHR extracts used in clinical practice, or edition and visualization, or situations where the aim is not only to query medical information, but also to restore the EHR in exactly its original form. PMID:29608174
Data Access Based on a Guide Map of the Underwater Wireless Sensor Network
Wei, Zhengxian; Song, Min; Yin, Guisheng; Wang, Hongbin; Cheng, Albert M. K.
2017-01-01
Underwater wireless sensor networks (UWSNs) represent an area of increasing research interest, as data storage, discovery, and query of UWSNs are always challenging issues. In this paper, a data access based on a guide map (DAGM) method is proposed for UWSNs. In DAGM, the metadata describes the abstracts of data content and the storage location. The center ring is composed of nodes according to the shortest average data query path in the network in order to store the metadata, and the data guide map organizes, diffuses and synchronizes the metadata in the center ring, providing the most time-saving and energy-efficient data query service for the user. For this method, firstly the data is stored in the UWSN. The storage node is determined, the data is transmitted from the sensor node (data generation source) to the storage node, and the metadata is generated for it. Then, the metadata is sent to the center ring node that is the nearest to the storage node and the data guide map organizes the metadata, diffusing and synchronizing it to the other center ring nodes. Finally, when there is query data in any user node, the data guide map will select a center ring node nearest to the user to process the query sentence, and based on the shortest transmission delay and lowest energy consumption, data transmission routing is generated according to the storage location abstract in the metadata. Hence, specific application data transmission from the storage node to the user is completed. The simulation results demonstrate that DAGM has advantages with respect to data access time and network energy consumption. PMID:29039757
Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS.
Yu, Hwanjo; Kim, Taehoon; Oh, Jinoh; Ko, Ilhwan; Kim, Sungchul; Han, Wook-Shin
2010-04-16
Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed. RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed. RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user's feedback and efficiently processes the function to return relevant articles in real time.
Sánchez-de-Madariaga, Ricardo; Muñoz, Adolfo; Castro, Antonio L; Moreno, Oscar; Pascual, Mario
2018-03-19
This research shows a protocol to assess the computational complexity of querying relational and non-relational (NoSQL (not only Structured Query Language)) standardized electronic health record (EHR) medical information database systems (DBMS). It uses a set of three doubling-sized databases, i.e. databases storing 5000, 10,000 and 20,000 realistic standardized EHR extracts, in three different database management systems (DBMS): relational MySQL object-relational mapping (ORM), document-based NoSQL MongoDB, and native extensible markup language (XML) NoSQL eXist. The average response times to six complexity-increasing queries were computed, and the results showed a linear behavior in the NoSQL cases. In the NoSQL field, MongoDB presents a much flatter linear slope than eXist. NoSQL systems may also be more appropriate to maintain standardized medical information systems due to the special nature of the updating policies of medical information, which should not affect the consistency and efficiency of the data stored in NoSQL databases. One limitation of this protocol is the lack of direct results of improved relational systems such as archetype relational mapping (ARM) with the same data. However, the interpolation of doubling-size database results to those presented in the literature and other published results suggests that NoSQL systems might be more appropriate in many specific scenarios and problems to be solved. For example, NoSQL may be appropriate for document-based tasks such as EHR extracts used in clinical practice, or edition and visualization, or situations where the aim is not only to query medical information, but also to restore the EHR in exactly its original form.
Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS
2010-01-01
Background Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed. Results RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed. Conclusions RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user’s feedback and efficiently processes the function to return relevant articles in real time. PMID:20406504
Retrieving high-resolution images over the Internet from an anatomical image database
NASA Astrophysics Data System (ADS)
Strupp-Adams, Annette; Henderson, Earl
1999-12-01
The Visible Human Data set is an important contribution to the national collection of anatomical images. To enhance the availability of these images, the National Library of Medicine has supported the design and development of a prototype object-oriented image database which imports, stores, and distributes high resolution anatomical images in both pixel and voxel formats. One of the key database modules is its client-server Internet interface. This Web interface provides a query engine with retrieval access to high-resolution anatomical images that range in size from 100KB for browser viewable rendered images, to 1GB for anatomical structures in voxel file formats. The Web query and retrieval client-server system is composed of applet GUIs, servlets, and RMI application modules which communicate with each other to allow users to query for specific anatomical structures, and retrieve image data as well as associated anatomical images from the database. Selected images can be downloaded individually as single files via HTTP or downloaded in batch-mode over the Internet to the user's machine through an applet that uses Netscape's Object Signing mechanism. The image database uses ObjectDesign's object-oriented DBMS, ObjectStore that has a Java interface. The query and retrieval systems has been tested with a Java-CDE window system, and on the x86 architecture using Windows NT 4.0. This paper describes the Java applet client search engine that queries the database; the Java client module that enables users to view anatomical images online; the Java application server interface to the database which organizes data returned to the user, and its distribution engine that allow users to download image files individually and/or in batch-mode.
Data Access Based on a Guide Map of the Underwater Wireless Sensor Network.
Wei, Zhengxian; Song, Min; Yin, Guisheng; Song, Houbing; Wang, Hongbin; Ma, Xuefei; Cheng, Albert M K
2017-10-17
Underwater wireless sensor networks (UWSNs) represent an area of increasing research interest, as data storage, discovery, and query of UWSNs are always challenging issues. In this paper, a data access based on a guide map (DAGM) method is proposed for UWSNs. In DAGM, the metadata describes the abstracts of data content and the storage location. The center ring is composed of nodes according to the shortest average data query path in the network in order to store the metadata, and the data guide map organizes, diffuses and synchronizes the metadata in the center ring, providing the most time-saving and energy-efficient data query service for the user. For this method, firstly the data is stored in the UWSN. The storage node is determined, the data is transmitted from the sensor node (data generation source) to the storage node, and the metadata is generated for it. Then, the metadata is sent to the center ring node that is the nearest to the storage node and the data guide map organizes the metadata, diffusing and synchronizing it to the other center ring nodes. Finally, when there is query data in any user node, the data guide map will select a center ring node nearest to the user to process the query sentence, and based on the shortest transmission delay and lowest energy consumption, data transmission routing is generated according to the storage location abstract in the metadata. Hence, specific application data transmission from the storage node to the user is completed. The simulation results demonstrate that DAGM has advantages with respect to data access time and network energy consumption.
An ontology-driven tool for structured data acquisition using Web forms.
Gonçalves, Rafael S; Tu, Samson W; Nyulas, Csongor I; Tierney, Michael J; Musen, Mark A
2017-08-01
Structured data acquisition is a common task that is widely performed in biomedicine. However, current solutions for this task are far from providing a means to structure data in such a way that it can be automatically employed in decision making (e.g., in our example application domain of clinical functional assessment, for determining eligibility for disability benefits) based on conclusions derived from acquired data (e.g., assessment of impaired motor function). To use data in these settings, we need it structured in a way that can be exploited by automated reasoning systems, for instance, in the Web Ontology Language (OWL); the de facto ontology language for the Web. We tackle the problem of generating Web-based assessment forms from OWL ontologies, and aggregating input gathered through these forms as an ontology of "semantically-enriched" form data that can be queried using an RDF query language, such as SPARQL. We developed an ontology-based structured data acquisition system, which we present through its specific application to the clinical functional assessment domain. We found that data gathered through our system is highly amenable to automatic analysis using queries. We demonstrated how ontologies can be used to help structuring Web-based forms and to semantically enrich the data elements of the acquired structured data. The ontologies associated with the enriched data elements enable automated inferences and provide a rich vocabulary for performing queries.
An ontology-based search engine for protein-protein interactions
2010-01-01
Background Keyword matching or ID matching is the most common searching method in a large database of protein-protein interactions. They are purely syntactic methods, and retrieve the records in the database that contain a keyword or ID specified in a query. Such syntactic search methods often retrieve too few search results or no results despite many potential matches present in the database. Results We have developed a new method for representing protein-protein interactions and the Gene Ontology (GO) using modified Gödel numbers. This representation is hidden from users but enables a search engine using the representation to efficiently search protein-protein interactions in a biologically meaningful way. Given a query protein with optional search conditions expressed in one or more GO terms, the search engine finds all the interaction partners of the query protein by unique prime factorization of the modified Gödel numbers representing the query protein and the search conditions. Conclusion Representing the biological relations of proteins and their GO annotations by modified Gödel numbers makes a search engine efficiently find all protein-protein interactions by prime factorization of the numbers. Keyword matching or ID matching search methods often miss the interactions involving a protein that has no explicit annotations matching the search condition, but our search engine retrieves such interactions as well if they satisfy the search condition with a more specific term in the ontology. PMID:20122195
An ontology-based search engine for protein-protein interactions.
Park, Byungkyu; Han, Kyungsook
2010-01-18
Keyword matching or ID matching is the most common searching method in a large database of protein-protein interactions. They are purely syntactic methods, and retrieve the records in the database that contain a keyword or ID specified in a query. Such syntactic search methods often retrieve too few search results or no results despite many potential matches present in the database. We have developed a new method for representing protein-protein interactions and the Gene Ontology (GO) using modified Gödel numbers. This representation is hidden from users but enables a search engine using the representation to efficiently search protein-protein interactions in a biologically meaningful way. Given a query protein with optional search conditions expressed in one or more GO terms, the search engine finds all the interaction partners of the query protein by unique prime factorization of the modified Gödel numbers representing the query protein and the search conditions. Representing the biological relations of proteins and their GO annotations by modified Gödel numbers makes a search engine efficiently find all protein-protein interactions by prime factorization of the numbers. Keyword matching or ID matching search methods often miss the interactions involving a protein that has no explicit annotations matching the search condition, but our search engine retrieves such interactions as well if they satisfy the search condition with a more specific term in the ontology.
Using patient lists to add value to integrated data repositories.
Wade, Ted D; Zelarney, Pearlanne T; Hum, Richard C; McGee, Sylvia; Batson, Deborah H
2014-12-01
Patient lists are project-specific sets of patients that can be queried in integrated data repositories (IDR's). By allowing a set of patients to be an addition to the qualifying conditions of a query, returned results will refer to, and only to, that set of patients. We report a variety of use cases for such lists, including: restricting retrospective chart review to a defined set of patients; following a set of patients for practice management purposes; distributing "honest-brokered" (deidentified) data; adding phenotypes to biosamples; and enhancing the content of study or registry data. Among the capabilities needed to implement patient lists in an IDR are: capture of patient identifiers from a query and feedback of these into the IDR; the existence of a permanent internal identifier in the IDR that is mappable to external identifiers; the ability to add queryable attributes to the IDR; the ability to merge data from multiple queries; and suitable control over user access and de-identification of results. We implemented patient lists in a custom IDR of our own design. We reviewed capabilities of other published IDRs for focusing on sets of patients. The widely used i2b2 IDR platform has various ways to address patient sets, and it could be modified to add the low-overhead version of patient lists that we describe. Copyright © 2014 Elsevier Inc. All rights reserved.
Woo, Hyekyung; Cho, Youngtae; Shim, Eunyoung; Lee, Jong-Koo; Lee, Chang-Gun; Kim, Seong Hwan
2016-07-04
As suggested as early as in 2006, logs of queries submitted to search engines seeking information could be a source for detection of emerging influenza epidemics if changes in the volume of search queries are monitored (infodemiology). However, selecting queries that are most likely to be associated with influenza epidemics is a particular challenge when it comes to generating better predictions. In this study, we describe a methodological extension for detecting influenza outbreaks using search query data; we provide a new approach for query selection through the exploration of contextual information gleaned from social media data. Additionally, we evaluate whether it is possible to use these queries for monitoring and predicting influenza epidemics in South Korea. Our study was based on freely available weekly influenza incidence data and query data originating from the search engine on the Korean website Daum between April 3, 2011 and April 5, 2014. To select queries related to influenza epidemics, several approaches were applied: (1) exploring influenza-related words in social media data, (2) identifying the chief concerns related to influenza, and (3) using Web query recommendations. Optimal feature selection by least absolute shrinkage and selection operator (Lasso) and support vector machine for regression (SVR) were used to construct a model predicting influenza epidemics. In total, 146 queries related to influenza were generated through our initial query selection approach. A considerable proportion of optimal features for final models were derived from queries with reference to the social media data. The SVR model performed well: the prediction values were highly correlated with the recent observed influenza-like illness (r=.956; P<.001) and virological incidence rate (r=.963; P<.001). These results demonstrate the feasibility of using search queries to enhance influenza surveillance in South Korea. In addition, an approach for query selection using social media data seems ideal for supporting influenza surveillance based on search query data.
Woo, Hyekyung; Shim, Eunyoung; Lee, Jong-Koo; Lee, Chang-Gun; Kim, Seong Hwan
2016-01-01
Background As suggested as early as in 2006, logs of queries submitted to search engines seeking information could be a source for detection of emerging influenza epidemics if changes in the volume of search queries are monitored (infodemiology). However, selecting queries that are most likely to be associated with influenza epidemics is a particular challenge when it comes to generating better predictions. Objective In this study, we describe a methodological extension for detecting influenza outbreaks using search query data; we provide a new approach for query selection through the exploration of contextual information gleaned from social media data. Additionally, we evaluate whether it is possible to use these queries for monitoring and predicting influenza epidemics in South Korea. Methods Our study was based on freely available weekly influenza incidence data and query data originating from the search engine on the Korean website Daum between April 3, 2011 and April 5, 2014. To select queries related to influenza epidemics, several approaches were applied: (1) exploring influenza-related words in social media data, (2) identifying the chief concerns related to influenza, and (3) using Web query recommendations. Optimal feature selection by least absolute shrinkage and selection operator (Lasso) and support vector machine for regression (SVR) were used to construct a model predicting influenza epidemics. Results In total, 146 queries related to influenza were generated through our initial query selection approach. A considerable proportion of optimal features for final models were derived from queries with reference to the social media data. The SVR model performed well: the prediction values were highly correlated with the recent observed influenza-like illness (r=.956; P<.001) and virological incidence rate (r=.963; P<.001). Conclusions These results demonstrate the feasibility of using search queries to enhance influenza surveillance in South Korea. In addition, an approach for query selection using social media data seems ideal for supporting influenza surveillance based on search query data. PMID:27377323
Schuers, Matthieu; Joulakian, Mher; Kerdelhué, Gaetan; Segas, Léa; Grosjean, Julien; Darmoni, Stéfan J; Griffon, Nicolas
2017-07-03
MEDLINE is the most widely used medical bibliographic database in the world. Most of its citations are in English and this can be an obstacle for some researchers to access the information the database contains. We created a multilingual query builder to facilitate access to the PubMed subset using a language other than English. The aim of our study was to assess the impact of this multilingual query builder on the quality of PubMed queries for non-native English speaking physicians and medical researchers. A randomised controlled study was conducted among French speaking general practice residents. We designed a multi-lingual query builder to facilitate information retrieval, based on available MeSH translations and providing users with both an interface and a controlled vocabulary in their own language. Participating residents were randomly allocated either the French or the English version of the query builder. They were asked to translate 12 short medical questions into MeSH queries. The main outcome was the quality of the query. Two librarians blind to the arm independently evaluated each query, using a modified published classification that differentiated eight types of errors. Twenty residents used the French version of the query builder and 22 used the English version. 492 queries were analysed. There were significantly more perfect queries in the French group vs. the English group (respectively 37.9% vs. 17.9%; p < 0.01). It took significantly more time for the members of the English group than the members of the French group to build each query, respectively 194 sec vs. 128 sec; p < 0.01. This multi-lingual query builder is an effective tool to improve the quality of PubMed queries in particular for researchers whose first language is not English.
A Web-Based Data-Querying Tool Based on Ontology-Driven Methodology and Flowchart-Based Model
Ping, Xiao-Ou; Chung, Yufang; Liang, Ja-Der; Yang, Pei-Ming; Huang, Guan-Tarn; Lai, Feipei
2013-01-01
Background Because of the increased adoption rate of electronic medical record (EMR) systems, more health care records have been increasingly accumulating in clinical data repositories. Therefore, querying the data stored in these repositories is crucial for retrieving the knowledge from such large volumes of clinical data. Objective The aim of this study is to develop a Web-based approach for enriching the capabilities of the data-querying system along the three following considerations: (1) the interface design used for query formulation, (2) the representation of query results, and (3) the models used for formulating query criteria. Methods The Guideline Interchange Format version 3.5 (GLIF3.5), an ontology-driven clinical guideline representation language, was used for formulating the query tasks based on the GLIF3.5 flowchart in the Protégé environment. The flowchart-based data-querying model (FBDQM) query execution engine was developed and implemented for executing queries and presenting the results through a visual and graphical interface. To examine a broad variety of patient data, the clinical data generator was implemented to automatically generate the clinical data in the repository, and the generated data, thereby, were employed to evaluate the system. The accuracy and time performance of the system for three medical query tasks relevant to liver cancer were evaluated based on the clinical data generator in the experiments with varying numbers of patients. Results In this study, a prototype system was developed to test the feasibility of applying a methodology for building a query execution engine using FBDQMs by formulating query tasks using the existing GLIF. The FBDQM-based query execution engine was used to successfully retrieve the clinical data based on the query tasks formatted using the GLIF3.5 in the experiments with varying numbers of patients. The accuracy of the three queries (ie, “degree of liver damage,” “degree of liver damage when applying a mutually exclusive setting,” and “treatments for liver cancer”) was 100% for all four experiments (10 patients, 100 patients, 1000 patients, and 10,000 patients). Among the three measured query phases, (1) structured query language operations, (2) criteria verification, and (3) other, the first two had the longest execution time. Conclusions The ontology-driven FBDQM-based approach enriched the capabilities of the data-querying system. The adoption of the GLIF3.5 increased the potential for interoperability, shareability, and reusability of the query tasks. PMID:25600078
Spatial Processes in Linear Ordering
ERIC Educational Resources Information Center
von Hecker, Ulrich; Klauer, Karl Christoph; Wolf, Lukas; Fazilat-Pour, Masoud
2016-01-01
Memory performance in linear order reasoning tasks (A > B, B > C, C > D, etc.) shows quicker, and more accurate responses to queries on wider (AD) than narrower (AB) pairs on a hypothetical linear mental model (A -- B -- C -- D). While indicative of an analogue representation, research so far did not provide positive evidence for spatial…
Student Query Trend Assessment with Semantical Annotation and Artificial Intelligent Multi-Agents
ERIC Educational Resources Information Center
Malik, Kaleem Razzaq; Mir, Rizwan Riaz; Farhan, Muhammad; Rafiq, Tariq; Aslam, Muhammad
2017-01-01
Research in era of data representation to contribute and improve key data policy involving the assessment of learning, training and English language competency. Students are required to communicate in English with high level impact using language and influence. The electronic technology works to assess students' questions positively enabling…
ERIC Educational Resources Information Center
Layton, Rebekah L.; Brandt, Patrick D.; Freeman, Ashalla M.; Harrell, Jessica R.; Hall, Joshua D.; Sinche, Melanie
2016-01-01
A national sample of PhD-trained scientists completed training, accepted subsequent employment in academic and nonacademic positions, and were queried about their previous graduate training and current employment. Respondents indicated factors contributing to their employment decision (e.g., working conditions, salary, job security). The data…
Processing uncertain RFID data in traceability supply chains.
Xie, Dong; Xiao, Jie; Guo, Guangjun; Jiang, Tong
2014-01-01
Radio Frequency Identification (RFID) is widely used to track and trace objects in traceability supply chains. However, massive uncertain data produced by RFID readers are not effective and efficient to be used in RFID application systems. Following the analysis of key features of RFID objects, this paper proposes a new framework for effectively and efficiently processing uncertain RFID data, and supporting a variety of queries for tracking and tracing RFID objects. We adjust different smoothing windows according to different rates of uncertain data, employ different strategies to process uncertain readings, and distinguish ghost, missing, and incomplete data according to their apparent positions. We propose a comprehensive data model which is suitable for different application scenarios. In addition, a path coding scheme is proposed to significantly compress massive data by aggregating the path sequence, the position, and the time intervals. The scheme is suitable for cyclic or long paths. Moreover, we further propose a processing algorithm for group and independent objects. Experimental evaluations show that our approach is effective and efficient in terms of the compression and traceability queries.
Processing Uncertain RFID Data in Traceability Supply Chains
Xie, Dong; Xiao, Jie
2014-01-01
Radio Frequency Identification (RFID) is widely used to track and trace objects in traceability supply chains. However, massive uncertain data produced by RFID readers are not effective and efficient to be used in RFID application systems. Following the analysis of key features of RFID objects, this paper proposes a new framework for effectively and efficiently processing uncertain RFID data, and supporting a variety of queries for tracking and tracing RFID objects. We adjust different smoothing windows according to different rates of uncertain data, employ different strategies to process uncertain readings, and distinguish ghost, missing, and incomplete data according to their apparent positions. We propose a comprehensive data model which is suitable for different application scenarios. In addition, a path coding scheme is proposed to significantly compress massive data by aggregating the path sequence, the position, and the time intervals. The scheme is suitable for cyclic or long paths. Moreover, we further propose a processing algorithm for group and independent objects. Experimental evaluations show that our approach is effective and efficient in terms of the compression and traceability queries. PMID:24737978
Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.
Pandey, Prashant; Almodaresi, Fatemeh; Bender, Michael A; Ferdman, Michael; Johnson, Rob; Patro, Rob
2018-06-18
Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6-108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days. Copyright © 2018 Elsevier Inc. All rights reserved.
Using internet searches for influenza surveillance.
Polgreen, Philip M; Chen, Yiling; Pennock, David M; Nelson, Forrest D
2008-12-01
The Internet is an important source of health information. Thus, the frequency of Internet searches may provide information regarding infectious disease activity. As an example, we examined the relationship between searches for influenza and actual influenza occurrence. Using search queries from the Yahoo! search engine ( http://search.yahoo.com ) from March 2004 through May 2008, we counted daily unique queries originating in the United States that contained influenza-related search terms. Counts were divided by the total number of searches, and the resulting daily fraction of searches was averaged over the week. We estimated linear models, using searches with 1-10-week lead times as explanatory variables to predict the percentage of cultures positive for influenza and deaths attributable to pneumonia and influenza in the United States. With use of the frequency of searches, our models predicted an increase in cultures positive for influenza 1-3 weeks in advance of when they occurred (P < .001), and similar models predicted an increase in mortality attributable to pneumonia and influenza up to 5 weeks in advance (P < .001). Search-term surveillance may provide an additional tool for disease surveillance.
Mining Longitudinal Web Queries: Trends and Patterns.
ERIC Educational Resources Information Center
Wang, Peiling; Berry, Michael W.; Yang, Yiheng
2003-01-01
Analyzed user queries submitted to an academic Web site during a four-year period, using a relational database, to examine users' query behavior, to identify problems they encounter, and to develop techniques for optimizing query analysis and mining. Linguistic analyses focus on query structures, lexicon, and word associations using statistical…
WATCHMAN: A Data Warehouse Intelligent Cache Manager
NASA Technical Reports Server (NTRS)
Scheuermann, Peter; Shim, Junho; Vingralek, Radek
1996-01-01
Data warehouses store large volumes of data which are used frequently by decision support applications. Such applications involve complex queries. Query performance in such an environment is critical because decision support applications often require interactive query response time. Because data warehouses are updated infrequently, it becomes possible to improve query performance by caching sets retrieved by queries in addition to query execution plans. In this paper we report on the design of an intelligent cache manager for sets retrieved by queries called WATCHMAN, which is particularly well suited for data warehousing environment. Our cache manager employs two novel, complementary algorithms for cache replacement and for cache admission. WATCHMAN aims at minimizing query response time and its cache replacement policy swaps out entire retrieved sets of queries instead of individual pages. The cache replacement and admission algorithms make use of a profit metric, which considers for each retrieved set its average rate of reference, its size, and execution cost of the associated query. We report on a performance evaluation based on the TPC-D and Set Query benchmarks. These experiments show that WATCHMAN achieves a substantial performance improvement in a decision support environment when compared to a traditional LRU replacement algorithm.
Infant Temperament: Stability by Age, Gender, Birth Order, Term Status, and SES
Bornstein, Marc H.; Putnick, Diane L.; Gartstein, Maria A.; Hahn, Chun-Shin; Auestad, Nancy; O’Connor, Deborah L.
2015-01-01
Two complementary studies focused on stability of infant temperament across the first year and considered infant age, gender, birth order, term status, and socioeconomic status (SES) as moderators. Study 1 consisted of 73 mothers of firstborn term girls and boys queried at 2, 5, and 13 months of age. Study 2 consisted of 335 mothers of infants of different gender, birth order, term status, and SES queried at 6 and 12 months. Consistent positive and negative affectivity factors emerged at all time-points across both studies. Infant temperament proved stable and robust across gender, birth order, term status, and SES. Stability coefficients for temperament factors and scales were medium to large for shorter (<9 months) inter-assessment intervals and small to medium for longer (>10 months) intervals. PMID:25865034
Mobile agent location in distributed environments
NASA Astrophysics Data System (ADS)
Fountoukis, S. G.; Argyropoulos, I. P.
2012-12-01
An agent is a small program acting on behalf of a user or an application which plays the role of a user. Artificial intelligence can be encapsulated in agents so that they can be capable of both behaving autonomously and showing an elementary decision ability regarding movement and some specific actions. Therefore they are often called autonomous mobile agents. In a distributed system, they can move themselves from one processing node to another through the interconnecting network infrastructure. Their purpose is to collect useful information and to carry it back to their user. Also, agents are used to start, monitor and stop processes running on the individual interconnected processing nodes of computer cluster systems. An agent has a unique id to discriminate itself from other agents and a current position. The position can be expressed as the address of the processing node which currently hosts the agent. Very often, it is necessary for a user, a processing node or another agent to know the current position of an agent in a distributed system. Several procedures and algorithms have been proposed for the purpose of position location of mobile agents. The most basic of all employs a fixed computing node, which acts as agent position repository, receiving messages from all the moving agents and keeping records of their current positions. The fixed node, responds to position queries and informs users, other nodes and other agents about the position of an agent. Herein, a model is proposed that considers pairs and triples of agents instead of single ones. A location method, which is investigated in this paper, attempts to exploit this model.
Standard Biological Parts Knowledgebase
Galdzicki, Michal; Rodriguez, Cesar; Chandran, Deepak; Sauro, Herbert M.; Gennari, John H.
2011-01-01
We have created the Knowledgebase of Standard Biological Parts (SBPkb) as a publically accessible Semantic Web resource for synthetic biology (sbolstandard.org). The SBPkb allows researchers to query and retrieve standard biological parts for research and use in synthetic biology. Its initial version includes all of the information about parts stored in the Registry of Standard Biological Parts (partsregistry.org). SBPkb transforms this information so that it is computable, using our semantic framework for synthetic biology parts. This framework, known as SBOL-semantic, was built as part of the Synthetic Biology Open Language (SBOL), a project of the Synthetic Biology Data Exchange Group. SBOL-semantic represents commonly used synthetic biology entities, and its purpose is to improve the distribution and exchange of descriptions of biological parts. In this paper, we describe the data, our methods for transformation to SBPkb, and finally, we demonstrate the value of our knowledgebase with a set of sample queries. We use RDF technology and SPARQL queries to retrieve candidate “promoter” parts that are known to be both negatively and positively regulated. This method provides new web based data access to perform searches for parts that are not currently possible. PMID:21390321
Visual information mining in remote sensing image archives
NASA Astrophysics Data System (ADS)
Pelizzari, Andrea; Descargues, Vincent; Datcu, Mihai P.
2002-01-01
The present article focuses on the development of interactive exploratory tools for visually mining the image content in large remote sensing archives. Two aspects are treated: the iconic visualization of the global information in the archive and the progressive visualization of the image details. The proposed methods are integrated in the Image Information Mining (I2M) system. The images and image structure in the I2M system are indexed based on a probabilistic approach. The resulting links are managed by a relational data base. Both the intrinsic complexity of the observed images and the diversity of user requests result in a great number of associations in the data base. Thus new tools have been designed to visualize, in iconic representation the relationships created during a query or information mining operation: the visualization of the query results positioned on the geographical map, quick-looks gallery, visualization of the measure of goodness of the query, visualization of the image space for statistical evaluation purposes. Additionally the I2M system is enhanced with progressive detail visualization in order to allow better access for operator inspection. I2M is a three-tier Java architecture and is optimized for the Internet.
An Automated Approach to Reasoning Under Multiple Perspectives
NASA Technical Reports Server (NTRS)
deBessonet, Cary
2004-01-01
This is the final report with emphasis on research during the last term. The context for the research has been the development of an automated reasoning technology for use in SMS (symbolic Manipulation System), a system used to build and query knowledge bases (KBs) using a special knowledge representation language SL (Symbolic Language). SMS interpreters assertive SL input and enters the results as components of its universe. The system operates in two basic models: 1) constructive mode (for building KBs); and 2) query/search mode (for querying KBs). Query satisfaction consists of matching query components with KB components. The system allows "penumbral matches," that is, matches that do not exactly meet the specifications of the query, but which are deemed relevant for the conversational context. If the user wants to know whether SMS has information that holds, say, for "any chow," the scope of relevancy might be set so that the system would respond based on a finding that it has information that holds for "most dogs," although this is not exactly what was called for by the query. The response would be qualified accordingly, as would normally be the case in ordinary human conversation. The general goal of the research was to develop an approach by which assertive content could be interpreted from multiple perspectives so that reasoning operations could be successfully conducted over the results. The interpretation of an SL statement such as, "{person believes [captain (asserted (perhaps)) (astronaut saw (comet (bright)))]}," which in English would amount to asserting something to the effect that, "Some person believes that a captain perhaps asserted that an astronaut saw a bright comet," would require the recognition of multiple perspectives, including some that are: a) epistemically-based (focusing on "believes"); b) assertion-based (focusing on "asserted"); c) perception-based (focusing on "saw"); d) adjectivally-based (focusing on "bight"); and e) modally-based (focusing on "perhaps"). Any conclusion reached under a line of reasoning that employs such an assertion or its associated implications should somehow reflect the employed perspectives. The investigators made significant progress in developing an approach that would enable a system to conduct reasoning operations over assertions of this kind while maintaining consistency in its knowledge bases. Significant accomplishments were made in the areas of: 1) integration and inferencing; 2) generation of perspectives, including wholistic ad composite views; and 3) consistency maintenance.
Unstructured medical image query using big data - An epilepsy case study.
Istephan, Sarmad; Siadat, Mohammad-Reza
2016-02-01
Big data technologies are critical to the medical field which requires new frameworks to leverage them. Such frameworks would benefit medical experts to test hypotheses by querying huge volumes of unstructured medical data to provide better patient care. The objective of this work is to implement and examine the feasibility of having such a framework to provide efficient querying of unstructured data in unlimited ways. The feasibility study was conducted specifically in the epilepsy field. The proposed framework evaluates a query in two phases. In phase 1, structured data is used to filter the clinical data warehouse. In phase 2, feature extraction modules are executed on the unstructured data in a distributed manner via Hadoop to complete the query. Three modules have been created, volume comparer, surface to volume conversion and average intensity. The framework allows for user-defined modules to be imported to provide unlimited ways to process the unstructured data hence potentially extending the application of this framework beyond epilepsy field. Two types of criteria were used to validate the feasibility of the proposed framework - the ability/accuracy of fulfilling an advanced medical query and the efficiency that Hadoop provides. For the first criterion, the framework executed an advanced medical query that spanned both structured and unstructured data with accurate results. For the second criterion, different architectures were explored to evaluate the performance of various Hadoop configurations and were compared to a traditional Single Server Architecture (SSA). The surface to volume conversion module performed up to 40 times faster than the SSA (using a 20 node Hadoop cluster) and the average intensity module performed up to 85 times faster than the SSA (using a 40 node Hadoop cluster). Furthermore, the 40 node Hadoop cluster executed the average intensity module on 10,000 models in 3h which was not even practical for the SSA. The current study is limited to epilepsy field and further research and more feature extraction modules are required to show its applicability in other medical domains. The proposed framework advances data-driven medicine by unleashing the content of unstructured medical data in an efficient and unlimited way to be harnessed by medical experts. Copyright © 2015 Elsevier Inc. All rights reserved.
PAQ: Persistent Adaptive Query Middleware for Dynamic Environments
NASA Astrophysics Data System (ADS)
Rajamani, Vasanth; Julien, Christine; Payton, Jamie; Roman, Gruia-Catalin
Pervasive computing applications often entail continuous monitoring tasks, issuing persistent queries that return continuously updated views of the operational environment. We present PAQ, a middleware that supports applications' needs by approximating a persistent query as a sequence of one-time queries. PAQ introduces an integration strategy abstraction that allows composition of one-time query responses into streams representing sophisticated spatio-temporal phenomena of interest. A distinguishing feature of our middleware is the realization that the suitability of a persistent query's result is a function of the application's tolerance for accuracy weighed against the associated overhead costs. In PAQ, programmers can specify an inquiry strategy that dictates how information is gathered. Since network dynamics impact the suitability of a particular inquiry strategy, PAQ associates an introspection strategy with a persistent query, that evaluates the quality of the query's results. The result of introspection can trigger application-defined adaptation strategies that alter the nature of the query. PAQ's simple API makes developing adaptive querying systems easily realizable. We present the key abstractions, describe their implementations, and demonstrate the middleware's usefulness through application examples and evaluation.
NASA Astrophysics Data System (ADS)
Kuznetsov, Valentin; Riley, Daniel; Afaq, Anzar; Sekhri, Vijay; Guo, Yuyi; Lueking, Lee
2010-04-01
The CMS experiment has implemented a flexible and powerful system enabling users to find data within the CMS physics data catalog. The Dataset Bookkeeping Service (DBS) comprises a database and the services used to store and access metadata related to CMS physics data. To this, we have added a generalized query system in addition to the existing web and programmatic interfaces to the DBS. This query system is based on a query language that hides the complexity of the underlying database structure by discovering the join conditions between database tables. This provides a way of querying the system that is simple and straightforward for CMS data managers and physicists to use without requiring knowledge of the database tables or keys. The DBS Query Language uses the ANTLR tool to build the input query parser and tokenizer, followed by a query builder that uses a graph representation of the DBS schema to construct the SQL query sent to underlying database. We will describe the design of the query system, provide details of the language components and overview of how this component fits into the overall data discovery system architecture.
Spatial aggregation query in dynamic geosensor networks
NASA Astrophysics Data System (ADS)
Yi, Baolin; Feng, Dayang; Xiao, Shisong; Zhao, Erdun
2007-11-01
Wireless sensor networks have been widely used for civilian and military applications, such as environmental monitoring and vehicle tracking. In many of these applications, the researches mainly aim at building sensor network based systems to leverage the sensed data to applications. However, the existing works seldom exploited spatial aggregation query considering the dynamic characteristics of sensor networks. In this paper, we investigate how to process spatial aggregation query over dynamic geosensor networks where both the sink node and sensor nodes are mobile and propose several novel improvements on enabling techniques. The mobility of sensors makes the existing routing protocol based on information of fixed framework or the neighborhood infeasible. We present an improved location-based stateless implicit geographic forwarding (IGF) protocol for routing a query toward the area specified by query window, a diameter-based window aggregation query (DWAQ) algorithm for query propagation and data aggregation in the query window, finally considering the location changing of the sink node, we present two schemes to forward the result to the sink node. Simulation results show that the proposed algorithms can improve query latency and query accuracy.
The Chandra Source Catalog: User Interface
NASA Astrophysics Data System (ADS)
Bonaventura, Nina; Evans, I. N.; Harbo, P. N.; Rots, A. H.; Tibbetts, M. S.; Van Stone, D. W.; Zografou, P.; Anderson, C. S.; Chen, J. C.; Davis, J. E.; Doe, S. M.; Evans, J. D.; Fabbiano, G.; Galle, E.; Gibbs, D. G.; Glotfelty, K. J.; Grier, J. D.; Hain, R.; Hall, D. M.; He, X.; Houck, J. C.; Karovska, M.; Lauer, J.; McCollough, M. L.; McDowell, J. C.; Miller, J. B.; Mitschang, A. W.; Morgan, D. L.; Nichols, J. S.; Nowak, M. A.; Plummer, D. A.; Primini, F. A.; Refsdal, B. L.; Siemiginowska, A. L.; Sundheim, B. A.; Winkelman, S. L.
2009-01-01
The Chandra Source Catalog (CSC) is the definitive catalog of all X-ray sources detected by Chandra. The CSC is presented to the user in two tables: the Master Chandra Source Table and the Table of Individual Source Observations. Each distinct X-ray source identified in the CSC is represented by a single master source entry and one or more individual source entries. If a source is unaffected by confusion and pile-up in multiple observations, the individual source observations are merged to produce a master source. In each table, a row represents a source, and each column a quantity that is officially part of the catalog. The CSC contains positions and multi-band fluxes for the sources, as well as derived spatial, spectral, and temporal source properties. The CSC also includes associated source region and full-field data products for each source, including images, photon event lists, light curves, and spectra. The master source properties represent the best estimates of the properties of a source, and are presented in the following categories: Position and Position Errors, Source Flags, Source Extent and Errors, Source Fluxes, Source Significance, Spectral Properties, and Source Variability. The CSC Data Access GUI provides direct access to the source properties and data products contained in the catalog. The user may query the catalog database via a web-style search or an SQL command-line query. Each query returns a table of source properties, along with the option to browse and download associated data products. The GUI is designed to run in a web browser with Java version 1.5 or higher, and may be accessed via a link on the CSC website homepage (http://cxc.harvard.edu/csc/). As an alternative to the GUI, the contents of the CSC may be accessed directly through a URL, using the command-line tool, cURL. Support: NASA contract NAS8-03060 (CXC).
On describing human white matter anatomy: the white matter query language.
Wassermann, Demian; Makris, Nikos; Rathi, Yogesh; Shenton, Martha; Kikinis, Ron; Kubicki, Marek; Westin, Carl-Fredrik
2013-01-01
The main contribution of this work is the careful syntactical definition of major white matter tracts in the human brain based on a neuroanatomist's expert knowledge. We present a technique to formally describe white matter tracts and to automatically extract them from diffusion MRI data. The framework is based on a novel query language with a near-to-English textual syntax. This query language allows us to construct a dictionary of anatomical definitions describing white matter tracts. The definitions include adjacent gray and white matter regions, and rules for spatial relations. This enables automated coherent labeling of white matter anatomy across subjects. We use our method to encode anatomical knowledge in human white matter describing 10 association and 8 projection tracts per hemisphere and 7 commissural tracts. The technique is shown to be comparable in accuracy to manual labeling. We present results applying this framework to create a white matter atlas from 77 healthy subjects, and we use this atlas in a proof-of-concept study to detect tract changes specific to schizophrenia.
Software Helps Retrieve Information Relevant to the User
NASA Technical Reports Server (NTRS)
Mathe, Natalie; Chen, James
2003-01-01
The Adaptive Indexing and Retrieval Agent (ARNIE) is a code library, designed to be used by an application program, that assists human users in retrieving desired information in a hypertext setting. Using ARNIE, the program implements a computational model for interactively learning what information each human user considers relevant in context. The model, called a "relevance network," incrementally adapts retrieved information to users individual profiles on the basis of feedback from the users regarding specific queries. The model also generalizes such knowledge for subsequent derivation of relevant references for similar queries and profiles, thereby, assisting users in filtering information by relevance. ARNIE thus enables users to categorize and share information of interest in various contexts. ARNIE encodes the relevance and structure of information in a neural network dynamically configured with a genetic algorithm. ARNIE maintains an internal database, wherein it saves associations, and from which it returns associated items in response to a query. A C++ compiler for a platform on which ARNIE will be utilized is necessary for creating the ARNIE library but is not necessary for the execution of the software.
NASA Astrophysics Data System (ADS)
Chmiel, P.; Ganzha, M.; Jaworska, T.; Paprzycki, M.
2017-10-01
Nowadays, as a part of systematic growth of volume, and variety, of information that can be found on the Internet, we observe also dramatic increase in sizes of available image collections. There are many ways to help users browsing / selecting images of interest. One of popular approaches are Content-Based Image Retrieval (CBIR) systems, which allow users to search for images that match their interests, expressed in the form of images (query by example). However, we believe that image search and retrieval could take advantage of semantic technologies. We have decided to test this hypothesis. Specifically, on the basis of knowledge captured in the CBIR, we have developed a domain ontology of residential real estate (detached houses, in particular). This allows us to semantically represent each image (and its constitutive architectural elements) represented within the CBIR. The proposed ontology was extended to capture not only the elements resulting from image segmentation, but also "spatial relations" between them. As a result, a new approach to querying the image database (semantic querying) has materialized, thus extending capabilities of the developed system.
NASA Astrophysics Data System (ADS)
Taira, Ricky K.; Wong, Clement; Johnson, David; Bhushan, Vikas; Rivera, Monica; Huang, Lu J.; Aberle, Denise R.; Cardenas, Alfonso F.; Chu, Wesley W.
1995-05-01
With the increase in the volume and distribution of images and text available in PACS and medical electronic health-care environments it becomes increasingly important to maintain indexes that summarize the content of these multi-media documents. Such indices are necessary to quickly locate relevant patient cases for research, patient management, and teaching. The goal of this project is to develop an intelligent document retrieval system that allows researchers to request for patient cases based on document content. Thus we wish to retrieve patient cases from electronic information archives that could include a combined specification of patient demographics, low level radiologic findings (size, shape, number), intermediate-level radiologic findings (e.g., atelectasis, infiltrates, etc.) and/or high-level pathology constraints (e.g., well-differentiated small cell carcinoma). The cases could be distributed among multiple heterogeneous databases such as PACS, RIS, and HIS. Content- based retrieval systems go beyond the capabilities of simple key-word or string-based retrieval matching systems. These systems require a knowledge base to comprehend the generality/specificity of a concept (thus knowing the subclasses or related concepts to a given concept) and knowledge of the various string representations for each concept (i.e., synonyms, lexical variants, etc.). We have previously reported on a data integration mediation layer that allows transparent access to multiple heterogeneous distributed medical databases (HIS, RIS, and PACS). The data access layer of our architecture currently has limited query processing capabilities. Given a patient hospital identification number, the access mediation layer collects all documents in RIS and HIS and returns this information to a specified workstation location. In this paper we report on our efforts to extend the query processing capabilities of the system by creation of custom query interfaces, an intelligent query processing engine, and a document-content index that can be generated automatically (i.e., no manual authoring or changes to the normal clinical protocols).
A unified framework for managing provenance information in translational research
2011-01-01
Background A critical aspect of the NIH Translational Research roadmap, which seeks to accelerate the delivery of "bench-side" discoveries to patient's "bedside," is the management of the provenance metadata that keeps track of the origin and history of data resources as they traverse the path from the bench to the bedside and back. A comprehensive provenance framework is essential for researchers to verify the quality of data, reproduce scientific results published in peer-reviewed literature, validate scientific process, and associate trust value with data and results. Traditional approaches to provenance management have focused on only partial sections of the translational research life cycle and they do not incorporate "domain semantics", which is essential to support domain-specific querying and analysis by scientists. Results We identify a common set of challenges in managing provenance information across the pre-publication and post-publication phases of data in the translational research lifecycle. We define the semantic provenance framework (SPF), underpinned by the Provenir upper-level provenance ontology, to address these challenges in the four stages of provenance metadata: (a) Provenance collection - during data generation (b) Provenance representation - to support interoperability, reasoning, and incorporate domain semantics (c) Provenance storage and propagation - to allow efficient storage and seamless propagation of provenance as the data is transferred across applications (d) Provenance query - to support queries with increasing complexity over large data size and also support knowledge discovery applications We apply the SPF to two exemplar translational research projects, namely the Semantic Problem Solving Environment for Trypanosoma cruzi (T.cruzi SPSE) and the Biomedical Knowledge Repository (BKR) project, to demonstrate its effectiveness. Conclusions The SPF provides a unified framework to effectively manage provenance of translational research data during pre and post-publication phases. This framework is underpinned by an upper-level provenance ontology called Provenir that is extended to create domain-specific provenance ontologies to facilitate provenance interoperability, seamless propagation of provenance, automated querying, and analysis. PMID:22126369
Possible Effects of Dietary Anthocyanins on Diabetes and Insulin Resistance.
Turrini, Eleonora; Ferruzzi, Lorenzo; Fimognari, Carmela
2017-01-01
Diabetes is reaching epidemic proportions worldwide. Many dietary compounds have been found to exert health beneficial effects against different pathologies including diabetes. Most bioactive compounds have been identified in fruits and vegetables and their mechanisms of action explored both in vitro and in vivo. In particular, great interest has been given to polyphenols and especially to a specific subset of molecules, i.e. anthocyanins. Several lines of evidence suggest that anthocyanins have positive effects on human health by inducing a number of biological activities. This review will give an overview on the influence of dietary anthocyanins on preventing and managing type 2 diabetes. In particular, in vitro and in vivo studies will be presented. The article also reviews the potential clinical impact of the antidiabetic activity of anthocyanins and outlines the major challenges of using anthocyanins for diabetes treatment. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Hoogendam, Arjen; Stalenhoef, Anton FH; Robbé, Pieter F de Vries; Overbeke, A John PM
2008-01-01
Background The use of PubMed to answer daily medical care questions is limited because it is challenging to retrieve a small set of relevant articles and time is restricted. Knowing what aspects of queries are likely to retrieve relevant articles can increase the effectiveness of PubMed searches. The objectives of our study were to identify queries that are likely to retrieve relevant articles by relating PubMed search techniques and tools to the number of articles retrieved and the selection of articles for further reading. Methods This was a prospective observational study of queries regarding patient-related problems sent to PubMed by residents and internists in internal medicine working in an Academic Medical Centre. We analyzed queries, search results, query tools (Mesh, Limits, wildcards, operators), selection of abstract and full-text for further reading, using a portal that mimics PubMed. Results PubMed was used to solve 1121 patient-related problems, resulting in 3205 distinct queries. Abstracts were viewed in 999 (31%) of these queries, and in 126 (39%) of 321 queries using query tools. The average term count per query was 2.5. Abstracts were selected in more than 40% of queries using four or five terms, increasing to 63% if the use of four or five terms yielded 2–161 articles. Conclusion Queries sent to PubMed by physicians at our hospital during daily medical care contain fewer than three terms. Queries using four to five terms, retrieving less than 161 article titles, are most likely to result in abstract viewing. PubMed search tools are used infrequently by our population and are less effective than the use of four or five terms. Methods to facilitate the formulation of precise queries, using more relevant terms, should be the focus of education and research. PMID:18816391
O'Sullivan, D; Wilk, S; Michalowski, W; Slowinski, R; Thomas, R; Kadzinski, M; Farion, K
2014-01-01
Online medical knowledge repositories such as MEDLINE and The Cochrane Library are increasingly used by physicians to retrieve articles to aid with clinical decision making. The prevailing approach for organizing retrieved articles is in the form of a rank-ordered list, with the assumption that the higher an article is presented on a list, the more relevant it is. Despite this common list-based organization, it is seldom studied how physicians perceive the association between the relevance of articles and the order in which articles are presented. In this paper we describe a case study that captured physician preferences for 3-element lists of medical articles in order to learn how to organize medical knowledge for decision-making. Comprehensive relevance evaluations were developed to represent 3-element lists of hypothetical articles that may be retrieved from an online medical knowledge source such as MEDLINE or The Cochrane Library. Comprehensive relevance evaluations asses not only an article's relevance for a query, but also whether it has been placed on the correct list position. In other words an article may be relevant and correctly placed on a result list (e.g. the most relevant article appears first in the result list), an article may be relevant for a query but placed on an incorrect list position (e.g. the most relevant article appears second in a result list), or an article may be irrelevant for a query yet still appear in the result list. The relevance evaluations were presented to six senior physicians who were asked to express their preferences for an article's relevance and its position on a list by pairwise comparisons representing different combinations of 3-element lists. The elicited preferences were assessed using a novel GRIP (Generalized Regression with Intensities of Preference) method and represented as an additive value function. Value functions were derived for individual physicians as well as the group of physicians. The results show that physicians assign significant value to the 1st position on a list and they expect that the most relevant article is presented first. Whilst physicians still prefer obtaining a correctly placed article on position 2, they are also quite satisfied with misplaced relevant article. Low consideration of the 3rd position was uniformly confirmed. Our findings confirm the importance of placing the most relevant article on the 1st position on a list and the importance paid to position on a list significantly diminishes after the 2nd position. The derived value functions may be used by developers of clinical decision support applications to decide how best to organize medical knowledge for decision making and to create personalized evaluation measures that can augment typical measures used to evaluate information retrieval systems.
Ben-Ishay, Offir; Daoud, Mai; Peled, Zvi; Brauner, Eran; Bahouth, Hany; Kluger, Yoram
2015-01-01
In pediatric care, the role of focused abdominal sonography in trauma (FAST) remains ill defined. The objective of this study was to assess the sensitivity and specificity of FAST for detecting free peritoneal fluid in children. The trauma registry of a single level I pediatric trauma center was queried for the results of FAST examination of consecutive pediatric (<18 years) blunt trauma patients over a period of 36 months, from January 2010 to December 2012. Demographics, type of injuries, FAST results, computerized tomography (CT) results, and operative findings were reviewed. During the study period, 543 injured pediatric patients (mean age 8.2 ± 5 years) underwent FAST examinations. In 95 (17.5 %) FAST was positive for free peritoneal fluid. CT examination was performed in 219 (40.3 %) children. Positive FAST examination was confirmed by CT scan in 61/73 (83.6 %). CT detected intra-peritoneal fluid in 62/448 (13.8 %) of the patients with negative FAST results. These findings correspond to a sensitivity of 50 %, specificity of 88 %, positive predictive value (PPV) of 84 %, and a negative predictive value (NPV) of 58 %. In patients who had negative FAST results and no CT examination (302), no missed abdominal injury was detected on clinical ground. FAST examination in the young age group (<2 years) yielded lower sensitivity and specificity (36 and 78 % respectively) with a PPV of only 50 %. This study shows that although a positive FAST evaluation does not necessarily correlate with an IAI, a negative one strongly suggests the absence of an IAI, with a high NPV. These findings are emphasized in the analysis of the subgroup of children less than 2 years of age. FAST examination tempered with sound clinical judgment seems to be an effective tool to discriminate injured children in need of further imaging evaluation.
Putting Space Physics Data Facility (SPDF) Services to Good Use
NASA Astrophysics Data System (ADS)
Candey, R. M.; Bilitza, D.; Chimiak, R.; Cooper, J. F.; Garcia, L. N.; Harris, B.; Johnson, R. C.; King, J. H.; Kovalick, T.; Leckner, H.; Liu, M.; McGuire, R. E.; Papitashvili, N. E.; Roberts, A.
2009-12-01
The Space Physics Data Facility (SPDF) project provides heliophysics science-enabling information services and is the most widely used single access point to heliophysics science data and orbits from NASA's solar-heliospheric satellite missions. Our emphasis has been on active service of the best digital data products and key ancillary information with graphics, listings and production of subsetted or merged files (mass downloads or parameter-specific selections). Our services today include the: (1) Heliophysics Resource Gateway (HRG) data finding service (also known as the Virtual Space Physics Observatory or VSPO); (2) Data services including the Coordinated Data Analysis Web (CDAWeb), OMNIweb compilation of interplanetary parameters (mapped to the Earth's bow shock) and related indices, and their large underlying collection of datasets; (3) Orbit information and display services including the Satellite Situation Center (SSCweb) and the 4D Orbit Viewer interactive Java client; and the (4) Common Data Format (CDF) software library and file format and science file format translation suite. (5) Upcoming is the Heliospheric Event List Manager (HELM) to coordinate lists of interesting events and provide a mechanism for tying together the above services and others. We describe several research projects that heavily used SPDF's services and resulted in publications. Although not actually all used at once, the following research scenario shows how SPDF and VxO services can be combined for studying solar events that produce energetic particles and effects at Earth: use the HRG/VPSO to locate data of interest, perhaps query OMNIWeb for times when energetic particle solar activity is high and query the SSCWeb orbit location service for when Cluster, Geotail, Polar/IMAGE are in position to measure the cusp, magnetotail and the Earth's aurora, respectively. Also query SSCweb for times when Polar and magnetometer ground stations are on the same field lines. Using these times, use CDAWeb to browse data from these spacecraft, and add Wind and ACE field and plasma data to identify interplanetary shocks arriving at Earth. Use HRG to find and retrieve SOHO LASCO CME data at SDAC. Use the SSCWeb 4D Orbit Viewer to display the relative spacecraft positions and geophysical boundaries and to follow the magnetic footpoints of the satellites. Confirm auroral substorm activity by a quick browse of IMAGE FUV and TIMED GUVI data as movies showing the expanding and intensifying auroral oval. Finally, pull these data directly into your own analysis tool (such as ViSBARD or some model in IDL) via our web services or simple FTP transfer to complete the analysis.
An initial log analysis of usage patterns on a research networking system.
Boland, Mary Regina; Trembowelski, Sylvia; Bakken, Suzanne; Weng, Chunhua
2012-08-01
Usage data for research networking systems (RNSs) are valuable but generally unavailable for understanding scientific professionals' information needs and online collaborator seeking behaviors. This study contributes a method for evaluating RNSs and initial usage knowledge of one RNS obtained from using this method. We designed a log for an institutional RNS, defined categories of users and tasks, and analyzed correlations between usage patterns and user and query types. Our results show that scientific professionals spend more time performing deep Web searching on RNSs than generic Google users and we also show that retrieving scientist profiles is faster on an RNS than on Google (3.5 seconds vs. 34.2 seconds) whereas organization-specific browsing on a RNS takes longer than on Google (117.0 seconds vs. 34.2 seconds). Usage patterns vary by user role, e.g., faculty performed more informational queries than administrators, which implies role-specific user support is needed for RNSs. © 2012 Wiley Periodicals, Inc.
An Initial Log Analysis of Usage Patterns on a Research Networking System
Boland, Mary Regina; Trembowelski, Sylvia; Bakken, Suzanne; Weng, Chunhua
2012-01-01
Abstract Usage data for research networking systems (RNSs) are valuable but generally unavailable for understanding scientific professionals’ information needs and online collaborator seeking behaviors. This study contributes a method for evaluating RNSs and initial usage knowledge of one RNS obtained from using this method. We designed a log for an institutional RNS, defined categories of users and tasks, and analyzed correlations between usage patterns and user and query types. Our results show that scientific professionals spend more time performing deep Web searching on RNSs than generic Google users and we also show that retrieving scientist profiles is faster on an RNS than on Google (3.5 seconds vs. 34.2 seconds) whereas organization‐specific browsing on a RNS takes longer than on Google (117.0 seconds vs. 34.2 seconds). Usage patterns vary by user role, e.g., faculty performed more informational queries than administrators, which implies role‐specific user support is needed for RNSs. Clin Trans Sci 2012; Volume 5: 340–347 PMID:22883612
Requests for electromyography in Rome: a critical evaluation
Di Fabio, Roberto; Castagnoli, Claudio; Madrigale, Andrea; Barrella, Massimo; Serrao, Mariano; Pierelli, Francesco
2013-01-01
Summary To date, there exist no data reporting the level of suitability of requests for electromyography examinations (EMGs) in Rome. The records of 1,220 consecutive patients (age: 57.6±15.0 years; 400 M, 820 F) in two neurophysiology laboratories were collected and analyzed. In total, 1,317 EMGs were requested, mainly by general practitioners (GPs) (57%) and orthopedic specialists (18%). The most common diagnoses were L4-L5 radiculopathy (22%) and carpal tunnel syndrome (21%); 332 examinations (25%) were normal. 68% of requests were not accompanied by any specific query. The concordance between initial hypothesis/final post-EMG diagnosis was low (<20%). When a specific query was indicated, the initial suspicion was confirmed by EMG in 54% of GP requests and 64% of requests by specialists (p=0.03). No difference in diagnostic ability was found between specialists (p>0.05). In 17% of cases, the EMG was deemed diagnostically useless by the neurophysiologist, which seems to indicate potentially suboptimal prescription of EMGs. PMID:24598396
Debruyne, Philip R; Johnson, Philip J; Pottel, Lies; Daniels, Susanna; Greer, Rachel; Hodgkinson, Elizabeth; Kelly, Stephen; Lycke, Michelle; Samol, Jens; Mason, Julie; Kimber, Donna; Loucaides, Eileen; Parmar, Mahesh Kb; Harvey, Sally
2015-06-01
Clarity and accuracy of the pharmacy aspects of cancer clinical trial protocols is essential. Inconsistencies and ambiguities in such protocols have the potential to delay research and jeopardise both patient safety and collection of credible data. The Chemotherapy and Pharmacy Advisory Service was established by the UK National Cancer Research Network, currently known as National Institute for Health Research Clinical Research Network, to improve the quality of pharmacy-related content in cancer clinical research protocols. This article reports the scope of Chemotherapy and Pharmacy Advisory Service, its methodology of mandated protocol review and pharmacy-related guidance initiatives and its current impact. Over a 6-year period (2008-2013) since the inception of Chemotherapy and Pharmacy Advisory Service, cancer clinical trial protocols were reviewed by the service, prior to implementation at clinical trial sites. A customised Review Checklist was developed and used by a panel of experts to standardise the review process and report back queries and inconsistencies to chief investigators. Based on common queries, a Standard Protocol Template comprising specific guidance on drug-related content and a Pharmacy Manual Template were developed. In addition, a guidance framework was established to address 'ad hoc' pharmacy-related queries. The most common remarks made at protocol review have been summarised and categorised through retrospective analysis. In order to evaluate the impact of the service, chief investigators were asked to respond to queries made at protocol review and make appropriate changes to their protocols. Responses from chief investigators have been collated and acceptance rates determined. A total of 176 protocols were reviewed. The median number of remarks per protocol was 26, of which 20 were deemed clinically relevant and mainly concerned the drug regimen, support medication, frequency and type of monitoring and drug supply aspects. Further analysis revealed that 62% of chief investigators responded to the review. All responses were positive with an overall acceptance rate of 89% of the proposed protocol changes. Review of pharmacy content of cancer clinical trial protocols is feasible and exposes many undetected clinically relevant issues that could hinder efficient trial conduct. Our service audit revealed that the majority of suggestions were effectively incorporated in the final protocols. The refinement of existing and development of new pharmacy-related guidance documents by Chemotherapy and Pharmacy Advisory Service might aid in better and safer clinical research. © The Author(s) 2015.
Lavorgna, Giovanni; Triunfo, Riccardo; Santoni, Federico; Orfanelli, Ugo; Noci, Sara; Bulfone, Alessandro; Zanetti, Gianluigi; Casari, Giorgio
2005-07-01
An increasing number of eukaryotic and prokaryotic genes are being found to have natural antisense transcripts (NATs). There is also growing evidence to suggest that antisense transcription could play a key role in many human diseases. Consequently, there have been several recent attempts to set up computational procedures aimed at identifying novel NATs. Our group has developed the AntiHunter program for the identification of expressed sequence tag (EST) antisense transcripts from BLAST output. In order to perform an analysis, the program requires a genomic sequence plus an associated list of transcript names and coordinates of the genomic region. After masking the repeated regions, the program carries out a BLASTN search of this sequence in the selected EST database, reporting via email the EST entries that reveal an antisense transcript according to the user-supplied list. Here, we present the newly developed version 2.0 of the AntiHunter tool. Several improvements have been added to this version of the program in order to increase its ability to detect a larger number of antisense ESTs. As a result, AntiHunter can now detect, on average, >45% more antisense ESTs with little or no increase in the percentage of the false positives. We also raised the maximum query size to 3 Mb (previously 1 Mb). Moreover, we found that a reasonable trade-off between the program search sensitivity and the maximum allowed size of the input-query sequence could be obtained by querying the database with the MEGABLAST program, rather than by using the BLAST one. We now offer this new opportunity to users, i.e. if choosing the MEGABLAST option, users can input a query sequence up to 30 Mb long, thus considerably improving the possibility to analyze longer query regions. The AntiHunter tool is freely available at http://bioinfo.crs4.it/AH2.0.
LAILAPS-QSM: A RESTful API and JAVA library for semantic query suggestions.
Chen, Jinbo; Scholz, Uwe; Zhou, Ruonan; Lange, Matthias
2018-03-01
In order to access and filter content of life-science databases, full text search is a widely applied query interface. But its high flexibility and intuitiveness is paid for with potentially imprecise and incomplete query results. To reduce this drawback, query assistance systems suggest those combinations of keywords with the highest potential to match most of the relevant data records. Widespread approaches are syntactic query corrections that avoid misspelling and support expansion of words by suffixes and prefixes. Synonym expansion approaches apply thesauri, ontologies, and query logs. All need laborious curation and maintenance. Furthermore, access to query logs is in general restricted. Approaches that infer related queries by their query profile like research field, geographic location, co-authorship, affiliation etc. require user's registration and its public accessibility that contradict privacy concerns. To overcome these drawbacks, we implemented LAILAPS-QSM, a machine learning approach that reconstruct possible linguistic contexts of a given keyword query. The context is referred from the text records that are stored in the databases that are going to be queried or extracted for a general purpose query suggestion from PubMed abstracts and UniProt data. The supplied tool suite enables the pre-processing of these text records and the further computation of customized distributed word vectors. The latter are used to suggest alternative keyword queries. An evaluated of the query suggestion quality was done for plant science use cases. Locally present experts enable a cost-efficient quality assessment in the categories trait, biological entity, taxonomy, affiliation, and metabolic function which has been performed using ontology term similarities. LAILAPS-QSM mean information content similarity for 15 representative queries is 0.70, whereas 34% have a score above 0.80. In comparison, the information content similarity for human expert made query suggestions is 0.90. The software is either available as tool set to build and train dedicated query suggestion services or as already trained general purpose RESTful web service. The service uses open interfaces to be seamless embeddable into database frontends. The JAVA implementation uses highly optimized data structures and streamlined code to provide fast and scalable response for web service calls. The source code of LAILAPS-QSM is available under GNU General Public License version 2 in Bitbucket GIT repository: https://bitbucket.org/ipk_bit_team/bioescorte-suggestion.
Tao, Shiqiang; Cui, Licong; Wu, Xi; Zhang, Guo-Qiang
2017-01-01
To help researchers better access clinical data, we developed a prototype query engine called DataSphere for exploring large-scale integrated clinical data repositories. DataSphere expedites data importing using a NoSQL data management system and dynamically renders its user interface for concept-based querying tasks. DataSphere provides an interactive query-building interface together with query translation and optimization strategies, which enable users to build and execute queries effectively and efficiently. We successfully loaded a dataset of one million patients for University of Kentucky (UK) Healthcare into DataSphere with more than 300 million clinical data records. We evaluated DataSphere by comparing it with an instance of i2b2 deployed at UK Healthcare, demonstrating that DataSphere provides enhanced user experience for both query building and execution.
Tao, Shiqiang; Cui, Licong; Wu, Xi; Zhang, Guo-Qiang
2017-01-01
To help researchers better access clinical data, we developed a prototype query engine called DataSphere for exploring large-scale integrated clinical data repositories. DataSphere expedites data importing using a NoSQL data management system and dynamically renders its user interface for concept-based querying tasks. DataSphere provides an interactive query-building interface together with query translation and optimization strategies, which enable users to build and execute queries effectively and efficiently. We successfully loaded a dataset of one million patients for University of Kentucky (UK) Healthcare into DataSphere with more than 300 million clinical data records. We evaluated DataSphere by comparing it with an instance of i2b2 deployed at UK Healthcare, demonstrating that DataSphere provides enhanced user experience for both query building and execution. PMID:29854239
Improve Performance of Data Warehouse by Query Cache
NASA Astrophysics Data System (ADS)
Gour, Vishal; Sarangdevot, S. S.; Sharma, Anand; Choudhary, Vinod
2010-11-01
The primary goal of data warehouse is to free the information locked up in the operational database so that decision makers and business analyst can make queries, analysis and planning regardless of the data changes in operational database. As the number of queries is large, therefore, in certain cases there is reasonable probability that same query submitted by the one or multiple users at different times. Each time when query is executed, all the data of warehouse is analyzed to generate the result of that query. In this paper we will study how using query cache improves performance of Data Warehouse and try to find the common problems faced. These kinds of problems are faced by Data Warehouse administrators which are minimizes response time and improves the efficiency of query in data warehouse overall, particularly when data warehouse is updated at regular interval.
Safari, Leila; Patrick, Jon D
2018-06-01
This paper reports on a generic framework to provide clinicians with the ability to conduct complex analyses on elaborate research topics using cascaded queries to resolve internal time-event dependencies in the research questions, as an extension to the proposed Clinical Data Analytics Language (CliniDAL). A cascaded query model is proposed to resolve internal time-event dependencies in the queries which can have up to five levels of criteria starting with a query to define subjects to be admitted into a study, followed by a query to define the time span of the experiment. Three more cascaded queries can be required to define control groups, control variables and output variables which all together simulate a real scientific experiment. According to the complexity of the research questions, the cascaded query model has the flexibility of merging some lower level queries for simple research questions or adding a nested query to each level to compose more complex queries. Three different scenarios (one of them contains two studies) are described and used for evaluation of the proposed solution. CliniDAL's complex analyses solution enables answering complex queries with time-event dependencies at most in a few hours which manually would take many days. An evaluation of results of the research studies based on the comparison between CliniDAL and SQL solutions reveals high usability and efficiency of CliniDAL's solution. Copyright © 2018 Elsevier Inc. All rights reserved.
Querying temporal clinical databases on granular trends.
Combi, Carlo; Pozzi, Giuseppe; Rossato, Rosalba
2012-04-01
This paper focuses on the identification of temporal trends involving different granularities in clinical databases, where data are temporal in nature: for example, while follow-up visit data are usually stored at the granularity of working days, queries on these data could require to consider trends either at the granularity of months ("find patients who had an increase of systolic blood pressure within a single month") or at the granularity of weeks ("find patients who had steady states of diastolic blood pressure for more than 3 weeks"). Representing and reasoning properly on temporal clinical data at different granularities are important both to guarantee the efficacy and the quality of care processes and to detect emergency situations. Temporal sequences of data acquired during a care process provide a significant source of information not only to search for a particular value or an event at a specific time, but also to detect some clinically-relevant patterns for temporal data. We propose a general framework for the description and management of temporal trends by considering specific temporal features with respect to the chosen time granularity. Temporal aspects of data are considered within temporal relational databases, first formally by using a temporal extension of the relational calculus, and then by showing how to map these relational expressions to plain SQL queries. Throughout the paper we consider the clinical domain of hemodialysis, where several parameters are periodically sampled during every session. Copyright © 2011 Elsevier Inc. All rights reserved.
Marcus, Brian S; Carlson, Jestin N; Hegde, Gajanan G; Shang, Jennifer; Venkat, Arvind
2016-03-01
We sought to evaluate whether health care professionals' viewpoints differed on the role of ethics committees and hospitals in the resolution of clinical ethical dilemmas based on practice location. We conducted a survey study from December 21, 2013 to March 15, 2014 of health care professionals at six hospitals (one tertiary care academic medical center, three large community hospitals and two small community hospitals). The survey consisted of eight clinical ethics cases followed by statements on whether there was a role for the ethics committee or hospital in their resolution, what that role might be and case specific queries. Respondents used a 5-point Likert scale to express their degree of agreement with the premises posed. We used the ANOVA test to evaluate whether respondent views significantly varied based on practice location. 240 health care professionals (108-tertiary care center, 92-large community hospitals, 40-small community hospitals) completed the survey (response rate: 63.6 %). Only three individual queries of 32 showed any significant response variations across practice locations. Overall, viewpoints did not vary across practice locations within question categories on whether the ethics committee or hospital had a role in case resolution, what that role might be and case specific queries. In this multicenter survey study, the viewpoints of health care professionals on the role of ethics committees or hospitals in the resolution of clinical ethics cases varied little based on practice location.
Davis, Margot T; Mulvaney-Day, Norah; Larson, Mary Jo; Hoover, Ronald; Mauch, Danna
2014-12-01
Recent reports reinforce the widespread interest in complementary and alternative medicine (CAM), not only among military personnel with combat-related disorders, but also among providers who are pressed to respond to patient demand for these therapies. However, an understanding of utilization of CAM therapies in this population is lacking. The goals of this study are to synthesize the content of self-report population surveys with information on use of CAM in military and veteran populations, assess gaps in knowledge, and suggest ways to address current limitations. The research team conducted a literature review of population surveys to identify CAM definitions, whether military status was queried, the medical and psychological conditions queried, and each specific CAM question. Utilization estimates specific to military/veterans were summarized and limitations to knowledge was classified. Seven surveys of CAM utilization were conducted with military/veteran groups. In addition, 7 household surveys queried military status, although there was no military/veteran subgroup analysis. Definition of CAM varied widely limiting cross-survey analysis. Among active duty and Reserve military, CAM use ranged between 37% and 46%. Survey estimates do not specify CAM use that is associated with a medical or behavioral health condition. Comparisons between surveys are hampered due to variation in methodologies. Too little is known about reasons for using CAM and conditions for which it is used. Additional information could be drawn from current surveys with additional subgroup analysis, and future surveys of CAM should include military status variable.
Evaluation of Sub Query Performance in SQL Server
NASA Astrophysics Data System (ADS)
Oktavia, Tanty; Sujarwo, Surya
2014-03-01
The paper explores several sub query methods used in a query and their impact on the query performance. The study uses experimental approach to evaluate the performance of each sub query methods combined with indexing strategy. The sub query methods consist of in, exists, relational operator and relational operator combined with top operator. The experimental shows that using relational operator combined with indexing strategy in sub query has greater performance compared with using same method without indexing strategy and also other methods. In summary, for application that emphasized on the performance of retrieving data from database, it better to use relational operator combined with indexing strategy. This study is done on Microsoft SQL Server 2012.
Secure Skyline Queries on Cloud Platform.
Liu, Jinfei; Yang, Juncheng; Xiong, Li; Pei, Jian
2017-04-01
Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions.
Distributed query plan generation using multiobjective genetic algorithm.
Panicker, Shina; Kumar, T V Vijay
2014-01-01
A distributed query processing strategy, which is a key performance determinant in accessing distributed databases, aims to minimize the total query processing cost. One way to achieve this is by generating efficient distributed query plans that involve fewer sites for processing a query. In the case of distributed relational databases, the number of possible query plans increases exponentially with respect to the number of relations accessed by the query and the number of sites where these relations reside. Consequently, computing optimal distributed query plans becomes a complex problem. This distributed query plan generation (DQPG) problem has already been addressed using single objective genetic algorithm, where the objective is to minimize the total query processing cost comprising the local processing cost (LPC) and the site-to-site communication cost (CC). In this paper, this DQPG problem is formulated and solved as a biobjective optimization problem with the two objectives being minimize total LPC and minimize total CC. These objectives are simultaneously optimized using a multiobjective genetic algorithm NSGA-II. Experimental comparison of the proposed NSGA-II based DQPG algorithm with the single objective genetic algorithm shows that the former performs comparatively better and converges quickly towards optimal solutions for an observed crossover and mutation probability.
Distributed Query Plan Generation Using Multiobjective Genetic Algorithm
Panicker, Shina; Vijay Kumar, T. V.
2014-01-01
A distributed query processing strategy, which is a key performance determinant in accessing distributed databases, aims to minimize the total query processing cost. One way to achieve this is by generating efficient distributed query plans that involve fewer sites for processing a query. In the case of distributed relational databases, the number of possible query plans increases exponentially with respect to the number of relations accessed by the query and the number of sites where these relations reside. Consequently, computing optimal distributed query plans becomes a complex problem. This distributed query plan generation (DQPG) problem has already been addressed using single objective genetic algorithm, where the objective is to minimize the total query processing cost comprising the local processing cost (LPC) and the site-to-site communication cost (CC). In this paper, this DQPG problem is formulated and solved as a biobjective optimization problem with the two objectives being minimize total LPC and minimize total CC. These objectives are simultaneously optimized using a multiobjective genetic algorithm NSGA-II. Experimental comparison of the proposed NSGA-II based DQPG algorithm with the single objective genetic algorithm shows that the former performs comparatively better and converges quickly towards optimal solutions for an observed crossover and mutation probability. PMID:24963513
Towards Hybrid Online On-Demand Querying of Realtime Data with Stateful Complex Event Processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, Qunzhi; Simmhan, Yogesh; Prasanna, Viktor K.
Emerging Big Data applications in areas like e-commerce and energy industry require both online and on-demand queries to be performed over vast and fast data arriving as streams. These present novel challenges to Big Data management systems. Complex Event Processing (CEP) is recognized as a high performance online query scheme which in particular deals with the velocity aspect of the 3-V’s of Big Data. However, traditional CEP systems do not consider data variety and lack the capability to embed ad hoc queries over the volume of data streams. In this paper, we propose H2O, a stateful complex event processing framework,more » to support hybrid online and on-demand queries over realtime data. We propose a semantically enriched event and query model to address data variety. A formal query algebra is developed to precisely capture the stateful and containment semantics of online and on-demand queries. We describe techniques to achieve the interactive query processing over realtime data featured by efficient online querying, dynamic stream data persistence and on-demand access. The system architecture is presented and the current implementation status reported.« less
Query Health: standards-based, cross-platform population health surveillance
Klann, Jeffrey G; Buck, Michael D; Brown, Jeffrey; Hadley, Marc; Elmore, Richard; Weber, Griffin M; Murphy, Shawn N
2014-01-01
Objective Understanding population-level health trends is essential to effectively monitor and improve public health. The Office of the National Coordinator for Health Information Technology (ONC) Query Health initiative is a collaboration to develop a national architecture for distributed, population-level health queries across diverse clinical systems with disparate data models. Here we review Query Health activities, including a standards-based methodology, an open-source reference implementation, and three pilot projects. Materials and methods Query Health defined a standards-based approach for distributed population health queries, using an ontology based on the Quality Data Model and Consolidated Clinical Document Architecture, Health Quality Measures Format (HQMF) as the query language, the Query Envelope as the secure transport layer, and the Quality Reporting Document Architecture as the result language. Results We implemented this approach using Informatics for Integrating Biology and the Bedside (i2b2) and hQuery for data analytics and PopMedNet for access control, secure query distribution, and response. We deployed the reference implementation at three pilot sites: two public health departments (New York City and Massachusetts) and one pilot designed to support Food and Drug Administration post-market safety surveillance activities. The pilots were successful, although improved cross-platform data normalization is needed. Discussions This initiative resulted in a standards-based methodology for population health queries, a reference implementation, and revision of the HQMF standard. It also informed future directions regarding interoperability and data access for ONC's Data Access Framework initiative. Conclusions Query Health was a test of the learning health system that supplied a functional methodology and reference implementation for distributed population health queries that has been validated at three sites. PMID:24699371
Query Health: standards-based, cross-platform population health surveillance.
Klann, Jeffrey G; Buck, Michael D; Brown, Jeffrey; Hadley, Marc; Elmore, Richard; Weber, Griffin M; Murphy, Shawn N
2014-01-01
Understanding population-level health trends is essential to effectively monitor and improve public health. The Office of the National Coordinator for Health Information Technology (ONC) Query Health initiative is a collaboration to develop a national architecture for distributed, population-level health queries across diverse clinical systems with disparate data models. Here we review Query Health activities, including a standards-based methodology, an open-source reference implementation, and three pilot projects. Query Health defined a standards-based approach for distributed population health queries, using an ontology based on the Quality Data Model and Consolidated Clinical Document Architecture, Health Quality Measures Format (HQMF) as the query language, the Query Envelope as the secure transport layer, and the Quality Reporting Document Architecture as the result language. We implemented this approach using Informatics for Integrating Biology and the Bedside (i2b2) and hQuery for data analytics and PopMedNet for access control, secure query distribution, and response. We deployed the reference implementation at three pilot sites: two public health departments (New York City and Massachusetts) and one pilot designed to support Food and Drug Administration post-market safety surveillance activities. The pilots were successful, although improved cross-platform data normalization is needed. This initiative resulted in a standards-based methodology for population health queries, a reference implementation, and revision of the HQMF standard. It also informed future directions regarding interoperability and data access for ONC's Data Access Framework initiative. Query Health was a test of the learning health system that supplied a functional methodology and reference implementation for distributed population health queries that has been validated at three sites. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
76 FR 55373 - Combined Notice of Filings #1
Federal Register 2010, 2011, 2012, 2013, 2014
2011-09-07
....13(a)(2)(iii: PJM Queue Position W2-075--Original Service Agreement No. 3039 to be effective 7/28... the Commission's eLibrary system by clicking on the links or querying the docket number. Any person... 214 of the Commission's Regulations (18 CFR 385.211 and 385.214) on or before 5 p.m. Eastern time on...
A parallel data management system for large-scale NASA datasets
NASA Technical Reports Server (NTRS)
Srivastava, Jaideep
1993-01-01
The past decade has experienced a phenomenal growth in the amount of data and resultant information generated by NASA's operations and research projects. A key application is the reprocessing problem which has been identified to require data management capabilities beyond those available today (PRAT93). The Intelligent Information Fusion (IIF) system (ROEL91) is an ongoing NASA project which has similar requirements. Deriving our understanding of NASA's future data management needs based on the above, this paper describes an approach to using parallel computer systems (processor and I/O architectures) to develop an efficient parallel database management system to address the needs. Specifically, we propose to investigate issues in low-level record organizations and management, complex query processing, and query compilation and scheduling.
Huang, Jidong; Zheng, Rong; Emery, Sherry
2013-01-01
Despite the tremendous economic and health costs imposed on China by tobacco use, China lacks a proactive and systematic tobacco control surveillance and evaluation system, hampering research progress on tobacco-focused surveillance and evaluation studies. This paper uses online search query analyses to investigate changes in online search behavior among Chinese Internet users in response to the adoption of the national indoor public place smoking ban. Baidu Index and Google Trends were used to examine the volume of search queries containing three key search terms "Smoking Ban(s)," "Quit Smoking," and "Electronic Cigarette(s)," along with the news coverage on the smoking ban, for the period 2009-2011. Our results show that the announcement and adoption of the indoor public place smoking ban in China generated significant increases in news coverage on smoking bans. There was a strong positive correlation between the media coverage of smoking bans and the volume of "Smoking Ban(s)" and "Quit Smoking" related search queries. The volume of search queries related to "Electronic Cigarette(s)" was also correlated with the smoking ban news coverage. To the extent it altered smoking-related online searches, our analyses suggest that the smoking ban had a significant effect, at least in the short run, on Chinese Internet users' smoking-related behaviors. This research introduces a novel analytic tool, which could serve as an alternative tobacco control evaluation and behavior surveillance tool in the absence of timely or comprehensive population surveillance system. This research also highlights the importance of a comprehensive approach to tobacco control in China.
Gao, Xiang; Lin, Huaiying; Revanna, Kashi; Dong, Qunfeng
2017-05-10
Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement. We have developed a method that shows significantly improved species-level classification results over existing methods. Our method calculates true sequence similarity between query sequences and database hits using pairwise sequence alignment. Taxonomic classifications are assigned from the species to the phylum levels based on the lowest common ancestors of multiple database hits for each query sequence, and further classification reliabilities are evaluated by bootstrap confidence scores. The novelty of our method is that the contribution of each database hit to the taxonomic assignment of the query sequence is weighted by a Bayesian posterior probability based upon the degree of sequence similarity of the database hit to the query sequence. Our method does not need any training datasets specific for different taxonomic groups. Instead only a reference database is required for aligning to the query sequences, making our method easily applicable for different regions of the 16S rRNA gene or other phylogenetic marker genes. Reliable species-level classification for 16S rRNA or other phylogenetic marker genes is critical for microbiome research. Our software shows significantly higher classification accuracy than the existing tools and we provide probabilistic-based confidence scores to evaluate the reliability of our taxonomic classification assignments based on multiple database matches to query sequences. Despite its higher computational costs, our method is still suitable for analyzing large-scale microbiome datasets for practical purposes. Furthermore, our method can be applied for taxonomic classification of any phylogenetic marker gene sequences. Our software, called BLCA, is freely available at https://github.com/qunfengdong/BLCA .
Cluster-Based Query Expansion Using Language Modeling for Biomedical Literature Retrieval
ERIC Educational Resources Information Center
Xu, Xuheng
2011-01-01
The tremendously huge volume of biomedical literature, scientists' specific information needs, long terms of multiples words, and fundamental problems of synonym and polysemy have been challenging issues facing the biomedical information retrieval community researchers. Search engines have significantly improved the efficiency and effectiveness of…
Guiding Students to Answers: Query Recommendation
ERIC Educational Resources Information Center
Yilmazel, Ozgur
2011-01-01
This paper reports on a guided navigation system built on the textbook search engine developed at Anadolu University to support distance education students. The search engine uses Turkish Language specific language processing modules to enable searches over course material presented in Open Education Faculty textbooks. We implemented a guided…
CSRQ: Communication-Efficient Secure Range Queries in Two-Tiered Sensor Networks
Dai, Hua; Ye, Qingqun; Yang, Geng; Xu, Jia; He, Ruiliang
2016-01-01
In recent years, we have seen many applications of secure query in two-tiered wireless sensor networks. Storage nodes are responsible for storing data from nearby sensor nodes and answering queries from Sink. It is critical to protect data security from a compromised storage node. In this paper, the Communication-efficient Secure Range Query (CSRQ)—a privacy and integrity preserving range query protocol—is proposed to prevent attackers from gaining information of both data collected by sensor nodes and queries issued by Sink. To preserve privacy and integrity, in addition to employing the encoding mechanisms, a novel data structure called encrypted constraint chain is proposed, which embeds the information of integrity verification. Sink can use this encrypted constraint chain to verify the query result. The performance evaluation shows that CSRQ has lower communication cost than the current range query protocols. PMID:26907293
Triage by ranking to support the curation of protein interactions
Pasche, Emilie; Gobeill, Julien; Rech de Laval, Valentine; Gleizes, Anne; Michel, Pierre-André; Bairoch, Amos
2017-01-01
Abstract Today, molecular biology databases are the cornerstone of knowledge sharing for life and health sciences. The curation and maintenance of these resources are labour intensive. Although text mining is gaining impetus among curators, its integration in curation workflow has not yet been widely adopted. The Swiss Institute of Bioinformatics Text Mining and CALIPHO groups joined forces to design a new curation support system named nextA5. In this report, we explore the integration of novel triage services to support the curation of two types of biological data: protein–protein interactions (PPIs) and post-translational modifications (PTMs). The recognition of PPIs and PTMs poses a special challenge, as it not only requires the identification of biological entities (proteins or residues), but also that of particular relationships (e.g. binding or position). These relationships cannot be described with onto-terminological descriptors such as the Gene Ontology for molecular functions, which makes the triage task more challenging. Prioritizing papers for these tasks thus requires the development of different approaches. In this report, we propose a new method to prioritize articles containing information specific to PPIs and PTMs. The new resources (RESTful APIs, semantically annotated MEDLINE library) enrich the neXtA5 platform. We tuned the article prioritization model on a set of 100 proteins previously annotated by the CALIPHO group. The effectiveness of the triage service was tested with a dataset of 200 annotated proteins. We defined two sets of descriptors to support automatic triage: the first set to enrich for papers with PPI data, and the second for PTMs. All occurrences of these descriptors were marked-up in MEDLINE and indexed, thus constituting a semantically annotated version of MEDLINE. These annotations were then used to estimate the relevance of a particular article with respect to the chosen annotation type. This relevance score was combined with a local vector-space search engine to generate a ranked list of PMIDs. We also evaluated a query refinement strategy, which adds specific keywords (such as ‘binds’ or ‘interacts’) to the original query. Compared to PubMed, the search effectiveness of the nextA5 triage service is improved by 190% for the prioritization of papers with PPIs information and by 260% for papers with PTMs information. Combining advanced retrieval and query refinement strategies with automatically enriched MEDLINE contents is effective to improve triage in complex curation tasks such as the curation of protein PPIs and PTMs. Database URL: http://candy.hesge.ch/nextA5 PMID:29220432
Improving accuracy for identifying related PubMed queries by an integrated approach.
Lu, Zhiyong; Wilbur, W John
2009-10-01
PubMed is the most widely used tool for searching biomedical literature online. As with many other online search tools, a user often types a series of multiple related queries before retrieving satisfactory results to fulfill a single information need. Meanwhile, it is also a common phenomenon to see a user type queries on unrelated topics in a single session. In order to study PubMed users' search strategies, it is necessary to be able to automatically separate unrelated queries and group together related queries. Here, we report a novel approach combining both lexical and contextual analyses for segmenting PubMed query sessions and identifying related queries and compare its performance with the previous approach based solely on concept mapping. We experimented with our integrated approach on sample data consisting of 1539 pairs of consecutive user queries in 351 user sessions. The prediction results of 1396 pairs agreed with the gold-standard annotations, achieving an overall accuracy of 90.7%. This demonstrates that our approach is significantly better than the previously published method. By applying this approach to a one day query log of PubMed, we found that a significant proportion of information needs involved more than one PubMed query, and that most of the consecutive queries for the same information need are lexically related. Finally, the proposed PubMed distance is shown to be an accurate and meaningful measure for determining the contextual similarity between biological terms. The integrated approach can play a critical role in handling real-world PubMed query log data as is demonstrated in our experiments.
Improving accuracy for identifying related PubMed queries by an integrated approach
Lu, Zhiyong; Wilbur, W. John
2009-01-01
PubMed is the most widely used tool for searching biomedical literature online. As with many other online search tools, a user often types a series of multiple related queries before retrieving satisfactory results to fulfill a single information need. Meanwhile, it is also a common phenomenon to see a user type queries on unrelated topics in a single session. In order to study PubMed users’ search strategies, it is necessary to be able to automatically separate unrelated queries and group together related queries. Here, we report a novel approach combining both lexical and contextual analyses for segmenting PubMed query sessions and identifying related queries and compare its performance with the previous approach based solely on concept mapping. We experimented with our integrated approach on sample data consisting of 1,539 pairs of consecutive user queries in 351 user sessions. The prediction results of 1,396 pairs agreed with the gold-standard annotations, achieving an overall accuracy of 90.7%. This demonstrates that our approach is significantly better than the previously published method. By applying this approach to a one day query log of PubMed, we found that a significant proportion of information needs involved more than one PubMed query, and that most of the consecutive queries for the same information need are lexically related. Finally, the proposed PubMed distance is shown to be an accurate and meaningful measure for determining the contextual similarity between biological terms. The integrated approach can play a critical role in handling real-world PubMed query log data as is demonstrated in our experiments. PMID:19162232
Multi-Bit Quantum Private Query
NASA Astrophysics Data System (ADS)
Shi, Wei-Xu; Liu, Xing-Tong; Wang, Jian; Tang, Chao-Jing
2015-09-01
Most of the existing Quantum Private Queries (QPQ) protocols provide only single-bit queries service, thus have to be repeated several times when more bits are retrieved. Wei et al.'s scheme for block queries requires a high-dimension quantum key distribution system to sustain, which is still restricted in the laboratory. Here, based on Markus Jakobi et al.'s single-bit QPQ protocol, we propose a multi-bit quantum private query protocol, in which the user can get access to several bits within one single query. We also extend the proposed protocol to block queries, using a binary matrix to guard database security. Analysis in this paper shows that our protocol has better communication complexity, implementability and can achieve a considerable level of security.
Jung, HaRim; Song, MoonBae; Youn, Hee Yong; Kim, Ung Mo
2015-09-18
A content-matched (CM) rangemonitoring query overmoving objects continually retrieves the moving objects (i) whose non-spatial attribute values are matched to given non-spatial query values; and (ii) that are currently located within a given spatial query range. In this paper, we propose a new query indexing structure, called the group-aware query region tree (GQR-tree) for efficient evaluation of CMrange monitoring queries. The primary role of the GQR-tree is to help the server leverage the computational capabilities of moving objects in order to improve the system performance in terms of the wireless communication cost and server workload. Through a series of comprehensive simulations, we verify the superiority of the GQR-tree method over the existing methods.
Estimating Missing Features to Improve Multimedia Information Retrieval
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bagherjeiran, A; Love, N S; Kamath, C
Retrieval in a multimedia database usually involves combining information from different modalities of data, such as text and images. However, all modalities of the data may not be available to form the query. The retrieval results from such a partial query are often less than satisfactory. In this paper, we present an approach to complete a partial query by estimating the missing features in the query. Our experiments with a database of images and their associated captions show that, with an initial text-only query, our completion method has similar performance to a full query with both image and text features.more » In addition, when we use relevance feedback, our approach outperforms the results obtained using a full query.« less
NASA Astrophysics Data System (ADS)
Liao, S.; Chen, L.; Li, J.; Xiong, W.; Wu, Q.
2015-07-01
Existing spatiotemporal database supports spatiotemporal aggregation query over massive moving objects datasets. Due to the large amounts of data and single-thread processing method, the query speed cannot meet the application requirements. On the other hand, the query efficiency is more sensitive to spatial variation then temporal variation. In this paper, we proposed a spatiotemporal aggregation query method using multi-thread parallel technique based on regional divison and implemented it on the server. Concretely, we divided the spatiotemporal domain into several spatiotemporal cubes, computed spatiotemporal aggregation on all cubes using the technique of multi-thread parallel processing, and then integrated the query results. By testing and analyzing on the real datasets, this method has improved the query speed significantly.
Virtual screening studies on HIV-1 reverse transcriptase inhibitors to design potent leads.
Vadivelan, S; Deeksha, T N; Arun, S; Machiraju, Pavan Kumar; Gundla, Rambabu; Sinha, Barij Nayan; Jagarlapudi, Sarma A R P
2011-03-01
The purpose of this study is to identify novel and potent inhibitors against HIV-1 reverse transcriptase (RT). The crystal structure of the most active ligand was converted into a feature-shaped query. This query was used to align molecules to generate statistically valid 3D-QSAR (r(2) = 0.873) and Pharmacophore models (HypoGen). The best HypoGen model consists of three Pharmacophore features (one hydrogen bond acceptor, one hydrophobic aliphatic and one ring aromatic) and further validated using known RT inhibitors. The designed novel inhibitors are further subjected to docking studies to reduce the number of false positives. We have identified and proposed some novel and potential lead molecules as reverse transcriptase inhibitors using analog and structure based studies. Copyright © 2011 Elsevier Masson SAS. All rights reserved.
Hauser, Susan E; Demner-Fushman, Dina; Jacobs, Joshua L; Humphrey, Susanne M; Ford, Glenn; Thoma, George R
2007-01-01
To evaluate: (1) the effectiveness of wireless handheld computers for online information retrieval in clinical settings; (2) the role of MEDLINE in answering clinical questions raised at the point of care. A prospective single-cohort study: accompanying medical teams on teaching rounds, five internal medicine residents used and evaluated MD on Tap, an application for handheld computers, to seek answers in real time to clinical questions arising at the point of care. All transactions were stored by an intermediate server. Evaluators recorded clinical scenarios and questions, identified MEDLINE citations that answered the questions, and submitted daily and summative reports of their experience. A senior medical librarian corroborated the relevance of the selected citation to each scenario and question. Evaluators answered 68% of 363 background and foreground clinical questions during rounding sessions using a variety of MD on Tap features in an average session length of less than four minutes. The evaluator, the number and quality of query terms, the total number of citations found for a query, and the use of auto-spellcheck significantly contributed to the probability of query success. Handheld computers with Internet access are useful tools for healthcare providers to access MEDLINE in real time. MEDLINE citations can answer specific clinical questions when several medical terms are used to form a query. The MD on Tap application is an effective interface to MEDLINE in clinical settings, allowing clinicians to quickly find relevant citations.
Hauser, Susan E.; Demner-Fushman, Dina; Jacobs, Joshua L.; Humphrey, Susanne M.; Ford, Glenn; Thoma, George R.
2007-01-01
Objective To evaluate: (1) the effectiveness of wireless handheld computers for online information retrieval in clinical settings; (2) the role of MEDLINE® in answering clinical questions raised at the point of care. Design A prospective single-cohort study: accompanying medical teams on teaching rounds, five internal medicine residents used and evaluated MD on Tap, an application for handheld computers, to seek answers in real time to clinical questions arising at the point of care. Measurements All transactions were stored by an intermediate server. Evaluators recorded clinical scenarios and questions, identified MEDLINE citations that answered the questions, and submitted daily and summative reports of their experience. A senior medical librarian corroborated the relevance of the selected citation to each scenario and question. Results Evaluators answered 68% of 363 background and foreground clinical questions during rounding sessions using a variety of MD on Tap features in an average session length of less than four minutes. The evaluator, the number and quality of query terms, the total number of citations found for a query, and the use of auto-spellcheck significantly contributed to the probability of query success. Conclusion Handheld computers with Internet access are useful tools for healthcare providers to access MEDLINE in real time. MEDLINE citations can answer specific clinical questions when several medical terms are used to form a query. The MD on Tap application is an effective interface to MEDLINE in clinical settings, allowing clinicians to quickly find relevant citations. PMID:17712085
Fu, Lawrence D; Aphinyanaphongs, Yindalon; Wang, Lily; Aliferis, Constantin F
2011-08-01
Evaluating the biomedical literature and health-related websites for quality are challenging information retrieval tasks. Current commonly used methods include impact factor for journals, PubMed's clinical query filters and machine learning-based filter models for articles, and PageRank for websites. Previous work has focused on the average performance of these methods without considering the topic, and it is unknown how performance varies for specific topics or focused searches. Clinicians, researchers, and users should be aware when expected performance is not achieved for specific topics. The present work analyzes the behavior of these methods for a variety of topics. Impact factor, clinical query filters, and PageRank vary widely across different topics while a topic-specific impact factor and machine learning-based filter models are more stable. The results demonstrate that a method may perform excellently on average but struggle when used on a number of narrower topics. Topic-adjusted metrics and other topic robust methods have an advantage in such situations. Users of traditional topic-sensitive metrics should be aware of their limitations. Copyright © 2011 Elsevier Inc. All rights reserved.
Ye, Jay J
2016-01-01
Different methods have been described for data extraction from pathology reports with varying degrees of success. Here a technique for directly extracting data from relational database is described. Our department uses synoptic reports modified from College of American Pathologists (CAP) Cancer Protocol Templates to report most of our cancer diagnoses. Choosing the melanoma of skin synoptic report as an example, R scripting language extended with RODBC package was used to query the pathology information system database. Reports containing melanoma of skin synoptic report in the past 4 and a half years were retrieved and individual data elements were extracted. Using the retrieved list of the cases, the database was queried a second time to retrieve/extract the lymph node staging information in the subsequent reports from the same patients. 426 synoptic reports corresponding to unique lesions of melanoma of skin were retrieved, and data elements of interest were extracted into an R data frame. The distribution of Breslow depth of melanomas grouped by year is used as an example of intra-report data extraction and analysis. When the new pN staging information was present in the subsequent reports, 82% (77/94) was precisely retrieved (pN0, pN1, pN2 and pN3). Additional 15% (14/94) was retrieved with certain ambiguity (positive or knowing there was an update). The specificity was 100% for both. The relationship between Breslow depth and lymph node status was graphed as an example of lesion-specific multi-report data extraction and analysis. R extended with RODBC package is a simple and versatile approach well-suited for the above tasks. The success or failure of the retrieval and extraction depended largely on whether the reports were formatted and whether the contents of the elements were consistently phrased. This approach can be easily modified and adopted for other pathology information systems that use relational database for data management.
Detecting Disease Outbreaks in Mass Gatherings Using Internet Data
Yom-Tov, Elad; Cox, Ingemar J; McKendry, Rachel A
2014-01-01
Background Mass gatherings, such as music festivals and religious events, pose a health care challenge because of the risk of transmission of communicable diseases. This is exacerbated by the fact that participants disperse soon after the gathering, potentially spreading disease within their communities. The dispersion of participants also poses a challenge for traditional surveillance methods. The ubiquitous use of the Internet may enable the detection of disease outbreaks through analysis of data generated by users during events and shortly thereafter. Objective The intent of the study was to develop algorithms that can alert to possible outbreaks of communicable diseases from Internet data, specifically Twitter and search engine queries. Methods We extracted all Twitter postings and queries made to the Bing search engine by users who repeatedly mentioned one of nine major music festivals held in the United Kingdom and one religious event (the Hajj in Mecca) during 2012, for a period of 30 days and after each festival. We analyzed these data using three methods, two of which compared words associated with disease symptoms before and after the time of the festival, and one that compared the frequency of these words with those of other users in the United Kingdom in the days following the festivals. Results The data comprised, on average, 7.5 million tweets made by 12,163 users, and 32,143 queries made by 1756 users from each festival. Our methods indicated the statistically significant appearance of a disease symptom in two of the nine festivals. For example, cough was detected at higher than expected levels following the Wakestock festival. Statistically significant agreement (chi-square test, P<.01) between methods and across data sources was found where a statistically significant symptom was detected. Anecdotal evidence suggests that symptoms detected are indeed indicative of a disease that some users attributed to being at the festival. Conclusions Our work shows the feasibility of creating a public health surveillance system for mass gatherings based on Internet data. The use of multiple data sources and analysis methods was found to be advantageous for rejecting false positives. Further studies are required in order to validate our findings with data from public health authorities. PMID:24943128
Detecting disease outbreaks in mass gatherings using Internet data.
Yom-Tov, Elad; Borsa, Diana; Cox, Ingemar J; McKendry, Rachel A
2014-06-18
Mass gatherings, such as music festivals and religious events, pose a health care challenge because of the risk of transmission of communicable diseases. This is exacerbated by the fact that participants disperse soon after the gathering, potentially spreading disease within their communities. The dispersion of participants also poses a challenge for traditional surveillance methods. The ubiquitous use of the Internet may enable the detection of disease outbreaks through analysis of data generated by users during events and shortly thereafter. The intent of the study was to develop algorithms that can alert to possible outbreaks of communicable diseases from Internet data, specifically Twitter and search engine queries. We extracted all Twitter postings and queries made to the Bing search engine by users who repeatedly mentioned one of nine major music festivals held in the United Kingdom and one religious event (the Hajj in Mecca) during 2012, for a period of 30 days and after each festival. We analyzed these data using three methods, two of which compared words associated with disease symptoms before and after the time of the festival, and one that compared the frequency of these words with those of other users in the United Kingdom in the days following the festivals. The data comprised, on average, 7.5 million tweets made by 12,163 users, and 32,143 queries made by 1756 users from each festival. Our methods indicated the statistically significant appearance of a disease symptom in two of the nine festivals. For example, cough was detected at higher than expected levels following the Wakestock festival. Statistically significant agreement (chi-square test, P<.01) between methods and across data sources was found where a statistically significant symptom was detected. Anecdotal evidence suggests that symptoms detected are indeed indicative of a disease that some users attributed to being at the festival. Our work shows the feasibility of creating a public health surveillance system for mass gatherings based on Internet data. The use of multiple data sources and analysis methods was found to be advantageous for rejecting false positives. Further studies are required in order to validate our findings with data from public health authorities.
Huang, Chung-Chi; Lu, Zhiyong
2016-01-01
Identifying relevant papers from the literature is a common task in biocuration. Most current biomedical literature search systems primarily rely on matching user keywords. Semantic search, on the other hand, seeks to improve search accuracy by understanding the entities and contextual relations in user keywords. However, past research has mostly focused on semantically identifying biological entities (e.g. chemicals, diseases and genes) with little effort on discovering semantic relations. In this work, we aim to discover biomedical semantic relations in PubMed queries in an automated and unsupervised fashion. Specifically, we focus on extracting and understanding the contextual information (or context patterns) that is used by PubMed users to represent semantic relations between entities such as 'CHEMICAL-1 compared to CHEMICAL-2' With the advances in automatic named entity recognition, we first tag entities in PubMed queries and then use tagged entities as knowledge to recognize pattern semantics. More specifically, we transform PubMed queries into context patterns involving participating entities, which are subsequently projected to latent topics via latent semantic analysis (LSA) to avoid the data sparseness and specificity issues. Finally, we mine semantically similar contextual patterns or semantic relations based on LSA topic distributions. Our two separate evaluation experiments of chemical-chemical (CC) and chemical-disease (CD) relations show that the proposed approach significantly outperforms a baseline method, which simply measures pattern semantics by similarity in participating entities. The highest performance achieved by our approach is nearly 0.9 and 0.85 respectively for the CC and CD task when compared against the ground truth in terms of normalized discounted cumulative gain (nDCG), a standard measure of ranking quality. These results suggest that our approach can effectively identify and return related semantic patterns in a ranked order covering diverse bio-entity relations. To assess the potential utility of our automated top-ranked patterns of a given relation in semantic search, we performed a pilot study on frequently sought semantic relations in PubMed and observed improved literature retrieval effectiveness based on post-hoc human relevance evaluation. Further investigation in larger tests and in real-world scenarios is warranted. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.
A Framework for WWW Query Processing
NASA Technical Reports Server (NTRS)
Wu, Binghui Helen; Wharton, Stephen (Technical Monitor)
2000-01-01
Query processing is the most common operation in a DBMS. Sophisticated query processing has been mainly targeted at a single enterprise environment providing centralized control over data and metadata. Submitting queries by anonymous users on the web is different in such a way that load balancing or DBMS' accessing control becomes the key issue. This paper provides a solution by introducing a framework for WWW query processing. The success of this framework lies in the utilization of query optimization techniques and the ontological approach. This methodology has proved to be cost effective at the NASA Goddard Space Flight Center Distributed Active Archive Center (GDAAC).
QBIC project: querying images by content, using color, texture, and shape
NASA Astrophysics Data System (ADS)
Niblack, Carlton W.; Barber, Ron; Equitz, Will; Flickner, Myron D.; Glasman, Eduardo H.; Petkovic, Dragutin; Yanker, Peter; Faloutsos, Christos; Taubin, Gabriel
1993-04-01
In the query by image content (QBIC) project we are studying methods to query large on-line image databases using the images' content as the basis of the queries. Examples of the content we use include color, texture, and shape of image objects and regions. Potential applications include medical (`Give me other images that contain a tumor with a texture like this one'), photo-journalism (`Give me images that have blue at the top and red at the bottom'), and many others in art, fashion, cataloging, retailing, and industry. Key issues include derivation and computation of attributes of images and objects that provide useful query functionality, retrieval methods based on similarity as opposed to exact match, query by image example or user drawn image, the user interfaces, query refinement and navigation, high dimensional database indexing, and automatic and semi-automatic database population. We currently have a prototype system written in X/Motif and C running on an RS/6000 that allows a variety of queries, and a test database of over 1000 images and 1000 objects populated from commercially available photo clip art images. In this paper we present the main algorithms for color texture, shape and sketch query that we use, show example query results, and discuss future directions.
a Novel Approach of Indexing and Retrieving Spatial Polygons for Efficient Spatial Region Queries
NASA Astrophysics Data System (ADS)
Zhao, J. H.; Wang, X. Z.; Wang, F. Y.; Shen, Z. H.; Zhou, Y. C.; Wang, Y. L.
2017-10-01
Spatial region queries are more and more widely used in web-based applications. Mechanisms to provide efficient query processing over geospatial data are essential. However, due to the massive geospatial data volume, heavy geometric computation, and high access concurrency, it is difficult to get response in real time. Spatial indexes are usually used in this situation. In this paper, based on k-d tree, we introduce a distributed KD-Tree (DKD-Tree) suitbable for polygon data, and a two-step query algorithm. The spatial index construction is recursive and iterative, and the query is an in memory process. Both the index and query methods can be processed in parallel, and are implemented based on HDFS, Spark and Redis. Experiments on a large volume of Remote Sensing images metadata have been carried out, and the advantages of our method are investigated by comparing with spatial region queries executed on PostgreSQL and PostGIS. Results show that our approach not only greatly improves the efficiency of spatial region query, but also has good scalability, Moreover, the two-step spatial range query algorithm can also save cluster resources to support a large number of concurrent queries. Therefore, this method is very useful when building large geographic information systems.
Secure Skyline Queries on Cloud Platform
Liu, Jinfei; Yang, Juncheng; Xiong, Li; Pei, Jian
2017-01-01
Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions. PMID:28883710
NASA Astrophysics Data System (ADS)
Indrayana, I. N. E.; P, N. M. Wirasyanti D.; Sudiartha, I. KG
2018-01-01
Mobile application allow many users to access data from the application without being limited to space, space and time. Over time the data population of this application will increase. Data access time will cause problems if the data record has reached tens of thousands to millions of records.The objective of this research is to maintain the performance of data execution for large data records. One effort to maintain data access time performance is to apply query optimization method. The optimization used in this research is query heuristic optimization method. The built application is a mobile-based financial application using MySQL database with stored procedure therein. This application is used by more than one business entity in one database, thus enabling rapid data growth. In this stored procedure there is an optimized query using heuristic method. Query optimization is performed on a “Select” query that involves more than one table with multiple clausa. Evaluation is done by calculating the average access time using optimized and unoptimized queries. Access time calculation is also performed on the increase of population data in the database. The evaluation results shown the time of data execution with query heuristic optimization relatively faster than data execution time without using query optimization.
Library Circulation Systems -- An Overview.
ERIC Educational Resources Information Center
Surace, Cecily J.
The model circulation system outlined is an on-line real time system in which the circulation file is created from the shelf list and the terminal inquiry system includes the capability to query and browse through the bibliographic system and the circulation subsystem together to determine the availability for circulation of specific documents, or…
Transcriptome analysis of Pseudomonas syringae identifies new genes, ncRNAs, and antisense activity
USDA-ARS?s Scientific Manuscript database
To fully understand how bacteria respond to their environment, it is essential to assess genome-wide transcriptional activity. New high throughput sequencing technologies make it possible to query the transcriptome of an organism in an efficient unbiased manner. We applied a strand-specific method t...
FIAMODEL: Users Guide Version 3.0.
Scott A. Pugh; David D. Reed; Kurt S. Pregitzer; Patrick D. Miles
2002-01-01
FIAMODEL is a geographic information system (GIS program used to summarize Forest Inventory and Analysis (FIA, USDA Forest Service) data such as volume. The model runs in ArcView and allows users to select FIA plots with heads-up-digitizing, overlays of digital map layers, or queries based on specific plot attributes.
A high performance, ad-hoc, fuzzy query processing system for relational databases
NASA Technical Reports Server (NTRS)
Mansfield, William H., Jr.; Fleischman, Robert M.
1992-01-01
Database queries involving imprecise or fuzzy predicates are currently an evolving area of academic and industrial research. Such queries place severe stress on the indexing and I/O subsystems of conventional database environments since they involve the search of large numbers of records. The Datacycle architecture and research prototype is a database environment that uses filtering technology to perform an efficient, exhaustive search of an entire database. It has recently been modified to include fuzzy predicates in its query processing. The approach obviates the need for complex index structures, provides unlimited query throughput, permits the use of ad-hoc fuzzy membership functions, and provides a deterministic response time largely independent of query complexity and load. This paper describes the Datacycle prototype implementation of fuzzy queries and some recent performance results.
Jung, HaRim; Song, MoonBae; Youn, Hee Yong; Kim, Ung Mo
2015-01-01
A content-matched (CM) range monitoring query over moving objects continually retrieves the moving objects (i) whose non-spatial attribute values are matched to given non-spatial query values; and (ii) that are currently located within a given spatial query range. In this paper, we propose a new query indexing structure, called the group-aware query region tree (GQR-tree) for efficient evaluation of CM range monitoring queries. The primary role of the GQR-tree is to help the server leverage the computational capabilities of moving objects in order to improve the system performance in terms of the wireless communication cost and server workload. Through a series of comprehensive simulations, we verify the superiority of the GQR-tree method over the existing methods. PMID:26393613
Systems and methods for an extensible business application framework
NASA Technical Reports Server (NTRS)
Bell, David G. (Inventor); Crawford, Michael (Inventor)
2012-01-01
Method and systems for editing data from a query result include requesting a query result using a unique collection identifier for a collection of individual files and a unique identifier for a configuration file that specifies a data structure for the query result. A query result is generated that contains a plurality of fields as specified by the configuration file, by combining each of the individual files associated with a unique identifier for a collection of individual files. The query result data is displayed with a plurality of labels as specified in the configuration file. Edits can be performed by querying a collection of individual files using the configuration file, editing a portion of the query result, and transmitting only the edited information for storage back into a data repository.
Analysis of Information Needs of Users of MEDLINEplus, 2002 – 2003
Scott-Wright, Alicia; Crowell, Jon; Zeng, Qing; Bates, David W.; Greenes, Robert
2006-01-01
We analyzed query logs from use of MEDLINEplus to answer the questions: Are consumers’ health information needs stable over time? and To what extent do users’ queries change over time? To determine log stability, we assessed an Overlap Rate (OR) defined as the number of unique queries common to two adjacent months divided by the total number of unique queries in those months. All exactly matching queries were considered as one unique query. We measured ORs for the top 10 and 100 unique queries of a month and compared these to ORs for the following month. Over ten months, users submitted 12,234,737 queries; only 2,179,571 (17.8%) were unique and these had a mean word count of 2.73 (S.D., 0.24); 121 of 137 (88.3%) unique queries each comprised of exactly matching search term(s) used at least 5000 times were of only one word. We could predict with 95% confidence that the monthly OR for the top 100 unique queries would lie between 67% – 87% when compared with the top 100 from the previous month. The mean month-to-month OR for top 10 queries was 62% (S.D., 20%) indicating significant variability; the lowest OR of 33% between the top 10 in Mar. compared to Apr. was likely due to “new” interest in information about SARS pneumonia in Apr. 2003. Consumers’ health information needs are relatively stable and the 100 most common unique queries are about 77% the same from month to month. Website sponsors should provide a broad range of information about a relatively stable number of topics. Analyses of log similarity may identify media-induced, cyclical, or seasonal changes in areas of consumer interest. PMID:17238431
Big Data and Dysmenorrhea: What Questions Do Women and Men Ask About Menstrual Pain?
Chen, Chen X; Groves, Doyle; Miller, Wendy R; Carpenter, Janet S
2018-04-30
Menstrual pain is highly prevalent among women of reproductive age. As the general public increasingly obtains health information online, Big Data from online platforms provide novel sources to understand the public's perspectives and information needs about menstrual pain. The study's purpose was to describe salient queries about dysmenorrhea using Big Data from a question and answer platform. We performed text-mining of 1.9 billion queries from ChaCha, a United States-based question and answer platform. Dysmenorrhea-related queries were identified by using keyword searching. Each relevant query was split into token words (i.e., meaningful words or phrases) and stop words (i.e., not meaningful functional words). Word Adjacency Graph (WAG) modeling was used to detect clusters of queries and visualize the range of dysmenorrhea-related topics. We constructed two WAG models respectively from queries by women of reproductive age and bymen. Salient themes were identified through inspecting clusters of WAG models. We identified two subsets of queries: Subset 1 contained 507,327 queries from women aged 13-50 years. Subset 2 contained 113,888 queries from men aged 13 or above. WAG modeling revealed topic clusters for each subset. Between female and male subsets, topic clusters overlapped on dysmenorrhea symptoms and management. Among female queries, there were distinctive topics on approaching menstrual pain at school and menstrual pain-related conditions; while among male queries, there was a distinctive cluster of queries on menstrual pain from male's perspectives. Big Data mining of the ChaCha ® question and answer service revealed a series of information needs among women and men on menstrual pain. Findings may be useful in structuring the content and informing the delivery platform for educational interventions.
Multiple Query Evaluation Based on an Enhanced Genetic Algorithm.
ERIC Educational Resources Information Center
Tamine, Lynda; Chrisment, Claude; Boughanem, Mohand
2003-01-01
Explains the use of genetic algorithms to combine results from multiple query evaluations to improve relevance in information retrieval. Discusses niching techniques, relevance feedback techniques, and evolution heuristics, and compares retrieval results obtained by both genetic multiple query evaluation and classical single query evaluation…
Relational Algebra and SQL: Better Together
ERIC Educational Resources Information Center
McMaster, Kirby; Sambasivam, Samuel; Hadfield, Steven; Wolthuis, Stuart
2013-01-01
In this paper, we describe how database instructors can teach Relational Algebra and Structured Query Language together through programming. Students write query programs consisting of sequences of Relational Algebra operations vs. Structured Query Language SELECT statements. The query programs can then be run interactively, allowing students to…
A Firefly Algorithm-based Approach for Pseudo-Relevance Feedback: Application to Medical Database.
Khennak, Ilyes; Drias, Habiba
2016-11-01
The difficulty of disambiguating the sense of the incomplete and imprecise keywords that are extensively used in the search queries has caused the failure of search systems to retrieve the desired information. One of the most powerful and promising method to overcome this shortcoming and improve the performance of search engines is Query Expansion, whereby the user's original query is augmented by new keywords that best characterize the user's information needs and produce more useful query. In this paper, a new Firefly Algorithm-based approach is proposed to enhance the retrieval effectiveness of query expansion while maintaining low computational complexity. In contrast to the existing literature, the proposed approach uses a Firefly Algorithm to find the best expanded query among a set of expanded query candidates. Moreover, this new approach allows the determination of the length of the expanded query empirically. Experimental results on MEDLINE, the on-line medical information database, show that our proposed approach is more effective and efficient compared to the state-of-the-art.
RiPPAS: A Ring-Based Privacy-Preserving Aggregation Scheme in Wireless Sensor Networks
Zhang, Kejia; Han, Qilong; Cai, Zhipeng; Yin, Guisheng
2017-01-01
Recently, data privacy in wireless sensor networks (WSNs) has been paid increased attention. The characteristics of WSNs determine that users’ queries are mainly aggregation queries. In this paper, the problem of processing aggregation queries in WSNs with data privacy preservation is investigated. A Ring-based Privacy-Preserving Aggregation Scheme (RiPPAS) is proposed. RiPPAS adopts ring structure to perform aggregation. It uses pseudonym mechanism for anonymous communication and uses homomorphic encryption technique to add noise to the data easily to be disclosed. RiPPAS can handle both sum() queries and min()/max() queries, while the existing privacy-preserving aggregation methods can only deal with sum() queries. For processing sum() queries, compared with the existing methods, RiPPAS has advantages in the aspects of privacy preservation and communication efficiency, which can be proved by theoretical analysis and simulation results. For processing min()/max() queries, RiPPAS provides effective privacy preservation and has low communication overhead. PMID:28178197
Reusable Software Component Retrieval via Normalized Algebraic Specifications
1991-12-01
outputs. In fact, this method of query is simpler for matching since it relieves the system from the burden of generating a test set. Eichmann [Eich9l...September 1991. [Eich9l] Eichmann , David A., "Selecting Reusable Components Using Algebraic Specifications", Proceedings of the Second International...Technology Atlanta, Georgia 30332-0800 12. Dr. David Eichmann 1 Department of Statistics and Computer Science Knapp Hall West Virginia University Morgantown, West Virginia 26506 226
Mining the SDSS SkyServer SQL queries log
NASA Astrophysics Data System (ADS)
Hirota, Vitor M.; Santos, Rafael; Raddick, Jordan; Thakar, Ani
2016-05-01
SkyServer, the Internet portal for the Sloan Digital Sky Survey (SDSS) astronomic catalog, provides a set of tools that allows data access for astronomers and scientific education. One of SkyServer data access interfaces allows users to enter ad-hoc SQL statements to query the catalog. SkyServer also presents some template queries that can be used as basis for more complex queries. This interface has logged over 330 million queries submitted since 2001. It is expected that analysis of this data can be used to investigate usage patterns, identify potential new classes of queries, find similar queries, etc. and to shed some light on how users interact with the Sloan Digital Sky Survey data and how scientists have adopted the new paradigm of e-Science, which could in turn lead to enhancements on the user interfaces and experience in general. In this paper we review some approaches to SQL query mining, apply the traditional techniques used in the literature and present lessons learned, namely, that the general text mining approach for feature extraction and clustering does not seem to be adequate for this type of data, and, most importantly, we find that this type of analysis can result in very different queries being clustered together.
Shinde, Nagesh; Baad, Rajendra; Nagpal, Deepak Kumar J; Prabhu, Prashant R; Surekha, L Chavan; Karande, Prasad
2012-11-01
People with HIV/HBsAg in India frequently encounter discrimination while seeking and receiving health care services. The knowledge and attitudes of health care workers (HCWs) influences the willingness and ability of people with HIV/HBsAg to access care, and the quality of the care they receive. The objective of this study was to asses HIV/HBsAg-related knowledge, attitudes and risk perception among students and dental HCWs. A cross-sectional survey was conducted on 250 students and 120 dental HCWs in the form of objective questionnaire. Information was gathered regarding demographic details (age, sex, duration of employment, job category); HIV/ HBsAg-related knowledge and attitudes; risk perception; and previous experience caring for HIV-positive patients. The HCWs in this study generally had a positive attitude to care for the people with HIV/HBsAg. However, this was tempered by substantial concerns about providing care, and the fear of occupational infection with HIV/HBsAg. A continuing dental education program was conducted to resolve all the queries found interfering to provide care to HIV/HBsAg patients. But even after the queries were resolved the care providing capability was not attained. These findings show that even with advanced knowledge and facilities the attitude of dental HCWs and students require more strategic training with regards to the ethics and moral stigma associated with the dreaded infectious diseases (HIV/HBsAg).
A Research on E - learning Resources Construction Based on Semantic Web
NASA Astrophysics Data System (ADS)
Rui, Liu; Maode, Deng
Traditional e-learning platforms have the flaws that it's usually difficult to query or positioning, and realize the cross platform sharing and interoperability. In the paper, the semantic web and metadata standard is discussed, and a kind of e - learning system framework based on semantic web is put forward to try to solve the flaws of traditional elearning platforms.
Applying Query Structuring in Cross-language Retrieval.
ERIC Educational Resources Information Center
Pirkola, Ari; Puolamaki, Deniz; Jarvelin, Kalervo
2003-01-01
Explores ways to apply query structuring in cross-language information retrieval. Tested were: English queries translated into Finnish using an electronic dictionary, and run in a Finnish newspaper databases; effects of compound-based structuring using a proximity operator for translation equivalents of query language compound components; and a…
Querying and Ranking XML Documents.
ERIC Educational Resources Information Center
Schlieder, Torsten; Meuss, Holger
2002-01-01
Discussion of XML, information retrieval, precision, and recall focuses on a retrieval technique that adopts the similarity measure of the vector space model, incorporates the document structure, and supports structured queries. Topics include a query model based on tree matching; structured queries and term-based ranking; and term frequency and…
Advanced Query Formulation in Deductive Databases.
ERIC Educational Resources Information Center
Niemi, Timo; Jarvelin, Kalervo
1992-01-01
Discusses deductive databases and database management systems (DBMS) and introduces a framework for advanced query formulation for end users. Recursive processing is described, a sample extensional database is presented, query types are explained, and criteria for advanced query formulation from the end user's viewpoint are examined. (31…
A Semantic Graph Query Language
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kaplan, I L
2006-10-16
Semantic graphs can be used to organize large amounts of information from a number of sources into one unified structure. A semantic query language provides a foundation for extracting information from the semantic graph. The graph query language described here provides a simple, powerful method for querying semantic graphs.
Nussbaumer, Thomas; Kugler, Karl G; Schweiger, Wolfgang; Bader, Kai C; Gundlach, Heidrun; Spannagl, Manuel; Poursarebani, Naser; Pfeifer, Matthias; Mayer, Klaus F X
2014-12-10
Over the last years reference genome sequences of several economically and scientifically important cereals and model plants became available. Despite the agricultural significance of these crops only a small number of tools exist that allow users to inspect and visualize the genomic position of genes of interest in an interactive manner. We present chromoWIZ, a web tool that allows visualizing the genomic positions of relevant genes and comparing these data between different plant genomes. Genes can be queried using gene identifiers, functional annotations, or sequence homology in four grass species (Triticum aestivum, Hordeum vulgare, Brachypodium distachyon, Oryza sativa). The distribution of the anchored genes is visualized along the chromosomes by using heat maps. Custom gene expression measurements, differential expression information, and gene-to-group mappings can be uploaded and can be used for further filtering. This tool is mainly designed for breeders and plant researchers, who are interested in the location and the distribution of candidate genes as well as in the syntenic relationships between different grass species. chromoWIZ is freely available and online accessible at http://mips.helmholtz-muenchen.de/plant/chromoWIZ/index.jsp.
FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation.
Bolleman, Jerven T; Mungall, Christopher J; Strozzi, Francesco; Baran, Joachim; Dumontier, Michel; Bonnal, Raoul J P; Buels, Robert; Hoehndorf, Robert; Fujisawa, Takatomo; Katayama, Toshiaki; Cock, Peter J A
2016-06-13
Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned "omics" areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe - and potentially merge - sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.
FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation
Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco; ...
2016-06-13
Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. In this paper, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data formatmore » to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Finally, data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less
FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco
Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. In this paper, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data formatmore » to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Finally, data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less
Harris, Daniel R.; Henderson, Darren W.; Kavuluru, Ramakanth; Stromberg, Arnold J.; Johnson, Todd R.
2015-01-01
We present a custom, Boolean query generator utilizing common-table expressions (CTEs) that is capable of scaling with big datasets. The generator maps user-defined Boolean queries, such as those interactively created in clinical-research and general-purpose healthcare tools, into SQL. We demonstrate the effectiveness of this generator by integrating our work into the Informatics for Integrating Biology and the Bedside (i2b2) query tool and show that it is capable of scaling. Our custom generator replaces and outperforms the default query generator found within the Clinical Research Chart (CRC) cell of i2b2. In our experiments, sixteen different types of i2b2 queries were identified by varying four constraints: date, frequency, exclusion criteria, and whether selected concepts occurred in the same encounter. We generated non-trivial, random Boolean queries based on these 16 types; the corresponding SQL queries produced by both generators were compared by execution times. The CTE-based solution significantly outperformed the default query generator and provided a much more consistent response time across all query types (M=2.03, SD=6.64 vs. M=75.82, SD=238.88 seconds). Without costly hardware upgrades, we provide a scalable solution based on CTEs with very promising empirical results centered on performance gains. The evaluation methodology used for this provides a means of profiling clinical data warehouse performance. PMID:25192572
RCQ-GA: RDF Chain Query Optimization Using Genetic Algorithms
NASA Astrophysics Data System (ADS)
Hogenboom, Alexander; Milea, Viorel; Frasincar, Flavius; Kaymak, Uzay
The application of Semantic Web technologies in an Electronic Commerce environment implies a need for good support tools. Fast query engines are needed for efficient querying of large amounts of data, usually represented using RDF. We focus on optimizing a special class of SPARQL queries, the so-called RDF chain queries. For this purpose, we devise a genetic algorithm called RCQ-GA that determines the order in which joins need to be performed for an efficient evaluation of RDF chain queries. The approach is benchmarked against a two-phase optimization algorithm, previously proposed in literature. The more complex a query is, the more RCQ-GA outperforms the benchmark in solution quality, execution time needed, and consistency of solution quality. When the algorithms are constrained by a time limit, the overall performance of RCQ-GA compared to the benchmark further improves.
Query Language for Location-Based Services: A Model Checking Approach
NASA Astrophysics Data System (ADS)
Hoareau, Christian; Satoh, Ichiro
We present a model checking approach to the rationale, implementation, and applications of a query language for location-based services. Such query mechanisms are necessary so that users, objects, and/or services can effectively benefit from the location-awareness of their surrounding environment. The underlying data model is founded on a symbolic model of space organized in a tree structure. Once extended to a semantic model for modal logic, we regard location query processing as a model checking problem, and thus define location queries as hybrid logicbased formulas. Our approach is unique to existing research because it explores the connection between location models and query processing in ubiquitous computing systems, relies on a sound theoretical basis, and provides modal logic-based query mechanisms for expressive searches over a decentralized data structure. A prototype implementation is also presented and will be discussed.
Towards Building a High Performance Spatial Query System for Large Scale Medical Imaging Data.
Aji, Ablimit; Wang, Fusheng; Saltz, Joel H
2012-11-06
Support of high performance queries on large volumes of scientific spatial data is becoming increasingly important in many applications. This growth is driven by not only geospatial problems in numerous fields, but also emerging scientific applications that are increasingly data- and compute-intensive. For example, digital pathology imaging has become an emerging field during the past decade, where examination of high resolution images of human tissue specimens enables more effective diagnosis, prediction and treatment of diseases. Systematic analysis of large-scale pathology images generates tremendous amounts of spatially derived quantifications of micro-anatomic objects, such as nuclei, blood vessels, and tissue regions. Analytical pathology imaging provides high potential to support image based computer aided diagnosis. One major requirement for this is effective querying of such enormous amount of data with fast response, which is faced with two major challenges: the "big data" challenge and the high computation complexity. In this paper, we present our work towards building a high performance spatial query system for querying massive spatial data on MapReduce. Our framework takes an on demand index building approach for processing spatial queries and a partition-merge approach for building parallel spatial query pipelines, which fits nicely with the computing model of MapReduce. We demonstrate our framework on supporting multi-way spatial joins for algorithm evaluation and nearest neighbor queries for microanatomic objects. To reduce query response time, we propose cost based query optimization to mitigate the effect of data skew. Our experiments show that the framework can efficiently support complex analytical spatial queries on MapReduce.
Towards Building a High Performance Spatial Query System for Large Scale Medical Imaging Data
Aji, Ablimit; Wang, Fusheng; Saltz, Joel H.
2013-01-01
Support of high performance queries on large volumes of scientific spatial data is becoming increasingly important in many applications. This growth is driven by not only geospatial problems in numerous fields, but also emerging scientific applications that are increasingly data- and compute-intensive. For example, digital pathology imaging has become an emerging field during the past decade, where examination of high resolution images of human tissue specimens enables more effective diagnosis, prediction and treatment of diseases. Systematic analysis of large-scale pathology images generates tremendous amounts of spatially derived quantifications of micro-anatomic objects, such as nuclei, blood vessels, and tissue regions. Analytical pathology imaging provides high potential to support image based computer aided diagnosis. One major requirement for this is effective querying of such enormous amount of data with fast response, which is faced with two major challenges: the “big data” challenge and the high computation complexity. In this paper, we present our work towards building a high performance spatial query system for querying massive spatial data on MapReduce. Our framework takes an on demand index building approach for processing spatial queries and a partition-merge approach for building parallel spatial query pipelines, which fits nicely with the computing model of MapReduce. We demonstrate our framework on supporting multi-way spatial joins for algorithm evaluation and nearest neighbor queries for microanatomic objects. To reduce query response time, we propose cost based query optimization to mitigate the effect of data skew. Our experiments show that the framework can efficiently support complex analytical spatial queries on MapReduce. PMID:24501719
Sujansky, Walter V; Faus, Sam A; Stone, Ethan; Brennan, Patricia Flatley
2010-10-01
Online personal health records (PHRs) enable patients to access, manage, and share certain of their own health information electronically. This capability creates the need for precise access-controls mechanisms that restrict the sharing of data to that intended by the patient. The authors describe the design and implementation of an access-control mechanism for PHR repositories that is modeled on the eXtensible Access Control Markup Language (XACML) standard, but intended to reduce the cognitive and computational complexity of XACML. The authors implemented the mechanism entirely in a relational database system using ANSI-standard SQL statements. Based on a set of access-control rules encoded as relational table rows, the mechanism determines via a single SQL query whether a user who accesses patient data from a specific application is authorized to perform a requested operation on a specified data object. Testing of this query on a moderately large database has demonstrated execution times consistently below 100ms. The authors include the details of the implementation, including algorithms, examples, and a test database as Supplementary materials. Copyright © 2010 Elsevier Inc. All rights reserved.
Representing and querying now-relative relational medical data.
Anselma, Luca; Piovesan, Luca; Stantic, Bela; Terenziani, Paolo
2018-03-01
Temporal information plays a crucial role in medicine. Patients' clinical records are intrinsically temporal. Thus, in Medical Informatics there is an increasing need to store, support and query temporal data (particularly in relational databases), in order, for instance, to supplement decision-support systems. In this paper, we show that current approaches to relational data have remarkable limitations in the treatment of "now-relative" data (i.e., data holding true at the current time). This can severely compromise their applicability in general, and specifically in the medical context, where "now-relative" data are essential to assess the current status of the patients. We propose a theoretically grounded and application-independent relational approach to cope with now-relative data (which can be paired, e.g., with different decision support systems) overcoming such limitations. We propose a new temporal relational representation, which is the first relational model coping with the temporal indeterminacy intrinsic in now-relative data. We also propose new temporal algebraic operators to query them, supporting the distinction between possible and necessary time, and Allen's temporal relations between data. We exemplify the impact of our approach, and study the theoretical and computational properties of the new representation and algebra. Copyright © 2018 Elsevier B.V. All rights reserved.
Performance Modeling in CUDA Streams - A Means for High-Throughput Data Processing.
Li, Hao; Yu, Di; Kumar, Anand; Tu, Yi-Cheng
2014-10-01
Push-based database management system (DBMS) is a new type of data processing software that streams large volume of data to concurrent query operators. The high data rate of such systems requires large computing power provided by the query engine. In our previous work, we built a push-based DBMS named G-SDMS to harness the unrivaled computational capabilities of modern GPUs. A major design goal of G-SDMS is to support concurrent processing of heterogenous query processing operations and enable resource allocation among such operations. Understanding the performance of operations as a result of resource consumption is thus a premise in the design of G-SDMS. With NVIDIA's CUDA framework as the system implementation platform, we present our recent work on performance modeling of CUDA kernels running concurrently under a runtime mechanism named CUDA stream . Specifically, we explore the connection between performance and resource occupancy of compute-bound kernels and develop a model that can predict the performance of such kernels. Furthermore, we provide an in-depth anatomy of the CUDA stream mechanism and summarize the main kernel scheduling disciplines in it. Our models and derived scheduling disciplines are verified by extensive experiments using synthetic and real-world CUDA kernels.
Magnetic Fields for All: The GPIPS Community Web-Access Portal
NASA Astrophysics Data System (ADS)
Carveth, Carol; Clemens, D. P.; Pinnick, A.; Pavel, M.; Jameson, K.; Taylor, B.
2007-12-01
The new GPIPS website portal provides community users with an intuitive and powerful interface to query the data products of the Galactic Plane Infrared Polarization Survey. The website, which was built using PHP for the front end and MySQL for the database back end, allows users to issue queries based on galactic or equatorial coordinates, GPIPS-specific identifiers, polarization information, magnitude information, and several other attributes. The returns are presented in HTML tables, with the added option of either downloading or being emailed an ASCII file including the same or more information from the database. Other functionalities of the website include providing details of the status of the Survey (which fields have been observed or are planned to be observed), techniques involved in data collection and analysis, and descriptions of the database contents and names. For this initial launch of the website, users may access the GPIPS polarization point source catalog and the deep coadd photometric point source catalog. Future planned developments include a graphics-based method for querying the database, as well as tools to combine neighboring GPIPS images into larger image files for both polarimetry and photometry. This work is partially supported by NSF grant AST-0607500.
Element distinctness revisited
NASA Astrophysics Data System (ADS)
Portugal, Renato
2018-07-01
The element distinctness problem is the problem of determining whether the elements of a list are distinct, that is, if x=(x_1,\\ldots ,x_N) is a list with N elements, we ask whether the elements of x are distinct or not. The solution in a classical computer requires N queries because it uses sorting to check whether there are equal elements. In the quantum case, it is possible to solve the problem in O(N^{2/3}) queries. There is an extension which asks whether there are k colliding elements, known as element k-distinctness problem. This work obtains optimal values of two critical parameters of Ambainis' seminal quantum algorithm (SIAM J Comput 37(1):210-239, 2007). The first critical parameter is the number of repetitions of the algorithm's main block, which inverts the phase of the marked elements and calls a subroutine. The second parameter is the number of quantum walk steps interlaced by oracle queries. We show that, when the optimal values of the parameters are used, the algorithm's success probability is 1-O(N^{1/(k+1)}), quickly approaching 1. The specification of the exact running time and success probability is important in practical applications of this algorithm.
Abeijon, Paula; Garcia-Mera, Xerardo; Caamano, Olga; Yanez, Matilde; Lopez-Castro, Edgar; Romero-Duran, Francisco J; Gonzalez-Diaz, Humberto
2017-01-01
Hansch's model is a classic approach to Quantitative Structure-Binding Relationships (QSBR) problems in Pharmacology and Medicinal Chemistry. Hansch QSAR equations are used as input parameters of electronic structure and lipophilicity. In this work, we perform a review on Hansch's analysis. We also developed a new type of PT-QSBR Hansch's model based on Perturbation Theory (PT) and QSBR approach for a large number of drugs reported in CheMBL. The targets are proteins expressed by the Hippocampus region of the brain of Alzheimer Disease (AD) patients. The model predicted correctly 49312 out of 53783 negative perturbations (Specificity = 91.7%) and 16197 out of 21245 positive perturbations (Sensitivity = 76.2%) in training series. The model also predicted correctly 49312/53783 (91.7%) and 16197/21245 (76.2%) negative or positive perturbations in external validation series. We applied our model in theoretical-experimental studies of organic synthesis, pharmacological assay, and prediction of unmeasured results for a series of compounds similar to Rasagiline (compound of reference) with potential neuroprotection effect. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
RNA Bricks—a database of RNA 3D motifs and their interactions
Chojnowski, Grzegorz; Waleń, Tomasz; Bujnicki, Janusz M.
2014-01-01
The RNA Bricks database (http://iimcb.genesilico.pl/rnabricks), stores information about recurrent RNA 3D motifs and their interactions, found in experimentally determined RNA structures and in RNA–protein complexes. In contrast to other similar tools (RNA 3D Motif Atlas, RNA Frabase, Rloom) RNA motifs, i.e. ‘RNA bricks’ are presented in the molecular environment, in which they were determined, including RNA, protein, metal ions, water molecules and ligands. All nucleotide residues in RNA bricks are annotated with structural quality scores that describe real-space correlation coefficients with the electron density data (if available), backbone geometry and possible steric conflicts, which can be used to identify poorly modeled residues. The database is also equipped with an algorithm for 3D motif search and comparison. The algorithm compares spatial positions of backbone atoms of the user-provided query structure and of stored RNA motifs, without relying on sequence or secondary structure information. This enables the identification of local structural similarities among evolutionarily related and unrelated RNA molecules. Besides, the search utility enables searching ‘RNA bricks’ according to sequence similarity, and makes it possible to identify motifs with modified ribonucleotide residues at specific positions. PMID:24220091
Application of MPEG-7 descriptors for content-based indexing of sports videos
NASA Astrophysics Data System (ADS)
Hoeynck, Michael; Auweiler, Thorsten; Ohm, Jens-Rainer
2003-06-01
The amount of multimedia data available worldwide is increasing every day. There is a vital need to annotate multimedia data in order to allow universal content access and to provide content-based search-and-retrieval functionalities. Since supervised video annotation can be time consuming, an automatic solution is appreciated. We review recent approaches to content-based indexing and annotation of videos for different kind of sports, and present our application for the automatic annotation of equestrian sports videos. Thereby, we especially concentrate on MPEG-7 based feature extraction and content description. We apply different visual descriptors for cut detection. Further, we extract the temporal positions of single obstacles on the course by analyzing MPEG-7 edge information and taking specific domain knowledge into account. Having determined single shot positions as well as the visual highlights, the information is jointly stored together with additional textual information in an MPEG-7 description scheme. Using this information, we generate content summaries which can be utilized in a user front-end in order to provide content-based access to the video stream, but further content-based queries and navigation on a video-on-demand streaming server.
Advancing the LSST Operations Simulator
NASA Astrophysics Data System (ADS)
Saha, Abhijit; Ridgway, S. T.; Cook, K. H.; Delgado, F.; Chandrasekharan, S.; Petry, C. E.; Operations Simulator Group
2013-01-01
The Operations Simulator for the Large Synoptic Survey Telescope (LSST; http://lsst.org) allows the planning of LSST observations that obey explicit science driven observing specifications, patterns, schema, and priorities, while optimizing against the constraints placed by design-specific opto-mechanical system performance of the telescope facility, site specific conditions (including weather and seeing), as well as additional scheduled and unscheduled downtime. A simulation run records the characteristics of all observations (e.g., epoch, sky position, seeing, sky brightness) in a MySQL database, which can be queried for any desired purpose. Derivative information digests of the observing history database are made with an analysis package called Simulation Survey Tools for Analysis and Reporting (SSTAR). Merit functions and metrics have been designed to examine how suitable a specific simulation run is for several different science applications. This poster reports recent work which has focussed on an architectural restructuring of the code that will allow us to a) use "look-ahead" strategies that avoid cadence sequences that cannot be completed due to observing constraints; and b) examine alternate optimization strategies, so that the most efficient scheduling algorithm(s) can be identified and used: even few-percent efficiency gains will create substantive scientific opportunity. The enhanced simulator will be used to assess the feasibility of desired observing cadences, study the impact of changing science program priorities, and assist with performance margin investigations of the LSST system.
Query Expansion and Query Translation as Logical Inference.
ERIC Educational Resources Information Center
Nie, Jian-Yun
2003-01-01
Examines query expansion during query translation in cross language information retrieval and develops a general framework for inferential information retrieval in two particular contexts: using fuzzy logic and probability theory. Obtains evaluation formulas that are shown to strongly correspond to those used in other information retrieval models.…
End-User Use of Data Base Query Language: Pros and Cons.
ERIC Educational Resources Information Center
Nicholes, Walter
1988-01-01
Man-machine interface, the concept of a computer "query," a review of database technology, and a description of the use of query languages at Brigham Young University are discussed. The pros and cons of end-user use of database query languages are explored. (Author/MLW)
Information Retrieval Using UMLS-based Structured Queries
Fagan, Lawrence M.; Berrios, Daniel C.; Chan, Albert; Cucina, Russell; Datta, Anupam; Shah, Maulik; Surendran, Sujith
2001-01-01
During the last three years, we have developed and described components of ELBook, a semantically based information-retrieval system [1-4]. Using these components, domain experts can specify a query model, indexers can use the query model to index documents, and end-users can search these documents for instances of indexed queries.
A Relational Algebra Query Language for Programming Relational Databases
ERIC Educational Resources Information Center
McMaster, Kirby; Sambasivam, Samuel; Anderson, Nicole
2011-01-01
In this paper, we describe a Relational Algebra Query Language (RAQL) and Relational Algebra Query (RAQ) software product we have developed that allows database instructors to teach relational algebra through programming. Instead of defining query operations using mathematical notation (the approach commonly taken in database textbooks), students…
Automated population of an i2b2 clinical data warehouse from an openEHR-based data repository.
Haarbrandt, Birger; Tute, Erik; Marschollek, Michael
2016-10-01
Detailed Clinical Model (DCM) approaches have recently seen wider adoption. More specifically, openEHR-based application systems are now used in production in several countries, serving diverse fields of application such as health information exchange, clinical registries and electronic medical record systems. However, approaches to efficiently provide openEHR data to researchers for secondary use have not yet been investigated or established. We developed an approach to automatically load openEHR data instances into the open source clinical data warehouse i2b2. We evaluated query capabilities and the performance of this approach in the context of the Hanover Medical School Translational Research Framework (HaMSTR), an openEHR-based data repository. Automated creation of i2b2 ontologies from archetypes and templates and the integration of openEHR data instances from 903 patients of a paediatric intensive care unit has been achieved. In total, it took an average of ∼2527s to create 2.311.624 facts from 141.917 XML documents. Using the imported data, we conducted sample queries to compare the performance with two openEHR systems and to investigate if this representation of data is feasible to support cohort identification and record level data extraction. We found the automated population of an i2b2 clinical data warehouse to be a feasible approach to make openEHR data instances available for secondary use. Such an approach can facilitate timely provision of clinical data to researchers. It complements analytics based on the Archetype Query Language by allowing querying on both, legacy clinical data sources and openEHR data instances at the same time and by providing an easy-to-use query interface. However, due to different levels of expressiveness in the data models, not all semantics could be preserved during the ETL process. Copyright © 2016 Elsevier Inc. All rights reserved.
Archetype-based data warehouse environment to enable the reuse of electronic health record data.
Marco-Ruiz, Luis; Moner, David; Maldonado, José A; Kolstrup, Nils; Bellika, Johan G
2015-09-01
The reuse of data captured during health care delivery is essential to satisfy the demands of clinical research and clinical decision support systems. A main barrier for the reuse is the existence of legacy formats of data and the high granularity of it when stored in an electronic health record (EHR) system. Thus, we need mechanisms to standardize, aggregate, and query data concealed in the EHRs, to allow their reuse whenever they are needed. To create a data warehouse infrastructure using archetype-based technologies, standards and query languages to enable the interoperability needed for data reuse. The work presented makes use of best of breed archetype-based data transformation and storage technologies to create a workflow for the modeling, extraction, transformation and load of EHR proprietary data into standardized data repositories. We converted legacy data and performed patient-centered aggregations via archetype-based transformations. Later, specific purpose aggregations were performed at a query level for particular use cases. Laboratory test results of a population of 230,000 patients belonging to Troms and Finnmark counties in Norway requested between January 2013 and November 2014 have been standardized. Test records normalization has been performed by defining transformation and aggregation functions between the laboratory records and an archetype. These mappings were used to automatically generate open EHR compliant data. These data were loaded into an archetype-based data warehouse. Once loaded, we defined indicators linked to the data in the warehouse to monitor test activity of Salmonella and Pertussis using the archetype query language. Archetype-based standards and technologies can be used to create a data warehouse environment that enables data from EHR systems to be reused in clinical research and decision support systems. With this approach, existing EHR data becomes available in a standardized and interoperable format, thus opening a world of possibilities toward semantic or concept-based reuse, query and communication of clinical data. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Database architectures for Space Telescope Science Institute
NASA Astrophysics Data System (ADS)
Lubow, Stephen
1993-08-01
At STScI nearly all large applications require database support. A general purpose architecture has been developed and is in use that relies upon an extended client-server paradigm. Processing is in general distributed across three processes, each of which generally resides on its own processor. Database queries are evaluated on one such process, called the DBMS server. The DBMS server software is provided by a database vendor. The application issues database queries and is called the application client. This client uses a set of generic DBMS application programming calls through our STDB/NET programming interface. Intermediate between the application client and the DBMS server is the STDB/NET server. This server accepts generic query requests from the application and converts them into the specific requirements of the DBMS server. In addition, it accepts query results from the DBMS server and passes them back to the application. Typically the STDB/NET server is local to the DBMS server, while the application client may be remote. The STDB/NET server provides additional capabilities such as database deadlock restart and performance monitoring. This architecture is currently in use for some major STScI applications, including the ground support system. We are currently investigating means of providing ad hoc query support to users through the above architecture. Such support is critical for providing flexible user interface capabilities. The Universal Relation advocated by Ullman, Kernighan, and others appears to be promising. In this approach, the user sees the entire database as a single table, thereby freeing the user from needing to understand the detailed schema. A software layer provides the translation between the user and detailed schema views of the database. However, many subtle issues arise in making this transformation. We are currently exploring this scheme for use in the Hubble Space Telescope user interface to the data archive system (DADS).
Spatiotemporal conceptual platform for querying archaeological information systems
NASA Astrophysics Data System (ADS)
Partsinevelos, Panagiotis; Sartzetaki, Mary; Sarris, Apostolos
2015-04-01
Spatial and temporal distribution of archaeological sites has been shown to associate with several attributes including marine, water, mineral and food resources, climate conditions, geomorphological features, etc. In this study, archeological settlement attributes are evaluated under various associations in order to provide a specialized query platform in a geographic information system (GIS). Towards this end, a spatial database is designed to include a series of archaeological findings for a secluded geographic area of Crete in Greece. The key categories of the geodatabase include the archaeological type (palace, burial site, village, etc.), temporal information of the habitation/usage period (pre Minoan, Minoan, Byzantine, etc.), and the extracted geographical attributes of the sites (distance to sea, altitude, resources, etc.). Most of the related spatial attributes are extracted with readily available GIS tools. Additionally, a series of conceptual data attributes are estimated, including: Temporal relation of an era to a future one in terms of alteration of the archaeological type, topologic relations of various types and attributes, spatial proximity relations between various types. These complex spatiotemporal relational measures reveal new attributes towards better understanding of site selection for prehistoric and/or historic cultures, yet their potential combinations can become numerous. Therefore, after the quantification of the above mentioned attributes, they are classified as of their importance for archaeological site location modeling. Under this new classification scheme, the user may select a geographic area of interest and extract only the important attributes for a specific archaeological type. These extracted attributes may then be queried against the entire spatial database and provide a location map of possible new archaeological sites. This novel type of querying is robust since the user does not have to type a standard SQL query but graphically select an area of interest. In addition, according to the application at hand, novel spatiotemporal attributes and relations can be supported, towards the understanding of historical settlement patterns.
Pathak, Jyotishman; Kiefer, Richard C.; Chute, Christopher G.
2012-01-01
The ability to conduct genome-wide association studies (GWAS) has enabled new exploration of how genetic variations contribute to health and disease etiology. One of the key requirements to perform GWAS is the identification of subject cohorts with accurate classification of disease phenotypes. In this work, we study how emerging Semantic Web technologies can be applied in conjunction with clinical data stored in electronic health records (EHRs) to accurately identify subjects with specific diseases for inclusion in cohort studies. In particular, we demonstrate the role of using Resource Description Framework (RDF) for representing EHR data and enabling federated querying and inferencing via standardized Web protocols for identifying subjects with Diabetes Mellitus. Our study highlights the potential of using Web-scale data federation approaches to execute complex queries. PMID:22779040
X Views and Counting: Interest in Rape-Oriented Pornography as Gendered Microaggression.
Makin, David A; Morczek, Amber L
2016-07-01
Academics and activists called to attention decades prior the importance of identifying, analyzing, and tracking the transmission of attitudes, behaviors, and norms correlated with violence against women. A specific call to attention reflected the media as a mode of transmission. This research builds on prior studies of media, with an emphasis on Internet search queries. Using Google search data, for the period 2004 to 2012, this research provides regional analysis of associated interest in rape-oriented pornography and pornographic hubs. Results indicate minor regional variations in interest, including the use of "BDSM" or "bondage/discipline, dominance/submission, and sadomasochism" as a foundational query for use in trend analysis. Interest in rape-oriented pornography by way of pornographic hubs is discussed in the context of microaggression. © The Author(s) 2015.
Big Data Analytics with Datalog Queries on Spark.
Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo
2016-01-01
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.
Big Data Analytics with Datalog Queries on Spark
Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo
2017-01-01
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics. PMID:28626296
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rusthoven, Chad G., E-mail: chad.rusthoven@ucdenver.edu; Carlson, Julie A.; Waxweiler, Timothy V.
2014-04-01
Purpose: To evaluate the survival outcomes for patients with lymph node-positive, nonmetastatic prostate cancer undergoing definitive local therapy (radical prostatectomy [RP], external beam radiation therapy [EBRT], or both) versus no local therapy (NLT) in the US population in the modern prostate specific antigen (PSA) era. Methods and Materials: The Surveillance, Epidemiology, and End Results database was queried for patients with T1-4N1M0 prostate cancer diagnosed from 1995 through 2005. To allow comparisons of equivalent datasets, patients were analyzed in separate clinical (cN+) and pathologically confirmed (pN+) lymph node-positive cohorts. Kaplan-Meier overall survival (OS) and prostate cancer-specific survival (PCSS) estimates were generated,more » with accompanying univariate log-rank and multivariate Cox proportional hazards comparisons. Results: A total of 796 cN+ and 2991 pN+ patients were evaluable. Among cN+ patients, 43% underwent EBRT and 57% had NLT. Outcomes for cN+ patients favored EBRT, with 10-year OS rates of 45% versus 29% (P<.001) and PCSS rates of 67% versus 53% (P<.001). Among pN+ patients, 78% underwent local therapy (RP 57%, EBRT 10%, or both 11%) and 22% had NLT. Outcomes for pN+ also favored local therapy, with 10-year OS rates of 65% versus 42% (P<.001) and PCSS rates of 78% versus 56% (P<.001). On multivariate analysis, local therapy in both the cN+ and pN+ cohorts remained independently associated with improved OS and PCSS (all P<.001). Local therapy was associated with favorable hazard ratios across subgroups, including patients aged ≥70 years and those with multiple positive lymph nodes. Among pN+ patients, no significant differences in survival were observed between RP versus EBRT and RP with or without adjuvant EBRT. Conclusions: In this large, population-based cohort, definitive local therapy was associated with significantly improved survival in patients with lymph node-positive prostate cancer.« less
An Ensemble Approach for Expanding Queries
2012-11-01
0.39 pain^0.39 Hospital 15094 0.82 hospital^0.82 Miscarriage 45 3.35 miscarriage ^3.35 Radiotherapy 53 3.28 radiotherapy^3.28 Hypoaldosteronism 3...negated query is the expansion of the original query with negation terms preceding each word. For example, the negated version of “ miscarriage ^3.35...includes “no miscarriage ”^3.35 and “not miscarriage ”^3.35. If a document is the result of both original query and negated query, its score is
Webb, Samuel J; Hanser, Thierry; Howlin, Brendan; Krause, Paul; Vessey, Jonathan D
2014-03-25
A new algorithm has been developed to enable the interpretation of black box models. The developed algorithm is agnostic to learning algorithm and open to all structural based descriptors such as fragments, keys and hashed fingerprints. The algorithm has provided meaningful interpretation of Ames mutagenicity predictions from both random forest and support vector machine models built on a variety of structural fingerprints.A fragmentation algorithm is utilised to investigate the model's behaviour on specific substructures present in the query. An output is formulated summarising causes of activation and deactivation. The algorithm is able to identify multiple causes of activation or deactivation in addition to identifying localised deactivations where the prediction for the query is active overall. No loss in performance is seen as there is no change in the prediction; the interpretation is produced directly on the model's behaviour for the specific query. Models have been built using multiple learning algorithms including support vector machine and random forest. The models were built on public Ames mutagenicity data and a variety of fingerprint descriptors were used. These models produced a good performance in both internal and external validation with accuracies around 82%. The models were used to evaluate the interpretation algorithm. Interpretation was revealed that links closely with understood mechanisms for Ames mutagenicity. This methodology allows for a greater utilisation of the predictions made by black box models and can expedite further study based on the output for a (quantitative) structure activity model. Additionally the algorithm could be utilised for chemical dataset investigation and knowledge extraction/human SAR development.
Implementation of the common phrase index method on the phrase query for information retrieval
NASA Astrophysics Data System (ADS)
Fatmawati, Triyah; Zaman, Badrus; Werdiningsih, Indah
2017-08-01
As the development of technology, the process of finding information on the news text is easy, because the text of the news is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using the search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. The number of words that make up the phrase query and their position obviously affect the relevance of the document produced. As a result, the accuracy of the information obtained will be affected. Based on the outlined problem, the purpose of this research was to analyze the implementation of the common phrase index method on information retrieval. This research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. The system is built with the stages of pre-processing, indexing, term weighting calculation, and cosine similarity calculation. Then the system will display the document search results in a sequence, based on the cosine similarity. Furthermore, system testing will be conducted using 100 documents and 20 queries. That result is then used for the evaluation stage. First, determine the relevant documents using kappa statistic calculation. Second, determine the system success rate using precision, recall, and F-measure calculation. In this research, the result of kappa statistic calculation was 0.71, so that the relevant documents are eligible for the system evaluation. Then the calculation of precision, recall, and F-measure produces precision of 0.37, recall of 0.50, and F-measure of 0.43. From this result can be said that the success rate of the system to produce relevant documents is low.
Huang, Jidong; Zheng, Rong; Emery, Sherry
2013-01-01
Background Despite the tremendous economic and health costs imposed on China by tobacco use, China lacks a proactive and systematic tobacco control surveillance and evaluation system, hampering research progress on tobacco-focused surveillance and evaluation studies. Methods This paper uses online search query analyses to investigate changes in online search behavior among Chinese Internet users in response to the adoption of the national indoor public place smoking ban. Baidu Index and Google Trends were used to examine the volume of search queries containing three key search terms “Smoking Ban(s),” “Quit Smoking,” and “Electronic Cigarette(s),” along with the news coverage on the smoking ban, for the period 2009–2011. Findings Our results show that the announcement and adoption of the indoor public place smoking ban in China generated significant increases in news coverage on smoking bans. There was a strong positive correlation between the media coverage of smoking bans and the volume of “Smoking Ban(s)” and “Quit Smoking” related search queries. The volume of search queries related to “Electronic Cigarette(s)” was also correlated with the smoking ban news coverage. Interpretation To the extent it altered smoking-related online searches, our analyses suggest that the smoking ban had a significant effect, at least in the short run, on Chinese Internet users’ smoking-related behaviors. This research introduces a novel analytic tool, which could serve as an alternative tobacco control evaluation and behavior surveillance tool in the absence of timely or comprehensive population surveillance system. This research also highlights the importance of a comprehensive approach to tobacco control in China. PMID:23776504
A novel adaptive Cuckoo search for optimal query plan generation.
Gomathi, Ramalingam; Sharmila, Dhandapani
2014-01-01
The emergence of multiple web pages day by day leads to the development of the semantic web technology. A World Wide Web Consortium (W3C) standard for storing semantic web data is the resource description framework (RDF). To enhance the efficiency in the execution time for querying large RDF graphs, the evolving metaheuristic algorithms become an alternate to the traditional query optimization methods. This paper focuses on the problem of query optimization of semantic web data. An efficient algorithm called adaptive Cuckoo search (ACS) for querying and generating optimal query plan for large RDF graphs is designed in this research. Experiments were conducted on different datasets with varying number of predicates. The experimental results have exposed that the proposed approach has provided significant results in terms of query execution time. The extent to which the algorithm is efficient is tested and the results are documented.
Query-Based Outlier Detection in Heterogeneous Information Networks.
Kuck, Jonathan; Zhuang, Honglei; Yan, Xifeng; Cam, Hasan; Han, Jiawei
2015-03-01
Outlier or anomaly detection in large data sets is a fundamental task in data science, with broad applications. However, in real data sets with high-dimensional space, most outliers are hidden in certain dimensional combinations and are relative to a user's search space and interest. It is often more effective to give power to users and allow them to specify outlier queries flexibly, and the system will then process such mining queries efficiently. In this study, we introduce the concept of query-based outlier in heterogeneous information networks, design a query language to facilitate users to specify such queries flexibly, define a good outlier measure in heterogeneous networks, and study how to process outlier queries efficiently in large data sets. Our experiments on real data sets show that following such a methodology, interesting outliers can be defined and uncovered flexibly and effectively in large heterogeneous networks.
Query-Based Outlier Detection in Heterogeneous Information Networks
Kuck, Jonathan; Zhuang, Honglei; Yan, Xifeng; Cam, Hasan; Han, Jiawei
2015-01-01
Outlier or anomaly detection in large data sets is a fundamental task in data science, with broad applications. However, in real data sets with high-dimensional space, most outliers are hidden in certain dimensional combinations and are relative to a user’s search space and interest. It is often more effective to give power to users and allow them to specify outlier queries flexibly, and the system will then process such mining queries efficiently. In this study, we introduce the concept of query-based outlier in heterogeneous information networks, design a query language to facilitate users to specify such queries flexibly, define a good outlier measure in heterogeneous networks, and study how to process outlier queries efficiently in large data sets. Our experiments on real data sets show that following such a methodology, interesting outliers can be defined and uncovered flexibly and effectively in large heterogeneous networks. PMID:27064397
Querying and Extracting Timeline Information from Road Traffic Sensor Data
Imawan, Ardi; Indikawati, Fitri Indra; Kwon, Joonho; Rao, Praveen
2016-01-01
The escalation of traffic congestion in urban cities has urged many countries to use intelligent transportation system (ITS) centers to collect historical traffic sensor data from multiple heterogeneous sources. By analyzing historical traffic data, we can obtain valuable insights into traffic behavior. Many existing applications have been proposed with limited analysis results because of the inability to cope with several types of analytical queries. In this paper, we propose the QET (querying and extracting timeline information) system—a novel analytical query processing method based on a timeline model for road traffic sensor data. To address query performance, we build a TQ-index (timeline query-index) that exploits spatio-temporal features of timeline modeling. We also propose an intuitive timeline visualization method to display congestion events obtained from specified query parameters. In addition, we demonstrate the benefit of our system through a performance evaluation using a Busan ITS dataset and a Seattle freeway dataset. PMID:27563900
Infant temperament: stability by age, gender, birth order, term status, and socioeconomic status.
Bornstein, Marc H; Putnick, Diane L; Gartstein, Maria A; Hahn, Chun-Shin; Auestad, Nancy; O'Connor, Deborah L
2015-01-01
Two complementary studies focused on stability of infant temperament across the 1st year and considered infant age, gender, birth order, term status, and socioeconomic status (SES) as moderators. Study 1 consisted of 73 mothers of firstborn term girls and boys queried at 2, 5, and 13 months of age. Study 2 consisted of 335 mothers of infants of different gender, birth order, term status, and SES queried at 6 and 12 months. Consistent positive and negative affectivity factors emerged at all time points across both studies. Infant temperament proved stable and robust across gender, birth order, term status, and SES. Stability coefficients for temperament factors and scales were medium to large for shorter (< 9 months) interassessment intervals and small to medium for longer (> 10 months) intervals. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.
Otte, Willem M; van Diessen, Eric; Bell, Gail S; Sander, Josemir W
2013-12-01
In old and modern times and across cultures, recurrent seizures have been attributed to the lunar phase. It is unclear whether this relationship should be classified as a myth or whether a true connection exists between moon phases and seizures. We analyzed the worldwide aggregated search queries related to epilepsy health-seeking behavior between 2005 and 2012. Epilepsy-related Internet searches increased in periods with a high moon illumination. The overall association was weak (r=0.11, 95% confidence interval: 0.07 to 0.14) but seems to be higher than most control search queries not related to epilepsy. Increased sleep deprivation during periods of full moon might explain this positive association and warrants further study into epilepsy-related health-seeking behavior on the Internet, the lunar phase, and its contribution to nocturnal luminance. © 2013.
Policy Compliance of Queries for Private Information Retrieval
2010-11-01
SPARQL, unfortunately, is not in RDF and so we had to develop tools to translate SPARQL queries into RDF to be used by our policy compliance prototype...policy-assurance/sparql2n3.py) that accepts SPARQL queries and returns the translated query in our simplified ontology. An example of a translated
New concepts for building vocabulary for cell image ontologies.
Plant, Anne L; Elliott, John T; Bhat, Talapady N
2011-12-21
There are significant challenges associated with the building of ontologies for cell biology experiments including the large numbers of terms and their synonyms. These challenges make it difficult to simultaneously query data from multiple experiments or ontologies. If vocabulary terms were consistently used and reused across and within ontologies, queries would be possible through shared terms. One approach to achieving this is to strictly control the terms used in ontologies in the form of a pre-defined schema, but this approach limits the individual researcher's ability to create new terms when needed to describe new experiments. Here, we propose the use of a limited number of highly reusable common root terms, and rules for an experimentalist to locally expand terms by adding more specific terms under more general root terms to form specific new vocabulary hierarchies that can be used to build ontologies. We illustrate the application of the method to build vocabularies and a prototype database for cell images that uses a visual data-tree of terms to facilitate sophisticated queries based on a experimental parameters. We demonstrate how the terminology might be extended by adding new vocabulary terms into the hierarchy of terms in an evolving process. In this approach, image data and metadata are handled separately, so we also describe a robust file-naming scheme to unambiguously identify image and other files associated with each metadata value. The prototype database http://sbd.nist.gov/ consists of more than 2000 images of cells and benchmark materials, and 163 metadata terms that describe experimental details, including many details about cell culture and handling. Image files of interest can be retrieved, and their data can be compared, by choosing one or more relevant metadata values as search terms. Metadata values for any dataset can be compared with corresponding values of another dataset through logical operations. Organizing metadata for cell imaging experiments under a framework of rules that include highly reused root terms will facilitate the addition of new terms into a vocabulary hierarchy and encourage the reuse of terms. These vocabulary hierarchies can be converted into XML schema or RDF graphs for displaying and querying, but this is not necessary for using it to annotate cell images. Vocabulary data trees from multiple experiments or laboratories can be aligned at the root terms to facilitate query development. This approach of developing vocabularies is compatible with the major advances in database technology and could be used for building the Semantic Web.
New concepts for building vocabulary for cell image ontologies
2011-01-01
Background There are significant challenges associated with the building of ontologies for cell biology experiments including the large numbers of terms and their synonyms. These challenges make it difficult to simultaneously query data from multiple experiments or ontologies. If vocabulary terms were consistently used and reused across and within ontologies, queries would be possible through shared terms. One approach to achieving this is to strictly control the terms used in ontologies in the form of a pre-defined schema, but this approach limits the individual researcher's ability to create new terms when needed to describe new experiments. Results Here, we propose the use of a limited number of highly reusable common root terms, and rules for an experimentalist to locally expand terms by adding more specific terms under more general root terms to form specific new vocabulary hierarchies that can be used to build ontologies. We illustrate the application of the method to build vocabularies and a prototype database for cell images that uses a visual data-tree of terms to facilitate sophisticated queries based on a experimental parameters. We demonstrate how the terminology might be extended by adding new vocabulary terms into the hierarchy of terms in an evolving process. In this approach, image data and metadata are handled separately, so we also describe a robust file-naming scheme to unambiguously identify image and other files associated with each metadata value. The prototype database http://sbd.nist.gov/ consists of more than 2000 images of cells and benchmark materials, and 163 metadata terms that describe experimental details, including many details about cell culture and handling. Image files of interest can be retrieved, and their data can be compared, by choosing one or more relevant metadata values as search terms. Metadata values for any dataset can be compared with corresponding values of another dataset through logical operations. Conclusions Organizing metadata for cell imaging experiments under a framework of rules that include highly reused root terms will facilitate the addition of new terms into a vocabulary hierarchy and encourage the reuse of terms. These vocabulary hierarchies can be converted into XML schema or RDF graphs for displaying and querying, but this is not necessary for using it to annotate cell images. Vocabulary data trees from multiple experiments or laboratories can be aligned at the root terms to facilitate query development. This approach of developing vocabularies is compatible with the major advances in database technology and could be used for building the Semantic Web. PMID:22188658
The role of economics in the QUERI program: QUERI Series
Smith, Mark W; Barnett, Paul G
2008-01-01
Background The United States (U.S.) Department of Veterans Affairs (VA) Quality Enhancement Research Initiative (QUERI) has implemented economic analyses in single-site and multi-site clinical trials. To date, no one has reviewed whether the QUERI Centers are taking an optimal approach to doing so. Consistent with the continuous learning culture of the QUERI Program, this paper provides such a reflection. Methods We present a case study of QUERI as an example of how economic considerations can and should be integrated into implementation research within both single and multi-site studies. We review theoretical and applied cost research in implementation studies outside and within VA. We also present a critique of the use of economic research within the QUERI program. Results Economic evaluation is a key element of implementation research. QUERI has contributed many developments in the field of implementation but has only recently begun multi-site implementation trials across multiple regions within the national VA healthcare system. These trials are unusual in their emphasis on developing detailed costs of implementation, as well as in the use of business case analyses (budget impact analyses). Conclusion Economics appears to play an important role in QUERI implementation studies, only after implementation has reached the stage of multi-site trials. Economic analysis could better inform the choice of which clinical best practices to implement and the choice of implementation interventions to employ. QUERI economics also would benefit from research on costing methods and development of widely accepted international standards for implementation economics. PMID:18430199
The role of economics in the QUERI program: QUERI Series.
Smith, Mark W; Barnett, Paul G
2008-04-22
The United States (U.S.) Department of Veterans Affairs (VA) Quality Enhancement Research Initiative (QUERI) has implemented economic analyses in single-site and multi-site clinical trials. To date, no one has reviewed whether the QUERI Centers are taking an optimal approach to doing so. Consistent with the continuous learning culture of the QUERI Program, this paper provides such a reflection. We present a case study of QUERI as an example of how economic considerations can and should be integrated into implementation research within both single and multi-site studies. We review theoretical and applied cost research in implementation studies outside and within VA. We also present a critique of the use of economic research within the QUERI program. Economic evaluation is a key element of implementation research. QUERI has contributed many developments in the field of implementation but has only recently begun multi-site implementation trials across multiple regions within the national VA healthcare system. These trials are unusual in their emphasis on developing detailed costs of implementation, as well as in the use of business case analyses (budget impact analyses). Economics appears to play an important role in QUERI implementation studies, only after implementation has reached the stage of multi-site trials. Economic analysis could better inform the choice of which clinical best practices to implement and the choice of implementation interventions to employ. QUERI economics also would benefit from research on costing methods and development of widely accepted international standards for implementation economics.
Processing SPARQL queries with regular expressions in RDF databases
2011-01-01
Background As the Resource Description Framework (RDF) data model is widely used for modeling and sharing a lot of online bioinformatics resources such as Uniprot (dev.isb-sib.ch/projects/uniprot-rdf) or Bio2RDF (bio2rdf.org), SPARQL - a W3C recommendation query for RDF databases - has become an important query language for querying the bioinformatics knowledge bases. Moreover, due to the diversity of users’ requests for extracting information from the RDF data as well as the lack of users’ knowledge about the exact value of each fact in the RDF databases, it is desirable to use the SPARQL query with regular expression patterns for querying the RDF data. To the best of our knowledge, there is currently no work that efficiently supports regular expression processing in SPARQL over RDF databases. Most of the existing techniques for processing regular expressions are designed for querying a text corpus, or only for supporting the matching over the paths in an RDF graph. Results In this paper, we propose a novel framework for supporting regular expression processing in SPARQL query. Our contributions can be summarized as follows. 1) We propose an efficient framework for processing SPARQL queries with regular expression patterns in RDF databases. 2) We propose a cost model in order to adapt the proposed framework in the existing query optimizers. 3) We build a prototype for the proposed framework in C++ and conduct extensive experiments demonstrating the efficiency and effectiveness of our technique. Conclusions Experiments with a full-blown RDF engine show that our framework outperforms the existing ones by up to two orders of magnitude in processing SPARQL queries with regular expression patterns. PMID:21489225
Processing SPARQL queries with regular expressions in RDF databases.
Lee, Jinsoo; Pham, Minh-Duc; Lee, Jihwan; Han, Wook-Shin; Cho, Hune; Yu, Hwanjo; Lee, Jeong-Hoon
2011-03-29
As the Resource Description Framework (RDF) data model is widely used for modeling and sharing a lot of online bioinformatics resources such as Uniprot (dev.isb-sib.ch/projects/uniprot-rdf) or Bio2RDF (bio2rdf.org), SPARQL - a W3C recommendation query for RDF databases - has become an important query language for querying the bioinformatics knowledge bases. Moreover, due to the diversity of users' requests for extracting information from the RDF data as well as the lack of users' knowledge about the exact value of each fact in the RDF databases, it is desirable to use the SPARQL query with regular expression patterns for querying the RDF data. To the best of our knowledge, there is currently no work that efficiently supports regular expression processing in SPARQL over RDF databases. Most of the existing techniques for processing regular expressions are designed for querying a text corpus, or only for supporting the matching over the paths in an RDF graph. In this paper, we propose a novel framework for supporting regular expression processing in SPARQL query. Our contributions can be summarized as follows. 1) We propose an efficient framework for processing SPARQL queries with regular expression patterns in RDF databases. 2) We propose a cost model in order to adapt the proposed framework in the existing query optimizers. 3) We build a prototype for the proposed framework in C++ and conduct extensive experiments demonstrating the efficiency and effectiveness of our technique. Experiments with a full-blown RDF engine show that our framework outperforms the existing ones by up to two orders of magnitude in processing SPARQL queries with regular expression patterns.
Chen, R S; Nadkarni, P; Marenco, L; Levin, F; Erdos, J; Miller, P L
2000-01-01
The entity-attribute-value representation with classes and relationships (EAV/CR) provides a flexible and simple database schema to store heterogeneous biomedical data. In certain circumstances, however, the EAV/CR model is known to retrieve data less efficiently than conventionally based database schemas. To perform a pilot study that systematically quantifies performance differences for database queries directed at real-world microbiology data modeled with EAV/CR and conventional representations, and to explore the relative merits of different EAV/CR query implementation strategies. Clinical microbiology data obtained over a ten-year period were stored using both database models. Query execution times were compared for four clinically oriented attribute-centered and entity-centered queries operating under varying conditions of database size and system memory. The performance characteristics of three different EAV/CR query strategies were also examined. Performance was similar for entity-centered queries in the two database models. Performance in the EAV/CR model was approximately three to five times less efficient than its conventional counterpart for attribute-centered queries. The differences in query efficiency became slightly greater as database size increased, although they were reduced with the addition of system memory. The authors found that EAV/CR queries formulated using multiple, simple SQL statements executed in batch were more efficient than single, large SQL statements. This paper describes a pilot project to explore issues in and compare query performance for EAV/CR and conventional database representations. Although attribute-centered queries were less efficient in the EAV/CR model, these inefficiencies may be addressable, at least in part, by the use of more powerful hardware or more memory, or both.
ScotlandsPlaces XML: Bespoke XML or XML Mapping?
ERIC Educational Resources Information Center
Beamer, Ashley; Gillick, Mark
2010-01-01
Purpose: The purpose of this paper is to investigate web services (in the form of parameterised URLs), specifically in the context of the ScotlandsPlaces project. This involves cross-domain querying, data retrieval and display via the development of a bespoke XML standard rather than existing XML formats and mapping between them.…
The Effect of User Characteristics on the Efficiency of Visual Querying
ERIC Educational Resources Information Center
Bak, Peter; Meyer, Joachim
2011-01-01
Information systems increasingly provide options for visually inspecting data during the process of information discovery and exploration. Little research has dealt so far with user interactions with these systems, and specifically with the effects of characteristics of the displayed data and the user on performance with such systems. The study…
ERIC Educational Resources Information Center
Barrera, Arnold; Braley, Richard T.; Slate, John R.
2010-01-01
Teacher mentors of first-year teachers provided insight into those practices they viewed as essential for their success in the mentoring role. Specifically, they were queried about teacher involvement/support, staff development, administrative support and resource materials. Almost all of the mentor teachers believed a teacher mentoring program…
Tracking Holland Interest Codes: The Case of South African Field Guides
ERIC Educational Resources Information Center
Watson, Mark B.; Foxcroft, Cheryl D.; Allen, Lynda J.
2007-01-01
Holland believes that specific personality types seek out matching occupational environments and his theory codes personality and environment according to a six letter interest typology. Since 1985 there have been numerous American studies that have queried the validity of Holland's coding system. Research in South Africa is scarcer, despite…
Mentors' Views of Factors Essential for the Success of Beginning Teachers
ERIC Educational Resources Information Center
Barrera, Arnold; Braley, Richard; Slate, John R.
2008-01-01
The views of 46 mentors of first-year teachers were obtained regarding practices that they viewed as essential for their success in mentoring teachers. Specifically, they were queried about teacher involvement/support, staff development, administrative support, and resource materials. Almost all of the mentor teachers believed a teacher-mentoring…
Seo, Dong-Woo; Sohn, Chang Hwan; Kim, Sung-Hoon; Ryoo, Seung Mok; Lee, Yoon-Seon; Lee, Jae Ho; Kim, Won Young; Lim, Kyoung Soo
2016-01-01
Background Digital surveillance using internet search queries can improve both the sensitivity and timeliness of the detection of a health event, such as an influenza outbreak. While it has recently been estimated that the mobile search volume surpasses the desktop search volume and mobile search patterns differ from desktop search patterns, the previous digital surveillance systems did not distinguish mobile and desktop search queries. The purpose of this study was to compare the performance of mobile and desktop search queries in terms of digital influenza surveillance. Methods and Results The study period was from September 6, 2010 through August 30, 2014, which consisted of four epidemiological years. Influenza-like illness (ILI) and virologic surveillance data from the Korea Centers for Disease Control and Prevention were used. A total of 210 combined queries from our previous survey work were used for this study. Mobile and desktop weekly search data were extracted from Naver, which is the largest search engine in Korea. Spearman’s correlation analysis was used to examine the correlation of the mobile and desktop data with ILI and virologic data in Korea. We also performed lag correlation analysis. We observed that the influenza surveillance performance of mobile search queries matched or exceeded that of desktop search queries over time. The mean correlation coefficients of mobile search queries and the number of queries with an r-value of ≥ 0.7 equaled or became greater than those of desktop searches over the four epidemiological years. A lag correlation analysis of up to two weeks showed similar trends. Conclusion Our study shows that mobile search queries for influenza surveillance have equaled or even become greater than desktop search queries over time. In the future development of influenza surveillance using search queries, the recognition of changing trend of mobile search data could be necessary. PMID:27391028
Shin, Soo-Yong; Kim, Taerim; Seo, Dong-Woo; Sohn, Chang Hwan; Kim, Sung-Hoon; Ryoo, Seung Mok; Lee, Yoon-Seon; Lee, Jae Ho; Kim, Won Young; Lim, Kyoung Soo
2016-01-01
Digital surveillance using internet search queries can improve both the sensitivity and timeliness of the detection of a health event, such as an influenza outbreak. While it has recently been estimated that the mobile search volume surpasses the desktop search volume and mobile search patterns differ from desktop search patterns, the previous digital surveillance systems did not distinguish mobile and desktop search queries. The purpose of this study was to compare the performance of mobile and desktop search queries in terms of digital influenza surveillance. The study period was from September 6, 2010 through August 30, 2014, which consisted of four epidemiological years. Influenza-like illness (ILI) and virologic surveillance data from the Korea Centers for Disease Control and Prevention were used. A total of 210 combined queries from our previous survey work were used for this study. Mobile and desktop weekly search data were extracted from Naver, which is the largest search engine in Korea. Spearman's correlation analysis was used to examine the correlation of the mobile and desktop data with ILI and virologic data in Korea. We also performed lag correlation analysis. We observed that the influenza surveillance performance of mobile search queries matched or exceeded that of desktop search queries over time. The mean correlation coefficients of mobile search queries and the number of queries with an r-value of ≥ 0.7 equaled or became greater than those of desktop searches over the four epidemiological years. A lag correlation analysis of up to two weeks showed similar trends. Our study shows that mobile search queries for influenza surveillance have equaled or even become greater than desktop search queries over time. In the future development of influenza surveillance using search queries, the recognition of changing trend of mobile search data could be necessary.
Machiela, Mitchell J; Chanock, Stephen J
2015-11-01
Assessing linkage disequilibrium (LD) across ancestral populations is a powerful approach for investigating population-specific genetic structure as well as functionally mapping regions of disease susceptibility. Here, we present LDlink, a web-based collection of bioinformatic modules that query single nucleotide polymorphisms (SNPs) in population groups of interest to generate haplotype tables and interactive plots. Modules are designed with an emphasis on ease of use, query flexibility, and interactive visualization of results. Phase 3 haplotype data from the 1000 Genomes Project are referenced for calculating pairwise metrics of LD, searching for proxies in high LD, and enumerating all observed haplotypes. LDlink is tailored for investigators interested in mapping common and uncommon disease susceptibility loci by focusing on output linking correlated alleles and highlighting putative functional variants. LDlink is a free and publically available web tool which can be accessed at http://analysistools.nci.nih.gov/LDlink/. mitchell.machiela@nih.gov. Published by Oxford University Press 2015. This work is written by US Government employees and is in the public domain in the US.
Georgitsi, Marianthi; Viennas, Emmanouil; Gkantouna, Vassiliki; Christodoulopoulou, Elena; Zagoriti, Zoi; Tafrali, Christina; Ntellos, Fotios; Giannakopoulou, Olga; Boulakou, Athanassia; Vlahopoulou, Panagiota; Kyriacou, Eva; Tsaknakis, John; Tsakalidis, Athanassios; Poulas, Konstantinos; Tzimas, Giannis; Patrinos, George P
2011-01-01
Population and ethnic group-specific allele frequencies of pharmacogenomic markers are poorly documented and not systematically collected in structured data repositories. We developed the Frequency of Inherited Disorders Pharmacogenomics database (FINDbase-PGx), a separate module of the FINDbase, aiming to systematically document pharmacogenomic allele frequencies in various populations and ethnic groups worldwide. We critically collected and curated 214 scientific articles reporting pharmacogenomic markers allele frequencies in various populations and ethnic groups worldwide. Subsequently, in order to host the curated data, support data visualization and data mining, we developed a website application, utilizing Microsoft™ PivotViewer software. Curated allelic frequency data pertaining to 144 pharmacogenomic markers across 14 genes, representing approximately 87,000 individuals from 150 populations worldwide, are currently included in FINDbase-PGx. A user-friendly query interface allows for easy data querying, based on numerous content criteria, such as population, ethnic group, geographical region, gene, drug and rare allele frequency. FINDbase-PGx is a comprehensive database, which, unlike other pharmacogenomic knowledgebases, fulfills the much needed requirement to systematically document pharmacogenomic allelic frequencies in various populations and ethnic groups worldwide.
Mabotuwana, Thusitha; Warren, Jim
2010-02-01
Quality audit and feedback to general practice is an important aspect of successful chronic disease management. However, due to the complex temporal relationships associated with the nature of chronic illness, formulating clinically relevant queries within the context of a specific evaluation period is difficult. We abstracted requirements from a set of previously developed criteria to develop a generic criteria model that can be used to formulate queries related to chronic condition management. We implemented and verified the framework, ChronoMedIt, to execute clinical queries within the scope of the criteria model. Our criteria model consists of four broad classes of audit criteria - lapse in indicated therapy, no measurement recording, time to achieve target and measurement contraindicating therapy. Using these criteria classes as a guide, ChronoMedIt has been implemented as an extensible framework. ChronoMedIt can produce criteria reports and has an integrated prescription and measurement timeline visualisation tool. We illustrate the use of the framework by identifying patients on suboptimal therapy based on a range of pre-determined audit criteria using production electronic medical record data from two general medical practices for 607 and 679 patients with hypertension. As the most prominent result, we find that 59% (out of 607) and 34% (out of 679) of patients with hypertension had at least one episode of >30day lapse in their antihypertensive therapy over a 12-month evaluation period. ChronoMedIt can reliably execute a wide range of clinically useful queries to identify patients whose chronic condition management can be improved.
Brown, Jeffrey S; Holmes, John H; Shah, Kiran; Hall, Ken; Lazarus, Ross; Platt, Richard
2010-06-01
Comparative effectiveness research, medical product safety evaluation, and quality measurement will require the ability to use electronic health data held by multiple organizations. There is no consensus about whether to create regional or national combined (eg, "all payer") databases for these purposes, or distributed data networks that leave most Protected Health Information and proprietary data in the possession of the original data holders. Demonstrate functions of a distributed research network that supports research needs and also address data holders concerns about participation. Key design functions included strong local control of data uses and a centralized web-based querying interface. We implemented a pilot distributed research network and evaluated the design considerations, utility for research, and the acceptability to data holders of methods for menu-driven querying. We developed and tested a central, web-based interface with supporting network software. Specific functions assessed include query formation and distribution, query execution and review, and aggregation of results. This pilot successfully evaluated temporal trends in medication use and diagnoses at 5 separate sites, demonstrating some of the possibilities of using a distributed research network. The pilot demonstrated the potential utility of the design, which addressed the major concerns of both users and data holders. No serious obstacles were identified that would prevent development of a fully functional, scalable network. Distributed networks are capable of addressing nearly all anticipated uses of routinely collected electronic healthcare data. Distributed networks would obviate the need for centralized databases, thus avoiding numerous obstacles.
Reyes-Aldasoro, Constantino Carlos
2017-01-01
In this work, the public database of biomedical literature PubMed was mined using queries with combinations of keywords and year restrictions. It was found that the proportion of Cancer-related entries per year in PubMed has risen from around 6% in 1950 to more than 16% in 2016. This increase is not shared by other conditions such as AIDS, Malaria, Tuberculosis, Diabetes, Cardiovascular, Stroke and Infection some of which have, on the contrary, decreased as a proportion of the total entries per year. Organ-related queries were performed to analyse the variation of some specific cancers. A series of queries related to incidence, funding, and relationship with DNA, Computing and Mathematics, were performed to test correlation between the keywords, with the hope of elucidating the cause behind the rise of Cancer in PubMed. Interestingly, the proportion of Cancer-related entries that contain "DNA", "Computational" or "Mathematical" have increased, which suggests that the impact of these scientific advances on Cancer has been stronger than in other conditions. It is important to highlight that the results obtained with the data mining approach here presented are limited to the presence or absence of the keywords on a single, yet extensive, database. Therefore, results should be observed with caution. All the data used for this work is publicly available through PubMed and the UK's Office for National Statistics. All queries and figures were generated with the software platform Matlab and the files are available as supplementary material.
2017-01-01
In this work, the public database of biomedical literature PubMed was mined using queries with combinations of keywords and year restrictions. It was found that the proportion of Cancer-related entries per year in PubMed has risen from around 6% in 1950 to more than 16% in 2016. This increase is not shared by other conditions such as AIDS, Malaria, Tuberculosis, Diabetes, Cardiovascular, Stroke and Infection some of which have, on the contrary, decreased as a proportion of the total entries per year. Organ-related queries were performed to analyse the variation of some specific cancers. A series of queries related to incidence, funding, and relationship with DNA, Computing and Mathematics, were performed to test correlation between the keywords, with the hope of elucidating the cause behind the rise of Cancer in PubMed. Interestingly, the proportion of Cancer-related entries that contain “DNA”, “Computational” or “Mathematical” have increased, which suggests that the impact of these scientific advances on Cancer has been stronger than in other conditions. It is important to highlight that the results obtained with the data mining approach here presented are limited to the presence or absence of the keywords on a single, yet extensive, database. Therefore, results should be observed with caution. All the data used for this work is publicly available through PubMed and the UK’s Office for National Statistics. All queries and figures were generated with the software platform Matlab and the files are available as supplementary material. PMID:28282418
Identification of Conserved Water Sites in Protein Structures for Drug Design.
Jukič, Marko; Konc, Janez; Gobec, Stanislav; Janežič, Dušanka
2017-12-26
Identification of conserved waters in protein structures is a challenging task with applications in molecular docking and protein stability prediction. As an alternative to computationally demanding simulations of proteins in water, experimental cocrystallized waters in the Protein Data Bank (PDB) in combination with a local structure alignment algorithm can be used for reliable prediction of conserved water sites. We developed the ProBiS H2O approach based on the previously developed ProBiS algorithm, which enables identification of conserved water sites in proteins using experimental protein structures from the PDB or a set of custom protein structures available to the user. With a protein structure, a binding site, or an individual water molecule as a query, ProBiS H2O collects similar proteins from the PDB and performs local or binding site-specific superimpositions of the query structure with similar proteins using the ProBiS algorithm. It collects the experimental water molecules from the similar proteins and transposes them to the query protein. Transposed waters are clustered by their mutual proximity, which enables identification of discrete sites in the query protein with high water conservation. ProBiS H2O is a robust and fast new approach that uses existing experimental structural data to identify conserved water sites on the interfaces of protein complexes, for example protein-small molecule interfaces, and elsewhere on the protein structures. It has been successfully validated in several reported proteins in which conserved water molecules were found to play an important role in ligand binding with applications in drug design.
2015-01-01
Background In recent years, with advances in techniques for protein structure analysis, the knowledge about protein structure and function has been published in a vast number of articles. A method to search for specific publications from such a large pool of articles is needed. In this paper, we propose a method to search for related articles on protein structure analysis by using an article itself as a query. Results Each article is represented as a set of concepts in the proposed method. Then, by using similarities among concepts formulated from databases such as Gene Ontology, similarities between articles are evaluated. In this framework, the desired search results vary depending on the user's search intention because a variety of information is included in a single article. Therefore, the proposed method provides not only one input article (primary article) but also additional articles related to it as an input query to determine the search intention of the user, based on the relationship between two query articles. In other words, based on the concepts contained in the input article and additional articles, we actualize a relevant literature search that considers user intention by varying the degree of attention given to each concept and modifying the concept hierarchy graph. Conclusions We performed an experiment to retrieve relevant papers from articles on protein structure analysis registered in the Protein Data Bank by using three query datasets. The experimental results yielded search results with better accuracy than when user intention was not considered, confirming the effectiveness of the proposed method. PMID:25952498
Collaborative Supervised Learning for Sensor Networks
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri L.; Rebbapragada, Umaa; Lane, Terran
2011-01-01
Collaboration methods for distributed machine-learning algorithms involve the specification of communication protocols for the learners, which can query other learners and/or broadcast their findings preemptively. Each learner incorporates information from its neighbors into its own training set, and they are thereby able to bootstrap each other to higher performance. Each learner resides at a different node in the sensor network and makes observations (collects data) independently of the other learners. After being seeded with an initial labeled training set, each learner proceeds to learn in an iterative fashion. New data is collected and classified. The learner can then either broadcast its most confident classifications for use by other learners, or can query neighbors for their classifications of its least confident items. As such, collaborative learning combines elements of both passive (broadcast) and active (query) learning. It also uses ideas from ensemble learning to combine the multiple responses to a given query into a single useful label. This approach has been evaluated against current non-collaborative alternatives, including training a single classifier and deploying it at all nodes with no further learning possible, and permitting learners to learn from their own most confident judgments, absent interaction with their neighbors. On several data sets, it has been consistently found that active collaboration is the best strategy for a distributed learner network. The main advantages include the ability for learning to take place autonomously by collaboration rather than by requiring intervention from an oracle (usually human), and also the ability to learn in a distributed environment, permitting decisions to be made in situ and to yield faster response time.
DASMiner: discovering and integrating data from DAS sources
2009-01-01
Background DAS is a widely adopted protocol for providing syntactic interoperability among biological databases. The popularity of DAS is due to a simplified and elegant mechanism for data exchange that consists of sources exposing their RESTful interfaces for data access. As a growing number of DAS services are available for molecular biology resources, there is an incentive to explore this protocol in order to advance data discovery and integration among these resources. Results We developed DASMiner, a Matlab toolkit for querying DAS data sources that enables creation of integrated biological models using the information available in DAS-compliant repositories. DASMiner is composed by a browser application and an API that work together to facilitate gathering of data from different DAS sources, which can be used for creating enriched datasets from multiple sources. The browser is used to formulate queries and navigate data contained in DAS sources. Users can execute queries against these sources in an intuitive fashion, without the need of knowing the specific DAS syntax for the particular source. Using the source's metadata provided by the DAS Registry, the browser's layout adapts to expose only the set of commands and coordinate systems supported by the specific source. For this reason, the browser can interrogate any DAS source, independently of the type of data being served. The API component of DASMiner may be used for programmatic access of DAS sources by programs in Matlab. Once the desired data is found during navigation, the query is exported in the format of an API call to be used within any Matlab application. We illustrate the use of DASMiner by creating integrative models of histone modification maps and protein-protein interaction networks. These enriched datasets were built by retrieving and integrating distributed genomic and proteomic DAS sources using the API. Conclusion The support of the DAS protocol allows that hundreds of molecular biology databases to be treated as a federated, online collection of resources. DASMiner enables full exploration of these resources, and can be used to deploy applications and create integrated views of biological systems using the information deposited in DAS repositories. PMID:19919683
A Fast Healthcare Interoperability Resources (FHIR) layer implemented over i2b2.
Boussadi, Abdelali; Zapletal, Eric
2017-08-14
Standards and technical specifications have been developed to define how the information contained in Electronic Health Records (EHRs) should be structured, semantically described, and communicated. Current trends rely on differentiating the representation of data instances from the definition of clinical information models. The dual model approach, which combines a reference model (RM) and a clinical information model (CIM), sets in practice this software design pattern. The most recent initiative, proposed by HL7, is called Fast Health Interoperability Resources (FHIR). The aim of our study was to investigate the feasibility of applying the FHIR standard to modeling and exposing EHR data of the Georges Pompidou European Hospital (HEGP) integrating biology and the bedside (i2b2) clinical data warehouse (CDW). We implemented a FHIR server over i2b2 to expose EHR data in relation with five FHIR resources: DiagnosisReport, MedicationOrder, Patient, Encounter, and Medication. The architecture of the server combines a Data Access Object design pattern and FHIR resource providers, implemented using the Java HAPI FHIR API. Two types of queries were tested: query type #1 requests the server to display DiagnosticReport resources, for which the diagnosis code is equal to a given ICD-10 code. A total of 80 DiagnosticReport resources, corresponding to 36 patients, were displayed. Query type #2, requests the server to display MedicationOrder, for which the FHIR Medication identification code is equal to a given code expressed in a French coding system. A total of 503 MedicationOrder resources, corresponding to 290 patients, were displayed. Results were validated by manually comparing the results of each request to the results displayed by an ad-hoc SQL query. We showed the feasibility of implementing a Java layer over the i2b2 database model to expose data of the CDW as a set of FHIR resources. An important part of this work was the structural and semantic mapping between the i2b2 model and the FHIR RM. To accomplish this, developers must manually browse the specifications of the FHIR standard. Our source code is freely available and can be adapted for use in other i2b2 sites.
NASA Astrophysics Data System (ADS)
Yang, Z.; Han, W.; di, L.
2010-12-01
The National Agricultural Statistics Service (NASS) of the USDA produces the Cropland Data Layer (CDL) product, which is a raster-formatted, geo-referenced, U.S. crop specific land cover classification. These digital data layers are widely used for a variety of applications by universities, research institutions, government agencies, and private industry in climate change studies, environmental ecosystem studies, bioenergy production & transportation planning, environmental health research and agricultural production decision making. The CDL is also used internally by NASS for crop acreage and yield estimation. Like most geospatial data products, the CDL product is only available by CD/DVD delivery or online bulk file downloading via the National Research Conservation Research (NRCS) Geospatial Data Gateway (external users) or in a printed paper map format. There is no online geospatial information access and dissemination, no crop visualization & browsing, no geospatial query capability, nor online analytics. To facilitate the application of this data layer and to help disseminating the data, a web-service based CDL interactive map visualization, dissemination, querying system is proposed. It uses Web service based service oriented architecture, adopts open standard geospatial information science technology and OGC specifications and standards, and re-uses functions/algorithms from GeoBrain Technology (George Mason University developed). This system provides capabilities of on-line geospatial crop information access, query and on-line analytics via interactive maps. It disseminates all data to the decision makers and users via real time retrieval, processing and publishing over the web through standards-based geospatial web services. A CDL region of interest can also be exported directly to Google Earth for mashup or downloaded for use with other desktop application. This web service based system greatly improves equal-accessibility, interoperability, usability, and data visualization, facilitates crop geospatial information usage, and enables US cropland online exploring capability without any client-side software installation. It also greatly reduces the need for paper map and analysis report printing and media usages, and thus enhances low-carbon Agro-geoinformation dissemination for decision support.
Searching for Images: The Analysis of Users' Queries for Image Retrieval in American History.
ERIC Educational Resources Information Center
Choi, Youngok; Rasmussen, Edie M.
2003-01-01
Studied users' queries for visual information in American history to identify the image attributes important for retrieval and the characteristics of users' queries for digital images, based on queries from 38 faculty and graduate students. Results of pre- and post-test questionnaires and interviews suggest principle categories of search terms.…
Searching and Filtering Tweets: CSIRO at the TREC 2012 Microblog Track
2012-11-01
stages. We first evaluate the effect of tweet corpus pre- processing in vanilla runs (no query expansion), and then assess the effect of query expansion...Effect of a vanilla run on D4 index (both realtime and non-real-time), and query expansion methods based on the submitted runs for two sets of queries
System, method and apparatus for conducting a keyterm search
NASA Technical Reports Server (NTRS)
McGreevy, Michael W. (Inventor)
2004-01-01
A keyterm search is a method of searching a database for subsets of the database that are relevant to an input query. First, a number of relational models of subsets of a database are provided. A query is then input. The query can include one or more keyterms. Next, a gleaning model of the query is created. The gleaning model of the query is then compared to each one of the relational models of subsets of the database. The identifiers of the relevant subsets are then output.