Sample records for large text databases

  1. Visualizing the semantic content of large text databases using text maps

    NASA Technical Reports Server (NTRS)

    Combs, Nathan

    1993-01-01

    A methodology for generating text map representations of the semantic content of text databases is presented. Text maps provide a graphical metaphor for conceptualizing and visualizing the contents and data interrelationships of large text databases. Described are a set of experiments conducted against the TIPSTER corpora of Wall Street Journal articles. These experiments provide an introduction to current work in the representation and visualization of documents by way of their semantic content.

  2. Large-scale feature searches of collections of medical imagery

    NASA Astrophysics Data System (ADS)

    Hedgcock, Marcus W.; Karshat, Walter B.; Levitt, Tod S.; Vosky, D. N.

    1993-09-01

    Large scale feature searches of accumulated collections of medical imagery are required for multiple purposes, including clinical studies, administrative planning, epidemiology, teaching, quality improvement, and research. To perform a feature search of large collections of medical imagery, one can either search text descriptors of the imagery in the collection (usually the interpretation), or (if the imagery is in digital format) the imagery itself. At our institution, text interpretations of medical imagery are all available in our VA Hospital Information System. These are downloaded daily into an off-line computer. The text descriptors of most medical imagery are usually formatted as free text, and so require a user friendly database search tool to make searches quick and easy for any user to design and execute. We are tailoring such a database search tool (Liveview), developed by one of the authors (Karshat). To further facilitate search construction, we are constructing (from our accumulated interpretation data) a dictionary of medical and radiological terms and synonyms. If the imagery database is digital, the imagery which the search discovers is easily retrieved from the computer archive. We describe our database search user interface, with examples, and compare the efficacy of computer assisted imagery searches from a clinical text database with manual searches. Our initial work on direct feature searches of digital medical imagery is outlined.

  3. Task-Driven Dynamic Text Summarization

    ERIC Educational Resources Information Center

    Workman, Terri Elizabeth

    2011-01-01

    The objective of this work is to examine the efficacy of natural language processing (NLP) in summarizing bibliographic text for multiple purposes. Researchers have noted the accelerating growth of bibliographic databases. Information seekers using traditional information retrieval techniques when searching large bibliographic databases are often…

  4. DEXTER: Disease-Expression Relation Extraction from Text.

    PubMed

    Gupta, Samir; Dingerdissen, Hayley; Ross, Karen E; Hu, Yu; Wu, Cathy H; Mazumder, Raja; Vijay-Shanker, K

    2018-01-01

    Gene expression levels affect biological processes and play a key role in many diseases. Characterizing expression profiles is useful for clinical research, and diagnostics and prognostics of diseases. There are currently several high-quality databases that capture gene expression information, obtained mostly from large-scale studies, such as microarray and next-generation sequencing technologies, in the context of disease. The scientific literature is another rich source of information on gene expression-disease relationships that not only have been captured from large-scale studies but have also been observed in thousands of small-scale studies. Expression information obtained from literature through manual curation can extend expression databases. While many of the existing databases include information from literature, they are limited by the time-consuming nature of manual curation and have difficulty keeping up with the explosion of publications in the biomedical field. In this work, we describe an automated text-mining tool, Disease-Expression Relation Extraction from Text (DEXTER) to extract information from literature on gene and microRNA expression in the context of disease. One of the motivations in developing DEXTER was to extend the BioXpress database, a cancer-focused gene expression database that includes data derived from large-scale experiments and manual curation of publications. The literature-based portion of BioXpress lags behind significantly compared to expression information obtained from large-scale studies and can benefit from our text-mined results. We have conducted two different evaluations to measure the accuracy of our text-mining tool and achieved average F-scores of 88.51 and 81.81% for the two evaluations, respectively. Also, to demonstrate the ability to extract rich expression information in different disease-related scenarios, we used DEXTER to extract information on differential expression information for 2024 genes in lung cancer, 115 glycosyltransferases in 62 cancers and 826 microRNA in 171 cancers. All extractions using DEXTER are integrated in the literature-based portion of BioXpress.Database URL: http://biotm.cis.udel.edu/DEXTER.

  5. Teaching Advanced SQL Skills: Text Bulk Loading

    ERIC Educational Resources Information Center

    Olsen, David; Hauser, Karina

    2007-01-01

    Studies show that advanced database skills are important for students to be prepared for today's highly competitive job market. A common task for database administrators is to insert a large amount of data into a database. This paper illustrates how an up-to-date, advanced database topic, namely bulk insert, can be incorporated into a database…

  6. Developing a Large Lexical Database for Information Retrieval, Parsing, and Text Generation Systems.

    ERIC Educational Resources Information Center

    Conlon, Sumali Pin-Ngern; And Others

    1993-01-01

    Important characteristics of lexical databases and their applications in information retrieval and natural language processing are explained. An ongoing project using various machine-readable sources to build a lexical database is described, and detailed designs of individual entries with examples are included. (Contains 66 references.) (EAM)

  7. Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization

    PubMed Central

    Wei, Chih-Hsuan; Hakala, Kai; Pyysalo, Sampo; Ananiadou, Sophia; Kao, Hung-Yu; Lu, Zhiyong; Salakoski, Tapio; Van de Peer, Yves; Ginter, Filip

    2013-01-01

    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license. PMID:23613707

  8. Generalized entropies and the similarity of texts

    NASA Astrophysics Data System (ADS)

    Altmann, Eduardo G.; Dias, Laércio; Gerlach, Martin

    2017-01-01

    We show how generalized Gibbs-Shannon entropies can provide new insights on the statistical properties of texts. The universal distribution of word frequencies (Zipf’s law) implies that the generalized entropies, computed at the word level, are dominated by words in a specific range of frequencies. Here we show that this is the case not only for the generalized entropies but also for the generalized (Jensen-Shannon) divergences, used to compute the similarity between different texts. This finding allows us to identify the contribution of specific words (and word frequencies) for the different generalized entropies and also to estimate the size of the databases needed to obtain a reliable estimation of the divergences. We test our results in large databases of books (from the google n-gram database) and scientific papers (indexed by Web of Science).

  9. Mining Quality Phrases from Massive Text Corpora

    PubMed Central

    Liu, Jialu; Shang, Jingbo; Wang, Chi; Ren, Xiang; Han, Jiawei

    2015-01-01

    Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method. PMID:26705375

  10. Database citation in supplementary data linked to Europe PubMed Central full text biomedical articles.

    PubMed

    Kafkas, Şenay; Kim, Jee-Hyub; Pi, Xingjun; McEntyre, Johanna R

    2015-01-01

    In this study, we present an analysis of data citation practices in full text research articles and their corresponding supplementary data files, made available in the Open Access set of articles from Europe PubMed Central. Our aim is to investigate whether supplementary data files should be considered as a source of information for integrating the literature with biomolecular databases. Using text-mining methods to identify and extract a variety of core biological database accession numbers, we found that the supplemental data files contain many more database citations than the body of the article, and that those citations often take the form of a relatively small number of articles citing large collections of accession numbers in text-based files. Moreover, citation of value-added databases derived from submission databases (such as Pfam, UniProt or Ensembl) is common, demonstrating the reuse of these resources as datasets in themselves. All the database accession numbers extracted from the supplementary data are publicly accessible from http://dx.doi.org/10.5281/zenodo.11771. Our study suggests that supplementary data should be considered when linking articles with data, in curation pipelines, and in information retrieval tasks in order to make full use of the entire research article. These observations highlight the need to improve the management of supplemental data in general, in order to make this information more discoverable and useful.

  11. The Clinical Practice Library of Medicine (CPLM): An on-line biomedical computer library. System documentation

    NASA Technical Reports Server (NTRS)

    Grams, R. R.

    1982-01-01

    A system designed to access a large range of available medical textbook information in an online interactive fashion is described. A high level query type database manager, INQUIRE, is used. Operating instructions, system flow diagrams, database descriptions, text generation, and error messages are discussed. User information is provided.

  12. US Army Research Laboratory Joint Interagency Field Experimentation 15-2 Final Report

    DTIC Science & Technology

    2015-12-01

    February 2015, at Alameda Island, California. Advanced text analytics capabilities were demonstrated in a logically coherent workflow pipeline that... text processing capabilities allowed the targeted use of a persistent imagery sensor for rapid detection of mission- critical events. The creation of...a very large text database from open source data provides a relevant and unclassified foundation for continued development of text -processing

  13. Block-suffix shifting: fast, simultaneous medical concept set identification in large medical record corpora.

    PubMed

    Liu, Ying; Lita, Lucian Vlad; Niculescu, Radu Stefan; Mitra, Prasenjit; Giles, C Lee

    2008-11-06

    Owing to new advances in computer hardware, large text databases have become more prevalent than ever.Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we present a new, fast multi-string pattern matching method based on the well known Aho-Chorasick algorithm. Advantages of our algorithm include:the ability to exploit the natural structure of text, the ability to perform significant character shifting, avoiding backtracking jumps that are not useful, efficiency in terms of matching time and avoiding the typical "sub-string" false positive errors.Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms.

  14. An Interactive Iterative Method for Electronic Searching of Large Literature Databases

    ERIC Educational Resources Information Center

    Hernandez, Marco A.

    2013-01-01

    PubMed® is an on-line literature database hosted by the U.S. National Library of Medicine. Containing over 21 million citations for biomedical literature--both abstracts and full text--in the areas of the life sciences, behavioral studies, chemistry, and bioengineering, PubMed® represents an important tool for researchers. PubMed® searches return…

  15. JEnsembl: a version-aware Java API to Ensembl data systems.

    PubMed

    Paterson, Trevor; Law, Andy

    2012-11-01

    The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data. The JEnsembl API implementation provides basic data retrieval and manipulation functionality from the Core, Compara and Variation databases for all species in Ensembl and EnsemblGenomes and is a platform for the development of a richer API to Ensembl datasources. The JEnsembl architecture uses a text-based configuration module to provide evolving, versioned mappings from database schema to code objects. A single installation of the JEnsembl API can therefore simultaneously and transparently connect to current and previous database instances (such as those in the public archive) thus facilitating better analysis repeatability and allowing 'through time' comparative analyses to be performed. Project development, released code libraries, Maven repository and documentation are hosted at SourceForge (http://jensembl.sourceforge.net).

  16. Development of a full-text information retrieval system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keizo Oyama; AKira Miyazawa, Atsuhiro Takasu; Kouji Shibano

    The authors have executed a project to realize a full-text information retrieval system. The system is designed to deal with a document database comprising full text of a large number of documents such as academic papers. The document structures are utilized in searching and extracting appropriate information. The concept of structure handling and the configuration of the system are described in this paper.

  17. DocCube: Multi-Dimensional Visualization and Exploration of Large Document Sets.

    ERIC Educational Resources Information Center

    Mothe, Josiane; Chrisment, Claude; Dousset, Bernard; Alaux, Joel

    2003-01-01

    Describes a user interface that provides global visualizations of large document sets to help users formulate the query that corresponds to their information needs. Highlights include concept hierarchies that users can browse to specify and refine information needs; knowledge discovery in databases and texts; and multidimensional modeling.…

  18. De-identifying an EHR database - anonymity, correctness and readability of the medical record.

    PubMed

    Pantazos, Kostas; Lauesen, Soren; Lippert, Soren

    2011-01-01

    Electronic health records (EHR) contain a large amount of structured data and free text. Exploring and sharing clinical data can improve healthcare and facilitate the development of medical software. However, revealing confidential information is against ethical principles and laws. We de-identified a Danish EHR database with 437,164 patients. The goal was to generate a version with real medical records, but related to artificial persons. We developed a de-identification algorithm that uses lists of named entities, simple language analysis, and special rules. Our algorithm consists of 3 steps: collect lists of identifiers from the database and external resources, define a replacement for each identifier, and replace identifiers in structured data and free text. Some patient records could not be safely de-identified, so the de-identified database has 323,122 patient records with an acceptable degree of anonymity, readability and correctness (F-measure of 95%). The algorithm has to be adjusted for each culture, language and database.

  19. Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing

    PubMed Central

    2013-01-01

    Background A large-scale, highly accurate, machine-understandable drug-disease treatment relationship knowledge base is important for computational approaches to drug repurposing. The large body of published biomedical research articles and clinical case reports available on MEDLINE is a rich source of FDA-approved drug-disease indication as well as drug-repurposing knowledge that is crucial for applying FDA-approved drugs for new diseases. However, much of this information is buried in free text and not captured in any existing databases. The goal of this study is to extract a large number of accurate drug-disease treatment pairs from published literature. Results In this study, we developed a simple but highly accurate pattern-learning approach to extract treatment-specific drug-disease pairs from 20 million biomedical abstracts available on MEDLINE. We extracted a total of 34,305 unique drug-disease treatment pairs, the majority of which are not included in existing structured databases. Our algorithm achieved a precision of 0.904 and a recall of 0.131 in extracting all pairs, and a precision of 0.904 and a recall of 0.842 in extracting frequent pairs. In addition, we have shown that the extracted pairs strongly correlate with both drug target genes and therapeutic classes, therefore may have high potential in drug discovery. Conclusions We demonstrated that our simple pattern-learning relationship extraction algorithm is able to accurately extract many drug-disease pairs from the free text of biomedical literature that are not captured in structured databases. The large-scale, accurate, machine-understandable drug-disease treatment knowledge base that is resultant of our study, in combination with pairs from structured databases, will have high potential in computational drug repurposing tasks. PMID:23742147

  20. DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures

    PubMed Central

    Yin, Xu-Cheng; Yang, Chun; Pei, Wei-Yi; Man, Haixia; Zhang, Jun; Learned-Miller, Erik; Yu, Hong

    2015-01-01

    Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/. PMID:25951377

  1. Text mining to decipher free-response consumer complaints: insights from the NHTSA vehicle owner's complaint database.

    PubMed

    Ghazizadeh, Mahtab; McDonald, Anthony D; Lee, John D

    2014-09-01

    This study applies text mining to extract clusters of vehicle problems and associated trends from free-response data in the National Highway Traffic Safety Administration's vehicle owner's complaint database. As the automotive industry adopts new technologies, it is important to systematically assess the effect of these changes on traffic safety. Driving simulators, naturalistic driving data, and crash databases all contribute to a better understanding of how drivers respond to changing vehicle technology, but other approaches, such as automated analysis of incident reports, are needed. Free-response data from incidents representing two severity levels (fatal incidents and incidents involving injury) were analyzed using a text mining approach: latent semantic analysis (LSA). LSA and hierarchical clustering identified clusters of complaints for each severity level, which were compared and analyzed across time. Cluster analysis identified eight clusters of fatal incidents and six clusters of incidents involving injury. Comparisons showed that although the airbag clusters across the two severity levels have the same most frequent terms, the circumstances around the incidents differ. The time trends show clear increases in complaints surrounding the Ford/Firestone tire recall and the Toyota unintended acceleration recall. Increases in complaints may be partially driven by these recall announcements and the associated media attention. Text mining can reveal useful information from free-response databases that would otherwise be prohibitively time-consuming and difficult to summarize manually. Text mining can extend human analysis capabilities for large free-response databases to support earlier detection of problems and more timely safety interventions.

  2. JEnsembl: a version-aware Java API to Ensembl data systems

    PubMed Central

    Paterson, Trevor; Law, Andy

    2012-01-01

    Motivation: The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data. Results: The JEnsembl API implementation provides basic data retrieval and manipulation functionality from the Core, Compara and Variation databases for all species in Ensembl and EnsemblGenomes and is a platform for the development of a richer API to Ensembl datasources. The JEnsembl architecture uses a text-based configuration module to provide evolving, versioned mappings from database schema to code objects. A single installation of the JEnsembl API can therefore simultaneously and transparently connect to current and previous database instances (such as those in the public archive) thus facilitating better analysis repeatability and allowing ‘through time’ comparative analyses to be performed. Availability: Project development, released code libraries, Maven repository and documentation are hosted at SourceForge (http://jensembl.sourceforge.net). Contact: jensembl-develop@lists.sf.net, andy.law@roslin.ed.ac.uk, trevor.paterson@roslin.ed.ac.uk PMID:22945789

  3. Search extension transforms Wiki into a relational system: a case for flavonoid metabolite database.

    PubMed

    Arita, Masanori; Suwa, Kazuhiro

    2008-09-17

    In computer science, database systems are based on the relational model founded by Edgar Codd in 1970. On the other hand, in the area of biology the word 'database' often refers to loosely formatted, very large text files. Although such bio-databases may describe conflicts or ambiguities (e.g. a protein pair do and do not interact, or unknown parameters) in a positive sense, the flexibility of the data format sacrifices a systematic query mechanism equivalent to the widely used SQL. To overcome this disadvantage, we propose embeddable string-search commands on a Wiki-based system and designed a half-formatted database. As proof of principle, a database of flavonoid with 6902 molecular structures from over 1687 plant species was implemented on MediaWiki, the background system of Wikipedia. Registered users can describe any information in an arbitrary format. Structured part is subject to text-string searches to realize relational operations. The system was written in PHP language as the extension of MediaWiki. All modifications are open-source and publicly available. This scheme benefits from both the free-formatted Wiki style and the concise and structured relational-database style. MediaWiki supports multi-user environments for document management, and the cost for database maintenance is alleviated.

  4. Search extension transforms Wiki into a relational system: A case for flavonoid metabolite database

    PubMed Central

    Arita, Masanori; Suwa, Kazuhiro

    2008-01-01

    Background In computer science, database systems are based on the relational model founded by Edgar Codd in 1970. On the other hand, in the area of biology the word 'database' often refers to loosely formatted, very large text files. Although such bio-databases may describe conflicts or ambiguities (e.g. a protein pair do and do not interact, or unknown parameters) in a positive sense, the flexibility of the data format sacrifices a systematic query mechanism equivalent to the widely used SQL. Results To overcome this disadvantage, we propose embeddable string-search commands on a Wiki-based system and designed a half-formatted database. As proof of principle, a database of flavonoid with 6902 molecular structures from over 1687 plant species was implemented on MediaWiki, the background system of Wikipedia. Registered users can describe any information in an arbitrary format. Structured part is subject to text-string searches to realize relational operations. The system was written in PHP language as the extension of MediaWiki. All modifications are open-source and publicly available. Conclusion This scheme benefits from both the free-formatted Wiki style and the concise and structured relational-database style. MediaWiki supports multi-user environments for document management, and the cost for database maintenance is alleviated. PMID:18822113

  5. A practical approach for inexpensive searches of radiology report databases.

    PubMed

    Desjardins, Benoit; Hamilton, R Curtis

    2007-06-01

    We present a method to perform full text searches of radiology reports for the large number of departments that do not have this ability as part of their radiology or hospital information system. A tool written in Microsoft Access (front-end) has been designed to search a server (back-end) containing the indexed backup weekly copy of the full relational database extracted from a radiology information system (RIS). This front end-/back-end approach has been implemented in a large academic radiology department, and is used for teaching, research and administrative purposes. The weekly second backup of the 80 GB, 4 million record RIS database takes 2 hours. Further indexing of the exported radiology reports takes 6 hours. Individual searches of the indexed database typically take less than 1 minute on the indexed database and 30-60 minutes on the nonindexed database. Guidelines to properly address privacy and institutional review board issues are closely followed by all users. This method has potential to improve teaching, research, and administrative programs within radiology departments that cannot afford more expensive technology.

  6. SureChEMBL: a large-scale, chemically annotated patent document database.

    PubMed

    Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P

    2016-01-04

    SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  7. SureChEMBL: a large-scale, chemically annotated patent document database

    PubMed Central

    Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A.; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.

    2016-01-01

    SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. PMID:26582922

  8. How Artificial Intelligence Can Improve Our Understanding of the Genes Associated with Endometriosis: Natural Language Processing of the PubMed Database

    PubMed Central

    Mashiach, R.; Cohen, S.; Kedem, A.; Baron, A.; Zajicek, M.; Feldman, I.; Seidman, D.; Soriano, D.

    2018-01-01

    Endometriosis is a disease characterized by the development of endometrial tissue outside the uterus, but its cause remains largely unknown. Numerous genes have been studied and proposed to help explain its pathogenesis. However, the large number of these candidate genes has made functional validation through experimental methodologies nearly impossible. Computational methods could provide a useful alternative for prioritizing those most likely to be susceptibility genes. Using artificial intelligence applied to text mining, this study analyzed the genes involved in the pathogenesis, development, and progression of endometriosis. The data extraction by text mining of the endometriosis-related genes in the PubMed database was based on natural language processing, and the data were filtered to remove false positives. Using data from the text mining and gene network information as input for the web-based tool, 15,207 endometriosis-related genes were ranked according to their score in the database. Characterization of the filtered gene set through gene ontology, pathway, and network analysis provided information about the numerous mechanisms hypothesized to be responsible for the establishment of ectopic endometrial tissue, as well as the migration, implantation, survival, and proliferation of ectopic endometrial cells. Finally, the human genome was scanned through various databases using filtered genes as a seed to determine novel genes that might also be involved in the pathogenesis of endometriosis but which have not yet been characterized. These genes could be promising candidates to serve as useful diagnostic biomarkers and therapeutic targets in the management of endometriosis. PMID:29750165

  9. How Artificial Intelligence Can Improve Our Understanding of the Genes Associated with Endometriosis: Natural Language Processing of the PubMed Database.

    PubMed

    Bouaziz, J; Mashiach, R; Cohen, S; Kedem, A; Baron, A; Zajicek, M; Feldman, I; Seidman, D; Soriano, D

    2018-01-01

    Endometriosis is a disease characterized by the development of endometrial tissue outside the uterus, but its cause remains largely unknown. Numerous genes have been studied and proposed to help explain its pathogenesis. However, the large number of these candidate genes has made functional validation through experimental methodologies nearly impossible. Computational methods could provide a useful alternative for prioritizing those most likely to be susceptibility genes. Using artificial intelligence applied to text mining, this study analyzed the genes involved in the pathogenesis, development, and progression of endometriosis. The data extraction by text mining of the endometriosis-related genes in the PubMed database was based on natural language processing, and the data were filtered to remove false positives. Using data from the text mining and gene network information as input for the web-based tool, 15,207 endometriosis-related genes were ranked according to their score in the database. Characterization of the filtered gene set through gene ontology, pathway, and network analysis provided information about the numerous mechanisms hypothesized to be responsible for the establishment of ectopic endometrial tissue, as well as the migration, implantation, survival, and proliferation of ectopic endometrial cells. Finally, the human genome was scanned through various databases using filtered genes as a seed to determine novel genes that might also be involved in the pathogenesis of endometriosis but which have not yet been characterized. These genes could be promising candidates to serve as useful diagnostic biomarkers and therapeutic targets in the management of endometriosis.

  10. DDMGD: the database of text-mined associations between genes methylated in diseases from different species.

    PubMed

    Bin Raies, Arwa; Mansour, Hicham; Incitti, Roberto; Bajic, Vladimir B

    2015-01-01

    Gathering information about associations between methylated genes and diseases is important for diseases diagnosis and treatment decisions. Recent advancements in epigenetics research allow for large-scale discoveries of associations of genes methylated in diseases in different species. Searching manually for such information is not easy, as it is scattered across a large number of electronic publications and repositories. Therefore, we developed DDMGD database (http://www.cbrc.kaust.edu.sa/ddmgd/) to provide a comprehensive repository of information related to genes methylated in diseases that can be found through text mining. DDMGD's scope is not limited to a particular group of genes, diseases or species. Using the text mining system DEMGD we developed earlier and additional post-processing, we extracted associations of genes methylated in different diseases from PubMed Central articles and PubMed abstracts. The accuracy of extracted associations is 82% as estimated on 2500 hand-curated entries. DDMGD provides a user-friendly interface facilitating retrieval of these associations ranked according to confidence scores. Submission of new associations to DDMGD is provided. A comparison analysis of DDMGD with several other databases focused on genes methylated in diseases shows that DDMGD is comprehensive and includes most of the recent information on genes methylated in diseases. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. A Description of the Text Data Base System TDBS. Stockholm Papers in Library and Information Science.

    ERIC Educational Resources Information Center

    Lofstrom, Mats

    Because experience with large information retrieval (IR) and database management (DBM) systems has shown that they are not adequate for the handling of textual material, two Swedish companies--Paralog and AU-System Network--have joined in a venture to develop a software package which combines features from IR and DMB systems to form a Text Data…

  12. Full-Text Databases in Medicine.

    ERIC Educational Resources Information Center

    Sievert, MaryEllen C.; And Others

    1995-01-01

    Describes types of full-text databases in medicine; discusses features for searching full-text journal databases available through online vendors; reviews research on full-text databases in medicine; and describes the MEDLINE/Full-Text Research Project at the University of Missouri (Columbia) which investigated precision, recall, and relevancy.…

  13. New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks

    NASA Astrophysics Data System (ADS)

    Morillot, Olivier; Likforman-Sulem, Laurence; Grosicki, Emmanuèle

    2013-04-01

    Many preprocessing techniques have been proposed for isolated word recognition. However, recently, recognition systems have dealt with text blocks and their compound text lines. In this paper, we propose a new preprocessing approach to efficiently correct baseline skew and fluctuations. Our approach is based on a sliding window within which the vertical position of the baseline is estimated. Segmentation of text lines into subparts is, thus, avoided. Experiments conducted on a large publicly available database (Rimes), with a BLSTM (bidirectional long short-term memory) recurrent neural network recognition system, show that our baseline correction approach highly improves performance.

  14. The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis.

    PubMed

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J; Inzé, Dirk; Van de Peer, Yves

    2013-03-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.

  15. Multi-source and ontology-based retrieval engine for maize mutant phenotypes

    PubMed Central

    Green, Jason M.; Harnsomburana, Jaturon; Schaeffer, Mary L.; Lawrence, Carolyn J.; Shyu, Chi-Ren

    2011-01-01

    Model Organism Databases, including the various plant genome databases, collect and enable access to massive amounts of heterogeneous information, including sequence data, gene product information, images of mutant phenotypes, etc, as well as textual descriptions of many of these entities. While a variety of basic browsing and search capabilities are available to allow researchers to query and peruse the names and attributes of phenotypic data, next-generation search mechanisms that allow querying and ranking of text descriptions are much less common. In addition, the plant community needs an innovative way to leverage the existing links in these databases to search groups of text descriptions simultaneously. Furthermore, though much time and effort have been afforded to the development of plant-related ontologies, the knowledge embedded in these ontologies remains largely unused in available plant search mechanisms. Addressing these issues, we have developed a unique search engine for mutant phenotypes from MaizeGDB. This advanced search mechanism integrates various text description sources in MaizeGDB to aid a user in retrieving desired mutant phenotype information. Currently, descriptions of mutant phenotypes, loci and gene products are utilized collectively for each search, though expansion of the search mechanism to include other sources is straightforward. The retrieval engine, to our knowledge, is the first engine to exploit the content and structure of available domain ontologies, currently the Plant and Gene Ontologies, to expand and enrich retrieval results in major plant genomic databases. Database URL: http:www.PhenomicsWorld.org/QBTA.php PMID:21558151

  16. ORFer--retrieval of protein sequences and open reading frames from GenBank and storage into relational databases or text files.

    PubMed

    Büssow, Konrad; Hoffmann, Steve; Sievert, Volker

    2002-12-19

    Functional genomics involves the parallel experimentation with large sets of proteins. This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins. A Java program was developed for retrieval of protein and nucleic acid sequences and annotations from NCBI GenBank, using the XML sequence format. Annotations retrieved by ORFer include sequence name, organism and also the completeness of the sequence. The program has a graphical user interface, although it can be used in a non-interactive mode. For protein sequences, the program also extracts the open reading frame sequence, if available, and checks its correct translation. ORFer accepts user input in the form of single or lists of GenBank GI identifiers or accession numbers. It can be used to extract complete sets of open reading frames and protein sequences from any kind of GenBank sequence entry, including complete genomes or chromosomes. Sequences are either stored with their features in a relational database or can be exported as text files in Fasta or tabulator delimited format. The ORFer program is freely available at http://www.proteinstrukturfabrik.de/orfer. The ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank. Furthermore, storage of sequences and features in a relational database is supported. Such a database can supplement a laboratory information system (LIMS) with appropriate sequence information.

  17. Text Mining to Support Gene Ontology Curation and Vice Versa.

    PubMed

    Ruch, Patrick

    2017-01-01

    In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.

  18. PaperBLAST: Text Mining Papers for Information about Homologs

    DOE PAGES

    Price, Morgan N.; Arkin, Adam P.

    2017-08-15

    Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quicklymore » finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.« less

  19. PaperBLAST: Text Mining Papers for Information about Homologs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Price, Morgan N.; Arkin, Adam P.

    Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quicklymore » finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.« less

  20. PaperBLAST: Text Mining Papers for Information about Homologs

    PubMed Central

    Arkin, Adam P.

    2017-01-01

    ABSTRACT Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/. IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions. PMID:28845458

  1. PaperBLAST: Text Mining Papers for Information about Homologs.

    PubMed

    Price, Morgan N; Arkin, Adam P

    2017-01-01

    Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST's database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/. IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins' functions.

  2. Duplicate document detection in DocBrowse

    NASA Astrophysics Data System (ADS)

    Chalana, Vikram; Bruce, Andrew G.; Nguyen, Thien

    1998-04-01

    Duplicate documents are frequently found in large databases of digital documents, such as those found in digital libraries or in the government declassification effort. Efficient duplicate document detection is important not only to allow querying for similar documents, but also to filter out redundant information in large document databases. We have designed three different algorithm to identify duplicate documents. The first algorithm is based on features extracted from the textual content of a document, the second algorithm is based on wavelet features extracted from the document image itself, and the third algorithm is a combination of the first two. These algorithms are integrated within the DocBrowse system for information retrieval from document images which is currently under development at MathSoft. DocBrowse supports duplicate document detection by allowing (1) automatic filtering to hide duplicate documents, and (2) ad hoc querying for similar or duplicate documents. We have tested the duplicate document detection algorithms on 171 documents and found that text-based method has an average 11-point precision of 97.7 percent while the image-based method has an average 11- point precision of 98.9 percent. However, in general, the text-based method performs better when the document contains enough high-quality machine printed text while the image- based method performs better when the document contains little or no quality machine readable text.

  3. Overview of Historical Earthquake Document Database in Japan and Future Development

    NASA Astrophysics Data System (ADS)

    Nishiyama, A.; Satake, K.

    2014-12-01

    In Japan, damage and disasters from historical large earthquakes have been documented and preserved. Compilation of historical earthquake documents started in the early 20th century and 33 volumes of historical document source books (about 27,000 pages) have been published. However, these source books are not effectively utilized for researchers due to a contamination of low-reliability historical records and a difficulty for keyword searching by characters and dates. To overcome these problems and to promote historical earthquake studies in Japan, construction of text database started in the 21 century. As for historical earthquakes from the beginning of the 7th century to the early 17th century, "Online Database of Historical Documents in Japanese Earthquakes and Eruptions in the Ancient and Medieval Ages" (Ishibashi, 2009) has been already constructed. They investigated the source books or original texts of historical literature, emended the descriptions, and assigned the reliability of each historical document on the basis of written age. Another database compiled the historical documents for seven damaging earthquakes occurred along the Sea of Japan coast in Honshu, central Japan in the Edo period (from the beginning of the 17th century to the middle of the 19th century) and constructed text database and seismic intensity data base. These are now publicized on the web (written only in Japanese). However, only about 9 % of the earthquake source books have been digitized so far. Therefore, we plan to digitize all of the remaining historical documents by the research-program which started in 2014. The specification of the data base will be similar for previous ones. We also plan to combine this database with liquefaction traces database, which will be constructed by other research program, by adding the location information described in historical documents. Constructed database would be utilized to estimate the distributions of seismic intensities and tsunami heights.

  4. Relationship mapping

    NASA Astrophysics Data System (ADS)

    Benachenhou, D.

    2009-04-01

    Information-technology departments in large enterprises spend 40% of budget on information integration-combining information from different data sources into a coherent form. IDC, a market-intelligence firm, estimates that the market for data integration and access software (which includes the key enabling technology for information integration) was about 2.5 billion in 2007, and is expected to grow to 3.8 billion in 2012. This is only the cost estimate for structured or traditional database information integration. Just imagine the market for transforming text into structured information and subsequent fusion with traditional databases.

  5. Efficient privacy-preserving string search and an application in genomics.

    PubMed

    Shimizu, Kana; Nuida, Koji; Rätsch, Gunnar

    2016-06-01

    Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within [Formula: see text] 4.6 s and [Formula: see text] 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm. https://github.com/iskana/PBWT-sec and https://github.com/ratschlab/PBWT-sec shimizu-kana@aist.go.jp or Gunnar.Ratsch@ratschlab.org Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  6. The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis[C][W

    PubMed Central

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J.; Inzé, Dirk; Van de Peer, Yves

    2013-01-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein–protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies. PMID:23532071

  7. Corpus-based Statistical Screening for Phrase Identification

    PubMed Central

    Kim, Won; Wilbur, W. John

    2000-01-01

    Purpose: The authors study the extraction of useful phrases from a natural language database by statistical methods. The aim is to leverage human effort by providing preprocessed phrase lists with a high percentage of useful material. Method: The approach is to develop six different scoring methods that are based on different aspects of phrase occurrence. The emphasis here is not on lexical information or syntactic structure but rather on the statistical properties of word pairs and triples that can be obtained from a large database. Measurements: The Unified Medical Language System (UMLS) incorporates a large list of humanly acceptable phrases in the medical field as a part of its structure. The authors use this list of phrases as a gold standard for validating their methods. A good method is one that ranks the UMLS phrases high among all phrases studied. Measurements are 11-point average precision values and precision-recall curves based on the rankings. Result: The authors find of six different scoring methods that each proves effective in identifying UMLS quality phrases in a large subset of MEDLINE. These methods are applicable both to word pairs and word triples. All six methods are optimally combined to produce composite scoring methods that are more effective than any single method. The quality of the composite methods appears sufficient to support the automatic placement of hyperlinks in text at the site of highly ranked phrases. Conclusion: Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text. PMID:10984469

  8. Academic Journal Embargoes and Full Text Databases.

    ERIC Educational Resources Information Center

    Brooks, Sam

    2003-01-01

    Documents the reasons for embargoes of academic journals in full text databases (i.e., publisher-imposed delays on the availability of full text content) and provides insight regarding common misconceptions. Tables present data on selected journals covering a cross-section of subjects and publishers and comparing two full text business databases.…

  9. Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics.

    PubMed

    Chung, Ming-Hua; Wang, Yuping; Tang, Hailin; Zou, Wen; Basinger, John; Xu, Xiaowei; Tong, Weida

    2015-01-01

    The advancement of high-throughput screening technologies facilitates the generation of massive amount of biological data, a big data phenomena in biomedical science. Yet, researchers still heavily rely on keyword search and/or literature review to navigate the databases and analyses are often done in rather small-scale. As a result, the rich information of a database has not been fully utilized, particularly for the information embedded in the interactive nature between data points that are largely ignored and buried. For the past 10 years, probabilistic topic modeling has been recognized as an effective machine learning algorithm to annotate the hidden thematic structure of massive collection of documents. The analogy between text corpus and large-scale genomic data enables the application of text mining tools, like probabilistic topic models, to explore hidden patterns of genomic data and to the extension of altered biological functions. In this paper, we developed a generalized probabilistic topic model to analyze a toxicogenomics dataset that consists of a large number of gene expression data from the rat livers treated with drugs in multiple dose and time-points. We discovered the hidden patterns in gene expression associated with the effect of doses and time-points of treatment. Finally, we illustrated the ability of our model to identify the evidence of potential reduction of animal use.

  10. Medical Image Databases

    PubMed Central

    Tagare, Hemant D.; Jaffe, C. Carl; Duncan, James

    1997-01-01

    Abstract Information contained in medical images differs considerably from that residing in alphanumeric format. The difference can be attributed to four characteristics: (1) the semantics of medical knowledge extractable from images is imprecise; (2) image information contains form and spatial data, which are not expressible in conventional language; (3) a large part of image information is geometric; (4) diagnostic inferences derived from images rest on an incomplete, continuously evolving model of normality. This paper explores the differentiating characteristics of text versus images and their impact on design of a medical image database intended to allow content-based indexing and retrieval. One strategy for implementing medical image databases is presented, which employs object-oriented iconic queries, semantics by association with prototypes, and a generic schema. PMID:9147338

  11. Full Text Psychology Journals Available from Popular Library Databases

    ERIC Educational Resources Information Center

    Joswick, Kathleen E.

    2006-01-01

    The author identified 433 core journals in psychology and investigated their full text availability in popular databases. While 62 percent of the studied journals were available in at least one database, access from individual databases ranged from 1.4 percent to 38.1 percent of the titles. The full text of influential psychology journals is not…

  12. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research.

    PubMed

    Bravo, Àlex; Piñero, Janet; Queralt-Rosinach, Núria; Rautschka, Michael; Furlong, Laura I

    2015-02-21

    Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

  13. DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

    PubMed

    Mahmood, A S M Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.

  14. DiMeX: A Text Mining System for Mutation-Disease Association Extraction

    PubMed Central

    Mahmood, A. S. M. Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K.

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases. PMID:27073839

  15. Why do nursing homes close? An analysis of newspaper articles.

    PubMed

    Fisher, Andrew; Castle, Nicholas

    2012-01-01

    Using Non-numerical Unstructured Data Indexing Searching and Theorizing (NUD'IST) software to extract and examine keywords from text, the authors explored the phenomenon of nursing home closure through an analysis of 30 major-market newspapers over a period of 66 months (January 1, 1999 to June 1, 2005). Newspaper articles typically represent a careful analysis of staff impressions via interviews, managerial perspectives, and financial records review. There is a current reliance on the synthesis of information from large regulatory databases such as the Online Survey Certification And Reporting database, the California Office of Statewide Healthcare Planning and Development database, and Area Resource Files. Although such databases permit the construction of studies capable of revealing some reasons for nursing home closure, they are hampered by the confines of the data entered. Using our analysis of newspaper articles, the authors are able to add further to their understanding of nursing home closures.

  16. High-Reproducibility and High-Accuracy Method for Automated Topic Classification

    NASA Astrophysics Data System (ADS)

    Lancichinetti, Andrea; Sirer, M. Irmak; Wang, Jane X.; Acuna, Daniel; Körding, Konrad; Amaral, Luís A. Nunes

    2015-01-01

    Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.

  17. Validating a strategy for psychosocial phenotyping using a large corpus of clinical text.

    PubMed

    Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H

    2013-12-01

    To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.

  18. Validating a strategy for psychosocial phenotyping using a large corpus of clinical text

    PubMed Central

    Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H

    2013-01-01

    Objective To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. Materials and methods From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. Results A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6–0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Conclusions Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype. PMID:24169276

  19. Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD).

    PubMed

    Jiang, Xiangying; Ringwald, Martin; Blake, Judith; Shatkay, Hagit

    2017-01-01

    The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. www.informatics.jax.org. © The Author(s) 2017. Published by Oxford University Press.

  20. PhAST: pharmacophore alignment search tool.

    PubMed

    Hähnke, Volker; Hofmann, Bettina; Grgat, Tomislav; Proschak, Ewgenij; Steinhilber, Dieter; Schneider, Gisbert

    2009-04-15

    We present a ligand-based virtual screening technique (PhAST) for rapid hit and lead structure searching in large compound databases. Molecules are represented as strings encoding the distribution of pharmacophoric features on the molecular graph. In contrast to other text-based methods using SMILES strings, we introduce a new form of text representation that describes the pharmacophore of molecules. This string representation opens the opportunity for revealing functional similarity between molecules by sequence alignment techniques in analogy to homology searching in protein or nucleic acid sequence databases. We favorably compared PhAST with other current ligand-based virtual screening methods in a retrospective analysis using the BEDROC metric. In a prospective application, PhAST identified two novel inhibitors of 5-lipoxygenase product formation with minimal experimental effort. This outcome demonstrates the applicability of PhAST to drug discovery projects and provides an innovative concept of sequence-based compound screening with substantial scaffold hopping potential. 2008 Wiley Periodicals, Inc.

  1. Where Full-Text Is Viable.

    ERIC Educational Resources Information Center

    Cotton, P. L.

    1987-01-01

    Defines two types of online databases: source, referring to those intended to be complete in themselves, whether full-text or abstracts; and bibliographic, meaning those that are not complete. Predictions are made about the future growth rate of these two types of databases, as well as full-text versus abstract databases. (EM)

  2. Text messaging in health care: a systematic review of impact studies.

    PubMed

    Yeager, Valerie A; Menachemi, Nir

    2011-01-01

    Studies suggest text messaging is beneficial to health care; however, no one has synthesized the overall evidence on texting interventions. In response to this need, we conducted a systematic review of the impacts of text messaging in health care. PubMed database searches and subsequent reference list reviews sought English-language, peer-reviewed studies involving text messaging in health care. Commentaries, conference proceedings, and feasibilities studies were excluded. Data was extracted using an article coding sheet and input into a database for analysis. Of the 61 papers reviewed, 50 articles (82%) found text messaging had a positive effect on the primary outcome. Average sample sizes in articles reporting positive findings (n=813) were significantly larger than those that did not find a positive impact (n=178) on outcomes (p = 0.032). Articles were categorized into focal groups as follows: 27 articles (44.3%) investigated the impact of texting on disease management, 24 articles (39.3%) focused texting's impact to public health related outcomes, and 10 articles (16.4%) examined texting and its influence on administrative processes. Articles in focal groups differed by the purpose of the study, direction of the communication, and where they were published, but not in likelihood of reporting a positive impact from texting. Current evidence indicates that text messaging health care interventions are largely beneficial clinically, in public health related uses, and in terms of administrative processes. However, despite the promise of these findings, literature gaps exist, especially in primary care settings, across geographic regions and with vulnerable populations.

  3. SparkText: Biomedical Text Mining on Big Data Framework.

    PubMed

    Ye, Zhan; Tafti, Ahmad P; He, Karen Y; Wang, Kai; He, Max M

    Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.

  4. SparkText: Biomedical Text Mining on Big Data Framework

    PubMed Central

    He, Karen Y.; Wang, Kai

    2016-01-01

    Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research. PMID:27685652

  5. GEMINI: a computationally-efficient search engine for large gene expression datasets.

    PubMed

    DeFreitas, Timothy; Saddiki, Hachem; Flaherty, Patrick

    2016-02-24

    Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query - a text-based string - is mismatched with the form of the target - a genomic profile. To improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an [Formula: see text] expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 10(5) samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec. GEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information.

  6. Database citation in full text biomedical articles.

    PubMed

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.

  7. Database Citation in Full Text Biomedical Articles

    PubMed Central

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R.

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services. PMID:23734176

  8. Keyless Entry: Building a Text Database Using OCR Technology.

    ERIC Educational Resources Information Center

    Grotophorst, Clyde W.

    1989-01-01

    Discusses the use of optical character recognition (OCR) technology to produce an ASCII text database. A tutorial on digital scanning and OCR is provided, and a systems integration project which used the Calera CDP-3000XF scanner and text retrieval software to construct a database of dissertations at George Mason University is described. (four…

  9. A Tutorial in Creating Web-Enabled Databases with Inmagic DB/TextWorks through ODBC.

    ERIC Educational Resources Information Center

    Breeding, Marshall

    2000-01-01

    Explains how to create Web-enabled databases. Highlights include Inmagic's DB/Text WebPublisher product called DB/TextWorks; ODBC (Open Database Connectivity) drivers; Perl programming language; HTML coding; Structured Query Language (SQL); Common Gateway Interface (CGI) programming; and examples of HTML pages and Perl scripts. (LRW)

  10. Database Management Systems: New Homes for Migrating Bibliographic Records.

    ERIC Educational Resources Information Center

    Brooks, Terrence A.; Bierbaum, Esther G.

    1987-01-01

    Assesses bibliographic databases as part of visionary text systems such as hypertext and scholars' workstations. Downloading is discussed in terms of the capability to search records and to maintain unique bibliographic descriptions, and relational database management systems, file managers, and text databases are reviewed as possible hosts for…

  11. Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

    PubMed

    Keuleers, Emmanuel; Balota, David A

    2015-01-01

    This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.

  12. Preparing College Students To Search Full-Text Databases: Is Instruction Necessary?

    ERIC Educational Resources Information Center

    Riley, Cheryl; Wales, Barbara

    Full-text databases allow Central Missouri State University's clients to access some of the serials that libraries have had to cancel due to escalating subscription costs; EbscoHost, the subject of this study, is one such database. The database is available free to all Missouri residents. A survey was designed consisting of 21 questions intended…

  13. Search Filter Precision Can Be Improved By NOTing Out Irrelevant Content

    PubMed Central

    Wilczynski, Nancy L.; McKibbon, K. Ann; Haynes, R. Brian

    2011-01-01

    Background: Most methodologic search filters developed for use in large electronic databases such as MEDLINE have low precision. One method that has been proposed but not tested for improving precision is NOTing out irrelevant content. Objective: To determine if search filter precision can be improved by NOTing out the text words and index terms assigned to those articles that are retrieved but are off-target. Design: Analytic survey. Methods: NOTing out unique terms in off-target articles and testing search filter performance in the Clinical Hedges Database. Main Outcome Measures: Sensitivity, specificity, precision and number needed to read (NNR). Results: For all purpose categories (diagnosis, prognosis and etiology) except treatment and for all databases (MEDLINE, EMBASE, CINAHL and PsycINFO), constructing search filters that NOTed out irrelevant content resulted in substantive improvements in NNR (over four-fold for some purpose categories and databases). Conclusion: Search filter precision can be improved by NOTing out irrelevant content. PMID:22195215

  14. Is More Always Better?

    ERIC Educational Resources Information Center

    Bell, Steven J.

    2003-01-01

    Discusses full-text databases and whether existing aggregator databases are meeting user needs. Topics include the need for better search interfaces; concepts of quality research and information retrieval; information overload; full text in electronic journal collections versus aggregator databases; underrepresentation of certain disciplines; and…

  15. High-Performance Secure Database Access Technologies for HEP Grids

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Matthew Vranicar; John Weicher

    2006-04-17

    The Large Hadron Collider (LHC) at the CERN Laboratory will become the largest scientific instrument in the world when it starts operations in 2007. Large Scale Analysis Computer Systems (computational grids) are required to extract rare signals of new physics from petabytes of LHC detector data. In addition to file-based event data, LHC data processing applications require access to large amounts of data in relational databases: detector conditions, calibrations, etc. U.S. high energy physicists demand efficient performance of grid computing applications in LHC physics research where world-wide remote participation is vital to their success. To empower physicists with data-intensive analysismore » capabilities a whole hyperinfrastructure of distributed databases cross-cuts a multi-tier hierarchy of computational grids. The crosscutting allows separation of concerns across both the global environment of a federation of computational grids and the local environment of a physicist’s computer used for analysis. Very few efforts are on-going in the area of database and grid integration research. Most of these are outside of the U.S. and rely on traditional approaches to secure database access via an extraneous security layer separate from the database system core, preventing efficient data transfers. Our findings are shared by the Database Access and Integration Services Working Group of the Global Grid Forum, who states that "Research and development activities relating to the Grid have generally focused on applications where data is stored in files. However, in many scientific and commercial domains, database management systems have a central role in data storage, access, organization, authorization, etc, for numerous applications.” There is a clear opportunity for a technological breakthrough, requiring innovative steps to provide high-performance secure database access technologies for grid computing. We believe that an innovative database architecture where the secure authorization is pushed into the database engine will eliminate inefficient data transfer bottlenecks. Furthermore, traditionally separated database and security layers provide an extra vulnerability, leaving a weak clear-text password authorization as the only protection on the database core systems. Due to the legacy limitations of the systems’ security models, the allowed passwords often can not even comply with the DOE password guideline requirements. We see an opportunity for the tight integration of the secure authorization layer with the database server engine resulting in both improved performance and improved security. Phase I has focused on the development of a proof-of-concept prototype using Argonne National Laboratory’s (ANL) Argonne Tandem-Linac Accelerator System (ATLAS) project as a test scenario. By developing a grid-security enabled version of the ATLAS project’s current relation database solution, MySQL, PIOCON Technologies aims to offer a more efficient solution to secure database access.« less

  16. Knowledge based word-concept model estimation and refinement for biomedical text mining.

    PubMed

    Jimeno Yepes, Antonio; Berlanga, Rafael

    2015-02-01

    Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches. Copyright © 2014 Elsevier Inc. All rights reserved.

  17. Performance of an open-source heart sound segmentation algorithm on eight independent databases.

    PubMed

    Liu, Chengyu; Springer, David; Clifford, Gari D

    2017-08-01

    Heart sound segmentation is a prerequisite step for the automatic analysis of heart sound signals, facilitating the subsequent identification and classification of pathological events. Recently, hidden Markov model-based algorithms have received increased interest due to their robustness in processing noisy recordings. In this study we aim to evaluate the performance of the recently published logistic regression based hidden semi-Markov model (HSMM) heart sound segmentation method, by using a wider variety of independently acquired data of varying quality. Firstly, we constructed a systematic evaluation scheme based on a new collection of heart sound databases, which we assembled for the PhysioNet/CinC Challenge 2016. This collection includes a total of more than 120 000 s of heart sounds recorded from 1297 subjects (including both healthy subjects and cardiovascular patients) and comprises eight independent heart sound databases sourced from multiple independent research groups around the world. Then, the HSMM-based segmentation method was evaluated using the assembled eight databases. The common evaluation metrics of sensitivity, specificity, accuracy, as well as the [Formula: see text] measure were used. In addition, the effect of varying the tolerance window for determining a correct segmentation was evaluated. The results confirm the high accuracy of the HSMM-based algorithm on a separate test dataset comprised of 102 306 heart sounds. An average [Formula: see text] score of 98.5% for segmenting S1 and systole intervals and 97.2% for segmenting S2 and diastole intervals were observed. The [Formula: see text] score was shown to increases with an increases in the tolerance window size, as expected. The high segmentation accuracy of the HSMM-based algorithm on a large database confirmed the algorithm's effectiveness. The described evaluation framework, combined with the largest collection of open access heart sound data, provides essential resources for evaluators who need to test their algorithms with realistic data and share reproducible results.

  18. Word-level recognition of multifont Arabic text using a feature vector matching approach

    NASA Astrophysics Data System (ADS)

    Erlandson, Erik J.; Trenkle, John M.; Vogt, Robert C., III

    1996-03-01

    Many text recognition systems recognize text imagery at the character level and assemble words from the recognized characters. An alternative approach is to recognize text imagery at the word level, without analyzing individual characters. This approach avoids the problem of individual character segmentation, and can overcome local errors in character recognition. A word-level recognition system for machine-printed Arabic text has been implemented. Arabic is a script language, and is therefore difficult to segment at the character level. Character segmentation has been avoided by recognizing text imagery of complete words. The Arabic recognition system computes a vector of image-morphological features on a query word image. This vector is matched against a precomputed database of vectors from a lexicon of Arabic words. Vectors from the database with the highest match score are returned as hypotheses for the unknown image. Several feature vectors may be stored for each word in the database. Database feature vectors generated using multiple fonts and noise models allow the system to be tuned to its input stream. Used in conjunction with database pruning techniques, this Arabic recognition system has obtained promising word recognition rates on low-quality multifont text imagery.

  19. Using Bitmap Indexing Technology for Combined Numerical and TextQueries

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Stockinger, Kurt; Cieslewicz, John; Wu, Kesheng

    2006-10-16

    In this paper, we describe a strategy of using compressedbitmap indices to speed up queries on both numerical data and textdocuments. By using an efficient compression algorithm, these compressedbitmap indices are compact even for indices with millions of distinctterms. Moreover, bitmap indices can be used very efficiently to answerBoolean queries over text documents involving multiple query terms.Existing inverted indices for text searches are usually inefficient forcorpora with a very large number of terms as well as for queriesinvolving a large number of hits. We demonstrate that our compressedbitmap index technology overcomes both of those short-comings. In aperformance comparison against amore » commonly used database system, ourindices answer queries 30 times faster on average. To provide full SQLsupport, we integrated our indexing software, called FastBit, withMonetDB. The integrated system MonetDB/FastBit provides not onlyefficient searches on a single table as FastBit does, but also answersjoin queries efficiently. Furthermore, MonetDB/FastBit also provides avery efficient retrieval mechanism of result records.« less

  20. Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II.

    PubMed

    Lu, Zhiyong; Hirschman, Lynette

    2012-01-01

    Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of written descriptions of curation workflows from expert curated databases for the BioCreative 2012 Workshop Track II. We received seven qualified contributions, primarily from model organism databases. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a survey done in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the workshop participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage) and ontology concept assignment as those most desired by the biocurators. DATABASE URL: http://www.biocreative.org/tasks/bc-workshop-2012/workflow/.

  1. [Exploration and construction of the full-text database of acupuncture literature in the Republic of China].

    PubMed

    Fei, Lin; Zhao, Jing; Leng, Jiahao; Zhang, Shujian

    2017-10-12

    The ALIPORC full-text database is targeted at a specific full-text database of acupuncture literature in the Republic of China. Starting in 2015, till now, the database has been getting completed, focusing on books relevant with acupuncture, articles and advertising documents, accomplished or published in the Republic of China. The construction of this database aims to achieve the source sharing of acupuncture medical literature in the Republic of China through the retrieval approaches to diversity and accurate content presentation, contributes to the exchange of scholars, reduces the paper damage caused by paging and simplify the retrieval of the rare literature. The writers have made the explanation of the database in light of sources, characteristics and current situation of construction; and have discussed on improving the efficiency and integrity of the database and deepening the development of acupuncture literature in the Republic of China.

  2. Sleep atlas and multimedia database.

    PubMed

    Penzel, T; Kesper, K; Mayer, G; Zulley, J; Peter, J H

    2000-01-01

    The ENN sleep atlas and database was set up on a dedicated server connected to the internet thus providing all services such as WWW, ftp and telnet access. The database serves as a platform to promote the goals of the European Neurological Network, to exchange patient cases for second opinion between experts and to create a case-oriented multimedia sleep atlas with descriptive text, images and video-clips of all known sleep disorders. The sleep atlas consists of a small public and a large private part for members of the consortium. 20 patient cases were collected and presented with educational information similar to published case reports. Case reports are complemented with images, video-clips and biosignal recordings. A Java based viewer for biosignals provided in EDF format was installed in order to move free within the sleep recordings without the need to download the full recording on the client.

  3. β-secretase inhibitors for Alzheimer's disease: identification using pharmacoinformatics.

    PubMed

    Islam, Md Ataul; Pillay, Tahir S

    2018-02-01

    In this study we searched for potential β-site amyloid precursor protein cleaving enzyme1 (BACE1) inhibitors using pharmacoinformatics. A large dataset containing 7155 known BACE1 inhibitors was evaluated for pharmacophore model generation. The final model (R = 0.950, RMSD = 1.094, Q 2  = 0.901, se = 0.332, [Formula: see text] = 0.901, [Formula: see text] = 0.756, sp = 0.468, [Formula: see text] = 0.667) was revealed with the importance of spatial arrangement of hydrogen bond acceptor and donor, hydrophobicity and aromatic ring features. The validated model was then used to search NCI and InterBioscreen databases for promising BACE1 inhibitors. The initial hits from both databases were sorted using a number of criteria and finally three molecules from each database were considered for further validation using molecular docking and molecular dynamics studies. Different protonation states of Asp32 and Asp228 dyad were analysed and best protonated form used for molecular docking study. Observation of the number of binding interactions in the molecular docking study supported the potential of these molecules being promising inhibitors. Values of RMSD, RMSF, Rg in molecular dynamics study and binding energies unquestionably explained that final screened molecules formed stable complexes inside the receptor cavity of BACE1. Hence, it can be concluded that the final screened six compounds may be potential therapeutic agents for Alzheimer's disease.

  4. Identifying duplicate content using statistically improbable phrases

    PubMed Central

    Errami, Mounir; Sun, Zhaohui; George, Angela C.; Long, Tara C.; Skinner, Michael A.; Wren, Jonathan D.; Garner, Harold R.

    2010-01-01

    Motivation: Document similarity metrics such as PubMed's ‘Find related articles’ feature, which have been primarily used to identify studies with similar topics, can now also be used to detect duplicated or potentially plagiarized papers within literature reference databases. However, the CPU-intensive nature of document comparison has limited MEDLINE text similarity studies to the comparison of abstracts, which constitute only a small fraction of a publication's total text. Extending searches to include text archived by online search engines would drastically increase comparison ability. For large-scale studies, submitting short phrases encased in direct quotes to search engines for exact matches would be optimal for both individual queries and programmatic interfaces. We have derived a method of analyzing statistically improbable phrases (SIPs) for assistance in identifying duplicate content. Results: When applied to MEDLINE citations, this method substantially improves upon previous algorithms in the detection of duplication citations, yielding a precision and recall of 78.9% (versus 50.3% for eTBLAST) and 99.6% (versus 99.8% for eTBLAST), respectively. Availability: Similar citations identified by this work are freely accessible in the Déjà vu database, under the SIP discovery method category at http://dejavu.vbi.vt.edu/dejavu/ Contact: merrami@collin.edu PMID:20472545

  5. Subject Retrieval from Full-Text Databases in the Humanities

    ERIC Educational Resources Information Center

    East, John W.

    2007-01-01

    This paper examines the problems involved in subject retrieval from full-text databases of secondary materials in the humanities. Ten such databases were studied and their search functionality evaluated, focusing on factors such as Boolean operators, document surrogates, limiting by subject area, proximity operators, phrase searching, wildcards,…

  6. Towards computational improvement of DNA database indexing and short DNA query searching.

    PubMed

    Stojanov, Done; Koceski, Sašo; Mileva, Aleksandra; Koceska, Nataša; Bande, Cveta Martinovska

    2014-09-03

    In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.

  7. Text mining in livestock animal science: introducing the potential of text mining to animal sciences.

    PubMed

    Sahadevan, S; Hofmann-Apitius, M; Schellander, K; Tesfaye, D; Fluck, J; Friedrich, C M

    2012-10-01

    In biological research, establishing the prior art by searching and collecting information already present in the domain has equal importance as the experiments done. To obtain a complete overview about the relevant knowledge, researchers mainly rely on 2 major information sources: i) various biological databases and ii) scientific publications in the field. The major difference between the 2 information sources is that information from databases is available, typically well structured and condensed. The information content in scientific literature is vastly unstructured; that is, dispersed among the many different sections of scientific text. The traditional method of information extraction from scientific literature occurs by generating a list of relevant publications in the field of interest and manually scanning these texts for relevant information, which is very time consuming. It is more than likely that in using this "classical" approach the researcher misses some relevant information mentioned in the literature or has to go through biological databases to extract further information. Text mining and named entity recognition methods have already been used in human genomics and related fields as a solution to this problem. These methods can process and extract information from large volumes of scientific text. Text mining is defined as the automatic extraction of previously unknown and potentially useful information from text. Named entity recognition (NER) is defined as the method of identifying named entities (names of real world objects; for example, gene/protein names, drugs, enzymes) in text. In animal sciences, text mining and related methods have been briefly used in murine genomics and associated fields, leaving behind other fields of animal sciences, such as livestock genomics. The aim of this work was to develop an information retrieval platform in the livestock domain focusing on livestock publications and the recognition of relevant data from cattle and pigs. For this purpose, the rather noncomprehensive resources of pig and cattle gene and protein terminologies were enriched with orthologue synonyms, integrated in the NER platform, ProMiner, which is successfully used in human genomics domain. Based on the performance tests done, the present system achieved a fair performance with precision 0.64, recall 0.74, and F(1) measure of 0.69 in a test scenario based on cattle literature.

  8. Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.

    PubMed

    Ravikumar, Komandur Elayavilli; Wagholikar, Kavishwar B; Li, Dingcheng; Kocher, Jean-Pierre; Liu, Hongfang

    2015-06-06

    Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains 'locked' in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems. We developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3% for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10% in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5%. Our quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating discourse level analysis significantly improved the performance of extracting the protein-mutation-disease association. Future work includes the extension of MutD for full text articles.

  9. MetReS, an Efficient Database for Genomic Applications.

    PubMed

    Vilaplana, Jordi; Alves, Rui; Solsona, Francesc; Mateo, Jordi; Teixidó, Ivan; Pifarré, Marc

    2018-02-01

    MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis. The study was started with a relational database, MySQL, which is the current database server used by the applications. We also tested the performance of an alternative data-handling framework, Apache Hadoop. Hadoop is currently used for large-scale data processing. We found that this data handling framework is likely to greatly improve the efficiency of the MetReS applications as the dataset and the processing needs increase by several orders of magnitude, as expected to happen in the near future.

  10. Searching Harvard Business Review Online. . . Lessons in Searching a Full Text Database.

    ERIC Educational Resources Information Center

    Tenopir, Carol

    1985-01-01

    This article examines the Harvard Business Review Online (HBRO) database (bibliographic description fields, abstracts, extracted information, full text, subject descriptors) and reports on 31 sample HBRO searches conducted in Bibliographic Retrieval Services to test differences between searching full text and searching bibliographic record. Sample…

  11. Extracting Databases from Dark Data with DeepDive.

    PubMed

    Zhang, Ce; Shin, Jaeho; Ré, Christopher; Cafarella, Michael; Niu, Feng

    2016-01-01

    DeepDive is a system for extracting relational databases from dark data : the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of "big data." DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.

  12. Pedoinformatics Approach to Soil Text Analytics

    NASA Astrophysics Data System (ADS)

    Furey, J.; Seiter, J.; Davis, A.

    2017-12-01

    The several extant schema for the classification of soils rely on differing criteria, but the major soil science taxonomies, including the United States Department of Agriculture (USDA) and the international harmonized World Reference Base for Soil Resources systems, are based principally on inferred pedogenic properties. These taxonomies largely result from compiled individual observations of soil morphologies within soil profiles, and the vast majority of this pedologic information is contained in qualitative text descriptions. We present text mining analyses of hundreds of gigabytes of parsed text and other data in the digitally available USDA soil taxonomy documentation, the Soil Survey Geographic (SSURGO) database, and the National Cooperative Soil Survey (NCSS) soil characterization database. These analyses implemented iPython calls to Gensim modules for topic modelling, with latent semantic indexing completed down to the lowest taxon level (soil series) paragraphs. Via a custom extension of the Natural Language Toolkit (NLTK), approximately one percent of the USDA soil series descriptions were used to train a classifier for the remainder of the documents, essentially by treating soil science words as comprising a novel language. While location-specific descriptors at the soil series level are amenable to geomatics methods, unsupervised clustering of the occurrence of other soil science words did not closely follow the usual hierarchy of soil taxa. We present preliminary phrasal analyses that may account for some of these effects.

  13. Scaling laws and fluctuations in the statistics of word frequencies

    NASA Astrophysics Data System (ADS)

    Gerlach, Martin; Altmann, Eduardo G.

    2014-11-01

    In this paper, we combine statistical analysis of written texts and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. The average vocabulary of an ensemble of fixed-length texts is known to scale sublinearly with the total number of words (Heaps’ law). Analyzing the fluctuations around this average in three large databases (Google-ngram, English Wikipedia, and a collection of scientific articles), we find that the standard deviation scales linearly with the average (Taylor's law), in contrast to the prediction of decaying fluctuations obtained using simple sampling arguments. We explain both scaling laws (Heaps’ and Taylor) by modeling the usage of words using a Poisson process with a fat-tailed distribution of word frequencies (Zipf's law) and topic-dependent frequencies of individual words (as in topic models). Considering topical variations lead to quenched averages, turn the vocabulary size a non-self-averaging quantity, and explain the empirical observations. For the numerous practical applications relying on estimations of vocabulary size, our results show that uncertainties remain large even for long texts. We show how to account for these uncertainties in measurements of lexical richness of texts with different lengths.

  14. AccuNet/AP (Associated Press) Multimedia Archive

    ERIC Educational Resources Information Center

    Young, Terrence E., Jr.

    2004-01-01

    The AccuNet/AP Multimedia Archive is an electronic library containing the AP's current photos and a selection of pictures from their enormous print and negative library, as well as text and graphic material. It is composed of two photo databases as well as graphics, text, and audio databases. The features of this database are briefly described in…

  15. Mobile Phone Interventions for the Secondary Prevention of Cardiovascular Disease.

    PubMed

    Park, Linda G; Beatty, Alexis; Stafford, Zoey; Whooley, Mary A

    2016-01-01

    Mobile health in the form of text messaging and mobile applications provides an innovative and effective approach to promote prevention and management of cardiovascular disease (CVD); however, the magnitude of these effects is unclear. Through a comprehensive search of databases from 2002-2016, we conducted a quantitative systematic review. The selected studies were critically evaluated to extract and summarize pertinent characteristics and outcomes. A large majority of studies (22 of 28, 79%) demonstrated text messaging, mobile applications, and telemonitoring via mobile phones were effective in improving outcomes. Some key factors associated with successful interventions included personalized messages with tailored advice, greater engagement (2-way text messaging, higher frequency of messages), and use of multiple modalities. Overall, text messaging appears more effective than smartphone-based interventions. Incorporating principles of behavioral activation will help promote and sustain healthy lifestyle behaviors in patients with CVD that result in improved clinical outcomes. Copyright © 2016 Elsevier Inc. All rights reserved.

  16. Mobile Phone Interventions for the Secondary Prevention of Cardiovascular Disease

    PubMed Central

    Park, Linda G.; Beatty, Alexis; Stafford, Zoey; Whooley, Mary A.

    2016-01-01

    Mobile health in the form of text messaging and mobile applications provides an innovative and effective approach to promote prevention and management of cardiovascular disease (CVD); however, the magnitude of these effects is unclear. Through a comprehensive search of databases from 2002–2016, we conducted a quantitative systematic review. The selected studies were critically evaluated to extract and summarize pertinent characteristics and outcomes. A large majority of studies (22 of 28, 79%) demonstrated text messaging, mobile applications, and telemonitoring via mobile phones were effective in improving outcomes. Some key factors associated with successful interventions included personalized messages with tailored advice, greater engagement (2-way text messaging, higher frequency of messages), and use of multiple modalities. Overall, text messaging appears more effective than smartphone-based interventions. Incorporating principles of behavioral activation will help promote and sustain healthy lifestyle behaviors in patients with CVD that result in improved clinical outcomes. PMID:27001245

  17. Bilingual Text Messaging Translation: Translating Text Messages From English Into Spanish for the Text4Walking Program

    PubMed Central

    Sandi, Giselle; Ingram, Diana; Welch, Mary Jane; Ocampo, Edith V

    2015-01-01

    Background Hispanic adults in the United States are at particular risk for diabetes and inadequate blood pressure control. Physical activity improves these health problems; however Hispanic adults also have a low rate of recommended aerobic physical activity. To address improving physical inactivity, one area of rapidly growing technology that can be utilized is text messaging (short message service, SMS). A physical activity research team, Text4Walking, had previously developed an initial database of motivational physical activity text messages in English that could be used for physical activity text messaging interventions. However, the team needed to translate these existing English physical activity text messages into Spanish in order to have culturally meaningful and useful text messages for those adults within the Hispanic population who would prefer to receive text messages in Spanish. Objective The aim of this study was to translate a database of English motivational physical activity messages into Spanish and review these text messages with a group of Spanish speaking adults to inform the use of these text messages in an intervention study. Methods The consent form and study documents, including the existing English physical activity text messages, were translated from English into Spanish, and received translation certification as well as Institutional Review Board approval. The translated text messages were placed into PowerPoint, accompanied by a set of culturally appropriate photos depicting barriers to walking, as well as walking scenarios. At the focus group, eligibility criteria for this study included being an adult between 30 to 65 years old who spoke Spanish as their primary language. After a general group introduction, participants were placed into smaller groups of two or three. Each small group was asked to review a segment of the translated text messages for accuracy and meaningfulness. After the break out, the group was brought back together to review the text messages. Results A translation confirmation group met at a church site in an urban community with a large population of Hispanics. Spanish speaking adults (N=8), with a mean age of 40 (SD 6.3), participated in the study. Participants were engaged in the group and viewed the text messages as culturally appropriate. They also thought that text messages could motivate them to walk more. Twenty-two new text messages were added to the original database of 246 translated text messages. While the text messages were generally understood, specific word preferences were seen related to personal preference, dialect, and level of formality which resulted in minor revisions to four text messages. Conclusions The English text messages were successfully translated into Spanish by a bilingual research staff and reviewed by Hispanic participants in order to inform the use of these text messages for future intervention studies. These Spanish text messages were recently used in a Text4Walking intervention study. PMID:25947953

  18. GeneView: a comprehensive semantic search engine for PubMed.

    PubMed

    Thomas, Philippe; Starlinger, Johannes; Vowinkel, Alexander; Arzt, Sebastian; Leser, Ulf

    2012-07-01

    Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein-protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271,000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.

  19. Mynodbcsv: lightweight zero-config database solution for handling very large CSV files.

    PubMed

    Adaszewski, Stanisław

    2014-01-01

    Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: "no copy" approach--data stay mostly in the CSV files; "zero configuration"--no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.

  20. Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

    PubMed Central

    Adaszewski, Stanisław

    2014-01-01

    Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: “no copy” approach – data stay mostly in the CSV files; “zero configuration” – no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results. PMID:25068261

  1. A review on computational systems biology of pathogen–host interactions

    PubMed Central

    Durmuş, Saliha; Çakır, Tunahan; Özgür, Arzucan; Guthke, Reinhard

    2015-01-01

    Pathogens manipulate the cellular mechanisms of host organisms via pathogen–host interactions (PHIs) in order to take advantage of the capabilities of host cells, leading to infections. The crucial role of these interspecies molecular interactions in initiating and sustaining infections necessitates a thorough understanding of the corresponding mechanisms. Unlike the traditional approach of considering the host or pathogen separately, a systems-level approach, considering the PHI system as a whole is indispensable to elucidate the mechanisms of infection. Following the technological advances in the post-genomic era, PHI data have been produced in large-scale within the last decade. Systems biology-based methods for the inference and analysis of PHI regulatory, metabolic, and protein–protein networks to shed light on infection mechanisms are gaining increasing demand thanks to the availability of omics data. The knowledge derived from the PHIs may largely contribute to the identification of new and more efficient therapeutics to prevent or cure infections. There are recent efforts for the detailed documentation of these experimentally verified PHI data through Web-based databases. Despite these advances in data archiving, there are still large amounts of PHI data in the biomedical literature yet to be discovered, and novel text mining methods are in development to unearth such hidden data. Here, we review a collection of recent studies on computational systems biology of PHIs with a special focus on the methods for the inference and analysis of PHI networks, covering also the Web-based databases and text-mining efforts to unravel the data hidden in the literature. PMID:25914674

  2. Application and Effects of Linguistic Functions on Information Retrieval in a German Language Full-Text Database: Comparison between Retrieval in Abstract and Full Text.

    ERIC Educational Resources Information Center

    Tauchert, Wolfgang; And Others

    1991-01-01

    Describes the PADOK-II project in Germany, which was designed to give information on the effects of linguistic algorithms on retrieval in a full-text database, the German Patent Information System (GPI). Relevance assessments are discussed, statistical evaluations are described, and searches are compared for the full-text section versus the…

  3. Metabolome searcher: a high throughput tool for metabolite identification and metabolic pathway mapping directly from mass spectrometry and using genome restriction.

    PubMed

    Dhanasekaran, A Ranjitha; Pearson, Jon L; Ganesan, Balasubramanian; Weimer, Bart C

    2015-02-25

    Mass spectrometric analysis of microbial metabolism provides a long list of possible compounds. Restricting the identification of the possible compounds to those produced by the specific organism would benefit the identification process. Currently, identification of mass spectrometry (MS) data is commonly done using empirically derived compound databases. Unfortunately, most databases contain relatively few compounds, leaving long lists of unidentified molecules. Incorporating genome-encoded metabolism enables MS output identification that may not be included in databases. Using an organism's genome as a database restricts metabolite identification to only those compounds that the organism can produce. To address the challenge of metabolomic analysis from MS data, a web-based application to directly search genome-constructed metabolic databases was developed. The user query returns a genome-restricted list of possible compound identifications along with the putative metabolic pathways based on the name, formula, SMILES structure, and the compound mass as defined by the user. Multiple queries can be done simultaneously by submitting a text file created by the user or obtained from the MS analysis software. The user can also provide parameters specific to the experiment's MS analysis conditions, such as mass deviation, adducts, and detection mode during the query so as to provide additional levels of evidence to produce the tentative identification. The query results are provided as an HTML page and downloadable text file of possible compounds that are restricted to a specific genome. Hyperlinks provided in the HTML file connect the user to the curated metabolic databases housed in ProCyc, a Pathway Tools platform, as well as the KEGG Pathway database for visualization and metabolic pathway analysis. Metabolome Searcher, a web-based tool, facilitates putative compound identification of MS output based on genome-restricted metabolic capability. This enables researchers to rapidly extend the possible identifications of large data sets for metabolites that are not in compound databases. Putative compound names with their associated metabolic pathways from metabolomics data sets are returned to the user for additional biological interpretation and visualization. This novel approach enables compound identification by restricting the possible masses to those encoded in the genome.

  4. ELNET--The Electronic Library Database System.

    ERIC Educational Resources Information Center

    King, Shirley V.

    1991-01-01

    ELNET (Electronic Library Network), a Japanese language database, allows searching of index terms and free text terms from articles and stores the full text of the articles on an optical disc system. Users can order fax copies of the text from the optical disc. This article also explains online searching and discusses machine translation. (LRW)

  5. The Weaknesses of Full-Text Searching

    ERIC Educational Resources Information Center

    Beall, Jeffrey

    2008-01-01

    This paper provides a theoretical critique of the deficiencies of full-text searching in academic library databases. Because full-text searching relies on matching words in a search query with words in online resources, it is an inefficient method of finding information in a database. This matching fails to retrieve synonyms, and it also retrieves…

  6. Using text mining to link journal articles to neuroanatomical databases

    PubMed Central

    French, Leon; Pavlidis, Paul

    2013-01-01

    The electronic linking of neuroscience information, including data embedded in the primary literature, would permit powerful queries and analyses driven by structured databases. This task would be facilitated by automated procedures which can identify biological concepts in journals. Here we apply an approach for automatically mapping formal identifiers of neuroanatomical regions to text found in journal abstracts, and apply it to a large body of abstracts from the Journal of Comparative Neurology (JCN). The analyses yield over one hundred thousand brain region mentions which we map to 8,225 brain region concepts in multiple organisms. Based on the analysis of a manually annotated corpus, we estimate mentions are mapped at 95% precision and 63% recall. Our results provide insights into the patterns of publication on brain regions and species of study in the Journal, but also point to important challenges in the standardization of neuroanatomical nomenclatures. We find that many terms in the formal terminologies never appear in a JCN abstract, while conversely, many terms authors use are not reflected in the terminologies. To improve the terminologies we deposited 136 unrecognized brain regions into the Neuroscience Lexicon (NeuroLex). The training data, terminologies, normalizations, evaluations and annotated journal abstracts are freely available at http://www.chibi.ubc.ca/WhiteText/. PMID:22120205

  7. Database tomography for commercial application

    NASA Technical Reports Server (NTRS)

    Kostoff, Ronald N.; Eberhart, Henry J.

    1994-01-01

    Database tomography is a method for extracting themes and their relationships from text. The algorithms, employed begin with word frequency and word proximity analysis and build upon these results. When the word 'database' is used, think of medical or police records, patents, journals, or papers, etc. (any text information that can be computer stored). Database tomography features a full text, user interactive technique enabling the user to identify areas of interest, establish relationships, and map trends for a deeper understanding of an area of interest. Database tomography concepts and applications have been reported in journals and presented at conferences. One important feature of the database tomography algorithm is that it can be used on a database of any size, and will facilitate the users ability to understand the volume of content therein. While employing the process to identify research opportunities it became obvious that this promising technology has potential applications for business, science, engineering, law, and academe. Examples include evaluating marketing trends, strategies, relationships and associations. Also, the database tomography process would be a powerful component in the area of competitive intelligence, national security intelligence and patent analysis. User interests and involvement cannot be overemphasized.

  8. Full-Text Linking: Affiliated versus Nonaffiliated Access in a Free Database.

    ERIC Educational Resources Information Center

    Grogg, Jill E.; Andreadis, Debra K.; Kirk, Rachel A.

    2002-01-01

    Presents a comparison of access to full-text articles from a free bibliographic database (PubSCIENCE) for affiliated and unaffiliated users. Found that affiliated users had access to more full-text articles than unaffiliated users had, and that both types of users could increase their level of access through additional searching and greater…

  9. Cataloging the biomedical world of pain through semi-automated curation of molecular interactions

    PubMed Central

    Jamieson, Daniel G.; Roberts, Phoebe M.; Robertson, David L.; Sidders, Ben; Nenadic, Goran

    2013-01-01

    The vast collection of biomedical literature and its continued expansion has presented a number of challenges to researchers who require structured findings to stay abreast of and analyze molecular mechanisms relevant to their domain of interest. By structuring literature content into topic-specific machine-readable databases, the aggregate data from multiple articles can be used to infer trends that can be compared and contrasted with similar findings from topic-independent resources. Our study presents a generalized procedure for semi-automatically creating a custom topic-specific molecular interaction database through the use of text mining to assist manual curation. We apply the procedure to capture molecular events that underlie ‘pain’, a complex phenomenon with a large societal burden and unmet medical need. We describe how existing text mining solutions are used to build a pain-specific corpus, extract molecular events from it, add context to the extracted events and assess their relevance. The pain-specific corpus contains 765 692 documents from Medline and PubMed Central, from which we extracted 356 499 unique normalized molecular events, with 261 438 single protein events and 93 271 molecular interactions supplied by BioContext. Event chains are annotated with negation, speculation, anatomy, Gene Ontology terms, mutations, pain and disease relevance, which collectively provide detailed insight into how that event chain is associated with pain. The extracted relations are visualized in a wiki platform (wiki-pain.org) that enables efficient manual curation and exploration of the molecular mechanisms that underlie pain. Curation of 1500 grouped event chains ranked by pain relevance revealed 613 accurately extracted unique molecular interactions that in the future can be used to study the underlying mechanisms involved in pain. Our approach demonstrates that combining existing text mining tools with domain-specific terms and wiki-based visualization can facilitate rapid curation of molecular interactions to create a custom database. Database URL: ••• PMID:23707966

  10. Machine learning approaches to analysing textual injury surveillance data: a systematic review.

    PubMed

    Vallmuur, Kirsten

    2015-06-01

    To synthesise recent research on the use of machine learning approaches to mining textual injury surveillance data. Systematic review. The electronic databases which were searched included PubMed, Cinahl, Medline, Google Scholar, and Proquest. The bibliography of all relevant articles was examined and associated articles were identified using a snowballing technique. For inclusion, articles were required to meet the following criteria: (a) used a health-related database, (b) focused on injury-related cases, AND used machine learning approaches to analyse textual data. The papers identified through the search were screened resulting in 16 papers selected for review. Articles were reviewed to describe the databases and methodology used, the strength and limitations of different techniques, and quality assurance approaches used. Due to heterogeneity between studies meta-analysis was not performed. Occupational injuries were the focus of half of the machine learning studies and the most common methods described were Bayesian probability or Bayesian network based methods to either predict injury categories or extract common injury scenarios. Models were evaluated through either comparison with gold standard data or content expert evaluation or statistical measures of quality. Machine learning was found to provide high precision and accuracy when predicting a small number of categories, was valuable for visualisation of injury patterns and prediction of future outcomes. However, difficulties related to generalizability, source data quality, complexity of models and integration of content and technical knowledge were discussed. The use of narrative text for injury surveillance has grown in popularity, complexity and quality over recent years. With advances in data mining techniques, increased capacity for analysis of large databases, and involvement of computer scientists in the injury prevention field, along with more comprehensive use and description of quality assurance methods in text mining approaches, it is likely that we will see a continued growth and advancement in knowledge of text mining in the injury field. Copyright © 2015 Elsevier Ltd. All rights reserved.

  11. Cataloging the biomedical world of pain through semi-automated curation of molecular interactions.

    PubMed

    Jamieson, Daniel G; Roberts, Phoebe M; Robertson, David L; Sidders, Ben; Nenadic, Goran

    2013-01-01

    The vast collection of biomedical literature and its continued expansion has presented a number of challenges to researchers who require structured findings to stay abreast of and analyze molecular mechanisms relevant to their domain of interest. By structuring literature content into topic-specific machine-readable databases, the aggregate data from multiple articles can be used to infer trends that can be compared and contrasted with similar findings from topic-independent resources. Our study presents a generalized procedure for semi-automatically creating a custom topic-specific molecular interaction database through the use of text mining to assist manual curation. We apply the procedure to capture molecular events that underlie 'pain', a complex phenomenon with a large societal burden and unmet medical need. We describe how existing text mining solutions are used to build a pain-specific corpus, extract molecular events from it, add context to the extracted events and assess their relevance. The pain-specific corpus contains 765 692 documents from Medline and PubMed Central, from which we extracted 356 499 unique normalized molecular events, with 261 438 single protein events and 93 271 molecular interactions supplied by BioContext. Event chains are annotated with negation, speculation, anatomy, Gene Ontology terms, mutations, pain and disease relevance, which collectively provide detailed insight into how that event chain is associated with pain. The extracted relations are visualized in a wiki platform (wiki-pain.org) that enables efficient manual curation and exploration of the molecular mechanisms that underlie pain. Curation of 1500 grouped event chains ranked by pain relevance revealed 613 accurately extracted unique molecular interactions that in the future can be used to study the underlying mechanisms involved in pain. Our approach demonstrates that combining existing text mining tools with domain-specific terms and wiki-based visualization can facilitate rapid curation of molecular interactions to create a custom database. Database URL: •••

  12. The Molecule Pages database

    PubMed Central

    Saunders, Brian; Lyon, Stephen; Day, Matthew; Riley, Brenda; Chenette, Emily; Subramaniam, Shankar

    2008-01-01

    The UCSD-Nature Signaling Gateway Molecule Pages (http://www.signaling-gateway.org/molecule) provides essential information on more than 3800 mammalian proteins involved in cellular signaling. The Molecule Pages contain expert-authored and peer-reviewed information based on the published literature, complemented by regularly updated information derived from public data source references and sequence analysis. The expert-authored data includes both a full-text review about the molecule, with citations, and highly structured data for bioinformatics interrogation, including information on protein interactions and states, transitions between states and protein function. The expert-authored pages are anonymously peer reviewed by the Nature Publishing Group. The Molecule Pages data is present in an object-relational database format and is freely accessible to the authors, the reviewers and the public from a web browser that serves as a presentation layer. The Molecule Pages are supported by several applications that along with the database and the interfaces form a multi-tier architecture. The Molecule Pages and the Signaling Gateway are routinely accessed by a very large research community. PMID:17965093

  13. The Molecule Pages database.

    PubMed

    Saunders, Brian; Lyon, Stephen; Day, Matthew; Riley, Brenda; Chenette, Emily; Subramaniam, Shankar; Vadivelu, Ilango

    2008-01-01

    The UCSD-Nature Signaling Gateway Molecule Pages (http://www.signaling-gateway.org/molecule) provides essential information on more than 3800 mammalian proteins involved in cellular signaling. The Molecule Pages contain expert-authored and peer-reviewed information based on the published literature, complemented by regularly updated information derived from public data source references and sequence analysis. The expert-authored data includes both a full-text review about the molecule, with citations, and highly structured data for bioinformatics interrogation, including information on protein interactions and states, transitions between states and protein function. The expert-authored pages are anonymously peer reviewed by the Nature Publishing Group. The Molecule Pages data is present in an object-relational database format and is freely accessible to the authors, the reviewers and the public from a web browser that serves as a presentation layer. The Molecule Pages are supported by several applications that along with the database and the interfaces form a multi-tier architecture. The Molecule Pages and the Signaling Gateway are routinely accessed by a very large research community.

  14. Full-text, Downloading, & Other Issues.

    ERIC Educational Resources Information Center

    Tenopir, Carol

    1983-01-01

    Issues having a possible impact on online search services in libraries are discussed including full text databases, front-end processors which translate user's input into the command language of an appropriate system, downloading to create personal files from commercial databases, and pricing. (EJS)

  15. NOAA Data Rescue of Key Solar Databases and Digitization of Historical Solar Images

    NASA Astrophysics Data System (ADS)

    Coffey, H. E.

    2006-08-01

    Over a number of years, the staff at NOAA National Geophysical Data Center (NGDC) has worked to rescue key solar databases by converting them to digital format and making them available via the World Wide Web. NOAA has had several data rescue programs where staff compete for funds to rescue important and critical historical data that are languishing in archives and at risk of being lost due to deteriorating condition, loss of any metadata or descriptive text that describe the databases, lack of interest or funding in maintaining databases, etc. The Solar-Terrestrial Physics Division at NGDC was able to obtain funds to key in some critical historical tabular databases. Recently the NOAA Climate Database Modernization Program (CDMP) funded a project to digitize historical solar images, producing a large online database of historical daily full disk solar images. The images include the wavelengths Calcium K, Hydrogen Alpha, and white light photos, as well as sunspot drawings and the comprehensive drawings of a multitude of solar phenomena on one daily map (Fraunhofer maps and Wendelstein drawings). Included in the digitization are high resolution solar H-alpha images taken at the Boulder Solar Observatory 1967-1984. The scanned daily images document many phases of solar activity, from decadal variation to rotational variation to daily changes. Smaller versions are available online. Larger versions are available by request. See http://www.ngdc.noaa.gov/stp/SOLAR/ftpsolarimages.html. The tabular listings and solar imagery will be discussed.

  16. Using statistical text classification to identify health information technology incidents

    PubMed Central

    Chai, Kevin E K; Anthony, Stephen; Coiera, Enrico; Magrabi, Farah

    2013-01-01

    Objective To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. Design We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. Text classifiers using regularized logistic regression were evaluated with both ‘balanced’ (50% HIT) and ‘stratified’ (0.297% HIT) datasets for training, validation, and testing. Dataset preparation, feature extraction, feature selection, cross-validation, classification, performance evaluation, and error analysis were performed iteratively to further improve the classifiers. Feature-selection techniques such as removing short words and stop words, stemming, lemmatization, and principal component analysis were examined. Measurements κ statistic, F1 score, precision and recall. Results Classification performance was similar on both the stratified (0.954 F1 score) and balanced (0.995 F1 score) datasets. Stemming was the most effective technique, reducing the feature set size to 79% while maintaining comparable performance. Training with balanced datasets improved recall (0.989) but reduced precision (0.165). Conclusions Statistical text classification appears to be a feasible method for identifying HIT reports within large databases of incidents. Automated identification should enable more HIT problems to be detected, analyzed, and addressed in a timely manner. Semi-supervised learning may be necessary when applying machine learning to big data analysis of patient safety incidents and requires further investigation. PMID:23666777

  17. Very large database of lipids: rationale and design.

    PubMed

    Martin, Seth S; Blaha, Michael J; Toth, Peter P; Joshi, Parag H; McEvoy, John W; Ahmed, Haitham M; Elshazly, Mohamed B; Swiger, Kristopher J; Michos, Erin D; Kwiterovich, Peter O; Kulkarni, Krishnaji R; Chimera, Joseph; Cannon, Christopher P; Blumenthal, Roger S; Jones, Steven R

    2013-11-01

    Blood lipids have major cardiovascular and public health implications. Lipid-lowering drugs are prescribed based in part on categorization of patients into normal or abnormal lipid metabolism, yet relatively little emphasis has been placed on: (1) the accuracy of current lipid measures used in clinical practice, (2) the reliability of current categorizations of dyslipidemia states, and (3) the relationship of advanced lipid characterization to other cardiovascular disease biomarkers. To these ends, we developed the Very Large Database of Lipids (NCT01698489), an ongoing database protocol that harnesses deidentified data from the daily operations of a commercial lipid laboratory. The database includes individuals who were referred for clinical purposes for a Vertical Auto Profile (Atherotech Inc., Birmingham, AL), which directly measures cholesterol concentrations of low-density lipoprotein, very low-density lipoprotein, intermediate-density lipoprotein, high-density lipoprotein, their subclasses, and lipoprotein(a). Individual Very Large Database of Lipids studies, ranging from studies of measurement accuracy, to dyslipidemia categorization, to biomarker associations, to characterization of rare lipid disorders, are investigator-initiated and utilize peer-reviewed statistical analysis plans to address a priori hypotheses/aims. In the first database harvest (Very Large Database of Lipids 1.0) from 2009 to 2011, there were 1 340 614 adult and 10 294 pediatric patients; the adult sample had a median age of 59 years (interquartile range, 49-70 years) with even representation by sex. Lipid distributions closely matched those from the population-representative National Health and Nutrition Examination Survey. The second harvest of the database (Very Large Database of Lipids 2.0) is underway. Overall, the Very Large Database of Lipids database provides an opportunity for collaboration and new knowledge generation through careful examination of granular lipid data on a large scale. © 2013 Wiley Periodicals, Inc.

  18. An Online Database Producer's Memoirs and Memories of an Online Pioneer and The Database Industry: Looking into the Future.

    ERIC Educational Resources Information Center

    Kollegger, James G.; And Others

    1988-01-01

    In the first of three articles, the producer of Energyline, Energynet, and Tele/Scope recalls the development of the databases and database business strategies. The second describes the development of biomedical online databases, and the third discusses future developments, including full text databases, database producers as online host, and…

  19. Extracting Databases from Dark Data with DeepDive

    PubMed Central

    Zhang, Ce; Shin, Jaeho; Ré, Christopher; Cafarella, Michael; Niu, Feng

    2016-01-01

    DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data — scientific papers, Web classified ads, customer service notes, and so on — were instead in a relational database, it would give analysts a massive and valuable new set of “big data.” DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference. PMID:28316365

  20. The Database Business: Managing Today--Planning for Tomorrow. Quality Assurance of Text and Image Databases at the U.S. Patent and Trademark Office.

    ERIC Educational Resources Information Center

    Grooms, David W.

    1988-01-01

    Discusses the quality controls imposed on text and image data that is currently being converted from paper to digital images by the Patent and Trademark Office. The methods of inspection used on text and on images are described, and the quality of the data delivered thus far is discussed. (CLB)

  1. High-performance information search filters for acute kidney injury content in PubMed, Ovid Medline and Embase.

    PubMed

    Hildebrand, Ainslie M; Iansavichus, Arthur V; Haynes, R Brian; Wilczynski, Nancy L; Mehta, Ravindra L; Parikh, Chirag R; Garg, Amit X

    2014-04-01

    We frequently fail to identify articles relevant to the subject of acute kidney injury (AKI) when searching the large bibliographic databases such as PubMed, Ovid Medline or Embase. To address this issue, we used computer automation to create information search filters to better identify articles relevant to AKI in these databases. We first manually reviewed a sample of 22 992 full-text articles and used prespecified criteria to determine whether each article contained AKI content or not. In the development phase (two-thirds of the sample), we developed and tested the performance of >1.3-million unique filters. Filters with high sensitivity and high specificity for the identification of AKI articles were then retested in the validation phase (remaining third of the sample). We succeeded in developing and validating high-performance AKI search filters for each bibliographic database with sensitivities and specificities in excess of 90%. Filters optimized for sensitivity reached at least 97.2% sensitivity, and filters optimized for specificity reached at least 99.5% specificity. The filters were complex; for example one PubMed filter included >140 terms used in combination, including 'acute kidney injury', 'tubular necrosis', 'azotemia' and 'ischemic injury'. In proof-of-concept searches, physicians found more articles relevant to topics in AKI with the use of the filters. PubMed, Ovid Medline and Embase can be filtered for articles relevant to AKI in a reliable manner. These high-performance information filters are now available online and can be used to better identify AKI content in large bibliographic databases.

  2. International comparison CCQM-K119 liquefied petroleum gas

    NASA Astrophysics Data System (ADS)

    Brewer, P. J.; Downey, M. L.; Atkins, E.; Brown, R. J. C.; Brown, A. S.; Zalewska, E. T.; van der Veen, A. M. H.; Smeulders, D. E.; McCallum, J. B.; Satumba, R. T.; Kim, Y. D.; Kang, N.; Bae, H. K.; Woo, J. C.; Konopelko, L. A.; Popova, T. A.; Meshkov, A. V.; Efremova, O. V.; Kustikov, Y.

    2018-01-01

    Liquefied hydrocarbon mixtures with traceable composition are required in order to underpin measurements of the composition and other physical properties of LPG (liquefied petroleum gas), thus meeting the needs of an increasingly large industrial market. This comparison aims to assess the analytical capabilities of laboratories for measuring the composition of a Liquid Petroleum Gas (LPG) mixture when sampled in the liquid phase from a Constant Pressure Cylinder. Mixtures contained ethane, propane, propene, i-butane, n-butane, but-1-ene and i-pentane with nominal amount fractions of 2, 71, 9, 4, 10, 3 and 1 cmol mol-1 respectively. Main text To reach the main text of this paper, click on Final Report. Note that this text is that which appears in Appendix B of the BIPM key comparison database kcdb.bipm.org/. The final report has been peer-reviewed and approved for publication by the CCQM, according to the provisions of the CIPM Mutual Recognition Arrangement (CIPM MRA).

  3. Generation of Natural-Language Textual Summaries from Longitudinal Clinical Records.

    PubMed

    Goldstein, Ayelet; Shahar, Yuval

    2015-01-01

    Physicians are required to interpret, abstract and present in free-text large amounts of clinical data in their daily tasks. This is especially true for chronic-disease domains, but holds also in other clinical domains. We have recently developed a prototype system, CliniText, which, given a time-oriented clinical database, and appropriate formal abstraction and summarization knowledge, combines the computational mechanisms of knowledge-based temporal data abstraction, textual summarization, abduction, and natural-language generation techniques, to generate an intelligent textual summary of longitudinal clinical data. We demonstrate our methodology, and the feasibility of providing a free-text summary of longitudinal electronic patient records, by generating summaries in two very different domains - Diabetes Management and Cardiothoracic surgery. In particular, we explain the process of generating a discharge summary of a patient who had undergone a Coronary Artery Bypass Graft operation, and a brief summary of the treatment of a diabetes patient for five years.

  4. Free text databases in an Integrated Academic Information System (IAIMS) at Columbia Presbyterian Medical Center.

    PubMed Central

    Clark, A. S.; Shea, S.

    1991-01-01

    The use of Folio Views, a PC DOS based product for free text databases, is explored in three applications in an Integrated Academic Information System (IAIMS): (1) a telephone directory, (2) a grants and contracts newsletter, and (3) nursing care plans. PMID:1666967

  5. The Impact of Online Bibliographic Databases on Teaching and Research in Political Science.

    ERIC Educational Resources Information Center

    Reichel, Mary

    The availability of online bibliographic databases greatly facilitates literature searching in political science. The advantages to searching databases online include combination of concepts, comprehensiveness, multiple database searching, free-text searching, currency, current awareness services, document delivery service, and convenience.…

  6. Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification.

    PubMed

    Mehryary, Farrokh; Kaewphan, Suwisa; Hakala, Kai; Ginter, Filip

    2016-01-01

    Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.

  7. Image Engine: an object-oriented multimedia database for storing, retrieving and sharing medical images and text.

    PubMed Central

    Lowe, H. J.

    1993-01-01

    This paper describes Image Engine, an object-oriented, microcomputer-based, multimedia database designed to facilitate the storage and retrieval of digitized biomedical still images, video, and text using inexpensive desktop computers. The current prototype runs on Apple Macintosh computers and allows network database access via peer to peer file sharing protocols. Image Engine supports both free text and controlled vocabulary indexing of multimedia objects. The latter is implemented using the TView thesaurus model developed by the author. The current prototype of Image Engine uses the National Library of Medicine's Medical Subject Headings (MeSH) vocabulary (with UMLS Meta-1 extensions) as its indexing thesaurus. PMID:8130596

  8. BDVC (Bimodal Database of Violent Content): A database of violent audio and video

    NASA Astrophysics Data System (ADS)

    Rivera Martínez, Jose Luis; Mijes Cruz, Mario Humberto; Rodríguez Vázqu, Manuel Antonio; Rodríguez Espejo, Luis; Montoya Obeso, Abraham; García Vázquez, Mireya Saraí; Ramírez Acosta, Alejandro Álvaro

    2017-09-01

    Nowadays there is a trend towards the use of unimodal databases for multimedia content description, organization and retrieval applications of a single type of content like text, voice and images, instead bimodal databases allow to associate semantically two different types of content like audio-video, image-text, among others. The generation of a bimodal database of audio-video implies the creation of a connection between the multimedia content through the semantic relation that associates the actions of both types of information. This paper describes in detail the used characteristics and methodology for the creation of the bimodal database of violent content; the semantic relationship is stablished by the proposed concepts that describe the audiovisual information. The use of bimodal databases in applications related to the audiovisual content processing allows an increase in the semantic performance only and only if these applications process both type of content. This bimodal database counts with 580 audiovisual annotated segments, with a duration of 28 minutes, divided in 41 classes. Bimodal databases are a tool in the generation of applications for the semantic web.

  9. Protocol for developing a Database of Zoonotic disease Research in India (DoZooRI).

    PubMed

    Chatterjee, Pranab; Bhaumik, Soumyadeep; Chauhan, Abhimanyu Singh; Kakkar, Manish

    2017-12-10

    Zoonotic and emerging infectious diseases (EIDs) represent a public health threat that has been acknowledged only recently although they have been on the rise for the past several decades. On an average, every year since the Second World War, one pathogen has emerged or re-emerged on a global scale. Low/middle-income countries such as India bear a significant burden of zoonotic and EIDs. We propose that the creation of a database of published, peer-reviewed research will open up avenues for evidence-based policymaking for targeted prevention and control of zoonoses. A large-scale systematic mapping of the published peer-reviewed research conducted in India will be undertaken. All published research will be included in the database, without any prejudice for quality screening, to broaden the scope of included studies. Structured search strategies will be developed for priority zoonotic diseases (leptospirosis, rabies, anthrax, brucellosis, cysticercosis, salmonellosis, bovine tuberculosis, Japanese encephalitis and rickettsial infections), and multiple databases will be searched for studies conducted in India. The database will be managed and hosted on a cloud-based platform called Rayyan. Individual studies will be tagged based on key preidentified parameters (disease, study design, study type, location, randomisation status and interventions, host involvement and others, as applicable). The database will incorporate already published studies, obviating the need for additional ethical clearances. The database will be made available online, and in collaboration with multisectoral teams, domains of enquiries will be identified and subsequent research questions will be raised. The database will be queried for these and resulting evidence will be analysed and published in peer-reviewed journals. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  10. A Practical Guide to Interpretation of Large Collections of Incident Narratives Using the QUORUM Method

    NASA Technical Reports Server (NTRS)

    McGreevy, Michael W.

    1997-01-01

    Analysis of incident reports plays an important role in aviation safety. Typically, a narrative description, written by a participant, is a central part of an incident report. Because there are so many reports, and the narratives contain so much detail, it can be difficult to efficiently and effectively recognize patterns among them. Recognizing and addressing recurring problems, however, is vital to continuing safety in commercial aviation operations. A practical way to interpret large collections of incident narratives is to apply the QUORUM method of text analysis, modeling, and relevance ranking. In this paper, QUORUM text analysis and modeling are surveyed, and QUORUM relevance ranking is described in detail with many examples. The examples are based on several large collections of reports from the Aviation Safety Reporting System (ASRS) database, and a collection of news stories describing the disaster of TWA Flight 800, the Boeing 747 which exploded in mid- air and crashed near Long Island, New York, on July 17, 1996. Reader familiarity with this disaster should make the relevance-ranking examples more understandable. The ASRS examples illustrate the practical application of QUORUM relevance ranking.

  11. Using Large Diabetes Databases for Research.

    PubMed

    Wild, Sarah; Fischbacher, Colin; McKnight, John

    2016-09-01

    There are an increasing number of clinical, administrative and trial databases that can be used for research. These are particularly valuable if there are opportunities for linkage to other databases. This paper describes examples of the use of large diabetes databases for research. It reviews the advantages and disadvantages of using large diabetes databases for research and suggests solutions for some challenges. Large, high-quality databases offer potential sources of information for research at relatively low cost. Fundamental issues for using databases for research are the completeness of capture of cases within the population and time period of interest and accuracy of the diagnosis of diabetes and outcomes of interest. The extent to which people included in the database are representative should be considered if the database is not population based and there is the intention to extrapolate findings to the wider diabetes population. Information on key variables such as date of diagnosis or duration of diabetes may not be available at all, may be inaccurate or may contain a large amount of missing data. Information on key confounding factors is rarely available for the nondiabetic or general population limiting comparisons with the population of people with diabetes. However comparisons that allow for differences in distribution of important demographic factors may be feasible using data for the whole population or a matched cohort study design. In summary, diabetes databases can be used to address important research questions. Understanding the strengths and limitations of this approach is crucial to interpret the findings appropriately. © 2016 Diabetes Technology Society.

  12. Textpresso site-specific recombinases: A text-mining server for the recombinase literature including Cre mice and conditional alleles.

    PubMed

    Urbanski, William M; Condie, Brian G

    2009-12-01

    Textpresso Site Specific Recombinases (http://ssrc.genetics.uga.edu/) is a text-mining web server for searching a database of more than 9,000 full-text publications. The papers and abstracts in this database represent a wide range of topics related to site-specific recombinase (SSR) research tools. Included in the database are most of the papers that report the characterization or use of mouse strains that express Cre recombinase as well as papers that describe or analyze mouse lines that carry conditional (floxed) alleles or SSR-activated transgenes/knockins. The database also includes reports describing SSR-based cloning methods such as the Gateway or the Creator systems, papers reporting the development or use of SSR-based tools in systems such as Drosophila, bacteria, parasites, stem cells, yeast, plants, zebrafish, and Xenopus as well as publications that describe the biochemistry, genetics, or molecular structure of the SSRs themselves. Textpresso Site Specific Recombinases is the only comprehensive text-mining resource available for the literature describing the biology and technical applications of SSRs. (c) 2009 Wiley-Liss, Inc.

  13. Text mining for metabolic pathways, signaling cascades, and protein networks.

    PubMed

    Hoffmann, Robert; Krallinger, Martin; Andres, Eduardo; Tamames, Javier; Blaschke, Christian; Valencia, Alfonso

    2005-05-10

    The complexity of the information stored in databases and publications on metabolic and signaling pathways, the high throughput of experimental data, and the growing number of publications make it imperative to provide systems to help the researcher navigate through these interrelated information resources. Text-mining methods have started to play a key role in the creation and maintenance of links between the information stored in biological databases and its original sources in the literature. These links will be extremely useful for database updating and curation, especially if a number of technical problems can be solved satisfactorily, including the identification of protein and gene names (entities in general) and the characterization of their types of interactions. The first generation of openly accessible text-mining systems, such as iHOP (Information Hyperlinked over Proteins), provides additional functions to facilitate the reconstruction of protein interaction networks, combine database and text information, and support the scientist in the formulation of novel hypotheses. The next challenge is the generation of comprehensive information regarding the general function of signaling pathways and protein interaction networks.

  14. Drowning in Data: Sorting through CD ROM and Computer Databases.

    ERIC Educational Resources Information Center

    Cates, Carl M.; Kaye, Barbara K.

    This paper identifies the bibliographic and numeric databases on CD-ROM and computer diskette that should be most useful for investigators in communication, marketing, and communication education. Bibliographic databases are usually found in three formats: citations only, citations and abstracts, and full-text articles. Numeric databases are…

  15. Semantic Annotation of Complex Text Structures in Problem Reports

    NASA Technical Reports Server (NTRS)

    Malin, Jane T.; Throop, David R.; Fleming, Land D.

    2011-01-01

    Text analysis is important for effective information retrieval from databases where the critical information is embedded in text fields. Aerospace safety depends on effective retrieval of relevant and related problem reports for the purpose of trend analysis. The complex text syntax in problem descriptions has limited statistical text mining of problem reports. The presentation describes an intelligent tagging approach that applies syntactic and then semantic analysis to overcome this problem. The tags identify types of problems and equipment that are embedded in the text descriptions. The power of these tags is illustrated in a faceted searching and browsing interface for problem report trending that combines automatically generated tags with database code fields and temporal information.

  16. Reducing process delays for real-time earthquake parameter estimation - An application of KD tree to large databases for Earthquake Early Warning

    NASA Astrophysics Data System (ADS)

    Yin, Lucy; Andrews, Jennifer; Heaton, Thomas

    2018-05-01

    Earthquake parameter estimations using nearest neighbor searching among a large database of observations can lead to reliable prediction results. However, in the real-time application of Earthquake Early Warning (EEW) systems, the accurate prediction using a large database is penalized by a significant delay in the processing time. We propose to use a multidimensional binary search tree (KD tree) data structure to organize large seismic databases to reduce the processing time in nearest neighbor search for predictions. We evaluated the performance of KD tree on the Gutenberg Algorithm, a database-searching algorithm for EEW. We constructed an offline test to predict peak ground motions using a database with feature sets of waveform filter-bank characteristics, and compare the results with the observed seismic parameters. We concluded that large database provides more accurate predictions of the ground motion information, such as peak ground acceleration, velocity, and displacement (PGA, PGV, PGD), than source parameters, such as hypocenter distance. Application of the KD tree search to organize the database reduced the average searching process by 85% time cost of the exhaustive method, allowing the method to be feasible for real-time implementation. The algorithm is straightforward and the results will reduce the overall time of warning delivery for EEW.

  17. Application of kernel functions for accurate similarity search in large chemical databases.

    PubMed

    Wang, Xiaohong; Huan, Jun; Smalter, Aaron; Lushington, Gerald H

    2010-04-29

    Similarity search in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep. Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases.

  18. Working with Data: Discovering Knowledge through Mining and Analysis; Systematic Knowledge Management and Knowledge Discovery; Text Mining; Methodological Approach in Discovering User Search Patterns through Web Log Analysis; Knowledge Discovery in Databases Using Formal Concept Analysis; Knowledge Discovery with a Little Perspective.

    ERIC Educational Resources Information Center

    Qin, Jian; Jurisica, Igor; Liddy, Elizabeth D.; Jansen, Bernard J; Spink, Amanda; Priss, Uta; Norton, Melanie J.

    2000-01-01

    These six articles discuss knowledge discovery in databases (KDD). Topics include data mining; knowledge management systems; applications of knowledge discovery; text and Web mining; text mining and information retrieval; user search patterns through Web log analysis; concept analysis; data collection; and data structure inconsistency. (LRW)

  19. Automated compound classification using a chemical ontology.

    PubMed

    Bobach, Claudia; Böhme, Timo; Laube, Ulf; Püschel, Anett; Weber, Lutz

    2012-12-29

    Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds. In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships. A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.

  20. Automated compound classification using a chemical ontology

    PubMed Central

    2012-01-01

    Background Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds. Results In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships. Conclusions A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated. PMID:23273256

  1. Using external data sources to improve audit trail analysis.

    PubMed

    Herting, R L; Asaro, P V; Roth, A C; Barnes, M R

    1999-01-01

    Audit trail analysis is the primary means of detection of inappropriate use of the medical record. While audit logs contain large amounts of information, the information required to determine useful user-patient relationships is often not present. Adequate information isn't present because most audit trail analysis systems rely on the limited information available within the medical record system. We report a feature of the STAR (System for Text Archive and Retrieval) audit analysis system where information available in the medical record is augmented with external information sources such as: database sources, Light-weight Directory Access Protocol (LDAP) server sources, and World Wide Web (WWW) database sources. We discuss several issues that arise when combining the information from each of these disparate information sources. Furthermore, we explain how the enhanced person specific information obtained can be used to determine user-patient relationships that might signify a motive for inappropriately accessing a patient's medical record.

  2. Automatic layout of structured hierarchical reports.

    PubMed

    Bakke, Eirik; Karger, David R; Miller, Robert C

    2013-12-01

    Domain-specific database applications tend to contain a sizable number of table-, form-, and report-style views that must each be designed and maintained by a software developer. A significant part of this job is the necessary tweaking of low-level presentation details such as label placements, text field dimensions, list or table styles, and so on. In this paper, we present a horizontally constrained layout management algorithm that automates the display of structured hierarchical data using the traditional visual idioms of hand-designed database UIs: tables, multi-column forms, and outline-style indented lists. We compare our system with pure outline and nested table layouts with respect to space efficiency and readability, the latter with an online user study on 27 subjects. Our layouts are 3.9 and 1.6 times more compact on average than outline layouts and horizontally unconstrained table layouts, respectively, and are as readable as table layouts even for large datasets.

  3. Recently active traces of the Bartlett Springs Fault, California: a digital database

    USGS Publications Warehouse

    Lienkaemper, James J.

    2010-01-01

    The purpose of this map is to show the location of and evidence for recent movement on active fault traces within the Bartlett Springs Fault Zone, California. The location and recency of the mapped traces is primarily based on geomorphic expression of the fault as interpreted from large-scale aerial photography. In a few places, evidence of fault creep and offset Holocene strata in trenches and natural exposures have confirmed the activity of some of these traces. This publication is formatted both as a digital database for use within a geographic information system (GIS) and for broader public access as map images that may be browsed on-line or download a summary map. The report text describes the types of scientific observations used to make the map, gives references pertaining to the fault and the evidence of faulting, and provides guidance for use of and limitations of the map.

  4. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database.

    PubMed

    Carver, Tim; Berriman, Matthew; Tivey, Adrian; Patel, Chinmay; Böhme, Ulrike; Barrell, Barclay G; Parkhill, Julian; Rajandream, Marie-Adèle

    2008-12-01

    Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/

  5. A User's Applications of Imaging Techniques: The University of Maryland Historic Textile Database.

    ERIC Educational Resources Information Center

    Anderson, Clarita S.

    1991-01-01

    Describes the incorporation of textile images into the University of Maryland Historic Textile Database by a computer user rather than a computer expert. Selection of a database management system is discussed, and PICTUREPOWER, a system that integrates photographic quality images with text and numeric information in databases, is described. (three…

  6. National Databases for Neurosurgical Outcomes Research: Options, Strengths, and Limitations.

    PubMed

    Karhade, Aditya V; Larsen, Alexandra M G; Cote, David J; Dubois, Heloise M; Smith, Timothy R

    2017-08-05

    Quality improvement, value-based care delivery, and personalized patient care depend on robust clinical, financial, and demographic data streams of neurosurgical outcomes. The neurosurgical literature lacks a comprehensive review of large national databases. To assess the strengths and limitations of various resources for outcomes research in neurosurgery. A review of the literature was conducted to identify surgical outcomes studies using national data sets. The databases were assessed for the availability of patient demographics and clinical variables, longitudinal follow-up of patients, strengths, and limitations. The number of unique patients contained within each data set ranged from thousands (Quality Outcomes Database [QOD]) to hundreds of millions (MarketScan). Databases with both clinical and financial data included PearlDiver, Premier Healthcare Database, Vizient Clinical Data Base and Resource Manager, and the National Inpatient Sample. Outcomes collected by databases included patient-reported outcomes (QOD); 30-day morbidity, readmissions, and reoperations (National Surgical Quality Improvement Program); and disease incidence and disease-specific survival (Surveillance, Epidemiology, and End Results-Medicare). The strengths of large databases included large numbers of rare pathologies and multi-institutional nationally representative sampling; the limitations of these databases included variable data veracity, variable data completeness, and missing disease-specific variables. The improvement of existing large national databases and the establishment of new registries will be crucial to the future of neurosurgical outcomes research. Copyright © 2017 by the Congress of Neurological Surgeons

  7. Dictionary as Database.

    ERIC Educational Resources Information Center

    Painter, Derrick

    1996-01-01

    Discussion of dictionaries as databases focuses on the digitizing of The Oxford English dictionary (OED) and the use of Standard Generalized Mark-Up Language (SGML). Topics include the creation of a consortium to digitize the OED, document structure, relational databases, text forms, sequence, and discourse. (LRW)

  8. Robust Optical Recognition of Cursive Pashto Script Using Scale, Rotation and Location Invariant Approach

    PubMed Central

    Ahmad, Riaz; Naz, Saeeda; Afzal, Muhammad Zeshan; Amin, Sayed Hassan; Breuel, Thomas

    2015-01-01

    The presence of a large number of unique shapes called ligatures in cursive languages, along with variations due to scaling, orientation and location provides one of the most challenging pattern recognition problems. Recognition of the large number of ligatures is often a complicated task in oriental languages such as Pashto, Urdu, Persian and Arabic. Research on cursive script recognition often ignores the fact that scaling, orientation, location and font variations are common in printed cursive text. Therefore, these variations are not included in image databases and in experimental evaluations. This research uncovers challenges faced by Arabic cursive script recognition in a holistic framework by considering Pashto as a test case, because Pashto language has larger alphabet set than Arabic, Persian and Urdu. A database containing 8000 images of 1000 unique ligatures having scaling, orientation and location variations is introduced. In this article, a feature space based on scale invariant feature transform (SIFT) along with a segmentation framework has been proposed for overcoming the above mentioned challenges. The experimental results show a significantly improved performance of proposed scheme over traditional feature extraction techniques such as principal component analysis (PCA). PMID:26368566

  9. Chyawanprash: A review of therapeutic benefits as in authoritative texts and documented clinical literature.

    PubMed

    Narayana, D B Anantha; Durg, Sharanbasappa; Manohar, P Ram; Mahapatra, Anita; Aramya, A R

    2017-02-02

    Chyawanprash (CP), a traditional immune booster recipe, has a long history of ethnic origin, development, household preparation and usage. There are even mythological stories about the origin of this recipe including its nomenclature. In the last six decades, CP, because of entrepreneurial actions of some research Vaidyas (traditional doctors) has grown to industrial production and marketing in packed forms to a large number of consumers/patients like any food or health care product. Currently, CP has acquired a large accepted user base in India and in a few countries out-side India. Authoritative texts, recognized by the Drugs and Cosmetics Act of India, describe CP as an immunity enhancer and strength giver meant for improving lung functions in diseases with compromised immunity. This review focuses on published clinical efficacy and safety studies of CP for correlation with health benefits as documented in the authoritative texts, and also briefs on its recipes and processes. Authoritative texts were searched for recipes, processes, and other technical details of CP. Labels of marketing CP products (Indian) were studied for the health claims. Electronic search for studies of CP on efficacy and safety data were performed in PubMed/MEDLINE and DHARA (Digital Helpline for Ayurveda Research Articles), and Ayurvedic books were also searched for clinical studies. The documented clinical studies from electronic databases and Ayurvedic books evidenced that individuals who consume CP regularly for a definite period of time showed improvement in overall health status and immunity. However, most of the clinical studies in this review are of smaller sample size and short duration. Further, limitation to access and review significant data on traditional products like CP in electronic databases was noted. Randomized controlled trials of high quality with larger sample size and longer follow-up are needed to have significant evidence on the clinical use of CP as immunity booster. Additional studies involving measurement of current biomarkers of immunity pre- and post-consumption of the product as well as benefits accruing with the use of CP as an adjuvant are suggested. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  10. The IBM PC at NASA Ames

    NASA Technical Reports Server (NTRS)

    Peredo, James P.

    1988-01-01

    Like many large companies, Ames relies very much on its computing power to get work done. And, like many other large companies, finding the IBM PC a reliable tool, Ames uses it for many of the same types of functions as other companies. Presentation and clarification needs demand much of graphics packages. Programming and text editing needs require simpler, more-powerful packages. The storage space needed by NASA's scientists and users for the monumental amounts of data that Ames needs to keep demand the best database packages that are large and easy to use. Availability to the Micom Switching Network combines the powers of the IBM PC with the capabilities of other computers and mainframes and allows users to communicate electronically. These four primary capabilities of the PC are vital to the needs of NASA's users and help to continue and support the vast amounts of work done by the NASA employees.

  11. Connection of European particle therapy centers and generation of a common particle database system within the European ULICE-framework

    PubMed Central

    2012-01-01

    Background To establish a common database on particle therapy for the evaluation of clinical studies integrating a large variety of voluminous datasets, different documentation styles, and various information systems, especially in the field of radiation oncology. Methods We developed a web-based documentation system for transnational and multicenter clinical studies in particle therapy. 560 patients have been treated from November 2009 to September 2011. Protons, carbon ions or a combination of both, as well as a combination with photons were applied. To date, 12 studies have been initiated and more are in preparation. Results It is possible to immediately access all patient information and exchange, store, process, and visualize text data, any DICOM images and multimedia data. Accessing the system and submitting clinical data is possible for internal and external users. Integrated into the hospital environment, data is imported both manually and automatically. Security and privacy protection as well as data validation and verification are ensured. Studies can be designed to fit individual needs. Conclusions The described database provides a basis for documentation of large patient groups with specific and specialized questions to be answered. Having recently begun electronic documentation, it has become apparent that the benefits lie in the user-friendly and timely workflow for documentation. The ultimate goal is a simplification of research work, better study analyses quality and eventually, the improvement of treatment concepts by evaluating the effectiveness of particle therapy. PMID:22828013

  12. Creation of a Genome-Wide Metabolic Pathway Database for Populus trichocarpa Using a New Approach for Reconstruction and Curation of Metabolic Pathways for Plants1[W][OA

    PubMed Central

    Zhang, Peifen; Dreher, Kate; Karthikeyan, A.; Chi, Anjo; Pujar, Anuradha; Caspi, Ron; Karp, Peter; Kirkup, Vanessa; Latendresse, Mario; Lee, Cynthia; Mueller, Lukas A.; Muller, Robert; Rhee, Seung Yon

    2010-01-01

    Metabolic networks reconstructed from sequenced genomes or transcriptomes can help visualize and analyze large-scale experimental data, predict metabolic phenotypes, discover enzymes, engineer metabolic pathways, and study metabolic pathway evolution. We developed a general approach for reconstructing metabolic pathway complements of plant genomes. Two new reference databases were created and added to the core of the infrastructure: a comprehensive, all-plant reference pathway database, PlantCyc, and a reference enzyme sequence database, RESD, for annotating metabolic functions of protein sequences. PlantCyc (version 3.0) includes 714 metabolic pathways and 2,619 reactions from over 300 species. RESD (version 1.0) contains 14,187 literature-supported enzyme sequences from across all kingdoms. We used RESD, PlantCyc, and MetaCyc (an all-species reference metabolic pathway database), in conjunction with the pathway prediction software Pathway Tools, to reconstruct a metabolic pathway database, PoplarCyc, from the recently sequenced genome of Populus trichocarpa. PoplarCyc (version 1.0) contains 321 pathways with 1,807 assigned enzymes. Comparing PoplarCyc (version 1.0) with AraCyc (version 6.0, Arabidopsis [Arabidopsis thaliana]) showed comparable numbers of pathways distributed across all domains of metabolism in both databases, except for a higher number of AraCyc pathways in secondary metabolism and a 1.5-fold increase in carbohydrate metabolic enzymes in PoplarCyc. Here, we introduce these new resources and demonstrate the feasibility of using them to identify candidate enzymes for specific pathways and to analyze metabolite profiling data through concrete examples. These resources can be searched by text or BLAST, browsed, and downloaded from our project Web site (http://plantcyc.org). PMID:20522724

  13. Cross-checking of Large Evaluated and Experimental Nuclear Reaction Databases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zeydina, O.; Koning, A.J.; Soppera, N.

    2014-06-15

    Automated methods are presented for the verification of large experimental and evaluated nuclear reaction databases (e.g. EXFOR, JEFF, TENDL). These methods allow an assessment of the overall consistency of the data and detect aberrant values in both evaluated and experimental databases.

  14. Using relational databases for improved sequence similarity searching and large-scale genomic analyses.

    PubMed

    Mackey, Aaron J; Pearson, William R

    2004-10-01

    Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.

  15. Surgical research using national databases

    PubMed Central

    Leland, Hyuma; Heckmann, Nathanael

    2016-01-01

    Recent changes in healthcare and advances in technology have increased the use of large-volume national databases in surgical research. These databases have been used to develop perioperative risk stratification tools, assess postoperative complications, calculate costs, and investigate numerous other topics across multiple surgical specialties. The results of these studies contain variable information but are subject to unique limitations. The use of large-volume national databases is increasing in popularity, and thorough understanding of these databases will allow for a more sophisticated and better educated interpretation of studies that utilize such databases. This review will highlight the composition, strengths, and weaknesses of commonly used national databases in surgical research. PMID:27867945

  16. Surgical research using national databases.

    PubMed

    Alluri, Ram K; Leland, Hyuma; Heckmann, Nathanael

    2016-10-01

    Recent changes in healthcare and advances in technology have increased the use of large-volume national databases in surgical research. These databases have been used to develop perioperative risk stratification tools, assess postoperative complications, calculate costs, and investigate numerous other topics across multiple surgical specialties. The results of these studies contain variable information but are subject to unique limitations. The use of large-volume national databases is increasing in popularity, and thorough understanding of these databases will allow for a more sophisticated and better educated interpretation of studies that utilize such databases. This review will highlight the composition, strengths, and weaknesses of commonly used national databases in surgical research.

  17. The National State Policy Database. Quick Turn Around (QTA).

    ERIC Educational Resources Information Center

    Ahearn, Eileen; Jackson, Terry

    This paper describes the National State Policy Database (NSPD), a full-text searchable database of state and federal education regulations for special education. It summarizes the history of the NSPD and reports on a survey of state directors or their designees as to their use of the database and their suggestions for its future expansion. The…

  18. Selecting Data-Base Management Software for Microcomputers in Libraries and Information Units.

    ERIC Educational Resources Information Center

    Pieska, K. A. O.

    1986-01-01

    Presents a model for the evaluation of database management systems software from the viewpoint of librarians and information specialists. The properties of data management systems, database management systems, and text retrieval systems are outlined and compared. (10 references) (CLB)

  19. ASM Based Synthesis of Handwritten Arabic Text Pages

    PubMed Central

    Al-Hamadi, Ayoub; Elzobi, Moftah; El-etriby, Sherif; Ghoneim, Ahmed

    2015-01-01

    Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available. PMID:26295059

  20. ASM Based Synthesis of Handwritten Arabic Text Pages.

    PubMed

    Dinges, Laslo; Al-Hamadi, Ayoub; Elzobi, Moftah; El-Etriby, Sherif; Ghoneim, Ahmed

    2015-01-01

    Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available.

  1. Image query and indexing for digital x rays

    NASA Astrophysics Data System (ADS)

    Long, L. Rodney; Thoma, George R.

    1998-12-01

    The web-based medical information retrieval system (WebMIRS) allows interned access to databases containing 17,000 digitized x-ray spine images and associated text data from National Health and Nutrition Examination Surveys (NHANES). WebMIRS allows SQL query of the text, and viewing of the returned text records and images using a standard browser. We are now working (1) to determine utility of data directly derived from the images in our databases, and (2) to investigate the feasibility of computer-assisted or automated indexing of the images to support image retrieval of images of interest to biomedical researchers in the field of osteoarthritis. To build an initial database based on image data, we are manually segmenting a subset of the vertebrae, using techniques from vertebral morphometry. From this, we will derive and add to the database vertebral features. This image-derived data will enhance the user's data access capability by enabling the creation of combined SQL/image-content queries.

  2. Text processing through Web services: calling Whatizit.

    PubMed

    Rebholz-Schuhmann, Dietrich; Arregui, Miguel; Gaudan, Sylvain; Kirsch, Harald; Jimeno, Antonio

    2008-01-15

    Text-mining (TM) solutions are developing into efficient services to researchers in the biomedical research community. Such solutions have to scale with the growing number and size of resources (e.g. available controlled vocabularies), with the amount of literature to be processed (e.g. about 17 million documents in PubMed) and with the demands of the user community (e.g. different methods for fact extraction). These demands motivated the development of a server-based solution for literature analysis. Whatizit is a suite of modules that analyse text for contained information, e.g. any scientific publication or Medline abstracts. Special modules identify terms and then link them to the corresponding entries in bioinformatics databases such as UniProtKb/Swiss-Prot data entries and gene ontology concepts. Other modules identify a set of selected annotation types like the set produced by the EBIMed analysis pipeline for proteins. In the case of Medline abstracts, Whatizit offers access to EBI's in-house installation via PMID or term query. For large quantities of the user's own text, the server can be operated in a streaming mode (http://www.ebi.ac.uk/webservices/whatizit).

  3. Geospatial Data Management Platform for Urban Groundwater

    NASA Astrophysics Data System (ADS)

    Gaitanaru, D.; Priceputu, A.; Gogu, C. R.

    2012-04-01

    Due to the large amount of civil work projects and research studies, large quantities of geo-data are produced for the urban environments. These data are usually redundant as well as they are spread in different institutions or private companies. Time consuming operations like data processing and information harmonisation represents the main reason to systematically avoid the re-use of data. The urban groundwater data shows the same complex situation. The underground structures (subway lines, deep foundations, underground parkings, and others), the urban facility networks (sewer systems, water supply networks, heating conduits, etc), the drainage systems, the surface water works and many others modify continuously. As consequence, their influence on groundwater changes systematically. However, these activities provide a large quantity of data, aquifers modelling and then behaviour prediction can be done using monitored quantitative and qualitative parameters. Due to the rapid evolution of technology in the past few years, transferring large amounts of information through internet has now become a feasible solution for sharing geoscience data. Furthermore, standard platform-independent means to do this have been developed (specific mark-up languages like: GML, GeoSciML, WaterML, GWML, CityML). They allow easily large geospatial databases updating and sharing through internet, even between different companies or between research centres that do not necessarily use the same database structures. For Bucharest City (Romania) an integrated platform for groundwater geospatial data management is developed under the framework of a national research project - "Sedimentary media modeling platform for groundwater management in urban areas" (SIMPA) financed by the National Authority for Scientific Research of Romania. The platform architecture is based on three components: a geospatial database, a desktop application (a complex set of hydrogeological and geological analysis tools) and a front-end geoportal service. The SIMPA platform makes use of mark-up transfer standards to provide a user-friendly application that can be accessed through internet to query, analyse, and visualise geospatial data related to urban groundwater. The platform holds the information within the local groundwater geospatial databases and the user is able to access this data through a geoportal service. The database architecture allows storing accurate and very detailed geological, hydrogeological, and infrastructure information that can be straightforwardly generalized and further upscaled. The geoportal service offers the possibility of querying a dataset from the spatial database. The query is coded in a standard mark-up language, and sent to the server through a standard Hyper Text Transfer Protocol (http) to be processed by the local application. After the validation of the query, the results are sent back to the user to be displayed by the geoportal application. The main advantage of the SIMPA platform is that it offers to the user the possibility to make a primary multi-criteria query, which results in a smaller set of data to be analysed afterwards. This improves both the transfer process parameters and the user's means of creating the desired query.

  4. Some Reliability Issues in Very Large Databases.

    ERIC Educational Resources Information Center

    Lynch, Clifford A.

    1988-01-01

    Describes the unique reliability problems of very large databases that necessitate specialized techniques for hardware problem management. The discussion covers the use of controlled partial redundancy to improve reliability, issues in operating systems and database management systems design, and the impact of disk technology on very large…

  5. Use of large healthcare databases for rheumatology clinical research.

    PubMed

    Desai, Rishi J; Solomon, Daniel H

    2017-03-01

    Large healthcare databases, which contain data collected during routinely delivered healthcare to patients, can serve as a valuable resource for generating actionable evidence to assist medical and healthcare policy decision-making. In this review, we summarize use of large healthcare databases in rheumatology clinical research. Large healthcare data are critical to evaluate medication safety and effectiveness in patients with rheumatologic conditions. Three major sources of large healthcare data are: first, electronic medical records, second, health insurance claims, and third, patient registries. Each of these sources offers unique advantages, but also has some inherent limitations. To address some of these limitations and maximize the utility of these data sources for evidence generation, recent efforts have focused on linking different data sources. Innovations such as randomized registry trials, which aim to facilitate design of low-cost randomized controlled trials built on existing infrastructure provided by large healthcare databases, are likely to make clinical research more efficient in coming years. Harnessing the power of information contained in large healthcare databases, while paying close attention to their inherent limitations, is critical to generate a rigorous evidence-base for medical decision-making and ultimately enhancing patient care.

  6. An algorithm of discovering signatures from DNA databases on a computer cluster.

    PubMed

    Lee, Hsiao Ping; Sheu, Tzu-Fang

    2014-10-05

    Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved. In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms. The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.

  7. The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. The DBCLS BioHackathon Consortium*.

    PubMed

    Katayama, Toshiaki; Arakawa, Kazuharu; Nakao, Mitsuteru; Ono, Keiichiro; Aoki-Kinoshita, Kiyoko F; Yamamoto, Yasunori; Yamaguchi, Atsuko; Kawashima, Shuichi; Chun, Hong-Woo; Aerts, Jan; Aranda, Bruno; Barboza, Lord Hendrix; Bonnal, Raoul Jp; Bruskiewich, Richard; Bryne, Jan C; Fernández, José M; Funahashi, Akira; Gordon, Paul Mk; Goto, Naohisa; Groscurth, Andreas; Gutteridge, Alex; Holland, Richard; Kano, Yoshinobu; Kawas, Edward A; Kerhornou, Arnaud; Kibukawa, Eri; Kinjo, Akira R; Kuhn, Michael; Lapp, Hilmar; Lehvaslaiho, Heikki; Nakamura, Hiroyuki; Nakamura, Yasukazu; Nishizawa, Tatsuya; Nobata, Chikashi; Noguchi, Tamotsu; Oinn, Thomas M; Okamoto, Shinobu; Owen, Stuart; Pafilis, Evangelos; Pocock, Matthew; Prins, Pjotr; Ranzinger, René; Reisinger, Florian; Salwinski, Lukasz; Schreiber, Mark; Senger, Martin; Shigemoto, Yasumasa; Standley, Daron M; Sugawara, Hideaki; Tashiro, Toshiyuki; Trelles, Oswaldo; Vos, Rutger A; Wilkinson, Mark D; York, William; Zmasek, Christian M; Asai, Kiyoshi; Takagi, Toshihisa

    2010-08-21

    Web services have become a key technology for bioinformatics, since life science databases are globally decentralized and the exponential increase in the amount of available data demands for efficient systems without the need to transfer entire databases for every step of an analysis. However, various incompatibilities among database resources and analysis services make it difficult to connect and integrate these into interoperable workflows. To resolve this situation, we invited domain specialists from web service providers, client software developers, Open Bio* projects, the BioMoby project and researchers of emerging areas where a standard exchange data format is not well established, for an intensive collaboration entitled the BioHackathon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC) and was held in Tokyo from February 11th to 15th, 2008. In this report we highlight the work accomplished and the common issues arisen from this event, including the standardization of data exchange formats and services in the emerging fields of glycoinformatics, biological interaction networks, text mining, and phyloinformatics. In addition, common shared object development based on BioSQL, as well as technical challenges in large data management, asynchronous services, and security are discussed. Consequently, we improved interoperability of web services in several fields, however, further cooperation among major database centers and continued collaborative efforts between service providers and software developers are still necessary for an effective advance in bioinformatics web service technologies.

  8. The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. The DBCLS BioHackathon Consortium*

    PubMed Central

    2010-01-01

    Web services have become a key technology for bioinformatics, since life science databases are globally decentralized and the exponential increase in the amount of available data demands for efficient systems without the need to transfer entire databases for every step of an analysis. However, various incompatibilities among database resources and analysis services make it difficult to connect and integrate these into interoperable workflows. To resolve this situation, we invited domain specialists from web service providers, client software developers, Open Bio* projects, the BioMoby project and researchers of emerging areas where a standard exchange data format is not well established, for an intensive collaboration entitled the BioHackathon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC) and was held in Tokyo from February 11th to 15th, 2008. In this report we highlight the work accomplished and the common issues arisen from this event, including the standardization of data exchange formats and services in the emerging fields of glycoinformatics, biological interaction networks, text mining, and phyloinformatics. In addition, common shared object development based on BioSQL, as well as technical challenges in large data management, asynchronous services, and security are discussed. Consequently, we improved interoperability of web services in several fields, however, further cooperation among major database centers and continued collaborative efforts between service providers and software developers are still necessary for an effective advance in bioinformatics web service technologies. PMID:20727200

  9. Identifying sports videos using replay, text, and camera motion features

    NASA Astrophysics Data System (ADS)

    Kobla, Vikrant; DeMenthon, Daniel; Doermann, David S.

    1999-12-01

    Automated classification of digital video is emerging as an important piece of the puzzle in the design of content management systems for digital libraries. The ability to classify videos into various classes such as sports, news, movies, or documentaries, increases the efficiency of indexing, browsing, and retrieval of video in large databases. In this paper, we discuss the extraction of features that enable identification of sports videos directly from the compressed domain of MPEG video. These features include detecting the presence of action replays, determining the amount of scene text in vide, and calculating various statistics on camera and/or object motion. The features are derived from the macroblock, motion,and bit-rate information that is readily accessible from MPEG video with very minimal decoding, leading to substantial gains in processing speeds. Full-decoding of selective frames is required only for text analysis. A decision tree classifier built using these features is able to identify sports clips with an accuracy of about 93 percent.

  10. Spelling ability selectively predicts the magnitude of disruption in unspaced text reading.

    PubMed

    Veldre, Aaron; Drieghe, Denis; Andrews, Sally

    2017-09-01

    We examined the effect of individual differences in written language proficiency on unspaced text reading in a large sample of skilled adult readers who were assessed on reading comprehension and spelling ability. Participants' eye movements were recorded as they read sentences containing a low or high frequency target word, presented with standard interword spacing, or in one of three unsegmented text conditions that either preserved or eliminated word boundary information. The average data replicated previous studies: unspaced text reading was associated with increased fixation durations, a higher number of fixations, more regressions, reduced saccade length, and an inflation of the word frequency effect. The individual differences results provided insight into the mechanisms contributing to these effects. Higher reading ability was associated with greater overall reading speed and fluency in all conditions. In contrast, spelling ability selectively modulated the effect of interword spacing with poorer spelling ability predicting greater difficulty across the majority of sentence- and word-level measures. These results suggest that high quality lexical representations allowed better spellers to extract lexical units from unfamiliar text forms, inoculating them against the disruptive effects of being deprived of spacing information. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  11. Does filler database size influence identification accuracy?

    PubMed

    Bergold, Amanda N; Heaton, Paul

    2018-06-01

    Police departments increasingly use large photo databases to select lineup fillers using facial recognition software, but this technological shift's implications have been largely unexplored in eyewitness research. Database use, particularly if coupled with facial matching software, could enable lineup constructors to increase filler-suspect similarity and thus enhance eyewitness accuracy (Fitzgerald, Oriet, Price, & Charman, 2013). However, with a large pool of potential fillers, such technologies might theoretically produce lineup fillers too similar to the suspect (Fitzgerald, Oriet, & Price, 2015; Luus & Wells, 1991; Wells, Rydell, & Seelau, 1993). This research proposes a new factor-filler database size-as a lineup feature affecting eyewitness accuracy. In a facial recognition experiment, we select lineup fillers in a legally realistic manner using facial matching software applied to filler databases of 5,000, 25,000, and 125,000 photos, and find that larger databases are associated with a higher objective similarity rating between suspects and fillers and lower overall identification accuracy. In target present lineups, witnesses viewing lineups created from the larger databases were less likely to make correct identifications and more likely to select known innocent fillers. When the target was absent, database size was associated with a lower rate of correct rejections and a higher rate of filler identifications. Higher algorithmic similarity ratings were also associated with decreases in eyewitness identification accuracy. The results suggest that using facial matching software to select fillers from large photograph databases may reduce identification accuracy, and provides support for filler database size as a meaningful system variable. (PsycINFO Database Record (c) 2018 APA, all rights reserved).

  12. Speeding Up Chemical Searches Using the Inverted Index: the Convergence of Chemoinformatics and Text Search Methods

    PubMed Central

    Nasr, Ramzi; Vernica, Rares; Li, Chen; Baldi, Pierre

    2012-01-01

    In ligand-based screening, retrosynthesis, and other chemoinformatics applications, one of-ten seeks to search large databases of molecules in order to retrieve molecules that are similar to a given query. With the expanding size of molecular databases, the efficiency and scalability of data structures and algorithms for chemical searches are becoming increasingly important. Remarkably, both the chemoinformatics and information retrieval communities have converged on similar solutions whereby molecules or documents are represented by binary vectors, or fingerprints, indexing their substructures such as labeled paths for molecules and n-grams for text, with the same Jaccard-Tanimoto similarity measure. As a result, similarity search methods from one field can be adapted to the other. Here we adapt recent, state-of-the-art, inverted index methods from information retrieval to speed up similarity searches in chemoinformatics. Our results show a several-fold speed-up improvement over previous methods for both thresh-old searches and top-K searches. We also provide a mathematical analysis that allows one to predict the level of pruning achieved by the inverted index approach, and validate the quality of these predictions through simulation experiments. All results can be replicated using data freely downloadable from http://cdb.ics.uci.edu/. PMID:22462644

  13. Automating document classification for the Immune Epitope Database

    PubMed Central

    Wang, Peng; Morgan, Alexander A; Zhang, Qing; Sette, Alessandro; Peters, Bjoern

    2007-01-01

    Background The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose. Results We here report our experience in automating this process using Naïve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified. Conclusion By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers. PMID:17655769

  14. Meta-Storms: efficient search for similar microbial communities based on a novel indexing scheme and similarity score for metagenomic data.

    PubMed

    Su, Xiaoquan; Xu, Jian; Ning, Kang

    2012-10-01

    It has long been intriguing scientists to effectively compare different microbial communities (also referred as 'metagenomic samples' here) in a large scale: given a set of unknown samples, find similar metagenomic samples from a large repository and examine how similar these samples are. With the current metagenomic samples accumulated, it is possible to build a database of metagenomic samples of interests. Any metagenomic samples could then be searched against this database to find the most similar metagenomic sample(s). However, on one hand, current databases with a large number of metagenomic samples mostly serve as data repositories that offer few functionalities for analysis; and on the other hand, methods to measure the similarity of metagenomic data work well only for small set of samples by pairwise comparison. It is not yet clear, how to efficiently search for metagenomic samples against a large metagenomic database. In this study, we have proposed a novel method, Meta-Storms, that could systematically and efficiently organize and search metagenomic data. It includes the following components: (i) creating a database of metagenomic samples based on their taxonomical annotations, (ii) efficient indexing of samples in the database based on a hierarchical taxonomy indexing strategy, (iii) searching for a metagenomic sample against the database by a fast scoring function based on quantitative phylogeny and (iv) managing database by index export, index import, data insertion, data deletion and database merging. We have collected more than 1300 metagenomic data from the public domain and in-house facilities, and tested the Meta-Storms method on these datasets. Our experimental results show that Meta-Storms is capable of database creation and effective searching for a large number of metagenomic samples, and it could achieve similar accuracies compared with the current popular significance testing-based methods. Meta-Storms method would serve as a suitable database management and search system to quickly identify similar metagenomic samples from a large pool of samples. ningkang@qibebt.ac.cn Supplementary data are available at Bioinformatics online.

  15. Library Instruction and Online Database Searching.

    ERIC Educational Resources Information Center

    Mercado, Heidi

    1999-01-01

    Reviews changes in online database searching in academic libraries. Topics include librarians conducting all searches; the advent of end-user searching and the need for user instruction; compact disk technology; online public catalogs; the Internet; full text databases; electronic information literacy; user education and the remote library user;…

  16. Using SQL Databases for Sequence Similarity Searching and Analysis.

    PubMed

    Pearson, William R; Mackey, Aaron J

    2017-09-13

    Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.

  17. Can we replace curation with information extraction software?

    PubMed

    Karp, Peter D

    2016-01-01

    Can we use programs for automated or semi-automated information extraction from scientific texts as practical alternatives to professional curation? I show that error rates of current information extraction programs are too high to replace professional curation today. Furthermore, current IEP programs extract single narrow slivers of information, such as individual protein interactions; they cannot extract the large breadth of information extracted by professional curators for databases such as EcoCyc. They also cannot arbitrate among conflicting statements in the literature as curators can. Therefore, funding agencies should not hobble the curation efforts of existing databases on the assumption that a problem that has stymied Artificial Intelligence researchers for more than 60 years will be solved tomorrow. Semi-automated extraction techniques appear to have significantly more potential based on a review of recent tools that enhance curator productivity. But a full cost-benefit analysis for these tools is lacking. Without such analysis it is possible to expend significant effort developing information-extraction tools that automate small parts of the overall curation workflow without achieving a significant decrease in curation costs.Database URL. © The Author(s) 2016. Published by Oxford University Press.

  18. Information Extraction for Clinical Data Mining: A Mammography Case Study

    PubMed Central

    Nassif, Houssam; Woods, Ryan; Burnside, Elizabeth; Ayvaci, Mehmet; Shavlik, Jude; Page, David

    2013-01-01

    Breast cancer is the leading cause of cancer mortality in women between the ages of 15 and 54. During mammography screening, radiologists use a strict lexicon (BI-RADS) to describe and report their findings. Mammography records are then stored in a well-defined database format (NMD). Lately, researchers have applied data mining and machine learning techniques to these databases. They successfully built breast cancer classifiers that can help in early detection of malignancy. However, the validity of these models depends on the quality of the underlying databases. Unfortunately, most databases suffer from inconsistencies, missing data, inter-observer variability and inappropriate term usage. In addition, many databases are not compliant with the NMD format and/or solely consist of text reports. BI-RADS feature extraction from free text and consistency checks between recorded predictive variables and text reports are crucial to addressing this problem. We describe a general scheme for concept information retrieval from free text given a lexicon, and present a BI-RADS features extraction algorithm for clinical data mining. It consists of a syntax analyzer, a concept finder and a negation detector. The syntax analyzer preprocesses the input into individual sentences. The concept finder uses a semantic grammar based on the BI-RADS lexicon and the experts’ input. It parses sentences detecting BI-RADS concepts. Once a concept is located, a lexical scanner checks for negation. Our method can handle multiple latent concepts within the text, filtering out ultrasound concepts. On our dataset, our algorithm achieves 97.7% precision, 95.5% recall and an F1-score of 0.97. It outperforms manual feature extraction at the 5% statistical significance level. PMID:23765123

  19. Information Extraction for Clinical Data Mining: A Mammography Case Study.

    PubMed

    Nassif, Houssam; Woods, Ryan; Burnside, Elizabeth; Ayvaci, Mehmet; Shavlik, Jude; Page, David

    2009-01-01

    Breast cancer is the leading cause of cancer mortality in women between the ages of 15 and 54. During mammography screening, radiologists use a strict lexicon (BI-RADS) to describe and report their findings. Mammography records are then stored in a well-defined database format (NMD). Lately, researchers have applied data mining and machine learning techniques to these databases. They successfully built breast cancer classifiers that can help in early detection of malignancy. However, the validity of these models depends on the quality of the underlying databases. Unfortunately, most databases suffer from inconsistencies, missing data, inter-observer variability and inappropriate term usage. In addition, many databases are not compliant with the NMD format and/or solely consist of text reports. BI-RADS feature extraction from free text and consistency checks between recorded predictive variables and text reports are crucial to addressing this problem. We describe a general scheme for concept information retrieval from free text given a lexicon, and present a BI-RADS features extraction algorithm for clinical data mining. It consists of a syntax analyzer, a concept finder and a negation detector. The syntax analyzer preprocesses the input into individual sentences. The concept finder uses a semantic grammar based on the BI-RADS lexicon and the experts' input. It parses sentences detecting BI-RADS concepts. Once a concept is located, a lexical scanner checks for negation. Our method can handle multiple latent concepts within the text, filtering out ultrasound concepts. On our dataset, our algorithm achieves 97.7% precision, 95.5% recall and an F 1 -score of 0.97. It outperforms manual feature extraction at the 5% statistical significance level.

  20. Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review.

    PubMed

    Marucci-Wellman, Helen R; Corns, Helen L; Lehto, Mark R

    2017-01-01

    Injury narratives are now available real time and include useful information for injury surveillance and prevention. However, manual classification of the cause or events leading to injury found in large batches of narratives, such as workers compensation claims databases, can be prohibitive. In this study we compare the utility of four machine learning algorithms (Naïve Bayes, Single word and Bi-gram models, Support Vector Machine and Logistic Regression) for classifying narratives into Bureau of Labor Statistics Occupational Injury and Illness event leading to injury classifications for a large workers compensation database. These algorithms are known to do well classifying narrative text and are fairly easy to implement with off-the-shelf software packages such as Python. We propose human-machine learning ensemble approaches which maximize the power and accuracy of the algorithms for machine-assigned codes and allow for strategic filtering of rare, emerging or ambiguous narratives for manual review. We compare human-machine approaches based on filtering on the prediction strength of the classifier vs. agreement between algorithms. Regularized Logistic Regression (LR) was the best performing algorithm alone. Using this algorithm and filtering out the bottom 30% of predictions for manual review resulted in high accuracy (overall sensitivity/positive predictive value of 0.89) of the final machine-human coded dataset. The best pairings of algorithms included Naïve Bayes with Support Vector Machine whereby the triple ensemble NB SW =NB BI-GRAM =SVM had very high performance (0.93 overall sensitivity/positive predictive value and high accuracy (i.e. high sensitivity and positive predictive values)) across both large and small categories leaving 41% of the narratives for manual review. Integrating LR into this ensemble mix improved performance only slightly. For large administrative datasets we propose incorporation of methods based on human-machine pairings such as we have done here, utilizing readily-available off-the-shelf machine learning techniques and resulting in only a fraction of narratives that require manual review. Human-machine ensemble methods are likely to improve performance over total manual coding. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.

  1. Enhancing navigation in biomedical databases by community voting and database-driven text classification

    PubMed Central

    Duchrow, Timo; Shtatland, Timur; Guettler, Daniel; Pivovarov, Misha; Kramer, Stefan; Weissleder, Ralph

    2009-01-01

    Background The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them. Results Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly. Conclusion Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases. The system can be accessed at . PMID:19799796

  2. Generating descriptive visual words and visual phrases for large-scale image applications.

    PubMed

    Zhang, Shiliang; Tian, Qi; Hua, Gang; Huang, Qingming; Gao, Wen

    2011-09-01

    Bag-of-visual Words (BoWs) representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the text words. Notwithstanding its great success and wide adoption, visual vocabulary created from single-image local descriptors is often shown to be not as effective as desired. In this paper, descriptive visual words (DVWs) and descriptive visual phrases (DVPs) are proposed as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs. Since images are the carriers of visual objects and scenes, a descriptive visual element set can be composed by the visual words and their combinations which are effective in representing certain visual objects or scenes. Based on this idea, a general framework is proposed for generating DVWs and DVPs for image applications. In a large-scale image database containing 1506 object and scene categories, the visual words and visual word pairs descriptive to certain objects or scenes are identified and collected as the DVWs and DVPs. Experiments show that the DVWs and DVPs are informative and descriptive and, thus, are more comparable with the text words than the classic visual words. We apply the identified DVWs and DVPs in several applications including large-scale near-duplicated image retrieval, image search re-ranking, and object recognition. The combination of DVW and DVP performs better than the state of the art in large-scale near-duplicated image retrieval in terms of accuracy, efficiency and memory consumption. The proposed image search re-ranking algorithm: DWPRank outperforms the state-of-the-art algorithm by 12.4% in mean average precision and about 11 times faster in efficiency.

  3. Text mining for the biocuration workflow

    PubMed Central

    Hirschman, Lynette; Burns, Gully A. P. C; Krallinger, Martin; Arighi, Cecilia; Cohen, K. Bretonnel; Valencia, Alfonso; Wu, Cathy H.; Chatr-Aryamontri, Andrew; Dowell, Karen G.; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G.

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129

  4. Text mining for the biocuration workflow.

    PubMed

    Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.

  5. Design and deployment of a large brain-image database for clinical and nonclinical research

    NASA Astrophysics Data System (ADS)

    Yang, Guo Liang; Lim, Choie Cheio Tchoyoson; Banukumar, Narayanaswami; Aziz, Aamer; Hui, Francis; Nowinski, Wieslaw L.

    2004-04-01

    An efficient database is an essential component of organizing diverse information on image metadata and patient information for research in medical imaging. This paper describes the design, development and deployment of a large database system serving as a brain image repository that can be used across different platforms in various medical researches. It forms the infrastructure that links hospitals and institutions together and shares data among them. The database contains patient-, pathology-, image-, research- and management-specific data. The functionalities of the database system include image uploading, storage, indexing, downloading and sharing as well as database querying and management with security and data anonymization concerns well taken care of. The structure of database is multi-tier client-server architecture with Relational Database Management System, Security Layer, Application Layer and User Interface. Image source adapter has been developed to handle most of the popular image formats. The database has a user interface based on web browsers and is easy to handle. We have used Java programming language for its platform independency and vast function libraries. The brain image database can sort data according to clinically relevant information. This can be effectively used in research from the clinicians" points of view. The database is suitable for validation of algorithms on large population of cases. Medical images for processing could be identified and organized based on information in image metadata. Clinical research in various pathologies can thus be performed with greater efficiency and large image repositories can be managed more effectively. The prototype of the system has been installed in a few hospitals and is working to the satisfaction of the clinicians.

  6. Biblio-MetReS for user-friendly mining of genes and biological processes in scientific documents.

    PubMed

    Usie, Anabel; Karathia, Hiren; Teixidó, Ivan; Alves, Rui; Solsona, Francesc

    2014-01-01

    One way to initiate the reconstruction of molecular circuits is by using automated text-mining techniques. Developing more efficient methods for such reconstruction is a topic of active research, and those methods are typically included by bioinformaticians in pipelines used to mine and curate large literature datasets. Nevertheless, experimental biologists have a limited number of available user-friendly tools that use text-mining for network reconstruction and require no programming skills to use. One of these tools is Biblio-MetReS. Originally, this tool permitted an on-the-fly analysis of documents contained in a number of web-based literature databases to identify co-occurrence of proteins/genes. This approach ensured results that were always up-to-date with the latest live version of the databases. However, this 'up-to-dateness' came at the cost of large execution times. Here we report an evolution of the application Biblio-MetReS that permits constructing co-occurrence networks for genes, GO processes, Pathways, or any combination of the three types of entities and graphically represent those entities. We show that the performance of Biblio-MetReS in identifying gene co-occurrence is as least as good as that of other comparable applications (STRING and iHOP). In addition, we also show that the identification of GO processes is on par to that reported in the latest BioCreAtIvE challenge. Finally, we also report the implementation of a new strategy that combines on-the-fly analysis of new documents with preprocessed information from documents that were encountered in previous analyses. This combination simultaneously decreases program run time and maintains 'up-to-dateness' of the results. http://metres.udl.cat/index.php/downloads, metres.cmb@gmail.com.

  7. Large-scale extraction of brain connectivity from the neuroscientific literature

    PubMed Central

    Richardet, Renaud; Chappelier, Jean-Cédric; Telefont, Martin; Hill, Sean

    2015-01-01

    Motivation: In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles. One challenge for modern neuroinformatics is finding methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and the integration of such data into computational models. A key example of this is metascale brain connectivity, where results are not reported in a normalized repository. Instead, these experimental results are published in natural language, scattered among individual scientific publications. This lack of normalization and centralization hinders the large-scale integration of brain connectivity results. In this article, we present text-mining models to extract and aggregate brain connectivity results from 13.2 million PubMed abstracts and 630 216 full-text publications related to neuroscience. The brain regions are identified with three different named entity recognizers (NERs) and then normalized against two atlases: the Allen Brain Atlas (ABA) and the atlas from the Brain Architecture Management System (BAMS). We then use three different extractors to assess inter-region connectivity. Results: NERs and connectivity extractors are evaluated against a manually annotated corpus. The complete in litero extraction models are also evaluated against in vivo connectivity data from ABA with an estimated precision of 78%. The resulting database contains over 4 million brain region mentions and over 100 000 (ABA) and 122 000 (BAMS) potential brain region connections. This database drastically accelerates connectivity literature review, by providing a centralized repository of connectivity data to neuroscientists. Availability and implementation: The resulting models are publicly available at github.com/BlueBrain/bluima. Contact: renaud.richardet@epfl.ch Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25609795

  8. GarlicESTdb: an online database and mining tool for garlic EST sequences.

    PubMed

    Kim, Dae-Won; Jung, Tae-Sung; Nam, Seong-Hyeuk; Kwon, Hyuk-Ryul; Kim, Aeri; Chae, Sung-Hwa; Choi, Sang-Haeng; Kim, Dong-Wook; Kim, Ryong Nam; Park, Hong-Seog

    2009-05-18

    Allium sativum., commonly known as garlic, is a species in the onion genus (Allium), which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST) of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum) EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition) software technology (JSP/EJB/JavaServlet) for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP) and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation information for others to view. The GarlicESTdb web application is freely available at http://garlicdb.kribb.re.kr. GarlicESTdb is the first incorporated online information database of EST sequences isolated from garlic that can be freely accessed and downloaded. It has many useful features for interactive mining of EST contigs and datasets from each library, including curation of annotated information, expression profiling, information retrieval, and summary of statistics of functional annotation. Consequently, the development of GarlicESTdb will provide a crucial contribution to biologists for data-mining and more efficient experimental studies.

  9. A Data Analysis Expert System For Large Established Distributed Databases

    NASA Astrophysics Data System (ADS)

    Gnacek, Anne-Marie; An, Y. Kim; Ryan, J. Patrick

    1987-05-01

    The purpose of this work is to analyze the applicability of artificial intelligence techniques for developing a user-friendly, parallel interface to large isolated, incompatible NASA databases for the purpose of assisting the management decision process. To carry out this work, a survey was conducted to establish the data access requirements of several key NASA user groups. In addition, current NASA database access methods were evaluated. The results of this work are presented in the form of a design for a natural language database interface system, called the Deductively Augmented NASA Management Decision Support System (DANMDS). This design is feasible principally because of recently announced commercial hardware and software product developments which allow cross-vendor compatibility. The goal of the DANMDS system is commensurate with the central dilemma confronting most large companies and institutions in America, the retrieval of information from large, established, incompatible database systems. The DANMDS system implementation would represent a significant first step toward this problem's resolution.

  10. Use of Patient Registries and Administrative Datasets for the Study of Pediatric Cancer

    PubMed Central

    Rice, Henry E.; Englum, Brian R.; Gulack, Brian C.; Adibe, Obinna O.; Tracy, Elizabeth T.; Kreissman, Susan G.; Routh, Jonathan C.

    2015-01-01

    Analysis of data from large administrative databases and patient registries is increasingly being used to study childhood cancer care, although the value of these data sources remains unclear to many clinicians. Interpretation of large databases requires a thorough understanding of how the dataset was designed, how data were collected, and how to assess data quality. This review will detail the role of administrative databases and registry databases for the study of childhood cancer, tools to maximize information from these datasets, and recommendations to improve the use of these databases for the study of pediatric oncology. PMID:25807938

  11. Recognizing References to Physical Places Inside Short Texts by Using Patterns as a Sequence of Grammatical Categories

    NASA Astrophysics Data System (ADS)

    Romero, A.; Sol, D.

    2017-09-01

    Collecting data by crowdsourcing is an explored trend to support database population and update. This kind of data is unstructured and comes from text, in particular text in social networks. Geographic database is a particular case of database that can be populated by crowdsourcing which can be done when people report some urban event in a social network by writing a short message. An event can describe an accident or a non-functioning device in the urban area. The authorities then need to read and to interpret the message to provide some help for injured people or to fix a problem in a device installed in the urban area like a light or a problem on road. Our main interest is located on working with short messages organized in a collection. Most of the messages do not have geographical coordinates. The messages can then be classified by text patterns describing a location. In fact, people use a text pattern to describe an urban location. Our work tries to identify patterns inside a short text and to indicate when it describes a location. When a pattern is identified our approach look to describe the place where the event is located. The source messages used are tweets reporting events from several Mexican cities.

  12. Creating databases for biological information: an introduction.

    PubMed

    Stein, Lincoln

    2013-06-01

    The essence of bioinformatics is dealing with large quantities of information. Whether it be sequencing data, microarray data files, mass spectrometric data (e.g., fingerprints), the catalog of strains arising from an insertional mutagenesis project, or even large numbers of PDF files, there inevitably comes a time when the information can simply no longer be managed with files and directories. This is where databases come into play. This unit briefly reviews the characteristics of several database management systems, including flat file, indexed file, relational databases, and NoSQL databases. It compares their strengths and weaknesses and offers some general guidelines for selecting an appropriate database management system. Copyright 2013 by JohnWiley & Sons, Inc.

  13. Geologic map and map database of parts of Marin, San Francisco, Alameda, Contra Costa, and Sonoma counties, California

    USGS Publications Warehouse

    Blake, M.C.; Jones, D.L.; Graymer, R.W.; digital database by Soule, Adam

    2000-01-01

    This digital map database, compiled from previously published and unpublished data, and new mapping by the authors, represents the general distribution of bedrock and surficial deposits in the mapped area. Together with the accompanying text file (mageo.txt, mageo.pdf, or mageo.ps), it provides current information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The scale of the source maps limits the spatial resolution (scale) of the database to 1:62,500 or smaller general distribution of bedrock and surficial deposits in the mapped area. Together with the accompanying text file (mageo.txt, mageo.pdf, or mageo.ps), it provides current information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The scale of the source maps limits the spatial resolution (scale) of the database to 1:62,500 or smaller.

  14. Improving semi-text-independent method of writer verification using difference vector

    NASA Astrophysics Data System (ADS)

    Li, Xin; Ding, Xiaoqing

    2009-01-01

    The semi-text-independent method of writer verification based on the linear framework is a method that can use all characters of two handwritings to discriminate the writers in the condition of knowing the text contents. The handwritings are allowed to just have small numbers of even totally different characters. This fills the vacancy of the classical text-dependent methods and the text-independent methods of writer verification. Moreover, the information, what every character is, is used for the semi-text-independent method in this paper. Two types of standard templates, generated from many writer-unknown handwritten samples and printed samples of each character, are introduced to represent the content information of each character. The difference vectors of the character samples are gotten by subtracting the standard templates from the original feature vectors and used to replace the original vectors in the process of writer verification. By removing a large amount of content information and remaining the style information, the verification accuracy of the semi-text-independent method is improved. On a handwriting database involving 30 writers, when the query handwriting and the reference handwriting are composed of 30 distinct characters respectively, the average equal error rate (EER) of writer verification reaches 9.96%. And when the handwritings contain 50 characters, the average EER falls to 6.34%, which is 23.9% lower than the EER of not using the difference vectors.

  15. DrugQuest - a text mining workflow for drug association discovery.

    PubMed

    Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Vizirianakis, Ioannis S; Iliopoulos, Ioannis

    2016-06-06

    Text mining and data integration methods are gaining ground in the field of health sciences due to the exponential growth of bio-medical literature and information stored in biological databases. While such methods mostly try to extract bioentity associations from PubMed, very few of them are dedicated in mining other types of repositories such as chemical databases. Herein, we apply a text mining approach on the DrugBank database in order to explore drug associations based on the DrugBank "Description", "Indication", "Pharmacodynamics" and "Mechanism of Action" text fields. We apply Name Entity Recognition (NER) techniques on these fields to identify chemicals, proteins, genes, pathways, diseases, and we utilize the TextQuest algorithm to find additional biologically significant words. Using a plethora of similarity and partitional clustering techniques, we group the DrugBank records based on their common terms and investigate possible scenarios why these records are clustered together. Different views such as clustered chemicals based on their textual information, tag clouds consisting of Significant Terms along with the terms that were used for clustering are delivered to the user through a user-friendly web interface. DrugQuest is a text mining tool for knowledge discovery: it is designed to cluster DrugBank records based on text attributes in order to find new associations between drugs. The service is freely available at http://bioinformatics.med.uoc.gr/drugquest .

  16. IntPath--an integrated pathway gene relationship database for model organisms and important pathogens.

    PubMed

    Zhou, Hufeng; Jin, Jingjing; Zhang, Haojun; Yi, Bo; Wozniak, Michal; Wong, Limsoon

    2012-01-01

    Pathway data are important for understanding the relationship between genes, proteins and many other molecules in living organisms. Pathway gene relationships are crucial information for guidance, prediction, reference and assessment in biochemistry, computational biology, and medicine. Many well-established databases--e.g., KEGG, WikiPathways, and BioCyc--are dedicated to collecting pathway data for public access. However, the effectiveness of these databases is hindered by issues such as incompatible data formats, inconsistent molecular representations, inconsistent molecular relationship representations, inconsistent referrals to pathway names, and incomprehensive data from different databases. In this paper, we overcome these issues through extraction, normalization and integration of pathway data from several major public databases (KEGG, WikiPathways, BioCyc, etc). We build a database that not only hosts our integrated pathway gene relationship data for public access but also maintains the necessary updates in the long run. This public repository is named IntPath (Integrated Pathway gene relationship database for model organisms and important pathogens). Four organisms--S. cerevisiae, M. tuberculosis H37Rv, H. Sapiens and M. musculus--are included in this version (V2.0) of IntPath. IntPath uses the "full unification" approach to ensure no deletion and no introduced noise in this process. Therefore, IntPath contains much richer pathway-gene and pathway-gene pair relationships and much larger number of non-redundant genes and gene pairs than any of the single-source databases. The gene relationships of each gene (measured by average node degree) per pathway are significantly richer. The gene relationships in each pathway (measured by average number of gene pairs per pathway) are also considerably richer in the integrated pathways. Moderate manual curation are involved to get rid of errors and noises from source data (e.g., the gene ID errors in WikiPathways and relationship errors in KEGG). We turn complicated and incompatible xml data formats and inconsistent gene and gene relationship representations from different source databases into normalized and unified pathway-gene and pathway-gene pair relationships neatly recorded in simple tab-delimited text format and MySQL tables, which facilitates convenient automatic computation and large-scale referencing in many related studies. IntPath data can be downloaded in text format or MySQL dump. IntPath data can also be retrieved and analyzed conveniently through web service by local programs or through web interface by mouse clicks. Several useful analysis tools are also provided in IntPath. We have overcome in IntPath the issues of compatibility, consistency, and comprehensiveness that often hamper effective use of pathway databases. We have included four organisms in the current release of IntPath. Our methodology and programs described in this work can be easily applied to other organisms; and we will include more model organisms and important pathogens in future releases of IntPath. IntPath maintains regular updates and is freely available at http://compbio.ddns.comp.nus.edu.sg:8080/IntPath.

  17. Very Large Data Volumes Analysis of Collaborative Systems with Finite Number of States

    ERIC Educational Resources Information Center

    Ivan, Ion; Ciurea, Cristian; Pavel, Sorin

    2010-01-01

    The collaborative system with finite number of states is defined. A very large database is structured. Operations on large databases are identified. Repetitive procedures for collaborative systems operations are derived. The efficiency of such procedures is analyzed. (Contains 6 tables, 5 footnotes and 3 figures.)

  18. WebCSD: the online portal to the Cambridge Structural Database

    PubMed Central

    Thomas, Ian R.; Bruno, Ian J.; Cole, Jason C.; Macrae, Clare F.; Pidcock, Elna; Wood, Peter A.

    2010-01-01

    WebCSD, a new web-based application developed by the Cambridge Crystallographic Data Centre, offers fast searching of the Cambridge Structural Database using only a standard internet browser. Search facilities include two-dimensional substructure, molecular similarity, text/numeric and reduced cell searching. Text, chemical diagrams and three-dimensional structural information can all be studied in the results browser using the efficient entry summaries and embedded three-dimensional viewer. PMID:22477776

  19. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

    PubMed Central

    Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila

    2018-01-01

    Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374

  20. Quaternary Geology and Liquefaction Susceptibility, San Francisco, California 1:100,000 Quadrangle: A Digital Database

    USGS Publications Warehouse

    Knudsen, Keith L.; Noller, Jay S.; Sowers, Janet M.; Lettis, William R.

    1997-01-01

    This Open-File report is a digital geologic map database. This pamphlet serves to introduce and describe the digital data. There are no paper maps included in the Open-File report. The report does include, however, PostScript plot files containing the images of the geologic map sheets with explanations, as well as the accompanying text describing the geology of the area. For those interested in a paper plot of information contained in the database or in obtaining the PostScript plot files, please see the section entitled 'For Those Who Aren't Familiar With Digital Geologic Map Databases' below. This digital map database, compiled from previously unpublished data, and new mapping by the authors, represents the general distribution of surficial deposits in the San Francisco bay region. Together with the accompanying text file (sf_geo.txt or sf_geo.pdf), it provides current information on Quaternary geology and liquefaction susceptibility of the San Francisco, California, 1:100,000 quadrangle. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The scale of the source maps limits the spatial resolution (scale) of the database to 1:100,000 or smaller. The content and character of the database, as well as three methods of obtaining the database, are described below.

  1. Classroom Laboratory Report: Using an Image Database System in Engineering Education.

    ERIC Educational Resources Information Center

    Alam, Javed; And Others

    1991-01-01

    Describes an image database system assembled using separate computer components that was developed to overcome text-only computer hardware storage and retrieval limitations for a pavement design class. (JJK)

  2. A general temporal data model and the structured population event history register

    PubMed Central

    Clark, Samuel J.

    2010-01-01

    At this time there are 37 demographic surveillance system sites active in sub-Saharan Africa, Asia and Central America, and this number is growing continuously. These sites and other longitudinal population and health research projects generate large quantities of complex temporal data in order to describe, explain and investigate the event histories of individuals and the populations they constitute. This article presents possible solutions to some of the key data management challenges associated with those data. The fundamental components of a temporal system are identified and both they and their relationships to each other are given simple, standardized definitions. Further, a metadata framework is proposed to endow this abstract generalization with specific meaning and to bind the definitions of the data to the data themselves. The result is a temporal data model that is generalized, conceptually tractable, and inherently contains a full description of the primary data it organizes. Individual databases utilizing this temporal data model can be customized to suit the needs of their operators without modifying the underlying design of the database or sacrificing the potential to transparently share compatible subsets of their data with other similar databases. A practical working relational database design based on this general temporal data model is presented and demonstrated. This work has arisen out of experience with demographic surveillance in the developing world, and although the challenges and their solutions are more general, the discussion is organized around applications in demographic surveillance. An appendix contains detailed examples and working prototype databases that implement the examples discussed in the text. PMID:20396614

  3. Teaching Case: Adapting the Access Northwind Database to Support a Database Course

    ERIC Educational Resources Information Center

    Dyer, John N.; Rogers, Camille

    2015-01-01

    A common problem encountered when teaching database courses is that few large illustrative databases exist to support teaching and learning. Most database textbooks have small "toy" databases that are chapter objective specific, and thus do not support application over the complete domain of design, implementation and management concepts…

  4. Large-Scale 1:1 Computing Initiatives: An Open Access Database

    ERIC Educational Resources Information Center

    Richardson, Jayson W.; McLeod, Scott; Flora, Kevin; Sauers, Nick J.; Kannan, Sathiamoorthy; Sincar, Mehmet

    2013-01-01

    This article details the spread and scope of large-scale 1:1 computing initiatives around the world. What follows is a review of the existing literature around 1:1 programs followed by a description of the large-scale 1:1 database. Main findings include: 1) the XO and the Classmate PC dominate large-scale 1:1 initiatives; 2) if professional…

  5. A Database as a Service for the Healthcare System to Store Physiological Signal Data.

    PubMed

    Chang, Hsien-Tsung; Lin, Tsai-Huei

    2016-01-01

    Wearable devices that measure physiological signals to help develop self-health management habits have become increasingly popular in recent years. These records are conducive for follow-up health and medical care. In this study, based on the characteristics of the observed physiological signal records- 1) a large number of users, 2) a large amount of data, 3) low information variability, 4) data privacy authorization, and 5) data access by designated users-we wish to resolve physiological signal record-relevant issues utilizing the advantages of the Database as a Service (DaaS) model. Storing a large amount of data using file patterns can reduce database load, allowing users to access data efficiently; the privacy control settings allow users to store data securely. The results of the experiment show that the proposed system has better database access performance than a traditional relational database, with a small difference in database volume, thus proving that the proposed system can improve data storage performance.

  6. A Database as a Service for the Healthcare System to Store Physiological Signal Data

    PubMed Central

    Lin, Tsai-Huei

    2016-01-01

    Wearable devices that measure physiological signals to help develop self-health management habits have become increasingly popular in recent years. These records are conducive for follow-up health and medical care. In this study, based on the characteristics of the observed physiological signal records– 1) a large number of users, 2) a large amount of data, 3) low information variability, 4) data privacy authorization, and 5) data access by designated users—we wish to resolve physiological signal record-relevant issues utilizing the advantages of the Database as a Service (DaaS) model. Storing a large amount of data using file patterns can reduce database load, allowing users to access data efficiently; the privacy control settings allow users to store data securely. The results of the experiment show that the proposed system has better database access performance than a traditional relational database, with a small difference in database volume, thus proving that the proposed system can improve data storage performance. PMID:28033415

  7. Development and evaluation of a Naïve Bayesian model for coding causation of workers' compensation claims.

    PubMed

    Bertke, S J; Meyers, A R; Wurzelbacher, S J; Bell, J; Lampl, M L; Robins, D

    2012-12-01

    Tracking and trending rates of injuries and illnesses classified as musculoskeletal disorders caused by ergonomic risk factors such as overexertion and repetitive motion (MSDs) and slips, trips, or falls (STFs) in different industry sectors is of high interest to many researchers. Unfortunately, identifying the cause of injuries and illnesses in large datasets such as workers' compensation systems often requires reading and coding the free form accident text narrative for potentially millions of records. To alleviate the need for manual coding, this paper describes and evaluates a computer auto-coding algorithm that demonstrated the ability to code millions of claims quickly and accurately by learning from a set of previously manually coded claims. The auto-coding program was able to code claims as a musculoskeletal disorders, STF or other with approximately 90% accuracy. The program developed and discussed in this paper provides an accurate and efficient method for identifying the causation of workers' compensation claims as a STF or MSD in a large database based on the unstructured text narrative and resulting injury diagnoses. The program coded thousands of claims in minutes. The method described in this paper can be used by researchers and practitioners to relieve the manual burden of reading and identifying the causation of claims as a STF or MSD. Furthermore, the method can be easily generalized to code/classify other unstructured text narratives. Published by Elsevier Ltd.

  8. Design and implementation of a distributed large-scale spatial database system based on J2EE

    NASA Astrophysics Data System (ADS)

    Gong, Jianya; Chen, Nengcheng; Zhu, Xinyan; Zhang, Xia

    2003-03-01

    With the increasing maturity of distributed object technology, CORBA, .NET and EJB are universally used in traditional IT field. However, theories and practices of distributed spatial database need farther improvement in virtue of contradictions between large scale spatial data and limited network bandwidth or between transitory session and long transaction processing. Differences and trends among of CORBA, .NET and EJB are discussed in details, afterwards the concept, architecture and characteristic of distributed large-scale seamless spatial database system based on J2EE is provided, which contains GIS client application, web server, GIS application server and spatial data server. Moreover the design and implementation of components of GIS client application based on JavaBeans, the GIS engine based on servlet, the GIS Application server based on GIS enterprise JavaBeans(contains session bean and entity bean) are explained.Besides, the experiments of relation of spatial data and response time under different conditions are conducted, which proves that distributed spatial database system based on J2EE can be used to manage, distribute and share large scale spatial data on Internet. Lastly, a distributed large-scale seamless image database based on Internet is presented.

  9. Data Compression in Full-Text Retrieval Systems.

    ERIC Educational Resources Information Center

    Bell, Timothy C.; And Others

    1993-01-01

    Describes compression methods for components of full-text systems such as text databases on CD-ROM. Topics discussed include storage media; structures for full-text retrieval, including indexes, inverted files, and bitmaps; compression tools; memory requirements during retrieval; and ranking and information retrieval. (Contains 53 references.)…

  10. TabSQL: a MySQL tool to facilitate mapping user data to public databases.

    PubMed

    Xia, Xiao-Qin; McClelland, Michael; Wang, Yipeng

    2010-06-23

    With advances in high-throughput genomics and proteomics, it is challenging for biologists to deal with large data files and to map their data to annotations in public databases. We developed TabSQL, a MySQL-based application tool, for viewing, filtering and querying data files with large numbers of rows. TabSQL provides functions for downloading and installing table files from public databases including the Gene Ontology database (GO), the Ensembl databases, and genome databases from the UCSC genome bioinformatics site. Any other database that provides tab-delimited flat files can also be imported. The downloaded gene annotation tables can be queried together with users' data in TabSQL using either a graphic interface or command line. TabSQL allows queries across the user's data and public databases without programming. It is a convenient tool for biologists to annotate and enrich their data.

  11. TabSQL: a MySQL tool to facilitate mapping user data to public databases

    PubMed Central

    2010-01-01

    Background With advances in high-throughput genomics and proteomics, it is challenging for biologists to deal with large data files and to map their data to annotations in public databases. Results We developed TabSQL, a MySQL-based application tool, for viewing, filtering and querying data files with large numbers of rows. TabSQL provides functions for downloading and installing table files from public databases including the Gene Ontology database (GO), the Ensembl databases, and genome databases from the UCSC genome bioinformatics site. Any other database that provides tab-delimited flat files can also be imported. The downloaded gene annotation tables can be queried together with users' data in TabSQL using either a graphic interface or command line. Conclusions TabSQL allows queries across the user's data and public databases without programming. It is a convenient tool for biologists to annotate and enrich their data. PMID:20573251

  12. OSTMED.DR®, an Osteopathic Medicine Digital Library.

    PubMed

    Fitterling, Lori; Powers, Elaine; Vardell, Emily

    2018-01-01

    The OSTMED.DR® database provides access to both citation and full-text osteopathic literature, including the Journal of the American Osteopathic Association. Currently, it is a free database searchable using basic and advanced search features.

  13. An Update on Electronic Information Sources.

    ERIC Educational Resources Information Center

    Ackerman, Katherine

    1987-01-01

    This review of new developments and products in online services discusses trends in travel related services; full text databases; statistical source databases; an emphasis on regional and international business news; and user friendly systems. (Author/CLB)

  14. Creating databases for biological information: an introduction.

    PubMed

    Stein, Lincoln

    2002-08-01

    The essence of bioinformatics is dealing with large quantities of information. Whether it be sequencing data, microarray data files, mass spectrometric data (e.g., fingerprints), the catalog of strains arising from an insertional mutagenesis project, or even large numbers of PDF files, there inevitably comes a time when the information can simply no longer be managed with files and directories. This is where databases come into play. This unit briefly reviews the characteristics of several database management systems, including flat file, indexed file, and relational databases, as well as ACeDB. It compares their strengths and weaknesses and offers some general guidelines for selecting an appropriate database management system.

  15. Computer Cataloging of Electronic Journals in Unstable Aggregator Databases: The Hong Kong Baptist University Library Experience.

    ERIC Educational Resources Information Center

    Li, Yiu-On; Leung, Shirley W.

    2001-01-01

    Discussion of aggregator databases focuses on a project at the Hong Kong Baptist University library to integrate full-text electronic journal titles from three unstable aggregator databases into its online public access catalog (OPAC). Explains the development of the electronic journal computer program (EJCOP) to generate MARC records for…

  16. bioNerDS: exploring bioinformatics’ database and software use through literature mining

    PubMed Central

    2013-01-01

    Background Biology-focused databases and software define bioinformatics and their use is central to computational biology. In such a complex and dynamic field, it is of interest to understand what resources are available, which are used, how much they are used, and for what they are used. While scholarly literature surveys can provide some insights, large-scale computer-based approaches to identify mentions of bioinformatics databases and software from primary literature would automate systematic cataloguing, facilitate the monitoring of usage, and provide the foundations for the recovery of computational methods for analysing biological data, with the long-term aim of identifying best/common practice in different areas of biology. Results We have developed bioNerDS, a named entity recogniser for the recovery of bioinformatics databases and software from primary literature. We identify such entities with an F-measure ranging from 63% to 91% at the mention level and 63-78% at the document level, depending on corpus. Not attaining a higher F-measure is mostly due to high ambiguity in resource naming, which is compounded by the on-going introduction of new resources. To demonstrate the software, we applied bioNerDS to full-text articles from BMC Bioinformatics and Genome Biology. General mention patterns reflect the remit of these journals, highlighting BMC Bioinformatics’s emphasis on new tools and Genome Biology’s greater emphasis on data analysis. The data also illustrates some shifts in resource usage: for example, the past decade has seen R and the Gene Ontology join BLAST and GenBank as the main components in bioinformatics processing. Abstract Conclusions We demonstrate the feasibility of automatically identifying resource names on a large-scale from the scientific literature and show that the generated data can be used for exploration of bioinformatics database and software usage. For example, our results help to investigate the rate of change in resource usage and corroborate the suspicion that a vast majority of resources are created, but rarely (if ever) used thereafter. bioNerDS is available at http://bionerds.sourceforge.net/. PMID:23768135

  17. Intelligent Text Retrieval and Knowledge Acquisition from Texts for NASA Applications: Preprocessing Issues

    NASA Technical Reports Server (NTRS)

    2002-01-01

    A system that retrieves problem reports from a NASA database is described. The database is queried with natural language questions. Part-of-speech tags are first assigned to each word in the question using a rule based tagger. A partial parse of the question is then produced with independent sets of deterministic finite state a utomata. Using partial parse information, a look up strategy searches the database for problem reports relevant to the question. A bigram stemmer and irregular verb conjugates have been incorporated into the system to improve accuracy. The system is evaluated by a set of fifty five questions posed by NASA engineers. A discussion of future research is also presented.

  18. Cell line name recognition in support of the identification of synthetic lethality in cancer from text

    PubMed Central

    Kaewphan, Suwisa; Van Landeghem, Sofie; Ohta, Tomoko; Van de Peer, Yves; Ginter, Filip; Pyysalo, Sampo

    2016-01-01

    Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers. Availability and implementation: The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/. Contact: sukaew@utu.fi PMID:26428294

  19. The structure of Diagnostic and Statistical Manual of Mental Disorders (4th edition, text revision) personality disorder symptoms in a large national sample.

    PubMed

    Trull, Timothy J; Vergés, Alvaro; Wood, Phillip K; Jahng, Seungmin; Sher, Kenneth J

    2012-10-01

    We examined the latent structure underlying the criteria for DSM-IV-TR (American Psychiatric Association, 2000, Diagnostic and statistical manual of mental disorders (4th ed., text revision). Washington, DC: Author.) personality disorders in a large nationally representative sample of U.S. adults. Personality disorder symptom data were collected using a structured diagnostic interview from approximately 35,000 adults assessed over two waves of data collection in the National Epidemiologic Survey on Alcohol and Related Conditions. Our analyses suggested that a seven-factor solution provided the best fit for the data, and these factors were marked primarily by one or at most two personality disorder criteria sets. A series of regression analyses that used external validators tapping Axis I psychopathology, treatment for mental health problems, functioning scores, interpersonal conflict, and suicidal ideation and behavior provided support for the seven-factor solution. We discuss these findings in the context of previous studies that have examined the structure underlying the personality disorder criteria as well as the current proposals for DSM-5 personality disorders. (PsycINFO Database Record (c) 2012 APA, all rights reserved).

  20. How to locate and appraise qualitative research in complementary and alternative medicine

    PubMed Central

    2013-01-01

    Background The aim of this publication is to present a case study of how to locate and appraise qualitative studies for the conduct of a meta-ethnography in the field of complementary and alternative medicine (CAM). CAM is commonly associated with individualized medicine. However, one established scientific approach to the individual, qualitative research, thus far has been explicitly used very rarely. This article demonstrates a case example of how qualitative research in the field of CAM studies was identified and critically appraised. Methods Several search terms and techniques were tested for the identification and appraisal of qualitative CAM research in the conduct of a meta-ethnography. Sixty-seven electronic databases were searched for the identification of qualitative CAM trials, including CAM databases, nursing, nutrition, psychological, social, medical databases, the Cochrane Library and DIMDI. Results 9578 citations were screened, 223 articles met the pre-specified inclusion criteria, 63 full text publications were reviewed, 38 articles were appraised qualitatively and 30 articles were included. The search began with PubMed, yielding 87% of the included publications of all databases with few additional relevant findings in the specific databases. CINHAL and DIMDI also revealed a high number of precise hits. Although CAMbase and CAM-QUEST® focus on CAM research only, almost no hits of qualitative trials were found there. Searching with broad text terms was the most effective search strategy in all databases. Conclusions This publication presents a case study on how to locate and appraise qualitative studies in the field of CAM. The example shows that the literature search for qualitative studies in the field of CAM is most effective when the search is begun in PubMed followed by CINHAL or DIMDI using broad text terms. Exclusive CAM databases delivered no additional findings to locate qualitative CAM studies. PMID:23731997

  1. How to locate and appraise qualitative research in complementary and alternative medicine.

    PubMed

    Franzel, Brigitte; Schwiegershausen, Martina; Heusser, Peter; Berger, Bettina

    2013-06-03

    The aim of this publication is to present a case study of how to locate and appraise qualitative studies for the conduct of a meta-ethnography in the field of complementary and alternative medicine (CAM). CAM is commonly associated with individualized medicine. However, one established scientific approach to the individual, qualitative research, thus far has been explicitly used very rarely. This article demonstrates a case example of how qualitative research in the field of CAM studies was identified and critically appraised. Several search terms and techniques were tested for the identification and appraisal of qualitative CAM research in the conduct of a meta-ethnography. Sixty-seven electronic databases were searched for the identification of qualitative CAM trials, including CAM databases, nursing, nutrition, psychological, social, medical databases, the Cochrane Library and DIMDI. 9578 citations were screened, 223 articles met the pre-specified inclusion criteria, 63 full text publications were reviewed, 38 articles were appraised qualitatively and 30 articles were included. The search began with PubMed, yielding 87% of the included publications of all databases with few additional relevant findings in the specific databases. CINHAL and DIMDI also revealed a high number of precise hits. Although CAMbase and CAM-QUEST® focus on CAM research only, almost no hits of qualitative trials were found there. Searching with broad text terms was the most effective search strategy in all databases. This publication presents a case study on how to locate and appraise qualitative studies in the field of CAM. The example shows that the literature search for qualitative studies in the field of CAM is most effective when the search is begun in PubMed followed by CINHAL or DIMDI using broad text terms. Exclusive CAM databases delivered no additional findings to locate qualitative CAM studies.

  2. [Privacy and public benefit in using large scale health databases].

    PubMed

    Yamamoto, Ryuichi

    2014-01-01

    In Japan, large scale heath databases were constructed in a few years, such as National Claim insurance and health checkup database (NDB) and Japanese Sentinel project. But there are some legal issues for making adequate balance between privacy and public benefit by using such databases. NDB is carried based on the act for elderly person's health care but in this act, nothing is mentioned for using this database for general public benefit. Therefore researchers who use this database are forced to pay much concern about anonymization and information security that may disturb the research work itself. Japanese Sentinel project is a national project to detecting drug adverse reaction using large scale distributed clinical databases of large hospitals. Although patients give the future consent for general such purpose for public good, it is still under discussion using insufficiently anonymized data. Generally speaking, researchers of study for public benefit will not infringe patient's privacy, but vague and complex requirements of legislation about personal data protection may disturb the researches. Medical science does not progress without using clinical information, therefore the adequate legislation that is simple and clear for both researchers and patients is strongly required. In Japan, the specific act for balancing privacy and public benefit is now under discussion. The author recommended the researchers including the field of pharmacology should pay attention to, participate in the discussion of, and make suggestion to such act or regulations.

  3. High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Joslyn, Cliff A.; Adolf, Robert D.; Al-Saffar, Sinan

    2010-10-04

    As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors.« less

  4. BioC: a minimalist approach to interoperability for biomedical text processing

    PubMed Central

    Comeau, Donald C.; Islamaj Doğan, Rezarta; Ciccarese, Paolo; Cohen, Kevin Bretonnel; Krallinger, Martin; Leitner, Florian; Lu, Zhiyong; Peng, Yifan; Rinaldi, Fabio; Torii, Manabu; Valencia, Alfonso; Verspoor, Karin; Wiegers, Thomas C.; Wu, Cathy H.; Wilbur, W. John

    2013-01-01

    A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/ PMID:24048470

  5. National Water Quality Standards Database (NWQSD)

    EPA Pesticide Factsheets

    The National Water Quality Standards Database (WQSDB) provides access to EPA and state water quality standards (WQS) information in text, tables, and maps. This data source was last updated in December 2007 and will no longer be updated.

  6. A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies.

    PubMed

    Jagtap, Pratik; Goslinga, Jill; Kooren, Joel A; McGowan, Thomas; Wroblewski, Matthew S; Seymour, Sean L; Griffin, Timothy J

    2013-04-01

    Large databases (>10(6) sequences) used in metaproteomic and proteogenomic studies present challenges in matching peptide sequences to MS/MS data using database-search programs. Most notably, strict filtering to avoid false-positive matches leads to more false negatives, thus constraining the number of peptide matches. To address this challenge, we developed a two-step method wherein matches derived from a primary search against a large database were used to create a smaller subset database. The second search was performed against a target-decoy version of this subset database merged with a host database. High confidence peptide sequence matches were then used to infer protein identities. Applying our two-step method for both metaproteomic and proteogenomic analysis resulted in twice the number of high confidence peptide sequence matches in each case, as compared to the conventional one-step method. The two-step method captured almost all of the same peptides matched by the one-step method, with a majority of the additional matches being false negatives from the one-step method. Furthermore, the two-step method improved results regardless of the database search program used. Our results show that our two-step method maximizes the peptide matching sensitivity for applications requiring large databases, especially valuable for proteogenomics and metaproteomics studies. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  7. Geology of Point Reyes National Seashore and vicinity, California: a digital database

    USGS Publications Warehouse

    Clark, Jospeh C.; Brabb, Earl E.

    1997-01-01

    This Open-File report is a digital geologic map database. This pamphlet serves to introduce and describe the digital data. There is no paper map included in the Open-File report. The report does include, however, a PostScript plot file containing an image of the geologic map sheet with explanation, as well as the accompanying text describing the geology of the area. For those interested in a paper plot of information contained in the database or in obtaining the PostScript plot files, please see the section entitled 'For Those Who Aren't Familiar With Digital Geologic Map Databases' below. This digital map database, compiled from previously published and unpublished data and new mapping by the authors, represents the general distribution of surficial deposits and rock units in Point Reyes and surrounding areas. Together with the accompanying text file (pr-geo.txt or pr-geo.ps), it provides current information on the stratigraphy and structural geology of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The scale of the source maps limits the spatial resolution (scale) of the database to 1:48,000 or smaller.

  8. Evaluation of five full-text drug databases by pharmacy students, faculty, and librarians: do the groups agree?

    PubMed

    Kupferberg, Natalie; Jones Hartel, Lynda

    2004-01-01

    The purpose of this study is to assess the usefulness of five full-text drug databases as evaluated by medical librarians, pharmacy faculty, and pharmacy students at an academic health center. Study findings and recommendations are offered as guidance to librarians responsible for purchasing decisions. Four pharmacy students, four pharmacy faculty members, and four medical librarians answered ten drug information questions using the databases AHFS Drug Information (STAT!Ref); DRUGDEX (Micromedex); eFacts (Drug Facts and Comparisons); Lexi-Drugs Online (Lexi-Comp); and the PDR Electronic Library (Micromedex). Participants noted whether each database contained answers to the questions and evaluated each database on ease of navigation, screen readability, overall satisfaction, and product recommendation. While each study group found that DRUGDEX provided the most direct answers to the ten questions, faculty members gave Lexi-Drugs the highest overall rating. Students favored eFacts. The faculty and students found the PDR least useful. Librarians ranked DRUGDEX the highest and AHFS the lowest. The comments of pharmacy faculty and students show that these groups preferred concise, easy-to-use sources; librarians focused on the comprehensiveness, layout, and supporting references of the databases. This study demonstrates the importance of consulting with primary clientele before purchasing databases. Although there are many online drug databases to consider, present findings offer strong support for eFacts, Lexi-Drugs, and DRUGDEX.

  9. Evaluation of five full-text drug databases by pharmacy students, faculty, and librarians: do the groups agree?

    PubMed Central

    Kupferberg, Natalie; Hartel, Lynda Jones

    2004-01-01

    Objectives: The purpose of this study is to assess the usefulness of five full-text drug databases as evaluated by medical librarians, pharmacy faculty, and pharmacy students at an academic health center. Study findings and recommendations are offered as guidance to librarians responsible for purchasing decisions. Methods: Four pharmacy students, four pharmacy faculty members, and four medical librarians answered ten drug information questions using the databases AHFS Drug Information (STAT!Ref); DRUGDEX (Micromedex); eFacts (Drug Facts and Comparisons); Lexi-Drugs Online (Lexi-Comp); and the PDR Electronic Library (Micromedex). Participants noted whether each database contained answers to the questions and evaluated each database on ease of navigation, screen readability, overall satisfaction, and product recommendation. Results: While each study group found that DRUGDEX provided the most direct answers to the ten questions, faculty members gave Lexi-Drugs the highest overall rating. Students favored eFacts. The faculty and students found the PDR least useful. Librarians ranked DRUGDEX the highest and AHFS the lowest. The comments of pharmacy faculty and students show that these groups preferred concise, easy-to-use sources; librarians focused on the comprehensiveness, layout, and supporting references of the databases. Conclusion: This study demonstrates the importance of consulting with primary clientele before purchasing databases. Although there are many online drug databases to consider, present findings offer strong support for eFacts, Lexi-Drugs, and DRUGDEX. PMID:14762464

  10. PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more.

    PubMed

    Liu, Yifeng; Liang, Yongjie; Wishart, David

    2015-07-01

    PolySearch2 (http://polysearch.ca) is an online text-mining system for identifying relationships between biomedical entities such as human diseases, genes, SNPs, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, drug actions, Gene Ontology terms, MeSH terms, ICD-10 medical codes, biological taxonomies and chemical taxonomies. PolySearch2 supports a generalized 'Given X, find all associated Ys' query, where X and Y can be selected from the aforementioned biomedical entities. An example query might be: 'Find all diseases associated with Bisphenol A'. To find its answers, PolySearch2 searches for associations against comprehensive collections of free-text collections, including local versions of MEDLINE abstracts, PubMed Central full-text articles, Wikipedia full-text articles and US Patent application abstracts. PolySearch2 also searches 14 widely used, text-rich biological databases such as UniProt, DrugBank and Human Metabolome Database to improve its accuracy and coverage. PolySearch2 maintains an extensive thesaurus of biological terms and exploits the latest search engine technology to rapidly retrieve relevant articles and databases records. PolySearch2 also generates, ranks and annotates associative candidates and present results with relevancy statistics and highlighted key sentences to facilitate user interpretation. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more

    PubMed Central

    Liu, Yifeng; Liang, Yongjie; Wishart, David

    2015-01-01

    PolySearch2 (http://polysearch.ca) is an online text-mining system for identifying relationships between biomedical entities such as human diseases, genes, SNPs, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, drug actions, Gene Ontology terms, MeSH terms, ICD-10 medical codes, biological taxonomies and chemical taxonomies. PolySearch2 supports a generalized ‘Given X, find all associated Ys’ query, where X and Y can be selected from the aforementioned biomedical entities. An example query might be: ‘Find all diseases associated with Bisphenol A’. To find its answers, PolySearch2 searches for associations against comprehensive collections of free-text collections, including local versions of MEDLINE abstracts, PubMed Central full-text articles, Wikipedia full-text articles and US Patent application abstracts. PolySearch2 also searches 14 widely used, text-rich biological databases such as UniProt, DrugBank and Human Metabolome Database to improve its accuracy and coverage. PolySearch2 maintains an extensive thesaurus of biological terms and exploits the latest search engine technology to rapidly retrieve relevant articles and databases records. PolySearch2 also generates, ranks and annotates associative candidates and present results with relevancy statistics and highlighted key sentences to facilitate user interpretation. PMID:25925572

  12. Orthographic and Phonological Neighborhood Databases across Multiple Languages.

    PubMed

    Marian, Viorica

    2017-01-01

    The increased globalization of science and technology and the growing number of bilinguals and multilinguals in the world have made research with multiple languages a mainstay for scholars who study human function and especially those who focus on language, cognition, and the brain. Such research can benefit from large-scale databases and online resources that describe and measure lexical, phonological, orthographic, and semantic information. The present paper discusses currently-available resources and underscores the need for tools that enable measurements both within and across multiple languages. A general review of language databases is followed by a targeted introduction to databases of orthographic and phonological neighborhoods. A specific focus on CLEARPOND illustrates how databases can be used to assess and compare neighborhood information across languages, to develop research materials, and to provide insight into broad questions about language. As an example of how using large-scale databases can answer questions about language, a closer look at neighborhood effects on lexical access reveals that not only orthographic, but also phonological neighborhoods can influence visual lexical access both within and across languages. We conclude that capitalizing upon large-scale linguistic databases can advance, refine, and accelerate scientific discoveries about the human linguistic capacity.

  13. Large Scale Landslide Database System Established for the Reservoirs in Southern Taiwan

    NASA Astrophysics Data System (ADS)

    Tsai, Tsai-Tsung; Tsai, Kuang-Jung; Shieh, Chjeng-Lun

    2017-04-01

    Typhoon Morakot seriously attack southern Taiwan awaken the public awareness of large scale landslide disasters. Large scale landslide disasters produce large quantity of sediment due to negative effects on the operating functions of reservoirs. In order to reduce the risk of these disasters within the study area, the establishment of a database for hazard mitigation / disaster prevention is necessary. Real time data and numerous archives of engineering data, environment information, photo, and video, will not only help people make appropriate decisions, but also bring the biggest concern for people to process and value added. The study tried to define some basic data formats / standards from collected various types of data about these reservoirs and then provide a management platform based on these formats / standards. Meanwhile, in order to satisfy the practicality and convenience, the large scale landslide disasters database system is built both provide and receive information abilities, which user can use this large scale landslide disasters database system on different type of devices. IT technology progressed extreme quick, the most modern system might be out of date anytime. In order to provide long term service, the system reserved the possibility of user define data format /standard and user define system structure. The system established by this study was based on HTML5 standard language, and use the responsive web design technology. This will make user can easily handle and develop this large scale landslide disasters database system.

  14. Machine Learning and Decision Support in Critical Care

    PubMed Central

    Johnson, Alistair E. W.; Ghassemi, Mohammad M.; Nemati, Shamim; Niehaus, Katherine E.; Clifton, David A.; Clifford, Gari D.

    2016-01-01

    Clinical data management systems typically provide caregiver teams with useful information, derived from large, sometimes highly heterogeneous, data sources that are often changing dynamically. Over the last decade there has been a significant surge in interest in using these data sources, from simply re-using the standard clinical databases for event prediction or decision support, to including dynamic and patient-specific information into clinical monitoring and prediction problems. However, in most cases, commercial clinical databases have been designed to document clinical activity for reporting, liability and billing reasons, rather than for developing new algorithms. With increasing excitement surrounding “secondary use of medical records” and “Big Data” analytics, it is important to understand the limitations of current databases and what needs to change in order to enter an era of “precision medicine.” This review article covers many of the issues involved in the collection and preprocessing of critical care data. The three challenges in critical care are considered: compartmentalization, corruption, and complexity. A range of applications addressing these issues are covered, including the modernization of static acuity scoring; on-line patient tracking; personalized prediction and risk assessment; artifact detection; state estimation; and incorporation of multimodal data sources such as genomic and free text data. PMID:27765959

  15. EnsMart: A Generic System for Fast and Flexible Access to Biological Data

    PubMed Central

    Kasprzyk, Arek; Keefe, Damian; Smedley, Damian; London, Darin; Spooner, William; Melsopp, Craig; Hammond, Martin; Rocca-Serra, Philippe; Cox, Tony; Birney, Ewan

    2004-01-01

    The EnsMart system (www.ensembl.org/EnsMart) provides a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools. The system consists of a query-optimized database and interactive, user-friendly interfaces. EnsMart has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries, on various types of annotations, for numerous species are supported. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs, and exons, with additional upstream and downstream regions, can be retrieved. The EnsMart database can be accessed via a public Web site, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to `non-Ensembl' data sets. PMID:14707178

  16. Large-scale annotation of small-molecule libraries using public databases.

    PubMed

    Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A

    2007-01-01

    While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.

  17. Comparison of the Frontier Distributed Database Caching System to NoSQL Databases

    NASA Astrophysics Data System (ADS)

    Dykstra, Dave

    2012-12-01

    One of the main attractions of non-relational “NoSQL” databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It also compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.

  18. Comparison of the Frontier Distributed Database Caching System to NoSQL Databases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dykstra, Dave

    One of the main attractions of non-relational NoSQL databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It alsomore » compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.« less

  19. Graphics-based intelligent search and abstracting using Data Modeling

    NASA Astrophysics Data System (ADS)

    Jaenisch, Holger M.; Handley, James W.; Case, Carl T.; Songy, Claude G.

    2002-11-01

    This paper presents an autonomous text and context-mining algorithm that converts text documents into point clouds for visual search cues. This algorithm is applied to the task of data-mining a scriptural database comprised of the Old and New Testaments from the Bible and the Book of Mormon, Doctrine and Covenants, and the Pearl of Great Price. Results are generated which graphically show the scripture that represents the average concept of the database and the mining of the documents down to the verse level.

  20. The Flip Sides of Full-Text: Superindex and the Harvard Business Review/Online.

    ERIC Educational Resources Information Center

    Dadlez, Eva M.

    1984-01-01

    This article illustrates similarities between two different types of full-text databases--Superindex, Harvard Business Review/Online--and uses them as arena to demonstrate search and display applications of full-text. The selection of logical operators, full-text search strategies, and keywords and Bibliographic Retrieval Service's Occurrence…

  1. Interactome of the hepatitis C virus: Literature mining with ANDSystem.

    PubMed

    Saik, Olga V; Ivanisenko, Timofey V; Demenkov, Pavel S; Ivanisenko, Vladimir A

    2016-06-15

    A study of the molecular genetics mechanisms of host-pathogen interactions is of paramount importance in developing drugs against viral diseases. Currently, the literature contains a huge amount of information that describes interactions between HCV and human proteins. In addition, there are many factual databases that contain experimentally verified data on HCV-host interactions. The sources of such data are the original data along with the data manually extracted from the literature. However, the manual analysis of scientific publications is time consuming and, because of this, databases created with such an approach often do not have complete information. One of the most promising methods to provide actualisation and completeness of information is text mining. Here, with the use of a previously developed method by the authors using ANDSystem, an automated extraction of information on the interactions between HCV and human proteins was conducted. As a data source for the text mining approach, PubMed abstracts and full text articles were used. Additionally, external factual databases were analyzed. On the basis of this analysis, a special version of ANDSystem, extended with the HCV interactome, was created. The HCV interactome contains information about the interactions between 969 human and 11 HCV proteins. Among the 969 proteins, 153 'new' proteins were found not previously referred to in any external databases of protein-protein interactions for HCV-host interactions. Thus, the extended ANDSystem possesses a more comprehensive detailing of HCV-host interactions versus other existing databases. It was interesting that HCV proteins more preferably interact with human proteins that were already involved in a large number of protein-protein interactions as well as those associated with many diseases. Among human proteins of the HCV interactome, there were a large number of proteins regulated by microRNAs. It turned out that the results obtained for protein-protein interactions and microRNA-regulation did not depend on how well the proteins were studied, while protein-disease interactions appeared to be dependent on the level of study. In particular, the mean number of diseases linked to well-studied proteins (proteins were considered well-studied if they were mentioned in 50 or more PubMed publications) from the HCV interactome was 20.8, significantly exceeding the mean number of associations with diseases (10.1) for the total set of well-studied human proteins present in ANDSystem. For proteins not highly poorly-studied investigated, proteins from the HCV interactome (each protein was referred to in less than 50 publications) distribution of the number of diseases associated with them had no statistically significant differences from the distribution of the number of diseases associated with poorly-studied proteins based on the total set of human proteins stored in ANDSystem. With this, the average number of associations with diseases for the HCV interactome and the total set of human proteins were 0.3 and 0.2, respectively. Thus, ANDSystem, extended with the HCV interactome, can be helpful in a wide range of issues related to analyzing HCV-host interactions in the search for anti-HCV drug targets. The demo version of the extended ANDSystem covered here containing only interactions between human proteins, genes, metabolites, diseases, miRNAs and molecular-genetic pathways, as well as interactions between human proteins/genes and HCV proteins, is freely available at the following web address: http://www-bionet.sscc.ru/psd/andhcv/. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.

  2. High performance semantic factoring of giga-scale semantic graph databases.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    al-Saffar, Sinan; Adolf, Bob; Haglin, David

    2010-10-01

    As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors, including basic properties, connected components, namespace interaction, and typed paths.« less

  3. Native Language Processing using Exegy Text Miner

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Compton, J

    2007-10-18

    Lawrence Livermore National Laboratory's New Architectures Testbed recently evaluated Exegy's Text Miner appliance to assess its applicability to high-performance, automated native language analysis. The evaluation was performed with support from the Computing Applications and Research Department in close collaboration with Global Security programs, and institutional activities in native language analysis. The Exegy Text Miner is a special-purpose device for detecting and flagging user-supplied patterns of characters, whether in streaming text or in collections of documents at very high rates. Patterns may consist of simple lists of words or complex expressions with sub-patterns linked by logical operators. These searches are accomplishedmore » through a combination of specialized hardware (i.e., one or more field-programmable gates arrays in addition to general-purpose processors) and proprietary software that exploits these individual components in an optimal manner (through parallelism and pipelining). For this application the Text Miner has performed accurately and reproducibly at high speeds approaching those documented by Exegy in its technical specifications. The Exegy Text Miner is primarily intended for the single-byte ASCII characters used in English, but at a technical level its capabilities are language-neutral and can be applied to multi-byte character sets such as those found in Arabic and Chinese. The system is used for searching databases or tracking streaming text with respect to one or more lexicons. In a real operational environment it is likely that data would need to be processed separately for each lexicon or search technique. However, the searches would be so fast that multiple passes should not be considered as a limitation a priori. Indeed, it is conceivable that large databases could be searched as often as necessary if new queries were deemed worthwhile. This project is concerned with evaluating the Exegy Text Miner installed in the New Architectures Testbed running under software version 2.0. The concrete goals of the evaluation were to test the speed and accuracy of the Exegy and explore ways that it could be employed in current or future text-processing projects at Lawrence Livermore National Laboratory (LLNL). This study extended beyond this to evaluate its suitability for processing foreign language sources. The scope of this study was limited to the capabilities of the Exegy Text Miner in the file search mode and does not attempt simulating the streaming mode. Since the capabilities of the machine are invariant to the choice of input mode and since timing should not depend on this choice, it was felt that the added effort was not necessary for this restricted study.« less

  4. [Benefits of large healthcare databases for drug risk research].

    PubMed

    Garbe, Edeltraut; Pigeot, Iris

    2015-08-01

    Large electronic healthcare databases have become an important worldwide data resource for drug safety research after approval. Signal generation methods and drug safety studies based on these data facilitate the prospective monitoring of drug safety after approval, as has been recently required by EU law and the German Medicines Act. Despite its large size, a single healthcare database may include insufficient patients for the study of a very small number of drug-exposed patients or the investigation of very rare drug risks. For that reason, in the United States, efforts have been made to work on models that provide the linkage of data from different electronic healthcare databases for monitoring the safety of medicines after authorization in (i) the Sentinel Initiative and (ii) the Observational Medical Outcomes Partnership (OMOP). In July 2014, the pilot project Mini-Sentinel included a total of 178 million people from 18 different US databases. The merging of the data is based on a distributed data network with a common data model. In the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance (ENCEPP) there has been no comparable merging of data from different databases; however, first experiences have been gained in various EU drug safety projects. In Germany, the data of the statutory health insurance providers constitute the most important resource for establishing a large healthcare database. Their use for this purpose has so far been severely restricted by the Code of Social Law (Section 75, Book 10). Therefore, a reform of this section is absolutely necessary.

  5. LMSD: LIPID MAPS structure database

    PubMed Central

    Sud, Manish; Fahy, Eoin; Cotter, Dawn; Brown, Alex; Dennis, Edward A.; Glass, Christopher K.; Merrill, Alfred H.; Murphy, Robert C.; Raetz, Christian R. H.; Russell, David W.; Subramaniam, Shankar

    2007-01-01

    The LIPID MAPS Structure Database (LMSD) is a relational database encompassing structures and annotations of biologically relevant lipids. Structures of lipids in the database come from four sources: (i) LIPID MAPS Consortium's core laboratories and partners; (ii) lipids identified by LIPID MAPS experiments; (iii) computationally generated structures for appropriate lipid classes; (iv) biologically relevant lipids manually curated from LIPID BANK, LIPIDAT and other public sources. All the lipid structures in LMSD are drawn in a consistent fashion. In addition to a classification-based retrieval of lipids, users can search LMSD using either text-based or structure-based search options. The text-based search implementation supports data retrieval by any combination of these data fields: LIPID MAPS ID, systematic or common name, mass, formula, category, main class, and subclass data fields. The structure-based search, in conjunction with optional data fields, provides the capability to perform a substructure search or exact match for the structure drawn by the user. Search results, in addition to structure and annotations, also include relevant links to external databases. The LMSD is publicly available at PMID:17098933

  6. Graph Learning in Knowledge Bases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Goldberg, Sean; Wang, Daisy Zhe

    The amount of text data has been growing exponentially in recent years, giving rise to automatic information extraction methods that store text annotations in a database. The current state-of-theart structured prediction methods, however, are likely to contain errors and it’s important to be able to manage the overall uncertainty of the database. On the other hand, the advent of crowdsourcing has enabled humans to aid machine algorithms at scale. As part of this project we introduced pi-CASTLE , a system that optimizes and integrates human and machine computing as applied to a complex structured prediction problem involving conditional random fieldsmore » (CRFs). We proposed strategies grounded in information theory to select a token subset, formulate questions for the crowd to label, and integrate these labelings back into the database using a method of constrained inference. On both a text segmentation task over academic citations and a named entity recognition task over tweets we showed an order of magnitude improvement in accuracy gain over baseline methods.« less

  7. The EpiSLI Database: A Publicly Available Database on Speech and Language

    ERIC Educational Resources Information Center

    Tomblin, J. Bruce

    2010-01-01

    Purpose: This article describes a database that was created in the process of conducting a large-scale epidemiologic study of specific language impairment (SLI). As such, this database will be referred to as the EpiSLI database. Children with SLI have unexpected and unexplained difficulties learning and using spoken language. Although there is no…

  8. Online Patent Searching: The Realities.

    ERIC Educational Resources Information Center

    Kaback, Stuart M.

    1983-01-01

    Considers patent subject searching capabilities of major online databases, noting patent claims, "deep-indexed" files, test searches, retrieval of related references, multi-database searching, improvements needed in indexing of chemical structures, full text searching, improvements needed in handling numerical data, and augmenting a…

  9. Online Databases. Ingenta Grows in the U.S. Market.

    ERIC Educational Resources Information Center

    Tenopir, Carol

    2002-01-01

    Focuses on the growth of the United Kingdom company Ingenta in the United States. Highlights include a new partnership with Gale Group related to their InfoTrac databases; indexing and full text coverage; standards; and other Ingenta acquisitions. (LRW)

  10. IntPath--an integrated pathway gene relationship database for model organisms and important pathogens

    PubMed Central

    2012-01-01

    Background Pathway data are important for understanding the relationship between genes, proteins and many other molecules in living organisms. Pathway gene relationships are crucial information for guidance, prediction, reference and assessment in biochemistry, computational biology, and medicine. Many well-established databases--e.g., KEGG, WikiPathways, and BioCyc--are dedicated to collecting pathway data for public access. However, the effectiveness of these databases is hindered by issues such as incompatible data formats, inconsistent molecular representations, inconsistent molecular relationship representations, inconsistent referrals to pathway names, and incomprehensive data from different databases. Results In this paper, we overcome these issues through extraction, normalization and integration of pathway data from several major public databases (KEGG, WikiPathways, BioCyc, etc). We build a database that not only hosts our integrated pathway gene relationship data for public access but also maintains the necessary updates in the long run. This public repository is named IntPath (Integrated Pathway gene relationship database for model organisms and important pathogens). Four organisms--S. cerevisiae, M. tuberculosis H37Rv, H. Sapiens and M. musculus--are included in this version (V2.0) of IntPath. IntPath uses the "full unification" approach to ensure no deletion and no introduced noise in this process. Therefore, IntPath contains much richer pathway-gene and pathway-gene pair relationships and much larger number of non-redundant genes and gene pairs than any of the single-source databases. The gene relationships of each gene (measured by average node degree) per pathway are significantly richer. The gene relationships in each pathway (measured by average number of gene pairs per pathway) are also considerably richer in the integrated pathways. Moderate manual curation are involved to get rid of errors and noises from source data (e.g., the gene ID errors in WikiPathways and relationship errors in KEGG). We turn complicated and incompatible xml data formats and inconsistent gene and gene relationship representations from different source databases into normalized and unified pathway-gene and pathway-gene pair relationships neatly recorded in simple tab-delimited text format and MySQL tables, which facilitates convenient automatic computation and large-scale referencing in many related studies. IntPath data can be downloaded in text format or MySQL dump. IntPath data can also be retrieved and analyzed conveniently through web service by local programs or through web interface by mouse clicks. Several useful analysis tools are also provided in IntPath. Conclusions We have overcome in IntPath the issues of compatibility, consistency, and comprehensiveness that often hamper effective use of pathway databases. We have included four organisms in the current release of IntPath. Our methodology and programs described in this work can be easily applied to other organisms; and we will include more model organisms and important pathogens in future releases of IntPath. IntPath maintains regular updates and is freely available at http://compbio.ddns.comp.nus.edu.sg:8080/IntPath. PMID:23282057

  11. Tomato functional genomics database (TFGD): a comprehensive collection and analysis package for tomato functional genomics

    USDA-ARS?s Scientific Manuscript database

    Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...

  12. Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency.

    PubMed

    Aniceto, Rodrigo; Xavier, Rene; Guimarães, Valeria; Hondo, Fernanda; Holanda, Maristela; Walter, Maria Emilia; Lifschitz, Sérgio

    2015-01-01

    Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB.

  13. Reporting to Improve Reproducibility and Facilitate Validity Assessment for Healthcare Database Studies V1.0.

    PubMed

    Wang, Shirley V; Schneeweiss, Sebastian; Berger, Marc L; Brown, Jeffrey; de Vries, Frank; Douglas, Ian; Gagne, Joshua J; Gini, Rosa; Klungel, Olaf; Mullins, C Daniel; Nguyen, Michael D; Rassen, Jeremy A; Smeeth, Liam; Sturkenboom, Miriam

    2017-09-01

    Defining a study population and creating an analytic dataset from longitudinal healthcare databases involves many decisions. Our objective was to catalogue scientific decisions underpinning study execution that should be reported to facilitate replication and enable assessment of validity of studies conducted in large healthcare databases. We reviewed key investigator decisions required to operate a sample of macros and software tools designed to create and analyze analytic cohorts from longitudinal streams of healthcare data. A panel of academic, regulatory, and industry experts in healthcare database analytics discussed and added to this list. Evidence generated from large healthcare encounter and reimbursement databases is increasingly being sought by decision-makers. Varied terminology is used around the world for the same concepts. Agreeing on terminology and which parameters from a large catalogue are the most essential to report for replicable research would improve transparency and facilitate assessment of validity. At a minimum, reporting for a database study should provide clarity regarding operational definitions for key temporal anchors and their relation to each other when creating the analytic dataset, accompanied by an attrition table and a design diagram. A substantial improvement in reproducibility, rigor and confidence in real world evidence generated from healthcare databases could be achieved with greater transparency about operational study parameters used to create analytic datasets from longitudinal healthcare databases. © 2017 The Authors. Pharmacoepidemiology & Drug Safety Published by John Wiley & Sons Ltd.

  14. Linguistic measures of chemical diversity and the "keywords" of molecular collections.

    PubMed

    Woźniak, Michał; Wołos, Agnieszka; Modrzyk, Urszula; Górski, Rafał L; Winkowski, Jan; Bajczyk, Michał; Szymkuć, Sara; Grzybowski, Bartosz A; Eder, Maciej

    2018-05-15

    Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections ("corpora"), including those deposited on the Internet - indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic "chemical words" that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular "keywords" by which such collections are best characterized and annotated.

  15. AphidBase: A centralized bioinformatic resource for annotation of the pea aphid genome

    PubMed Central

    Legeai, Fabrice; Shigenobu, Shuji; Gauthier, Jean-Pierre; Colbourne, John; Rispe, Claude; Collin, Olivier; Richards, Stephen; Wilson, Alex C. C.; Tagu, Denis

    2015-01-01

    AphidBase is a centralized bioinformatic resource that was developed to facilitate community annotation of the pea aphid genome by the International Aphid Genomics Consortium (IAGC). The AphidBase Information System designed to organize and distribute genomic data and annotations for a large international community was constructed using open source software tools from the Generic Model Organism Database (GMOD). The system includes Apollo and GBrowse utilities as well as a wiki, blast search capabilities and a full text search engine. AphidBase strongly supported community cooperation and coordination in the curation of gene models during community annotation of the pea aphid genome. AphidBase can be accessed at http://www.aphidbase.com. PMID:20482635

  16. Data Mining.

    ERIC Educational Resources Information Center

    Benoit, Gerald

    2002-01-01

    Discusses data mining (DM) and knowledge discovery in databases (KDD), taking the view that KDD is the larger view of the entire process, with DM emphasizing the cleaning, warehousing, mining, and visualization of knowledge discovery in databases. Highlights include algorithms; users; the Internet; text mining; and information extraction.…

  17. The Era of the Large Databases: Outcomes After Gastroesophageal Surgery According to NSQIP, NIS, and NCDB Databases. Systematic Literature Review.

    PubMed

    Batista Rodríguez, Gabriela; Balla, Andrea; Fernández-Ananín, Sonia; Balagué, Carmen; Targarona, Eduard M

    2018-05-01

    The term big data refers to databases that include large amounts of information used in various areas of knowledge. Currently, there are large databases that allow the evaluation of postoperative evolution, such as the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP), the Healthcare Cost and Utilization Project (HCUP) National Inpatient Sample (NIS), and the National Cancer Database (NCDB). The aim of this review was to evaluate the clinical impact of information obtained from these registries regarding gastroesophageal surgery. A systematic review using the Meta-analysis of Observational Studies in Epidemiology guidelines was performed. The research was carried out using the PubMed database identifying 251 articles. All outcomes related to gastroesophageal surgery were analyzed. A total of 34 articles published between January 2007 and July 2017 were included, for a total of 345 697 patients. Studies were analyzed and divided according to the type of surgery and main theme in (1) esophageal surgery and (2) gastric surgery. The information provided by these databases is an effective way to obtain levels of evidence not obtainable by conventional methods. Furthermore, this information is useful for the external validation of previous studies, to establish benchmarks that allow comparisons between centers and have a positive impact on the quality of care.

  18. Development and Operation of a Database Machine for Online Access and Update of a Large Database.

    ERIC Educational Resources Information Center

    Rush, James E.

    1980-01-01

    Reviews the development of a fault tolerant database processor system which replaced OCLC's conventional file system. A general introduction to database management systems and the operating environment is followed by a description of the hardware selection, software processes, and system characteristics. (SW)

  19. Hypersonic and Supersonic Flow Roadmaps Using Bibliometrics and Database Tomography.

    ERIC Educational Resources Information Center

    Kostoff, R. N.; Eberhart, Henry J.; Toothman, Darrell Ray

    1999-01-01

    Database Tomography (DT) is a textual database-analysis system consisting of algorithms for extracting multiword phrase frequencies and proximities from a large textual database, to augment interpretative capabilities of the expert human analyst. Describes use of the DT process, supplemented by literature bibliometric analyses, to derive technical…

  20. Comparison of the NCI open database with seven large chemical structural databases.

    PubMed

    Voigt, J H; Bienfait, B; Wang, S; Nicklaus, M C

    2001-01-01

    Eight large chemical databases have been analyzed and compared to each other. Central to this comparison is the open National Cancer Institute (NCI) database, consisting of approximately 250 000 structures. The other databases analyzed are the Available Chemicals Directory ("ACD," from MDL, release 1.99, 3D-version); the ChemACX ("ACX," from CamSoft, Version 4.5); the Maybridge Catalog and the Asinex database (both as distributed by CamSoft as part of ChemInfo 4.5); the Sigma-Aldrich Catalog (CD-ROM, 1999 Version); the World Drug Index ("WDI," Derwent, version 1999.03); and the organic part of the Cambridge Crystallographic Database ("CSD," from Cambridge Crystallographic Data Center, 1999 Version 5.18). The database properties analyzed are internal duplication rates; compounds unique to each database; cumulative occurrence of compounds in an increasing number of databases; overlap of identical compounds between two databases; similarity overlap; diversity; and others. The crystallographic database CSD and the WDI show somewhat less overlap with the other databases than those with each other. In particular the collections of commercial compounds and compilations of vendor catalogs have a substantial degree of overlap among each other. Still, no database is completely a subset of any other, and each appears to have its own niche and thus "raison d'être". The NCI database has by far the highest number of compounds that are unique to it. Approximately 200 000 of the NCI structures were not found in any of the other analyzed databases.

  1. A Database of Supercooled Large Droplet Ice Accretions [Supplement

    NASA Technical Reports Server (NTRS)

    VanZante, Judith Foss

    2007-01-01

    A unique, publicly available database regarding supercooled large droplet (SLD) ice accretions has been developed in NASA Glenn's Icing Research Tunnel. Identical cloud and flight conditions were generated for five different airfoil models. The models chosen represent a variety of aircraft types from the horizontal stabilizer of a large transport aircraft to the wings of regional, business, and general aviation aircraft. In addition to the standard documentation methods of 2D ice shape tracing and imagery, ice mass measurements were also taken. This database will also be used to validate and verify the extension of the ice accretion code, LEWICE, into the SLD realm.

  2. A Database of Supercooled Large Droplet Ice Accretions

    NASA Technical Reports Server (NTRS)

    VanZante, Judith Foss

    2007-01-01

    A unique, publicly available database regarding supercooled large droplet ice accretions has been developed in NASA Glenn's Icing Research Tunnel. Identical cloud and flight conditions were generated for five different airfoil models. The models chosen represent a variety of aircraft types from the horizontal stabilizer of a large trans-port aircraft to the wings of regional, business, and general aviation aircraft. In addition to the standard documentation methods of 2D ice shape tracing and imagery, ice mass measurements were also taken. This database will also be used to validate and verify the extension of the ice accretion code, LEWICE, into the SLD realm.

  3. Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy.

    PubMed

    Bekhuis, Tanja

    2006-04-03

    Innovative biomedical librarians and information specialists who want to expand their roles as expert searchers need to know about profound changes in biology and parallel trends in text mining. In recent years, conceptual biology has emerged as a complement to empirical biology. This is partly in response to the availability of massive digital resources such as the network of databases for molecular biologists at the National Center for Biotechnology Information. Developments in text mining and hypothesis discovery systems based on the early work of Swanson, a mathematician and information scientist, are coincident with the emergence of conceptual biology. Very little has been written to introduce biomedical digital librarians to these new trends. In this paper, background for data and text mining, as well as for knowledge discovery in databases (KDD) and in text (KDT) is presented, then a brief review of Swanson's ideas, followed by a discussion of recent approaches to hypothesis discovery and testing. 'Testing' in the context of text mining involves partially automated methods for finding evidence in the literature to support hypothetical relationships. Concluding remarks follow regarding (a) the limits of current strategies for evaluation of hypothesis discovery systems and (b) the role of literature-based discovery in concert with empirical research. Report of an informatics-driven literature review for biomarkers of systemic lupus erythematosus is mentioned. Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians.

  4. Development and evaluation of a Naïve Bayesian model for coding causation of workers’ compensation claims☆

    PubMed Central

    Bertke, S. J.; Meyers, A. R.; Wurzelbacher, S. J.; Bell, J.; Lampl, M. L.; Robins, D.

    2015-01-01

    Introduction Tracking and trending rates of injuries and illnesses classified as musculoskeletal disorders caused by ergonomic risk factors such as overexertion and repetitive motion (MSDs) and slips, trips, or falls (STFs) in different industry sectors is of high interest to many researchers. Unfortunately, identifying the cause of injuries and illnesses in large datasets such as workers’ compensation systems often requires reading and coding the free form accident text narrative for potentially millions of records. Method To alleviate the need for manual coding, this paper describes and evaluates a computer auto-coding algorithm that demonstrated the ability to code millions of claims quickly and accurately by learning from a set of previously manually coded claims. Conclusions The auto-coding program was able to code claims as a musculoskeletal disorders, STF or other with approximately 90% accuracy. Impact on industry The program developed and discussed in this paper provides an accurate and efficient method for identifying the causation of workers’ compensation claims as a STF or MSD in a large database based on the unstructured text narrative and resulting injury diagnoses. The program coded thousands of claims in minutes. The method described in this paper can be used by researchers and practitioners to relieve the manual burden of reading and identifying the causation of claims as a STF or MSD. Furthermore, the method can be easily generalized to code/classify other unstructured text narratives. PMID:23206504

  5. A Review of Stellar Abundance Databases and the Hypatia Catalog Database

    NASA Astrophysics Data System (ADS)

    Hinkel, Natalie Rose

    2018-01-01

    The astronomical community is interested in elements from lithium to thorium, from solar twins to peculiarities of stellar evolution, because they give insight into different regimes of star formation and evolution. However, while some trends between elements and other stellar or planetary properties are well known, many other trends are not as obvious and are a point of conflict. For example, stars that host giant planets are found to be consistently enriched in iron, but the same cannot be definitively said for any other element. Therefore, it is time to take advantage of large stellar abundance databases in order to better understand not only the large-scale patterns, but also the more subtle, small-scale trends within the data.In this overview to the special session, I will present a review of large stellar abundance databases that are both currently available (i.e. RAVE, APOGEE) and those that will soon be online (i.e. Gaia-ESO, GALAH). Additionally, I will discuss the Hypatia Catalog Database (www.hypatiacatalog.com) -- which includes abundances from individual literature sources that observed stars within 150pc. The Hypatia Catalog currently contains 72 elements as measured within ~6000 stars, with a total of ~240,000 unique abundance determinations. The online database offers a variety of solar normalizations, stellar properties, and planetary properties (where applicable) that can all be viewed through multiple interactive plotting interfaces as well as in a tabular format. By analyzing stellar abundances for large populations of stars and from a variety of different perspectives, a wealth of information can be revealed on both large and small scales.

  6. DataHub knowledge based assistance for science visualization and analysis using large distributed databases

    NASA Technical Reports Server (NTRS)

    Handley, Thomas H., Jr.; Collins, Donald J.; Doyle, Richard J.; Jacobson, Allan S.

    1991-01-01

    Viewgraphs on DataHub knowledge based assistance for science visualization and analysis using large distributed databases. Topics covered include: DataHub functional architecture; data representation; logical access methods; preliminary software architecture; LinkWinds; data knowledge issues; expert systems; and data management.

  7. pGenN, a gene normalization tool for plant genes and proteins in scientific literature.

    PubMed

    Ding, Ruoyao; Arighi, Cecilia N; Lee, Jung-Youn; Wu, Cathy H; Vijay-Shanker, K

    2015-01-01

    Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).

  8. Effectiveness of Acupuncture and Electroacupuncture for Chronic Neck Pain: A Systematic Review and Meta-Analysis.

    PubMed

    Seo, See Yoon; Lee, Ki-Beom; Shin, Joon-Shik; Lee, Jinho; Kim, Me-Riong; Ha, In-Hyuk; Ko, Youme; Lee, Yoon Jae

    2017-01-01

    The aim of this systematic review was to assess evidence from randomized controlled trials (RCTs) on the effectiveness and safety of acupuncture and electroacupuncture in patients with chronic neck pain. We searched nine databases including Chinese, Japanese and Korean databases through 30 July 2016. The participants were adults with chronic neck pain and were treated with acupuncture or electroacupuncture. Eligible trials were those with intervention groups receiving acupuncture and electroacupuncture with or without active control, and control groups receiving other conventional treatments such as physical therapy or medication. Outcomes included pain intensity, disability, quality of life (QoL) and adverse effects. For statistical pooling, the standardized mean difference (SMD) and its 95% confidence interval (CI) were calculated using a fixed-effects model. Sixteen RCTs were selected. The comparison of the sole acupuncture group and the active control group did not come out with a significant difference in pain (SMD 0.24, 95% CI [Formula: see text]0.27-0.75), disability (SMD 0.51, 95% CI [Formula: see text]0.01-1.02), or QoL (SMD [Formula: see text]0.37, 95% CI [Formula: see text]1.09-0.35), showing a similar effectiveness of acupuncture with active control. When acupuncture was added into the control group, the acupuncture add-on group showed significantly higher relief of pain in studies with unclear allocation concealment (SMD [Formula: see text]1.78, 95% CI [Formula: see text]2.08-[Formula: see text]1.48), but did not show significant relief of pain in studies with good allocation concealment (SMD [Formula: see text]0.07, 95% CI [Formula: see text]0.26-0.12). Significant relief of pain was observed when the sole electroacupuncture group was compared to the control group or electroacupuncture was added onto the active control group, but a lot of the results were evaluated to have low level of evidence, making it difficult to draw clear conclusions. In the result reporting adverse effects, no serious outcome of adverse event was confirmed. Acupuncture and conventional medicine for chronic neck pain have similar effectiveness on pain and disability when compared solely between the two of them. When acupuncture was added onto conventional treatment it relieved pain better, and electroacupuncture relieved pain even more. It is difficult to draw conclusion because the included studies have a high risk of bias and imprecision. Therefore better designed large-scale studies are needed in the future.

  9. Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets.

    PubMed

    Brown, Adrian P; Borgs, Christian; Randall, Sean M; Schnell, Rainer

    2017-06-08

    Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Under many jurisdictions, unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. The corresponding set of techniques for privacy-preserving record linkage (PPRL) has received widespread attention. One recent method is based on Bloom filters. Due to superior resilience against cryptographic attacks, composite Bloom filters (cryptographic long-term keys, CLKs) are considered best practice for privacy in PPRL. Real-world performance of these techniques using large-scale data is unknown up to now. Using a large subset of Australian hospital admission data, we tested the performance of an innovative PPRL technique (CLKs using multibit trees) against a gold-standard derived from clear-text probabilistic record linkage. Linkage time and linkage quality (recall, precision and F-measure) were evaluated. Clear text probabilistic linkage resulted in marginally higher precision and recall than CLKs. PPRL required more computing time but 5 million records could still be de-duplicated within one day. However, the PPRL approach required fine tuning of parameters. We argue that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time. These costs seem to be acceptable in most applied settings, but they have to be considered in the decision to apply PPRL. Further research on the optimal automatic choice of parameters is needed.

  10. DOE Research and Development Accomplishments Help

    Science.gov Websites

    be used to search, locate, access, and electronically download full-text research and development (R Browse Downloading, Viewing, and/or Searching Full-text Documents/Pages Searching the Database Search Features Search allows you to search the OCRed full-text document and bibliographic information, the

  11. The Effects of Elaboration on Self-Learning Procedures from Text.

    ERIC Educational Resources Information Center

    Yang, Fu-mei

    This study investigated the effects of augmenting and deleting elaborations in an existing self-instructional text for a micro-computer database application, "Microsoft Works User's Manual." A total of 60 undergraduate students were randomly assigned to the original, elaborated, or unelaborated text versions. The elaborated version…

  12. A Text Knowledge Base from the AI Handbook.

    ERIC Educational Resources Information Center

    Simmons, Robert F.

    1987-01-01

    Describes a prototype natural language text knowledge system (TKS) that was used to organize 50 pages of a handbook on artificial intelligence as an inferential knowledge base with natural language query and command capabilities. Representation of text, database navigation, query systems, discourse structuring, and future research needs are…

  13. The Relationship between Searches Performed in Online Databases and the Number of Full-Text Articles Accessed: Measuring the Interaction between Database and E-Journal Collections

    ERIC Educational Resources Information Center

    Lamothe, Alain R.

    2011-01-01

    The purpose of this paper is to report the results of a quantitative analysis exploring the interaction and relationship between the online database and electronic journal collections at the J. N. Desmarais Library of Laurentian University. A very strong relationship exists between the number of searches and the size of the online database…

  14. Worldwide Report, Telecommunications Policy, Research and Development

    DTIC Science & Technology

    1985-12-31

    3 Hong Kong Database Signs Contract With PRC (K. Gopinath; Hong Kong HONGKONG STANDARD, 18 Oct 85) .... 5 Briefs Hong Kong-London Data Link...KONG HONG KONG DATABASE SIGNS CONTRACT WITH PRC Hong Kong HONGKONG STANDARD in English Supplement 18 Oct 85 p 1 [Article by K. Gopinath] [Text...information group. The agreement between state-owned China Hua Yang Technology and Trade Corporation and DataBase Asia of Hong- kong, authorises

  15. DR HAGIS-a fundus image database for the automatic extraction of retinal surface vessels from diabetic patients.

    PubMed

    Holm, Sven; Russell, Greg; Nourrit, Vincent; McLoughlin, Niall

    2017-01-01

    A database of retinal fundus images, the DR HAGIS database, is presented. This database consists of 39 high-resolution color fundus images obtained from a diabetic retinopathy screening program in the UK. The NHS screening program uses service providers that employ different fundus and digital cameras. This results in a range of different image sizes and resolutions. Furthermore, patients enrolled in such programs often display other comorbidities in addition to diabetes. Therefore, in an effort to replicate the normal range of images examined by grading experts during screening, the DR HAGIS database consists of images of varying image sizes and resolutions and four comorbidity subgroups: collectively defined as the diabetic retinopathy, hypertension, age-related macular degeneration, and Glaucoma image set (DR HAGIS). For each image, the vasculature has been manually segmented to provide a realistic set of images on which to test automatic vessel extraction algorithms. Modified versions of two previously published vessel extraction algorithms were applied to this database to provide some baseline measurements. A method based purely on the intensity of images pixels resulted in a mean segmentation accuracy of 95.83% ([Formula: see text]), whereas an algorithm based on Gabor filters generated an accuracy of 95.71% ([Formula: see text]).

  16. SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents

    PubMed Central

    Heifets, Abraham; Jurisica, Igor

    2012-01-01

    The patent literature is a rich catalog of biologically relevant chemicals; many public and commercial molecular databases contain the structures disclosed in patent claims. However, patents are an equally rich source of metadata about bioactive molecules, including mechanism of action, disease class, homologous experimental series, structural alternatives, or the synthetic pathways used to produce molecules of interest. Unfortunately, this metadata is discarded when chemical structures are deposited separately in databases. SCRIPDB is a chemical structure database designed to make this metadata accessible. SCRIPDB provides the full original patent text, reactions and relationships described within any individual patent, in addition to the molecular files common to structural databases. We discuss how such information is valuable in medical text mining, chemical image analysis, reaction extraction and in silico pharmaceutical lead optimization. SCRIPDB may be searched by exact chemical structure, substructure or molecular similarity and the results may be restricted to patents describing synthetic routes. SCRIPDB is available at http://dcv.uhnres.utoronto.ca/SCRIPDB. PMID:22067445

  17. PubstractHelper: A Web-based Text-Mining Tool for Marking Sentences in Abstracts from PubMed Using Multiple User-Defined Keywords.

    PubMed

    Chen, Chou-Cheng; Ho, Chung-Liang

    2014-01-01

    While a huge amount of information about biological literature can be obtained by searching the PubMed database, reading through all the titles and abstracts resulting from such a search for useful information is inefficient. Text mining makes it possible to increase this efficiency. Some websites use text mining to gather information from the PubMed database; however, they are database-oriented, using pre-defined search keywords while lacking a query interface for user-defined search inputs. We present the PubMed Abstract Reading Helper (PubstractHelper) website which combines text mining and reading assistance for an efficient PubMed search. PubstractHelper can accept a maximum of ten groups of keywords, within each group containing up to ten keywords. The principle behind the text-mining function of PubstractHelper is that keywords contained in the same sentence are likely to be related. PubstractHelper highlights sentences with co-occurring keywords in different colors. The user can download the PMID and the abstracts with color markings to be reviewed later. The PubstractHelper website can help users to identify relevant publications based on the presence of related keywords, which should be a handy tool for their research. http://bio.yungyun.com.tw/ATM/PubstractHelper.aspx and http://holab.med.ncku.edu.tw/ATM/PubstractHelper.aspx.

  18. A bibliography of IRIS-related publications, 2000-2011

    NASA Astrophysics Data System (ADS)

    Muco, B.

    2012-12-01

    Citations and acknowledgements in scientific journals can be an indicator of the role an organization has on the research of that field. Since its formation and incorporation in May 1984, the IRIS Consortium (Incorporated Research Institutions for Seismology) is mentioned more and more as a valuable source of data, instruments and programs in the literature of earth sciences. As a large organization with more than 100 member domestic institutes and about 40 international affiliates, obviously IRIS has a direct impact on the earth sciences through all its programs, projects, workshops, symposia, and news¬letters and as a lively forum for exchanging ideas. In order to maintain support from National Science Foundation (NSF) and the research community, it is important to document the continued use of IRIS facilities in basic research programs. IRIS maintains a database of articles that are based on the use of IRIS facilities or which reference use of IRIS data and resources. Articles in this database have been either been provided to IRIS by the authors or selected through an annual search of a number of prominent journals. A text version of the full bibliographic database is available on the IRIS website and a version in EndNote format is also provided. To provide a more complete bibliography and a consistent evaluation of temporal tends in publications, a special annual search began in 2000 which focused on a subset of key seismology and Earth science journals: Bulletin of Seismological Society of America, Journal of Geophysical Research, Seismological Research Letters, Geophysical Research Letters, Earth and Planetary Science Letters, Physics of the Earth and Planetary Interiors, Tectonophysics, Geophysical Journal International, Nature, Science, Geology and EOS. Using different search engines as Scirus, ScienceDirect, GeoRef, OCLC First Search, EASI Search, NASA Abstract Service etc. for online journals and publishers' databases, we searched for key words (IRIS, GSN, DMS, PASSCAL, USArray etc) in titles, abstracts and text. Most of the selections found by this method were confirmed by reading through online texts or original journals. This bibliography of peer-reviewed articles (excluding abstracts) identified in these key journals for 2000-2011 includes approximately 1800 entries. As for American Geophysical Union (AGU) transaction, the bibliography of IRIS-related abstracts for the abovementioned period includes approximately 1400 abstracts. This study is a clear indicator of making intensive use by the seismological community of the resources that IRIS provides and of the paramount importance this organization has in advancement of seismological research worldwide.

  19. Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency

    PubMed Central

    Aniceto, Rodrigo; Xavier, Rene; Guimarães, Valeria; Hondo, Fernanda; Holanda, Maristela; Walter, Maria Emilia; Lifschitz, Sérgio

    2015-01-01

    Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB. PMID:26558254

  20. Recently Active Traces of the Berryessa Fault, California: A Digital Database

    USGS Publications Warehouse

    Lienkaemper, James J.

    2012-01-01

    The purpose of this map is to show the location of and evidence for recent movement on active fault traces within the Berryessa section and parts of adjacent sections of the Green Valley Fault Zone, California. The location and recency of the mapped traces is primarily based on geomorphic expression of the fault as interpreted from large-scale 2010 aerial photography and from 2007 and 2011 0.5 and 1.0 meter bare-earth LiDAR imagery (that is, high-resolution topographic data). In a few places, evidence of fault creep and offset Holocene strata in trenches and natural exposures have confirmed the activity of some of these traces. This publication is formatted both as a digital database for use within a geographic information system (GIS) and for broader public access as map images that may be browsed on-line or download a summary map. The report text describes the types of scientific observations used to make the map, gives references pertaining to the fault and the evidence of faulting, and provides guidance for use of and limitations of the map.

  1. Managing Attribute—Value Clinical Trials Data Using the ACT/DB Client—Server Database System

    PubMed Central

    Nadkarni, Prakash M.; Brandt, Cynthia; Frawley, Sandra; Sayward, Frederick G.; Einbinder, Robin; Zelterman, Daniel; Schacter, Lee; Miller, Perry L.

    1998-01-01

    ACT/DB is a client—server database application for storing clinical trials and outcomes data, which is currently undergoing initial pilot use. It stores most of its data in entity—attribute—value form. Such data are segregated according to data type to allow indexing by value when possible, and binary large object data are managed in the same way as other data. ACT/DB lets an investigator design a study rapidly by defining the parameters (or attributes) that are to be gathered, as well as their logical grouping for purposes of display and data entry. ACT/DB generates customizable data entry. The data can be viewed through several standard reports as well as exported as text to external analysis programs. ACT/DB is designed to encourage reuse of parameters across multiple studies and has facilities for dictionary search and maintenance. It uses a Microsoft Access client running on Windows 95 machines, which communicates with an Oracle server running on a UNIX platform. ACT/DB is being used to manage the data for seven studies in its initial deployment. PMID:9524347

  2. Brief Report: The Negev Hospital-University-Based (HUB) Autism Database

    ERIC Educational Resources Information Center

    Meiri, Gal; Dinstein, Ilan; Michaelowski, Analya; Flusser, Hagit; Ilan, Michal; Faroy, Michal; Bar-Sinai, Asif; Manelis, Liora; Stolowicz, Dana; Yosef, Lili Lea; Davidovitch, Nadav; Golan, Hava; Arbelle, Shosh; Menashe, Idan

    2017-01-01

    Elucidating the heterogeneous etiologies of autism will require investment in comprehensive longitudinal data acquisition from large community based cohorts. With this in mind, we have established a hospital-university-based (HUB) database of autism which incorporates prospective and retrospective data from a large and ethnically diverse…

  3. Improving Decisions with Data

    ERIC Educational Resources Information Center

    Johnson, Doug

    2004-01-01

    Schools gather, store and use an increasingly large amount of data. Keeping track of everything from bus routes to building access codes to test scores to sports equipment is done with the help of electronic database programs. Large databases designed for budgeting and student record keeping have long been an integral part of the educational…

  4. Sagace: A web-based search engine for biomedical databases in Japan

    PubMed Central

    2012-01-01

    Background In the big data era, biomedical research continues to generate a large amount of data, and the generated information is often stored in a database and made publicly available. Although combining data from multiple databases should accelerate further studies, the current number of life sciences databases is too large to grasp features and contents of each database. Findings We have developed Sagace, a web-based search engine that enables users to retrieve information from a range of biological databases (such as gene expression profiles and proteomics data) and biological resource banks (such as mouse models of disease and cell lines). With Sagace, users can search more than 300 databases in Japan. Sagace offers features tailored to biomedical research, including manually tuned ranking, a faceted navigation to refine search results, and rich snippets constructed with retrieved metadata for each database entry. Conclusions Sagace will be valuable for experts who are involved in biomedical research and drug development in both academia and industry. Sagace is freely available at http://sagace.nibio.go.jp/en/. PMID:23110816

  5. A high performance, ad-hoc, fuzzy query processing system for relational databases

    NASA Technical Reports Server (NTRS)

    Mansfield, William H., Jr.; Fleischman, Robert M.

    1992-01-01

    Database queries involving imprecise or fuzzy predicates are currently an evolving area of academic and industrial research. Such queries place severe stress on the indexing and I/O subsystems of conventional database environments since they involve the search of large numbers of records. The Datacycle architecture and research prototype is a database environment that uses filtering technology to perform an efficient, exhaustive search of an entire database. It has recently been modified to include fuzzy predicates in its query processing. The approach obviates the need for complex index structures, provides unlimited query throughput, permits the use of ad-hoc fuzzy membership functions, and provides a deterministic response time largely independent of query complexity and load. This paper describes the Datacycle prototype implementation of fuzzy queries and some recent performance results.

  6. Integration of NASA/GSFC and USGS Rock Magnetic Databases.

    NASA Astrophysics Data System (ADS)

    Nazarova, K. A.; Glen, J. M.

    2004-05-01

    A global Magnetic Petrology Database (MPDB) was developed and continues to be updated at NASA/Goddard Space Flight Center. The purpose of this database is to provide the geomagnetic community with a comprehensive and user-friendly method of accessing magnetic petrology data via the Internet for a more realistic interpretation of satellite (as well as aeromagnetic and ground) lithospheric magnetic anomalies. The MPDB contains data on rocks from localities around the world (about 19,000 samples) including the Ukranian and Baltic Shields, Kamchatka, Iceland, Urals Mountains, etc. The MPDB is designed, managed and presented on the web as a research oriented database. Several database applications have been specifically developed for data manipulation and analysis of the MPDB. The geophysics unit at the USGS in Menlo Park has over 17,000 rock-property data, largely from sites within the western U.S. This database contains rock-density and rock-magnetic parameters collected for use in gravity and magnetic field modeling, and paleomagnetic studies. Most of these data were taken from surface outcrops and together they span a broad range of rock types. Measurements were made either in-situ at the outcrop, or in the laboratory on hand samples and paleomagnetic cores acquired in the field. The USGS and NASA/GSFC data will be integrated as part of an effort to provide public access to a single, uniformly maintained database. Due to the large number of data and the very large area sampled, the database can yield rock-property statistics on a broad range of rock types; it is thus applicable to study areas beyond the geographic scope of the database. The intent of this effort is to provide incentive for others to further contribute to the database, and a tool with which the geophysical community can entertain studies formerly precluded.

  7. Compressing DNA sequence databases with coil.

    PubMed

    White, W Timothy J; Hendy, Michael D

    2008-05-20

    Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.

  8. Compressing DNA sequence databases with coil

    PubMed Central

    White, W Timothy J; Hendy, Michael D

    2008-01-01

    Background Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work. PMID:18489794

  9. Estimating Missing Features to Improve Multimedia Information Retrieval

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bagherjeiran, A; Love, N S; Kamath, C

    Retrieval in a multimedia database usually involves combining information from different modalities of data, such as text and images. However, all modalities of the data may not be available to form the query. The retrieval results from such a partial query are often less than satisfactory. In this paper, we present an approach to complete a partial query by estimating the missing features in the query. Our experiments with a database of images and their associated captions show that, with an initial text-only query, our completion method has similar performance to a full query with both image and text features.more » In addition, when we use relevance feedback, our approach outperforms the results obtained using a full query.« less

  10. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

    PubMed Central

    2013-01-01

    Background Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. Results We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. Conclusions We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts. PMID:23631733

  11. BioMart: a data federation framework for large collaborative projects.

    PubMed

    Zhang, Junjun; Haider, Syed; Baran, Joachim; Cros, Anthony; Guberman, Jonathan M; Hsu, Jack; Liang, Yong; Yao, Long; Kasprzyk, Arek

    2011-01-01

    BioMart is a freely available, open source, federated database system that provides a unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework. BioMart allows databases hosted on different servers to be presented seamlessly to users, facilitating collaborative projects between different research groups. BioMart contains several levels of query optimization to efficiently manage large data sets and offers a diverse selection of graphical user interfaces and application programming interfaces to ensure that queries can be performed in whatever manner is most convenient for the user. The software has now been adopted by a large number of different biological databases spanning a wide range of data types and providing a rich source of annotation available to bioinformaticians and biologists alike.

  12. HMDB 4.0: the human metabolome database for 2018

    PubMed Central

    Feunang, Yannick Djoumbou; Marcu, Ana; Guo, An Chi; Liang, Kevin; Vázquez-Fresno, Rosa; Sajed, Tanvir; Johnson, Daniel; Li, Carin; Karu, Naama; Sayeeda, Zinat; Lo, Elvis; Assempour, Nazanin; Berjanskii, Mark; Singhal, Sandeep; Arndt, David; Liang, Yonjie; Badran, Hasan; Grant, Jason; Serra-Cayuela, Arnau; Liu, Yifeng; Mandal, Rupa; Neveu, Vanessa; Pon, Allison; Knox, Craig; Wilson, Michael; Manach, Claudine; Scalbert, Augustin

    2018-01-01

    Abstract The Human Metabolome Database or HMDB (www.hmdb.ca) is a web-enabled metabolomic database containing comprehensive information about human metabolites along with their biological roles, physiological concentrations, disease associations, chemical reactions, metabolic pathways, and reference spectra. First described in 2007, the HMDB is now considered the standard metabolomic resource for human metabolic studies. Over the past decade the HMDB has continued to grow and evolve in response to emerging needs for metabolomics researchers and continuing changes in web standards. This year's update, HMDB 4.0, represents the most significant upgrade to the database in its history. For instance, the number of fully annotated metabolites has increased by nearly threefold, the number of experimental spectra has grown by almost fourfold and the number of illustrated metabolic pathways has grown by a factor of almost 60. Significant improvements have also been made to the HMDB’s chemical taxonomy, chemical ontology, spectral viewing, and spectral/text searching tools. A great deal of brand new data has also been added to HMDB 4.0. This includes large quantities of predicted MS/MS and GC–MS reference spectral data as well as predicted (physiologically feasible) metabolite structures to facilitate novel metabolite identification. Additional information on metabolite-SNP interactions and the influence of drugs on metabolite levels (pharmacometabolomics) has also been added. Many other important improvements in the content, the interface, and the performance of the HMDB website have been made and these should greatly enhance its ease of use and its potential applications in nutrition, biochemistry, clinical chemistry, clinical genetics, medicine, and metabolomics science. PMID:29140435

  13. Using database reports to reduce workplace violence: Perceptions of hospital stakeholders

    PubMed Central

    Arnetz, Judith E.; Hamblin, Lydia; Ager, Joel; Aranyos, Deanna; Essenmacher, Lynnette; Upfal, Mark J.; Luborsky, Mark

    2016-01-01

    BACKGROUND Documented incidents of violence provide the foundation for any workplace violence prevention program. However, no published research to date has examined stakeholders’ preferences for workplace violence data reports in healthcare settings. If relevant data are not readily available and effectively summarized and presented, the likelihood is low that they will be utilized by stakeholders in targeted efforts to reduce violence. OBJECTIVE To discover and describe hospital system stakeholders’ perceptions of database-generated workplace violence data reports. PARTICIPANTS Eight hospital system stakeholders representing Human Resources, Security, Occupational Health Services, Quality and Safety, and Labor in a large, metropolitan hospital system. METHODS The hospital system utilizes a central database for reporting adverse workplace events, including incidents of violence. A focus group was conducted to identify stakeholders’ preferences and specifications for standardized, computerized reports of workplace violence data to be generated by the central database. The discussion was audio-taped, transcribed verbatim, processed as text, and analyzed using stepwise content analysis. RESULTS Five distinct themes emerged from participant responses: Concerns, Etiology, Customization, Use, and Outcomes. In general, stakeholders wanted data reports to provide “the big picture,” i.e., rates of occurrence; reasons for and details regarding incident occurrence; consequences for the individual employee and/or the workplace; and organizational efforts that were employed to deal with the incident. CONCLUSIONS Exploring stakeholder views regarding workplace violence summary reports provided concrete information on the preferred content, format, and use of workplace violence data. Participants desired both epidemiological and incident-specific data in order to better understand and work to prevent the workplace violence occurring in their hospital system. PMID:25059315

  14. Using database reports to reduce workplace violence: Perceptions of hospital stakeholders.

    PubMed

    Arnetz, Judith E; Hamblin, Lydia; Ager, Joel; Aranyos, Deanna; Essenmacher, Lynnette; Upfal, Mark J; Luborsky, Mark

    2015-01-01

    Documented incidents of violence provide the foundation for any workplace violence prevention program. However, no published research to date has examined stakeholders' preferences for workplace violence data reports in healthcare settings. If relevant data are not readily available and effectively summarized and presented, the likelihood is low that they will be utilized by stakeholders in targeted efforts to reduce violence. To discover and describe hospital system stakeholders' perceptions of database-generated workplace violence data reports. Eight hospital system stakeholders representing Human Resources, Security, Occupational Health Services, Quality and Safety, and Labor in a large, metropolitan hospital system. The hospital system utilizes a central database for reporting adverse workplace events, including incidents of violence. A focus group was conducted to identify stakeholders' preferences and specifications for standardized, computerized reports of workplace violence data to be generated by the central database. The discussion was audio-taped, transcribed verbatim, processed as text, and analyzed using stepwise content analysis. Five distinct themes emerged from participant responses: Concerns, Etiology, Customization, Use, and Outcomes. In general, stakeholders wanted data reports to provide ``the big picture,'' i.e., rates of occurrence; reasons for and details regarding incident occurrence; consequences for the individual employee and/or the workplace; and organizational efforts that were employed to deal with the incident. Exploring stakeholder views regarding workplace violence summary reports provided concrete information on the preferred content, format, and use of workplace violence data. Participants desired both epidemiological and incident-specific data in order to better understand and work to prevent the workplace violence occurring in their hospital system.

  15. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

    PubMed Central

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

    2015-01-01

    Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595

  16. Human interface to large multimedia databases

    NASA Astrophysics Data System (ADS)

    Davis, Ben; Marks, Linn; Collins, Dave; Mack, Robert; Malkin, Peter; Nguyen, Tam

    1994-04-01

    The emergence of high-speed networking for multimedia will have the effect of turning the computer screen into a window on a very large information space. As this information space increases in size and complexity, providing users with easy and intuitive means of accessing information will become increasingly important. Providing access to large amounts of text has been the focus of work for hundreds of years and has resulted in the evolution of a set of standards, from the Dewey Decimal System for libraries to the recently proposed ANSI standards for representing information on-line: KIF, Knowledge Interchange Format, and CG's, Conceptual Graphs. Certain problems remain unsolved by these efforts, though: how to let users know the contents of the information space, so that they know whether or not they want to search it in the first place, how to facilitate browsing, and, more specifically, how to facilitate visual browsing. These issues are particularly important for users in educational contexts and have been the focus of much of our recent work. In this paper we discuss some of the solutions we have prototypes: specifically, visual means, visual browsers, and visual definitional sequences.

  17. Application of Large-Scale Database-Based Online Modeling to Plant State Long-Term Estimation

    NASA Astrophysics Data System (ADS)

    Ogawa, Masatoshi; Ogai, Harutoshi

    Recently, attention has been drawn to the local modeling techniques of a new idea called “Just-In-Time (JIT) modeling”. To apply “JIT modeling” to a large amount of database online, “Large-scale database-based Online Modeling (LOM)” has been proposed. LOM is a technique that makes the retrieval of neighboring data more efficient by using both “stepwise selection” and quantization. In order to predict the long-term state of the plant without using future data of manipulated variables, an Extended Sequential Prediction method of LOM (ESP-LOM) has been proposed. In this paper, the LOM and the ESP-LOM are introduced.

  18. Integrating query of relational and textual data in clinical databases: a case study.

    PubMed

    Fisk, John M; Mutalik, Pradeep; Levin, Forrest W; Erdos, Joseph; Taylor, Caroline; Nadkarni, Prakash

    2003-01-01

    The authors designed and implemented a clinical data mart composed of an integrated information retrieval (IR) and relational database management system (RDBMS). Using commodity software, which supports interactive, attribute-centric text and relational searches, the mart houses 2.8 million documents that span a five-year period and supports basic IR features such as Boolean searches, stemming, and proximity and fuzzy searching. Results are relevance-ranked using either "total documents per patient" or "report type weighting." Non-curated medical text has a significant degree of malformation with respect to spelling and punctuation, which creates difficulties for text indexing and searching. Presently, the IR facilities of RDBMS packages lack the features necessary to handle such malformed text adequately. A robust IR+RDBMS system can be developed, but it requires integrating RDBMSs with third-party IR software. RDBMS vendors need to make their IR offerings more accessible to non-programmers.

  19. System, method and apparatus for generating phrases from a database

    NASA Technical Reports Server (NTRS)

    McGreevy, Michael W. (Inventor)

    2004-01-01

    A phrase generation is a method of generating sequences of terms, such as phrases, that may occur within a database of subsets containing sequences of terms, such as text. A database is provided and a relational model of the database is created. A query is then input. The query includes a term or a sequence of terms or multiple individual terms or multiple sequences of terms or combinations thereof. Next, several sequences of terms that are contextually related to the query are assembled from contextual relations in the model of the database. The sequences of terms are then sorted and output. Phrase generation can also be an iterative process used to produce sequences of terms from a relational model of a database.

  20. Manual versus automated coding of free-text self-reported medication data in the 45 and Up Study: a validation study.

    PubMed

    Gnjidic, Danijela; Pearson, Sallie-Anne; Hilmer, Sarah N; Basilakis, Jim; Schaffer, Andrea L; Blyth, Fiona M; Banks, Emily

    2015-03-30

    Increasingly, automated methods are being used to code free-text medication data, but evidence on the validity of these methods is limited. To examine the accuracy of automated coding of previously keyed in free-text medication data compared with manual coding of original handwritten free-text responses (the 'gold standard'). A random sample of 500 participants (475 with and 25 without medication data in the free-text box) enrolled in the 45 and Up Study was selected. Manual coding involved medication experts keying in free-text responses and coding using Anatomical Therapeutic Chemical (ATC) codes (i.e. chemical substance 7-digit level; chemical subgroup 5-digit; pharmacological subgroup 4-digit; therapeutic subgroup 3-digit). Using keyed-in free-text responses entered by non-experts, the automated approach coded entries using the Australian Medicines Terminology database and assigned corresponding ATC codes. Based on manual coding, 1377 free-text entries were recorded and, of these, 1282 medications were coded to ATCs manually. The sensitivity of automated coding compared with manual coding was 79% (n = 1014) for entries coded at the exact ATC level, and 81.6% (n = 1046), 83.0% (n = 1064) and 83.8% (n = 1074) at the 5, 4 and 3-digit ATC levels, respectively. The sensitivity of automated coding for blank responses was 100% compared with manual coding. Sensitivity of automated coding was highest for prescription medications and lowest for vitamins and supplements, compared with the manual approach. Positive predictive values for automated coding were above 95% for 34 of the 38 individual prescription medications examined. Automated coding for free-text prescription medication data shows very high to excellent sensitivity and positive predictive values, indicating that automated methods can potentially be useful for large-scale, medication-related research.

  1. Mining the Galaxy Zoo Database: Machine Learning Applications

    NASA Astrophysics Data System (ADS)

    Borne, Kirk D.; Wallin, J.; Vedachalam, A.; Baehr, S.; Lintott, C.; Darg, D.; Smith, A.; Fortson, L.

    2010-01-01

    The new Zooniverse initiative is addressing the data flood in the sciences through a transformative partnership between professional scientists, volunteer citizen scientists, and machines. As part of this project, we are exploring the application of machine learning techniques to data mining problems associated with the large and growing database of volunteer science results gathered by the Galaxy Zoo citizen science project. We will describe the basic challenge, some machine learning approaches, and early results. One of the motivators for this study is the acquisition (through the Galaxy Zoo results database) of approximately 100 million classification labels for roughly one million galaxies, yielding a tremendously large and rich set of training examples for improving automated galaxy morphological classification algorithms. In our first case study, the goal is to learn which morphological and photometric features in the Sloan Digital Sky Survey (SDSS) database correlate most strongly with user-selected galaxy morphological class. As a corollary to this study, we are also aiming to identify which galaxy parameters in the SDSS database correspond to galaxies that have been the most difficult to classify (based upon large dispersion in their volunter-provided classifications). Our second case study will focus on similar data mining analyses and machine leaning algorithms applied to the Galaxy Zoo catalog of merging and interacting galaxies. The outcomes of this project will have applications in future large sky surveys, such as the LSST (Large Synoptic Survey Telescope) project, which will generate a catalog of 20 billion galaxies and will produce an additional astronomical alert database of approximately 100 thousand events each night for 10 years -- the capabilities and algorithms that we are exploring will assist in the rapid characterization and classification of such massive data streams. This research has been supported in part through NSF award #0941610.

  2. Complete Decoding and Reporting of Aviation Routine Weather Reports (METARs)

    NASA Technical Reports Server (NTRS)

    Lui, Man-Cheung Max

    2014-01-01

    Aviation Routine Weather Report (METAR) provides surface weather information at and around observation stations, including airport terminals. These weather observations are used by pilots for flight planning and by air traffic service providers for managing departure and arrival flights. The METARs are also an important source of weather data for Air Traffic Management (ATM) analysts and researchers at NASA and elsewhere. These researchers use METAR to correlate severe weather events with local or national air traffic actions that restrict air traffic, as one example. A METAR is made up of multiple groups of coded text, each with a specific standard coding format. These groups of coded text are located in two sections of a report: Body and Remarks. The coded text groups in a U.S. METAR are intended to follow the coding standards set by National Oceanic and Atmospheric Administration (NOAA). However, manual data entry and edits made by a human report observer may result in coded text elements that do not follow the standards, especially in the Remarks section. And contrary to the standards, some significant weather observations are noted only in the Remarks section and not in the Body section of the reports. While human readers can infer the intended meaning of non-standard coding of weather conditions, doing so with a computer program is far more challenging. However such programmatic pre-processing is necessary to enable efficient and faster database query when researchers need to perform any significant historical weather analysis. Therefore, to support such analysis, a computer algorithm was developed to identify groups of coded text anywhere in a report and to perform subsequent decoding in software. The algorithm considers common deviations from the standards and data entry mistakes made by observers. The implemented software code was tested to decode 12 million reports and the decoding process was able to completely interpret 99.93 of the reports. This document presents the deviations from the standards and the decoding algorithm. Storing all decoded data in a database allows users to quickly query a large amount of data and to perform data mining on the data. Users can specify complex query criteria not only on date or airport but also on weather condition. This document also describes the design of a database schema for storing the decoded data, and a Data Warehouse web application that allows users to perform reporting and analysis on the decoded data. Finally, this document presents a case study correlating dust storms reported in METARs from the Phoenix International airport with Ground Stops issued by Air Route Traffic Control Centers (ATCSCC). Blowing widespread dust is one of the weather conditions when dust storm occurs. By querying the database, 294 METARs were found to report blowing widespread dust at the Phoenix airport and 41 of them reported such condition only in the Remarks section of the reports. When METAR is a data source for an ATM research, it is important to include weather conditions not only from the Body section but also from the Remarks section of METARs.

  3. Full-Text Searching on Major Supermarket Systems: Dialog, Data-Star, and Nexis.

    ERIC Educational Resources Information Center

    Tenopir, Carol; Berglund, Sharon

    1993-01-01

    Examines the similarities, differences, and full-text features of the three most-used online systems for full-text searching in general libraries: DIALOG, Data-Star, and NEXIS. Overlapping databases, unique sources, search features, proximity operators, set building, language enhancement and word equivalencies, and display features are discussed.…

  4. Clinical records anonymisation and text extraction (CRATE): an open-source software system.

    PubMed

    Cardinal, Rudolf N

    2017-04-26

    Electronic medical records contain information of value for research, but contain identifiable and often highly sensitive confidential information. Patient-identifiable information cannot in general be shared outside clinical care teams without explicit consent, but anonymisation/de-identification allows research uses of clinical data without explicit consent. This article presents CRATE (Clinical Records Anonymisation and Text Extraction), an open-source software system with separable functions: (1) it anonymises or de-identifies arbitrary relational databases, with sensitivity and precision similar to previous comparable systems; (2) it uses public secure cryptographic methods to map patient identifiers to research identifiers (pseudonyms); (3) it connects relational databases to external tools for natural language processing; (4) it provides a web front end for research and administrative functions; and (5) it supports a specific model through which patients may consent to be contacted about research. Creation and management of a research database from sensitive clinical records with secure pseudonym generation, full-text indexing, and a consent-to-contact process is possible and practical using entirely free and open-source software.

  5. Database for the Quaternary and Pliocene Yellowstone Plateau volcanic field of Wyoming, Idaho, and Montana (Database for Professional Paper 729-G)

    USGS Publications Warehouse

    Koch, Richard D.; Ramsey, David W.; Christiansen, Robert L.

    2011-01-01

    The superlative hot springs, geysers, and fumarole fields of Yellowstone National Park are vivid reminders of a recent volcanic past. Volcanism on an immense scale largely shaped the unique landscape of central and western Yellowstone Park, and intimately related tectonism and seismicity continue even now. Furthermore, the volcanism that gave rise to Yellowstone's hydrothermal displays was only part of a long history of late Cenozoic eruptions in southern and eastern Idaho, northwestern Wyoming, and southwestern Montana. The late Cenozoic volcanism of Yellowstone National Park, although long believed to have occurred in late Tertiary time, is now known to have been of latest Pliocene and Pleistocene age. The eruptions formed a complex plateau of voluminous rhyolitic ash-flow tuffs and lavas, but basaltic lavas too have erupted intermittently around the margins of the rhyolite plateau. Volcanism almost certainly will recur in the Yellowstone National Park region. This digital release contains all the information used to produce the geologic maps published as plates in U.S. Geological Survey Professional Paper 729-G (Christiansen, 2001). The main component of this digital release is a geologic map database prepared using geographic information systems (GIS) applications. This release also contains files to view or print the geologic maps and main report text from Professional Paper 729-G.

  6. Clinical results of HIS, RIS, PACS integration using data integration CASE tools

    NASA Astrophysics Data System (ADS)

    Taira, Ricky K.; Chan, Hing-Ming; Breant, Claudine M.; Huang, Lu J.; Valentino, Daniel J.

    1995-05-01

    Current infrastructure research in PACS is dominated by the development of communication networks (local area networks, teleradiology, ATM networks, etc.), multimedia display workstations, and hierarchical image storage architectures. However, limited work has been performed on developing flexible, expansible, and intelligent information processing architectures for the vast decentralized image and text data repositories prevalent in healthcare environments. Patient information is often distributed among multiple data management systems. Current large-scale efforts to integrate medical information and knowledge sources have been costly with limited retrieval functionality. Software integration strategies to unify distributed data and knowledge sources is still lacking commercially. Systems heterogeneity (i.e., differences in hardware platforms, communication protocols, database management software, nomenclature, etc.) is at the heart of the problem and is unlikely to be standardized in the near future. In this paper, we demonstrate the use of newly available CASE (computer- aided software engineering) tools to rapidly integrate HIS, RIS, and PACS information systems. The advantages of these tools include fast development time (low-level code is generated from graphical specifications), and easy system maintenance (excellent documentation, easy to perform changes, and centralized code repository in an object-oriented database). The CASE tools are used to develop and manage the `middle-ware' in our client- mediator-serve architecture for systems integration. Our architecture is scalable and can accommodate heterogeneous database and communication protocols.

  7. Data Sources for Trait Databases: Comparing the Phenomic Content of Monographs and Evolutionary Matrices.

    PubMed

    Dececchi, T Alex; Mabee, Paula M; Blackburn, David C

    2016-01-01

    Databases of organismal traits that aggregate information from one or multiple sources can be leveraged for large-scale analyses in biology. Yet the differences among these data streams and how well they capture trait diversity have never been explored. We present the first analysis of the differences between phenotypes captured in free text of descriptive publications ('monographs') and those used in phylogenetic analyses ('matrices'). We focus our analysis on osteological phenotypes of the limbs of four extinct vertebrate taxa critical to our understanding of the fin-to-limb transition. We find that there is low overlap between the anatomical entities used in these two sources of phenotype data, indicating that phenotypes represented in matrices are not simply a subset of those found in monographic descriptions. Perhaps as expected, compared to characters found in matrices, phenotypes in monographs tend to emphasize descriptive and positional morphology, be somewhat more complex, and relate to fewer additional taxa. While based on a small set of focal taxa, these qualitative and quantitative data suggest that either source of phenotypes alone will result in incomplete knowledge of variation for a given taxon. As a broader community develops to use and expand databases characterizing organismal trait diversity, it is important to recognize the limitations of the data sources and develop strategies to more fully characterize variation both within species and across the tree of life.

  8. Data Sources for Trait Databases: Comparing the Phenomic Content of Monographs and Evolutionary Matrices

    PubMed Central

    Dececchi, T. Alex; Mabee, Paula M.; Blackburn, David C.

    2016-01-01

    Databases of organismal traits that aggregate information from one or multiple sources can be leveraged for large-scale analyses in biology. Yet the differences among these data streams and how well they capture trait diversity have never been explored. We present the first analysis of the differences between phenotypes captured in free text of descriptive publications (‘monographs’) and those used in phylogenetic analyses (‘matrices’). We focus our analysis on osteological phenotypes of the limbs of four extinct vertebrate taxa critical to our understanding of the fin-to-limb transition. We find that there is low overlap between the anatomical entities used in these two sources of phenotype data, indicating that phenotypes represented in matrices are not simply a subset of those found in monographic descriptions. Perhaps as expected, compared to characters found in matrices, phenotypes in monographs tend to emphasize descriptive and positional morphology, be somewhat more complex, and relate to fewer additional taxa. While based on a small set of focal taxa, these qualitative and quantitative data suggest that either source of phenotypes alone will result in incomplete knowledge of variation for a given taxon. As a broader community develops to use and expand databases characterizing organismal trait diversity, it is important to recognize the limitations of the data sources and develop strategies to more fully characterize variation both within species and across the tree of life. PMID:27191170

  9. UKPMC: a full text article resource for the life sciences.

    PubMed

    McEntyre, Johanna R; Ananiadou, Sophia; Andrews, Stephen; Black, William J; Boulderstone, Richard; Buttery, Paula; Chaplin, David; Chevuru, Sandeepreddy; Cobley, Norman; Coleman, Lee-Ann; Davey, Paul; Gupta, Bharti; Haji-Gholam, Lesley; Hawkins, Craig; Horne, Alan; Hubbard, Simon J; Kim, Jee-Hyub; Lewin, Ian; Lyte, Vic; MacIntyre, Ross; Mansoor, Sami; Mason, Linda; McNaught, John; Newbold, Elizabeth; Nobata, Chikashi; Ong, Ernest; Pillai, Sharmila; Rebholz-Schuhmann, Dietrich; Rosie, Heather; Rowbotham, Rob; Rupp, C J; Stoehr, Peter; Vaughan, Philip

    2011-01-01

    UK PubMed Central (UKPMC) is a full-text article database that extends the functionality of the original PubMed Central (PMC) repository. The UKPMC project was launched as the first 'mirror' site to PMC, which in analogy to the International Nucleotide Sequence Database Collaboration, aims to provide international preservation of the open and free-access biomedical literature. UKPMC (http://ukpmc.ac.uk) has undergone considerable development since its inception in 2007 and now includes both a UKPMC and PubMed search, as well as access to other records such as Agricola, Patents and recent biomedical theses. UKPMC also differs from PubMed/PMC in that the full text and abstract information can be searched in an integrated manner from one input box. Furthermore, UKPMC contains 'Cited By' information as an alternative way to navigate the literature and has incorporated text-mining approaches to semantically enrich content and integrate it with related database resources. Finally, UKPMC also offers added-value services (UKPMC+) that enable grantees to deposit manuscripts, link papers to grants, publish online portfolios and view citation information on their papers. Here we describe UKPMC and clarify the relationship between PMC and UKPMC, providing historical context and future directions, 10 years on from when PMC was first launched.

  10. UKPMC: a full text article resource for the life sciences

    PubMed Central

    McEntyre, Johanna R.; Ananiadou, Sophia; Andrews, Stephen; Black, William J.; Boulderstone, Richard; Buttery, Paula; Chaplin, David; Chevuru, Sandeepreddy; Cobley, Norman; Coleman, Lee-Ann; Davey, Paul; Gupta, Bharti; Haji-Gholam, Lesley; Hawkins, Craig; Horne, Alan; Hubbard, Simon J.; Kim, Jee-Hyub; Lewin, Ian; Lyte, Vic; MacIntyre, Ross; Mansoor, Sami; Mason, Linda; McNaught, John; Newbold, Elizabeth; Nobata, Chikashi; Ong, Ernest; Pillai, Sharmila; Rebholz-Schuhmann, Dietrich; Rosie, Heather; Rowbotham, Rob; Rupp, C. J.; Stoehr, Peter; Vaughan, Philip

    2011-01-01

    UK PubMed Central (UKPMC) is a full-text article database that extends the functionality of the original PubMed Central (PMC) repository. The UKPMC project was launched as the first ‘mirror’ site to PMC, which in analogy to the International Nucleotide Sequence Database Collaboration, aims to provide international preservation of the open and free-access biomedical literature. UKPMC (http://ukpmc.ac.uk) has undergone considerable development since its inception in 2007 and now includes both a UKPMC and PubMed search, as well as access to other records such as Agricola, Patents and recent biomedical theses. UKPMC also differs from PubMed/PMC in that the full text and abstract information can be searched in an integrated manner from one input box. Furthermore, UKPMC contains ‘Cited By’ information as an alternative way to navigate the literature and has incorporated text-mining approaches to semantically enrich content and integrate it with related database resources. Finally, UKPMC also offers added-value services (UKPMC+) that enable grantees to deposit manuscripts, link papers to grants, publish online portfolios and view citation information on their papers. Here we describe UKPMC and clarify the relationship between PMC and UKPMC, providing historical context and future directions, 10 years on from when PMC was first launched. PMID:21062818

  11. Effective use of latent semantic indexing and computational linguistics in biological and biomedical applications.

    PubMed

    Chen, Hongyu; Martin, Bronwen; Daimon, Caitlin M; Maudsley, Stuart

    2013-01-01

    Text mining is rapidly becoming an essential technique for the annotation and analysis of large biological data sets. Biomedical literature currently increases at a rate of several thousand papers per week, making automated information retrieval methods the only feasible method of managing this expanding corpus. With the increasing prevalence of open-access journals and constant growth of publicly-available repositories of biomedical literature, literature mining has become much more effective with respect to the extraction of biomedically-relevant data. In recent years, text mining of popular databases such as MEDLINE has evolved from basic term-searches to more sophisticated natural language processing techniques, indexing and retrieval methods, structural analysis and integration of literature with associated metadata. In this review, we will focus on Latent Semantic Indexing (LSI), a computational linguistics technique increasingly used for a variety of biological purposes. It is noted for its ability to consistently outperform benchmark Boolean text searches and co-occurrence models at information retrieval and its power to extract indirect relationships within a data set. LSI has been used successfully to formulate new hypotheses, generate novel connections from existing data, and validate empirical data.

  12. Using Large-Scale Databases in Evaluation: Advances, Opportunities, and Challenges

    ERIC Educational Resources Information Center

    Penuel, William R.; Means, Barbara

    2011-01-01

    Major advances in the number, capabilities, and quality of state, national, and transnational databases have opened up new opportunities for evaluators. Both large-scale data sets collected for administrative purposes and those collected by other researchers can provide data for a variety of evaluation-related activities. These include (a)…

  13. Improving the Scalability of an Exact Approach for Frequent Item Set Hiding

    ERIC Educational Resources Information Center

    LaMacchia, Carolyn

    2013-01-01

    Technological advances have led to the generation of large databases of organizational data recognized as an information-rich, strategic asset for internal analysis and sharing with trading partners. Data mining techniques can discover patterns in large databases including relationships considered strategically relevant to the owner of the data.…

  14. Reflections on CD-ROM: Bridging the Gap between Technology and Purpose.

    ERIC Educational Resources Information Center

    Saviers, Shannon Smith

    1987-01-01

    Provides a technological overview of CD-ROM (Compact Disc-Read Only Memory), an optically-based medium for data storage offering large storage capacity, computer-based delivery system, read-only medium, and economic mass production. CD-ROM database attributes appropriate for information delivery are also reviewed, including large database size,…

  15. Relational Databases: A Transparent Framework for Encouraging Biology Students to Think Informatically

    ERIC Educational Resources Information Center

    Rice, Michael; Gladstone, William; Weir, Michael

    2004-01-01

    We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a…

  16. Cost and cost-effectiveness studies in urologic oncology using large administrative databases.

    PubMed

    Wang, Ye; Mossanen, Matthew; Chang, Steven L

    2018-04-01

    Urologic cancers are not only among the most common types of cancers, but also among the most expensive cancers to treat in the United States. This study aimed to review the use of CEAs and other cost analyses in urologic oncology using large databases to better understand the value of management strategies of these cancers. A literature review on CEAs and other cost analyses in urologic oncology using large databases. The options for and costs of diagnosing, treating, and following patients with urologic cancers can be expected to rise in the coming years. There are numerous opportunities in each urologic cancer to use CEAs to both lower costs and provide high-quality services. Improved cancer care must balance the integration of novelty with ensuring reasonable costs to patients and the health care system. With the increasing focus cost containment, appreciating the value of competing strategies in caring for our patients is pivotal. Leveraging methods such as CEAs and harnessing large databases may help evaluate the merit of established or emerging strategies. Copyright © 2018 Elsevier Inc. All rights reserved.

  17. NASA scientific and technical information for the 1990s

    NASA Technical Reports Server (NTRS)

    Cotter, Gladys A.

    1990-01-01

    Projections for NASA scientific and technical information (STI) in the 1990s are outlined. NASA STI for the 1990s will maintain a quality bibliographic and full-text database, emphasizing electronic input and products supplemented by networked access to a wide variety of sources, particularly numeric databases.

  18. The Missing Link: Context Loss in Online Databases

    ERIC Educational Resources Information Center

    Mi, Jia; Nesta, Frederick

    2005-01-01

    Full-text databases do not allow for the complexity of the interaction of the human eye and brain with printed matter. As a result, both content and context may be lost. The authors propose additional indexing fields that would maintain the content and context of print in electronic formats.

  19. Redefining Information Access to Serials Information.

    ERIC Educational Resources Information Center

    Chen, Ching-chih

    1992-01-01

    Describes full-text document delivery services that have been introduced in conjunction with available databases in response to economic and technological changes affecting libraries: (1) CARL System's UnCover database and UnCover2 service; (2) Research Libraries Group's CitaDel delivery service; and (3) Faxon Research Service's Faxon Finder and…

  20. Conducting a Web Search.

    ERIC Educational Resources Information Center

    Miller-Whitehead, Marie

    Keyword and text string searches of online library catalogs often provide different results according to library and database used and depending upon how books and journals are indexed. For this reason, online databases such as ERIC often provide tutorials and recommendations for searching their site, such as how to use Boolean search strategies.…

  1. LSD: Large Survey Database framework

    NASA Astrophysics Data System (ADS)

    Juric, Mario

    2012-09-01

    The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures.

  2. Databases for LDEF results

    NASA Technical Reports Server (NTRS)

    Bohnhoff-Hlavacek, Gail

    1992-01-01

    One of the objectives of the team supporting the LDEF Systems and Materials Special Investigative Groups is to develop databases of experimental findings. These databases identify the hardware flown, summarize results and conclusions, and provide a system for acknowledging investigators, tracing sources of data, and future design suggestions. To date, databases covering the optical experiments, and thermal control materials (chromic acid anodized aluminum, silverized Teflon blankets, and paints) have been developed at Boeing. We used the Filemaker Pro software, the database manager for the Macintosh computer produced by the Claris Corporation. It is a flat, text-retrievable database that provides access to the data via an intuitive user interface, without tedious programming. Though this software is available only for the Macintosh computer at this time, copies of the databases can be saved to a format that is readable on a personal computer as well. Further, the data can be exported to more powerful relational databases, capabilities, and use of the LDEF databases and describe how to get copies of the database for your own research.

  3. Healthcare resource utilisation by patients with coronary heart disease receiving a lifestyle-focused text message support program: an analysis from the TEXT ME study.

    PubMed

    Thakkar, Jay; Redfern, Julie; Khan, Ehsan; Atkins, Emily; Ha, Jeffrey; Vo, Kha; Thiagalingam, Aravinda; Chow, Clara K

    2018-05-23

    The 'Tobacco, Exercise and Diet Messages' (TEXT ME) study was a 6-month, single-centre randomised clinical trial (RCT) that found a text message support program improved levels of cardiovascular risk factors in patients with coronary heart disease (CHD). The current analyses examined whether receipt of text messages influenced participants' engagement with conventional healthcare resources. The TEXT ME study database (N=710) was linked with routinely collected health department databases. Number of doctor consultations, investigations and cardiac medication prescriptions in the two study groups were compared. The most frequently accessed health service was consultations with a General Practitioner (mean 7.1, s.d. 5.4). The numbers of medical consultations, biochemical tests or cardiac-specific investigations were similar between the study groups. There was at least one prescription registered for statin, ACEI/ARBs and β-blockers in 79, 66 and 50% of patients respectively, with similar refill rates in both the study groups. The study identified TEXT ME text messaging program did not increase use of Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Scheme (PBS) captured healthcare services. The observed benefits of TEXT ME reflect direct effects of intervention independent of conventional healthcare resource engagement.

  4. Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE.

    PubMed

    Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong

    2012-01-01

    High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, <50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation. Database URLs: http://www.ncbi.nlm.nih.gov/PubMed, http://www.ncbi.nlm.nih.gov/geo/, http://www.rcsb.org/pdb/

  5. Measuring use patterns of online journals and databases

    PubMed Central

    De Groote, Sandra L.; Dorsch, Josephine L.

    2003-01-01

    Purpose: This research sought to determine use of online biomedical journals and databases and to assess current user characteristics associated with the use of online resources in an academic health sciences center. Setting: The Library of the Health Sciences–Peoria is a regional site of the University of Illinois at Chicago (UIC) Library with 350 print journals, more than 4,000 online journals, and multiple online databases. Methodology: A survey was designed to assess online journal use, print journal use, database use, computer literacy levels, and other library user characteristics. A survey was sent through campus mail to all (471) UIC Peoria faculty, residents, and students. Results: Forty-one percent (188) of the surveys were returned. Ninety-eight percent of the students, faculty, and residents reported having convenient access to a computer connected to the Internet. While 53% of the users indicated they searched MEDLINE at least once a week, other databases showed much lower usage. Overall, 71% of respondents indicated a preference for online over print journals when possible. Conclusions: Users prefer online resources to print, and many choose to access these online resources remotely. Convenience and full-text availability appear to play roles in selecting online resources. The findings of this study suggest that databases without links to full text and online journal collections without links from bibliographic databases will have lower use. These findings have implications for collection development, promotion of library resources, and end-user training. PMID:12883574

  6. Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE

    PubMed Central

    Névéol, Aurélie; Wilbur, W. John; Lu, Zhiyong

    2012-01-01

    High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, <50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation. Database URLs: http://www.ncbi.nlm.nih.gov/PubMed, http://www.ncbi.nlm.nih.gov/geo/, http://www.rcsb.org/pdb/ PMID:22685160

  7. Topic models: a novel method for modeling couple and family text data.

    PubMed

    Atkins, David C; Rubin, Timothy N; Steyvers, Mark; Doeden, Michelle A; Baucom, Brian R; Christensen, Andrew

    2012-10-01

    Couple and family researchers often collect open-ended linguistic data-either through free-response questionnaire items, or transcripts of interviews or therapy sessions. Because participants' responses are not forced into a set number of categories, text-based data can be very rich and revealing of psychological processes. At the same time, it is highly unstructured and challenging to analyze. Within family psychology, analyzing text data typically means applying a coding system, which can quantify text data but also has several limitations, including the time needed for coding, difficulties with interrater reliability, and defining a priori what should be coded. The current article presents an alternative method for analyzing text data called topic models (Steyvers & Griffiths, 2006), which has not yet been applied within couple and family psychology. Topic models have similarities to factor analysis and cluster analysis in that they identify underlying clusters of words with semantic similarities (i.e., the "topics"). In the present article, a nontechnical introduction to topic models is provided, highlighting how these models can be used for text exploration and indexing (e.g., quickly locating text passages that share semantic meaning) and how output from topic models can be used to predict behavioral codes or other types of outcomes. Throughout the article, a collection of transcripts from a large couple-therapy trial (Christensen et al., 2004) is used as example data to highlight potential applications. Practical resources for learning more about topic models and how to apply them are discussed. (PsycINFO Database Record (c) 2012 APA, all rights reserved).

  8. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Abbott, Jennifer; Sandberg, Tami

    The Wind-Wildlife Impacts Literature Database (WILD), formerly known as the Avian Literature Database, was created in 1997. The goal of the database was to begin tracking the research that detailed the potential impact of wind energy development on birds. The Avian Literature Database was originally housed on a proprietary platform called Livelink ECM from Open- Text and maintained by in-house technical staff. The initial set of records was added by library staff. A vital part of the newly launched Drupal-based WILD database is the Bibliography module. Many of the resources included in the database have digital object identifiers (DOI). Themore » bibliographic information for any item that has a DOI can be imported into the database using this module. This greatly reduces the amount of manual data entry required to add records to the database. The content available in WILD is international in scope, which can be easily discerned by looking at the tags available in the browse menu.« less

  9. Text-mining and information-retrieval services for molecular biology

    PubMed Central

    Krallinger, Martin; Valencia, Alfonso

    2005-01-01

    Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators. PMID:15998455

  10. Assignment to database industy

    NASA Astrophysics Data System (ADS)

    Abe, Kohichiroh

    Various kinds of databases are considered to be essential part in future large sized systems. Information provision only by databases is also considered to be growing as the market becomes mature. This paper discusses how such circumstances have been built and will be developed from now on.

  11. Muscle Logic: New Knowledge Resource for Anatomy Enables Comprehensive Searches of the Literature on the Feeding Muscles of Mammals.

    PubMed

    Druzinsky, Robert E; Balhoff, James P; Crompton, Alfred W; Done, James; German, Rebecca Z; Haendel, Melissa A; Herrel, Anthony; Herring, Susan W; Lapp, Hilmar; Mabee, Paula M; Muller, Hans-Michael; Mungall, Christopher J; Sternberg, Paul W; Van Auken, Kimberly; Vinyard, Christopher J; Williams, Susan H; Wall, Christine E

    2016-01-01

    In recent years large bibliographic databases have made much of the published literature of biology available for searches. However, the capabilities of the search engines integrated into these databases for text-based bibliographic searches are limited. To enable searches that deliver the results expected by comparative anatomists, an underlying logical structure known as an ontology is required. Here we present the Mammalian Feeding Muscle Ontology (MFMO), a multi-species ontology focused on anatomical structures that participate in feeding and other oral/pharyngeal behaviors. A unique feature of the MFMO is that a simple, computable, definition of each muscle, which includes its attachments and innervation, is true across mammals. This construction mirrors the logical foundation of comparative anatomy and permits searches using language familiar to biologists. Further, it provides a template for muscles that will be useful in extending any anatomy ontology. The MFMO is developed to support the Feeding Experiments End-User Database Project (FEED, https://feedexp.org/), a publicly-available, online repository for physiological data collected from in vivo studies of feeding (e.g., mastication, biting, swallowing) in mammals. Currently the MFMO is integrated into FEED and also into two literature-specific implementations of Textpresso, a text-mining system that facilitates powerful searches of a corpus of scientific publications. We evaluate the MFMO by asking questions that test the ability of the ontology to return appropriate answers (competency questions). We compare the results of queries of the MFMO to results from similar searches in PubMed and Google Scholar. Our tests demonstrate that the MFMO is competent to answer queries formed in the common language of comparative anatomy, but PubMed and Google Scholar are not. Overall, our results show that by incorporating anatomical ontologies into searches, an expanded and anatomically comprehensive set of results can be obtained. The broader scientific and publishing communities should consider taking up the challenge of semantically enabled search capabilities.

  12. Signature Verification Based on Handwritten Text Recognition

    NASA Astrophysics Data System (ADS)

    Viriri, Serestina; Tapamo, Jules-R.

    Signatures continue to be an important biometric trait because it remains widely used primarily for authenticating the identity of human beings. This paper presents an efficient text-based directional signature recognition algorithm which verifies signatures, even when they are composed of special unconstrained cursive characters which are superimposed and embellished. This algorithm extends the character-based signature verification technique. The experiments carried out on the GPDS signature database and an additional database created from signatures captured using the ePadInk tablet, show that the approach is effective and efficient, with a positive verification rate of 94.95%.

  13. pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature

    PubMed Central

    Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.

    2015-01-01

    Background Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. Methods In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. Results We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/). PMID:26258475

  14. MannDB: A microbial annotation database for protein characterization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhou, C; Lam, M; Smith, J

    2006-05-19

    MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-sourcemore » tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high-priority agents on the websites of several governmental organizations concerned with bio-terrorism. MannDB provides the user with a BLAST interface for comparison of native and non-native sequences and a query tool for conveniently selecting proteins of interest. In addition, the user has access to a web-based browser that compiles comprehensive and extensive reports.« less

  15. From atomic structure to excess entropy: a neutron diffraction and density functional theory study of CaO-Al2O3-SiO2 melts

    NASA Astrophysics Data System (ADS)

    Liu, Maoyuan; Jacob, Aurélie; Schmetterer, Clemens; Masset, Patrick J.; Hennet, Louis; Fischer, Henry E.; Kozaily, Jad; Jahn, Sandro; Gray-Weale, Angus

    2016-04-01

    Calcium aluminosilicate \\text{CaO}-\\text{A}{{\\text{l}}2}{{\\text{O}}3}-\\text{Si}{{\\text{O}}2} (CAS) melts with compositions {{≤ft(\\text{CaO}-\\text{Si}{{\\text{O}}2}\\right)}x}{{≤ft(\\text{A}{{\\text{l}}2}{{\\text{O}}3}\\right)}1-x} for x  <  0.5 and {{≤ft(\\text{A}{{\\text{l}}2}{{\\text{O}}3}\\right)}x}{{≤ft(\\text{Si}{{\\text{O}}2}\\right)}1-x} for x≥slant 0.5 are studied using neutron diffraction with aerodynamic levitation and density functional theory molecular dynamics modelling. Simulated structure factors are found to be in good agreement with experimental structure factors. Local atomic structures from simulations reveal the role of calcium cations as a network modifier, and aluminium cations as a non-tetrahedral network former. Distributions of tetrahedral order show that an increasing concentration of the network former Al increases entropy, while an increasing concentration of the network modifier Ca decreases entropy. This trend is opposite to the conventional understanding that increasing amounts of network former should increase order in the network liquid, and so decrease entropy. The two-body correlation entropy S 2 is found to not correlate with the excess entropy values obtained from thermochemical databases, while entropies including higher-order correlations such as tetrahedral order, O-M-O or M-O-M bond angles and Q N environments show a clear linear correlation between computed entropy and database excess entropy. The possible relationship between atomic structures and excess entropy is discussed.

  16. Description of 'REQUEST-KYUSHYU' for KYUKEICHO regional data base

    NASA Astrophysics Data System (ADS)

    Takimoto, Shin'ichi

    Kyushu Economic Research Association (a foundational juridical person) initiated the regional database services, ' REQUEST-Kyushu ' recently. It is the full scale databases compiled based on the information and know-hows which the Association has accumulated over forty years. It covers the regional information database for journal and newspaper articles, and statistical information database for economic statistics. As to the former database it is searched on a personal computer and then a search result (original text) is sent through a facsimile. As to the latter, it is also searched on a personal computer where the data is processed, edited or downloaded. This paper describes characteristics, content and the system outline of 'REQUEST-Kyushu'.

  17. Ambiguity and variability of database and software names in bioinformatics.

    PubMed

    Duck, Geraint; Kovacevic, Aleksandar; Robertson, David L; Stevens, Robert; Nenadic, Goran

    2015-01-01

    There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification. Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature. Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.

  18. Inconsistencies in the red blood cell membrane proteome analysis: generation of a database for research and diagnostic applications

    PubMed Central

    Hegedűs, Tamás; Chaubey, Pururawa Mayank; Várady, György; Szabó, Edit; Sarankó, Hajnalka; Hofstetter, Lia; Roschitzki, Bernd; Sarkadi, Balázs

    2015-01-01

    Based on recent results, the determination of the easily accessible red blood cell (RBC) membrane proteins may provide new diagnostic possibilities for assessing mutations, polymorphisms or regulatory alterations in diseases. However, the analysis of the current mass spectrometry-based proteomics datasets and other major databases indicates inconsistencies—the results show large scattering and only a limited overlap for the identified RBC membrane proteins. Here, we applied membrane-specific proteomics studies in human RBC, compared these results with the data in the literature, and generated a comprehensive and expandable database using all available data sources. The integrated web database now refers to proteomic, genetic and medical databases as well, and contains an unexpected large number of validated membrane proteins previously thought to be specific for other tissues and/or related to major human diseases. Since the determination of protein expression in RBC provides a method to indicate pathological alterations, our database should facilitate the development of RBC membrane biomarker platforms and provide a unique resource to aid related further research and diagnostics. Database URL: http://rbcc.hegelab.org PMID:26078478

  19. A blue carbon soil database: Tidal wetland stocks for the US National Greenhouse Gas Inventory

    NASA Astrophysics Data System (ADS)

    Feagin, R. A.; Eriksson, M.; Hinson, A.; Najjar, R. G.; Kroeger, K. D.; Herrmann, M.; Holmquist, J. R.; Windham-Myers, L.; MacDonald, G. M.; Brown, L. N.; Bianchi, T. S.

    2015-12-01

    Coastal wetlands contain large reservoirs of carbon, and in 2015 the US National Greenhouse Gas Inventory began the work of placing blue carbon within the national regulatory context. The potential value of a wetland carbon stock, in relation to its location, soon could be influential in determining governmental policy and management activities, or in stimulating market-based CO2 sequestration projects. To meet the national need for high-resolution maps, a blue carbon stock database was developed linking National Wetlands Inventory datasets with the USDA Soil Survey Geographic Database. Users of the database can identify the economic potential for carbon conservation or restoration projects within specific estuarine basins, states, wetland types, physical parameters, and land management activities. The database is geared towards both national-level assessments and local-level inquiries. Spatial analysis of the stocks show high variance within individual estuarine basins, largely dependent on geomorphic position on the landscape, though there are continental scale trends to the carbon distribution as well. Future plans including linking this database with a sedimentary accretion database to predict carbon flux in US tidal wetlands.

  20. Query-oriented evidence extraction to support evidence-based medicine practice.

    PubMed

    Sarker, Abeed; Mollá, Diego; Paris, Cecile

    2016-02-01

    Evidence-based medicine practice requires medical practitioners to rely on the best available evidence, in addition to their expertise, when making clinical decisions. The medical domain boasts a large amount of published medical research data, indexed in various medical databases such as MEDLINE. As the size of this data grows, practitioners increasingly face the problem of information overload, and past research has established the time-associated obstacles faced by evidence-based medicine practitioners. In this paper, we focus on the problem of automatic text summarisation to help practitioners quickly find query-focused information from relevant documents. We utilise an annotated corpus that is specialised for the task of evidence-based summarisation of text. In contrast to past summarisation approaches, which mostly rely on surface level features to identify salient pieces of texts that form the summaries, our approach focuses on the use of corpus-based statistics, and domain-specific lexical knowledge for the identification of summary contents. We also apply a target-sentence-specific summarisation technique that reduces the problem of underfitting that persists in generic summarisation models. In automatic evaluations run over a large number of annotated summaries, our extractive summarisation technique statistically outperforms various baseline and benchmark summarisation models with a percentile rank of 96.8%. A manual evaluation shows that our extractive summarisation approach is capable of selecting content with high recall and precision, and may thus be used to generate bottom-line answers to practitioners' queries. Our research shows that the incorporation of specialised data and domain-specific knowledge can significantly improve text summarisation performance in the medical domain. Due to the vast amounts of medical text available, and the high growth of this form of data, we suspect that such summarisation techniques will address the time-related obstacles associated with evidence-based medicine. Copyright © 2015 Elsevier Inc. All rights reserved.

  1. Canadian pharmacy practice residents' projects: publication rates and study characteristics.

    PubMed

    Hung, Michelle; Duffett, Mark

    2013-03-01

    Research projects are a key component of pharmacy residents' education. Projects represent both a large investment of effort for each resident (up to 10 weeks over the residency year) and a large body of research (given that there are currently over 150 residency positions in Canada annually). Publication of results is a vital part of the dissemination of information gleaned from these projects. To determine the publication rate for research projects performed under the auspices of accredited English-language hospital pharmacy residency programs in Canada and to describe the study characteristics of residency projects performed in Ontario from 1999/2000 to 2008/2009. Lists of residents and project titles for the period of interest were obtained from residency coordinators. PubMed, CINAHL, the Canadian Journal of Hospital Pharmacy, and Google were searched for evidence of publication of each project identified, as an abstract or presentation at a meeting, a letter to the editor, or a full-text manuscript. The library holdings of the University of Toronto were reviewed to determine study characteristics of the Ontario residency projects. For the objective of this study relating to publication rate, 518 projects were included. The overall publication rate was 32.2% (60 [35.9%] as abstracts and 107 [64.1%] as full-text manuscripts). Publication in pharmacy-specific journals (66 [61.7%] of 107 full-text manuscripts) was more frequent than publication in non-pharmacy-specific journals. The publication rate of projects as full-text manuscripts remained stable over time. Of the 202 Ontario residency projects archived in the University of Toronto's library, most were cohort studies (83 [41.1%]), and the most common topic was efficacy and/or safety of a medication (46 [22.8%]). Most hospital pharmacy residents' projects were unpublished, and the publication rate of projects as full-text manuscripts has not increased over time. Most projects were observational studies. Increasing publication rates and creating a central database or repository of residency projects would increase the dissemination and accessibility of residents' research.

  2. Therapeutic hypnosis, psychotherapy, and the digital humanities: the narratives and culturomics of hypnosis, 1800-2008.

    PubMed

    Rossi, Ernest; Mortimer, Jane; Rossi, Kathryn

    2013-04-01

    Culturomics is a new scientific discipline of the digital humanities-the use of computer algorithms to search for meaning in large databases of text and media. This new digital discipline is used to explore 200 years of the history of hypnosis and psychotherapy in over five million digitized books from more than 40 university libraries around the world. It graphically compares the frequencies of English words about hypnosis, hypnotherapy, psychoanalysis, psychotherapy, and their founders from 1800 to 2008. This new perspective explore issues such as: Who were the major innovators in the history of therapeutic hypnosis, psychoanalysis, and psychotherapy? How well does this new digital approach to the humanities correspond to traditional histories of hypnosis and psychotherapy?

  3. Texture for script identification.

    PubMed

    Busch, Andrew; Boles, Wageeh W; Sridharan, Sridha

    2005-11-01

    The problem of determining the script and language of a document image has a number of important applications in the field of document analysis, such as indexing and sorting of large collections of such images, or as a precursor to optical character recognition (OCR). In this paper, we investigate the use of texture as a tool for determining the script of a document image, based on the observation that text has a distinct visual texture. An experimental evaluation of a number of commonly used texture features is conducted on a newly created script database, providing a qualitative measure of which features are most appropriate for this task. Strategies for improving classification results in situations with limited training data and multiple font types are also proposed.

  4. ECO: A Framework for Entity Co-Occurrence Exploration with Faceted Navigation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Halliday, K. D.

    2010-08-20

    Even as highly structured databases and semantic knowledge bases become more prevalent, a substantial amount of human knowledge is reported as written prose. Typical textual reports, such as news articles, contain information about entities (people, organizations, and locations) and their relationships. Automatically extracting such relationships from large text corpora is a key component of corporate and government knowledge bases. The primary goal of the ECO project is to develop a scalable framework for extracting and presenting these relationships for exploration using an easily navigable faceted user interface. ECO uses entity co-occurrence relationships to identify related entities. The system aggregates andmore » indexes information on each entity pair, allowing the user to rapidly discover and mine relational information.« less

  5. Semantic Technologies for Re-Use of Clinical Routine Data.

    PubMed

    Kreuzthaler, Markus; Martínez-Costa, Catalina; Kaiser, Peter; Schulz, Stefan

    2017-01-01

    Routine patient data in electronic patient records are only partly structured, and an even smaller segment is coded, mainly for administrative purposes. Large parts are only available as free text. Transforming this content into a structured and semantically explicit form is a prerequisite for querying and information extraction. The core of the system architecture presented in this paper is based on SAP HANA in-memory database technology using the SAP Connected Health platform for data integration as well as for clinical data warehousing. A natural language processing pipeline analyses unstructured content and maps it to a standardized vocabulary within a well-defined information model. The resulting semantically standardized patient profiles are used for a broad range of clinical and research application scenarios.

  6. New Resources for Computer-Aided Legal Research: An Assessment of the Usefulness of the DIALOG System in Securities Regulation Studies.

    ERIC Educational Resources Information Center

    Gruner, Richard; Heron, Carol E.

    1984-01-01

    Examines usefulness of DIALOG as legal research tool through use of DIALOG's DIALINDEX database to identify those databases among almost 200 available that contain large numbers of records related to federal securities regulation. Eight databases selected for further study are detailed. Twenty-six footnotes, database statistics, and samples are…

  7. Recent advances on terrain database correlation testing

    NASA Astrophysics Data System (ADS)

    Sakude, Milton T.; Schiavone, Guy A.; Morelos-Borja, Hector; Martin, Glenn; Cortes, Art

    1998-08-01

    Terrain database correlation is a major requirement for interoperability in distributed simulation. There are numerous situations in which terrain database correlation problems can occur that, in turn, lead to lack of interoperability in distributed training simulations. Examples are the use of different run-time terrain databases derived from inconsistent on source data, the use of different resolutions, and the use of different data models between databases for both terrain and culture data. IST has been developing a suite of software tools, named ZCAP, to address terrain database interoperability issues. In this paper we discuss recent enhancements made to this suite, including improved algorithms for sampling and calculating line-of-sight, an improved method for measuring terrain roughness, and the application of a sparse matrix method to the terrain remediation solution developed at the Visual Systems Lab of the Institute for Simulation and Training. We review the application of some of these new algorithms to the terrain correlation measurement processes. The application of these new algorithms improves our support for very large terrain databases, and provides the capability for performing test replications to estimate the sampling error of the tests. With this set of tools, a user can quantitatively assess the degree of correlation between large terrain databases.

  8. Multiresource inventories incorporating GIS, GPS, and database management systems

    Treesearch

    Loukas G. Arvanitis; Balaji Ramachandran; Daniel P. Brackett; Hesham Abd-El Rasol; Xuesong Du

    2000-01-01

    Large-scale natural resource inventories generate enormous data sets. Their effective handling requires a sophisticated database management system. Such a system must be robust enough to efficiently store large amounts of data and flexible enough to allow users to manipulate a wide variety of information. In a pilot project, related to a multiresource inventory of the...

  9. Data-driven indexing mechanism for the recognition of polyhedral objects

    NASA Astrophysics Data System (ADS)

    McLean, Stewart; Horan, Peter; Caelli, Terry M.

    1992-02-01

    This paper is concerned with the problem of searching large model databases. To date, most object recognition systems have concentrated on the problem of matching using simple searching algorithms. This is quite acceptable when the number of object models is small. However, in the future, general purpose computer vision systems will be required to recognize hundreds or perhaps thousands of objects and, in such circumstances, efficient searching algorithms will be needed. The problem of searching a large model database is one which must be addressed if future computer vision systems are to be at all effective. In this paper we present a method we call data-driven feature-indexed hypothesis generation as one solution to the problem of searching large model databases.

  10. Spectral signature verification using statistical analysis and text mining

    NASA Astrophysics Data System (ADS)

    DeCoster, Mallory E.; Firpi, Alexe H.; Jacobs, Samantha K.; Cone, Shelli R.; Tzeng, Nigel H.; Rodriguez, Benjamin M.

    2016-05-01

    In the spectral science community, numerous spectral signatures are stored in databases representative of many sample materials collected from a variety of spectrometers and spectroscopists. Due to the variety and variability of the spectra that comprise many spectral databases, it is necessary to establish a metric for validating the quality of spectral signatures. This has been an area of great discussion and debate in the spectral science community. This paper discusses a method that independently validates two different aspects of a spectral signature to arrive at a final qualitative assessment; the textual meta-data and numerical spectral data. Results associated with the spectral data stored in the Signature Database1 (SigDB) are proposed. The numerical data comprising a sample material's spectrum is validated based on statistical properties derived from an ideal population set. The quality of the test spectrum is ranked based on a spectral angle mapper (SAM) comparison to the mean spectrum derived from the population set. Additionally, the contextual data of a test spectrum is qualitatively analyzed using lexical analysis text mining. This technique analyzes to understand the syntax of the meta-data to provide local learning patterns and trends within the spectral data, indicative of the test spectrum's quality. Text mining applications have successfully been implemented for security2 (text encryption/decryption), biomedical3 , and marketing4 applications. The text mining lexical analysis algorithm is trained on the meta-data patterns of a subset of high and low quality spectra, in order to have a model to apply to the entire SigDB data set. The statistical and textual methods combine to assess the quality of a test spectrum existing in a database without the need of an expert user. This method has been compared to other validation methods accepted by the spectral science community, and has provided promising results when a baseline spectral signature is present for comparison. The spectral validation method proposed is described from a practical application and analytical perspective.

  11. Medical Language Processing for Knowledge Representation and Retrievals

    PubMed Central

    Lyman, Margaret; Sager, Naomi; Chi, Emile C.; Tick, Leo J.; Nhan, Ngo Thanh; Su, Yun; Borst, Francois; Scherrer, Jean-Raoul

    1989-01-01

    The Linguistic String Project-Medical Language Processor, a system for computer analysis of narrative patient documents in English, is being adapted for French Lettres de Sortie. The system converts the free-text input to a semantic representation which is then mapped into a relational database. Retrievals of clinical data from the database are described.

  12. Fast neutron mutants database and web displays at SoyBase

    USDA-ARS?s Scientific Manuscript database

    SoyBase, the USDA-ARS soybean genetics and genomics database, has been expanded to include data for the fast neutron mutants produced by Bolon, Vance, et al. In addition to the expected text and sequence homology searches and visualization of the indels in the context of the genome sequence viewer, ...

  13. Extending the Online Public Access Catalog into the Microcomputer Environment.

    ERIC Educational Resources Information Center

    Sutton, Brett

    1990-01-01

    Describes PCBIS, a database program for MS-DOS microcomputers that features a utility for automatically converting online public access catalog search results stored as text files into structured database files that can be searched, sorted, edited, and printed. Topics covered include the general features of the program, record structure, record…

  14. Semi-Automated Annotation of Biobank Data Using Standard Medical Terminologies in a Graph Database.

    PubMed

    Hofer, Philipp; Neururer, Sabrina; Goebel, Georg

    2016-01-01

    Data describing biobank resources frequently contains unstructured free-text information or insufficient coding standards. (Bio-) medical ontologies like Orphanet Rare Diseases Ontology (ORDO) or the Human Disease Ontology (DOID) provide a high number of concepts, synonyms and entity relationship properties. Such standard terminologies increase quality and granularity of input data by adding comprehensive semantic background knowledge from validated entity relationships. Moreover, cross-references between terminology concepts facilitate data integration across databases using different coding standards. In order to encourage the use of standard terminologies, our aim is to identify and link relevant concepts with free-text diagnosis inputs within a biobank registry. Relevant concepts are selected automatically by lexical matching and SPARQL queries against a RDF triplestore. To ensure correctness of annotations, proposed concepts have to be confirmed by medical data administration experts before they are entered into the registry database. Relevant (bio-) medical terminologies describing diseases and phenotypes were identified and stored in a graph database which was tied to a local biobank registry. Concept recommendations during data input trigger a structured description of medical data and facilitate data linkage between heterogeneous systems.

  15. Medical data mining: knowledge discovery in a clinical data warehouse.

    PubMed Central

    Prather, J. C.; Lobach, D. F.; Goodwin, L. K.; Hales, J. W.; Hage, M. L.; Hammond, W. E.

    1997-01-01

    Clinical databases have accumulated large quantities of information about patients and their medical conditions. Relationships and patterns within this data could provide new medical knowledge. Unfortunately, few methodologies have been developed and applied to discover this hidden knowledge. In this study, the techniques of data mining (also known as Knowledge Discovery in Databases) were used to search for relationships in a large clinical database. Specifically, data accumulated on 3,902 obstetrical patients were evaluated for factors potentially contributing to preterm birth using exploratory factor analysis. Three factors were identified by the investigators for further exploration. This paper describes the processes involved in mining a clinical database including data warehousing, data query and cleaning, and data analysis. PMID:9357597

  16. Nosql for Storage and Retrieval of Large LIDAR Data Collections

    NASA Astrophysics Data System (ADS)

    Boehm, J.; Liu, K.

    2015-08-01

    Developments in LiDAR technology over the past decades have made LiDAR to become a mature and widely accepted source of geospatial information. This in turn has led to an enormous growth in data volume. The central idea for a file-centric storage of LiDAR point clouds is the observation that large collections of LiDAR data are typically delivered as large collections of files, rather than single files of terabyte size. This split of the dataset, commonly referred to as tiling, was usually done to accommodate a specific processing pipeline. It makes therefore sense to preserve this split. A document oriented NoSQL database can easily emulate this data partitioning, by representing each tile (file) in a separate document. The document stores the metadata of the tile. The actual files are stored in a distributed file system emulated by the NoSQL database. We demonstrate the use of MongoDB a highly scalable document oriented NoSQL database for storing large LiDAR files. MongoDB like any NoSQL database allows for queries on the attributes of the document. As a specialty MongoDB also allows spatial queries. Hence we can perform spatial queries on the bounding boxes of the LiDAR tiles. Inserting and retrieving files on a cloud-based database is compared to native file system and cloud storage transfer speed.

  17. What have we learned in minimally invasive colorectal surgery from NSQIP and NIS large databases? A systematic review.

    PubMed

    Batista Rodríguez, Gabriela; Balla, Andrea; Corradetti, Santiago; Martinez, Carmen; Hernández, Pilar; Bollo, Jesús; Targarona, Eduard M

    2018-06-01

    "Big data" refers to large amount of dataset. Those large databases are useful in many areas, including healthcare. The American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) and the National Inpatient Sample (NIS) are big databases that were developed in the USA in order to record surgical outcomes. The aim of the present systematic review is to evaluate the type and clinical impact of the information retrieved through NISQP and NIS big database articles focused on laparoscopic colorectal surgery. A systematic review was conducted using The Meta-Analysis Of Observational Studies in Epidemiology (MOOSE) guidelines. The research was carried out on PubMed database and revealed 350 published papers. Outcomes of articles in which laparoscopic colorectal surgery was the primary aim were analyzed. Fifty-five studies, published between 2007 and February 2017, were included. Articles included were categorized in groups according to the main topic as: outcomes related to surgical technique comparisons, morbidity and perioperatory results, specific disease-related outcomes, sociodemographic disparities, and academic training impact. NSQIP and NIS databases are just the tip of the iceberg for the potential application of Big Data technology and analysis in MIS. Information obtained through big data is useful and could be considered as external validation in those situations where a significant evidence-based medicine exists; also, those databases establish benchmarks to measure the quality of patient care. Data retrieved helps to inform decision-making and improve healthcare delivery.

  18. Development of Human Face Literature Database Using Text Mining Approach: Phase I.

    PubMed

    Kaur, Paramjit; Krishan, Kewal; Sharma, Suresh K

    2018-06-01

    The face is an important part of the human body by which an individual communicates in the society. Its importance can be highlighted by the fact that a person deprived of face cannot sustain in the living world. The amount of experiments being performed and the number of research papers being published under the domain of human face have surged in the past few decades. Several scientific disciplines, which are conducting research on human face include: Medical Science, Anthropology, Information Technology (Biometrics, Robotics, and Artificial Intelligence, etc.), Psychology, Forensic Science, Neuroscience, etc. This alarms the need of collecting and managing the data concerning human face so that the public and free access of it can be provided to the scientific community. This can be attained by developing databases and tools on human face using bioinformatics approach. The current research emphasizes on creating a database concerning literature data of human face. The database can be accessed on the basis of specific keywords, journal name, date of publication, author's name, etc. The collected research papers will be stored in the form of a database. Hence, the database will be beneficial to the research community as the comprehensive information dedicated to the human face could be found at one place. The information related to facial morphologic features, facial disorders, facial asymmetry, facial abnormalities, and many other parameters can be extracted from this database. The front end has been developed using Hyper Text Mark-up Language and Cascading Style Sheets. The back end has been developed using hypertext preprocessor (PHP). The JAVA Script has used as scripting language. MySQL (Structured Query Language) is used for database development as it is most widely used Relational Database Management System. XAMPP (X (cross platform), Apache, MySQL, PHP, Perl) open source web application software has been used as the server.The database is still under the developmental phase and discusses the initial steps of its creation. The current paper throws light on the work done till date.

  19. Keeping Track of Our Treasures: Managing Historical Data with Relational Database Software.

    ERIC Educational Resources Information Center

    Gutmann, Myron P.; And Others

    1989-01-01

    Describes the way a relational database management system manages a large historical data collection project. Shows that such databases are practical to construct. States that the programing tasks involved are not for beginners, but the rewards of having data organized are worthwhile. (GG)

  20. Incorporating Aquatic Interspecies Toxicity Estimates into Large Databases: Model Evaluations and Data Gains

    EPA Science Inventory

    The Chemical Aquatic Fate and Effects (CAFE) database, developed by NOAA’s Emergency Response Division (ERD), is a centralized data repository that allows for unrestricted access to fate and effects data. While this database was originally designed to help support decisions...

  1. Validation study in four health-care databases: upper gastrointestinal bleeding misclassification affects precision but not magnitude of drug-related upper gastrointestinal bleeding risk.

    PubMed

    Valkhoff, Vera E; Coloma, Preciosa M; Masclee, Gwen M C; Gini, Rosa; Innocenti, Francesco; Lapi, Francesco; Molokhia, Mariam; Mosseveld, Mees; Nielsson, Malene Schou; Schuemie, Martijn; Thiessard, Frantz; van der Lei, Johan; Sturkenboom, Miriam C J M; Trifirò, Gianluca

    2014-08-01

    To evaluate the accuracy of disease codes and free text in identifying upper gastrointestinal bleeding (UGIB) from electronic health-care records (EHRs). We conducted a validation study in four European electronic health-care record (EHR) databases such as Integrated Primary Care Information (IPCI), Health Search/CSD Patient Database (HSD), ARS, and Aarhus, in which we identified UGIB cases using free text or disease codes: (1) International Classification of Disease (ICD)-9 (HSD, ARS); (2) ICD-10 (Aarhus); and (3) International Classification of Primary Care (ICPC) (IPCI). From each database, we randomly selected and manually reviewed 200 cases to calculate positive predictive values (PPVs). We employed different case definitions to assess the effect of outcome misclassification on estimation of risk of drug-related UGIB. PPV was 22% [95% confidence interval (CI): 16, 28] and 21% (95% CI: 16, 28) in IPCI for free text and ICPC codes, respectively. PPV was 91% (95% CI: 86, 95) for ICD-9 codes and 47% (95% CI: 35, 59) for free text in HSD. PPV for ICD-9 codes in ARS was 72% (95% CI: 65, 78) and 77% (95% CI: 69, 83) for ICD-10 codes (Aarhus). More specific definitions did not have significant impact on risk estimation of drug-related UGIB, except for wider CIs. ICD-9-CM and ICD-10 disease codes have good PPV in identifying UGIB from EHR; less granular terminology (ICPC) may require additional strategies. Use of more specific UGIB definitions affects precision, but not magnitude, of risk estimates. Copyright © 2014 Elsevier Inc. All rights reserved.

  2. A Node Linkage Approach for Sequential Pattern Mining

    PubMed Central

    Navarro, Osvaldo; Cumplido, René; Villaseñor-Pineda, Luis; Feregrino-Uribe, Claudia; Carrasco-Ochoa, Jesús Ariel

    2014-01-01

    Sequential Pattern Mining is a widely addressed problem in data mining, with applications such as analyzing Web usage, examining purchase behavior, and text mining, among others. Nevertheless, with the dramatic increase in data volume, the current approaches prove inefficient when dealing with large input datasets, a large number of different symbols and low minimum supports. In this paper, we propose a new sequential pattern mining algorithm, which follows a pattern-growth scheme to discover sequential patterns. Unlike most pattern growth algorithms, our approach does not build a data structure to represent the input dataset, but instead accesses the required sequences through pseudo-projection databases, achieving better runtime and reducing memory requirements. Our algorithm traverses the search space in a depth-first fashion and only preserves in memory a pattern node linkage and the pseudo-projections required for the branch being explored at the time. Experimental results show that our new approach, the Node Linkage Depth-First Traversal algorithm (NLDFT), has better performance and scalability in comparison with state of the art algorithms. PMID:24933123

  3. Document similarity measures and document browsing

    NASA Astrophysics Data System (ADS)

    Ahmadullin, Ildus; Fan, Jian; Damera-Venkata, Niranjan; Lim, Suk Hwan; Lin, Qian; Liu, Jerry; Liu, Sam; O'Brien-Strain, Eamonn; Allebach, Jan

    2011-03-01

    Managing large document databases is an important task today. Being able to automatically com- pare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We measure single page documents' similarity with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between dierent documents' components are calculated as probabilistic similarities between corresponding distributions. The similarity measure between documents is represented as a weighted sum of the components' distances. Using this document similarity measure, we propose a browsing mechanism operating on a document dataset. For these purposes, we use a hierarchical browsing environment which we call the document similarity pyramid. It allows the user to browse a large document dataset and to search for documents in the dataset that are similar to the query. The user can browse the dataset on dierent levels of the pyramid, and zoom into the documents that are of interest.

  4. FERN Ethnomedicinal Plant Database: Exploring Fern Ethnomedicinal Plants Knowledge for Computational Drug Discovery.

    PubMed

    Thakar, Sambhaji B; Ghorpade, Pradnya N; Kale, Manisha V; Sonawane, Kailas D

    2015-01-01

    Fern plants are known for their ethnomedicinal applications. Huge amount of fern medicinal plants information is scattered in the form of text. Hence, database development would be an appropriate endeavor to cope with the situation. So by looking at the importance of medicinally useful fern plants, we developed a web based database which contains information about several group of ferns, their medicinal uses, chemical constituents as well as protein/enzyme sequences isolated from different fern plants. Fern ethnomedicinal plant database is an all-embracing, content management web-based database system, used to retrieve collection of factual knowledge related to the ethnomedicinal fern species. Most of the protein/enzyme sequences have been extracted from NCBI Protein sequence database. The fern species, family name, identification, taxonomy ID from NCBI, geographical occurrence, trial for, plant parts used, ethnomedicinal importance, morphological characteristics, collected from various scientific literatures and journals available in the text form. NCBI's BLAST, InterPro, phylogeny, Clustal W web source has also been provided for the future comparative studies. So users can get information related to fern plants and their medicinal applications at one place. This Fern ethnomedicinal plant database includes information of 100 fern medicinal species. This web based database would be an advantageous to derive information specifically for computational drug discovery, botanists or botanical interested persons, pharmacologists, researchers, biochemists, plant biotechnologists, ayurvedic practitioners, doctors/pharmacists, traditional medicinal users, farmers, agricultural students and teachers from universities as well as colleges and finally fern plant lovers. This effort would be useful to provide essential knowledge for the users about the adventitious applications for drug discovery, applications, conservation of fern species around the world and finally to create social awareness.

  5. Architecture for biomedical multimedia information delivery on the World Wide Web

    NASA Astrophysics Data System (ADS)

    Long, L. Rodney; Goh, Gin-Hua; Neve, Leif; Thoma, George R.

    1997-10-01

    Research engineers at the National Library of Medicine are building a prototype system for the delivery of multimedia biomedical information on the World Wide Web. This paper discuses the architecture and design considerations for the system, which will be used initially to make images and text from the third National Health and Nutrition Examination Survey (NHANES) publicly available. We categorized our analysis as follows: (1) fundamental software tools: we analyzed trade-offs among use of conventional HTML/CGI, X Window Broadway, and Java; (2) image delivery: we examined the use of unconventional TCP transmission methods; (3) database manager and database design: we discuss the capabilities and planned use of the Informix object-relational database manager and the planned schema for the HNANES database; (4) storage requirements for our Sun server; (5) user interface considerations; (6) the compatibility of the system with other standard research and analysis tools; (7) image display: we discuss considerations for consistent image display for end users. Finally, we discuss the scalability of the system in terms of incorporating larger or more databases of similar data, and the extendibility of the system for supporting content-based retrieval of biomedical images. The system prototype is called the Web-based Medical Information Retrieval System. An early version was built as a Java applet and tested on Unix, PC, and Macintosh platforms. This prototype used the MiniSQL database manager to do text queries on a small database of records of participants in the second NHANES survey. The full records and associated x-ray images were retrievable and displayable on a standard Web browser. A second version has now been built, also a Java applet, using the MySQL database manager.

  6. High Performance Descriptive Semantic Analysis of Semantic Graph Databases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Joslyn, Cliff A.; Adolf, Robert D.; al-Saffar, Sinan

    As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to understand their inherent semantic structure, whether codified in explicit ontologies or not. Our group is researching novel methods for what we call descriptive semantic analysis of RDF triplestores, to serve purposes of analysis, interpretation, visualization, and optimization. But data size and computational complexity makes it increasingly necessary to bring high performance computational resources to bear on this task. Our research group built a novel high performance hybrid system comprisingmore » computational capability for semantic graph database processing utilizing the large multi-threaded architecture of the Cray XMT platform, conventional servers, and large data stores. In this paper we describe that architecture and our methods, and present the results of our analyses of basic properties, connected components, namespace interaction, and typed paths such for the Billion Triple Challenge 2010 dataset.« less

  7. GRBase, a new gene regulation data base available by anonymous ftp.

    PubMed Central

    Collier, B; Danielsen, M

    1994-01-01

    The Gene Regulation Database (GRBase) is a compendium of information on the structure and function of proteins involved in the control of gene expression in eukaryotes. These proteins include transcription factors, proteins involved in signal transduction, and receptors. The database can be obtained by FTP in Filemaker Pro, text, and postscript formats. The database will be expanded in the coming year to include reviews on families of proteins involved in gene regulation and to allow online searching. PMID:7937071

  8. Staradmin -- Starlink User Database Maintainer

    NASA Astrophysics Data System (ADS)

    Fish, Adrian

    The subject of this SSN is a utility called STARADMIN. This utility allows the system administrator to build and maintain a Starlink User Database (UDB). The principal source of information for each user is a text file, named after their username. The content of each file is a list consisting of one keyword followed by the relevant user data per line. These user database files reside in a single directory. The STARADMIN program is used to manipulate these user data files and automatically generate user summary lists.

  9. Creating a VAPEPS database: A VAPEPS tutorial

    NASA Technical Reports Server (NTRS)

    Graves, George

    1989-01-01

    A procedural method is outlined for creating a Vibroacoustic Payload Environment Prediction System (VAPEPS) Database. The method of presentation employs flowcharts of sequential VAPEPS Commands used to create a VAPEPS Database. The commands are accompanied by explanatory text to the right of the command in order to minimize the need for repetitive reference to the VAPEPS user's manual. The method is demonstrated by examples of varying complexity. It is assumed that the reader has acquired a basic knowledge of the VAPEPS software program.

  10. [Applications and advantages of a multimedia system for autopsies ].

    PubMed

    Gualco, M; Benzi, D; Fulcheri, E

    2001-10-01

    This work evaluates the benefits and applications of computers and multimedia systems in post-mortem examination practice and, more in particular, in the definition of data collection protocols. We examined issues concerning the different aims of autopsy (e.g. diagnostic, scientific, educational, legal), and found that the pathologist's main duty is to acquire a large amount of data in the best possible way. However, despite the will to carry out detailed post-mortem examinations, many pathologic anatomy services face objective difficulties in doing so, especially due to understaffing, lack of time and high costs. The Institute for Pathologic Anatomy of the University of Genoa has developed software for data handling and for outcome reporting, a particularly important aspect in fetal-perinatal diagnosis. The system consists of a relational database in a client-server environment (Fourth Dimension) with two integrated parts. The first part, with unrestricted access, contains patients' personal data, including gender, age, time and date of death, hospital department of origin, person and department requiring the post-mortem examination, hour and time of autopsy, pathologist's name, and clinical diagnosis of death. Using a scanner, a copy of the autopsy application is also field, together with the patient's medical file and any diagnostic images useful to document the case history. The second part of the information system is accessible by pathologists only, and contains the autopsy report. This part is organized to balance two different needs: it allows sufficient space and freedom for autopsy description while providing guidelines for presentation of the report. The structure of the conventional autopsy protocol has been maintained, with subdivisions for all the organs and apparatuses according to topographic criteria. Before this part, a section is dedicated to external cadaver examination and anthropometric data; weight, shape, volume and texture are described for each organ, together with external and cut-surface features. A third section allows the examiner to report other observations not requested previously, while a final section is also provided for the epicrisis and for the formulation of the final diagnosis, the same as that reported in the first form. The database is coupled with an interactive system for collecting voice comments, thereby replacing the need for tape-recorders in the autopsy room. The user can recall a dictation window, dictate a text, check spelling and insert additional text. The database is also coupled to an image acquisition system, on the assumption that moving images allow a more faithful documentation of reality. Therefore, all rooms in which autopsies are carried out on fetuses or neonates have been equipped with a fixed camera linked to a monitor and a video-recorder. A PCB, used for image digitalization, recognizes up to 16,000,000 different colors. Guided by dedicated software, image files are transferred to a computer and then saved with the autoptic report. The database can be consulted and queried in two principle ways: by key words in the contents or main disease descriptions, or by individual words or phrases contained within the complete text of the reports. The present database system for autopsy reporting has proved itself useful in a pathological anatomy service. The combined presence of images and texts renders the system useful also as a research tool. By linking to a Web site dedicated to pathologic anatomy, it will be possible to display online rare cases involving diagnostic difficulties. The system offers great advantages for present and retrospective diagnostics, as well as for research and education purposes.

  11. Routine health insurance data for scientific research: potential and limitations of the Agis Health Database.

    PubMed

    Smeets, Hugo M; de Wit, Niek J; Hoes, Arno W

    2011-04-01

    Observational studies performed within routine health care databases have the advantage of their large size and, when the aim is to assess the effect of interventions, can offer a completion to randomized controlled trials with usually small samples from experimental situations. Institutional Health Insurance Databases (HIDs) are attractive for research because of their large size, their longitudinal perspective, and their practice-based information. As they are based on financial reimbursement, the information is generally reliable. The database of one of the major insurance companies in the Netherlands, the Agis Health Database (AHD), is described in detail. Whether the AHD data sets meet the specific requirements to conduct several types of clinical studies is discussed according to the classification of the four different types of clinical research; that is, diagnostic, etiologic, prognostic, and intervention research. The potential of the AHD for these various types of research is illustrated using examples of studies recently conducted in the AHD. HIDs such as the AHD offer large potential for several types of clinical research, in particular etiologic and intervention studies, but at present the lack of detailed clinical information is an important limitation. Copyright © 2011 Elsevier Inc. All rights reserved.

  12. Image-based query-by-example for big databases of galaxy images

    NASA Astrophysics Data System (ADS)

    Shamir, Lior; Kuminski, Evan

    2017-01-01

    Very large astronomical databases containing millions or even billions of galaxy images have been becoming increasingly important tools in astronomy research. However, in many cases the very large size makes it more difficult to analyze these data manually, reinforcing the need for computer algorithms that can automate the data analysis process. An example of such task is the identification of galaxies of a certain morphology of interest. For instance, if a rare galaxy is identified it is reasonable to expect that more galaxies of similar morphology exist in the database, but it is virtually impossible to manually search these databases to identify such galaxies. Here we describe computer vision and pattern recognition methodology that receives a galaxy image as an input, and searches automatically a large dataset of galaxies to return a list of galaxies that are visually similar to the query galaxy. The returned list is not necessarily complete or clean, but it provides a substantial reduction of the original database into a smaller dataset, in which the frequency of objects visually similar to the query galaxy is much higher. Experimental results show that the algorithm can identify rare galaxies such as ring galaxies among datasets of 10,000 astronomical objects.

  13. Selecting Full-Text Undergraduate Periodicals Databases.

    ERIC Educational Resources Information Center

    Still, Julie M.; Kassabian, Vibiana

    1999-01-01

    Examines how libraries and librarians can compare full-text general periodical indices, using ProQuest Direct, Periodical Abstracts (via Ovid), and EBSCOhost as examples. Explores breadth and depth of coverage; manipulation of results (email/download/print); ease of use (searching); and indexing quirks. (AEF)

  14. DBMap: a TreeMap-based framework for data navigation and visualization of brain research registry

    NASA Astrophysics Data System (ADS)

    Zhang, Ming; Zhang, Hong; Tjandra, Donny; Wong, Stephen T. C.

    2003-05-01

    The purpose of this study is to investigate and apply a new, intuitive and space-conscious visualization framework to facilitate efficient data presentation and exploration of large-scale data warehouses. We have implemented the DBMap framework for the UCSF Brain Research Registry. Such a novel utility would facilitate medical specialists and clinical researchers in better exploring and evaluating a number of attributes organized in the brain research registry. The current UCSF Brain Research Registry consists of a federation of disease-oriented database modules, including Epilepsy, Brain Tumor, Intracerebral Hemorrphage, and CJD (Creuzfeld-Jacob disease). These database modules organize large volumes of imaging and non-imaging data to support Web-based clinical research. While the data warehouse supports general information retrieval and analysis, there lacks an effective way to visualize and present the voluminous and complex data stored. This study investigates whether the TreeMap algorithm can be adapted to display and navigate categorical biomedical data warehouse or registry. TreeMap is a space constrained graphical representation of large hierarchical data sets, mapped to a matrix of rectangles, whose size and color represent interested database fields. It allows the display of a large amount of numerical and categorical information in limited real estate of computer screen with an intuitive user interface. The paper will describe, DBMap, the proposed new data visualization framework for large biomedical databases. Built upon XML, Java and JDBC technologies, the prototype system includes a set of software modules that reside in the application server tier and provide interface to backend database tier and front-end Web tier of the brain registry.

  15. DOE Office of Scientific and Technical Information (OSTI.GOV)

    The system is developed to collect, process, store and present the information provided by the radio frequency identification (RFID) devices. The system contains three parts, the application software, the database and the web page. The application software manages multiple RFID devices, such as readers and portals, simultaneously. It communicates with the devices through application programming interface (API) provided by the device vendor. The application software converts data collected by the RFID readers and portals to readable information. It is capable of encrypting data using 256 bits advanced encryption standard (AES). The application software has a graphical user interface (GUI). Themore » GUI mimics the configurations of the nucler material storage sites or transport vehicles. The GUI gives the user and system administrator an intuitive way to read the information and/or configure the devices. The application software is capable of sending the information to a remote, dedicated and secured web and database server. Two captured screen samples, one for storage and transport, are attached. The database is constructed to handle a large number of RFID tag readers and portals. A SQL server is employed for this purpose. An XML script is used to update the database once the information is sent from the application software. The design of the web page imitates the design of the application software. The web page retrieves data from the database and presents it in different panels. The user needs a user name combined with a password to access the web page. The web page is capable of sending e-mail and text messages based on preset criteria, such as when alarm thresholds are excceeded. A captured screen sample is attached. The application software is designed to be installed on a local computer. The local computer is directly connected to the RFID devices and can be controlled locally or remotely. There are multiple local computers managing different sites or transport vehicles. The control from remote sites and information transmitted to a central database server is through secured internet. The information stored in the central databaser server is shown on the web page. The users can view the web page on the internet. A dedicated and secured web and database server (https) is used to provide information security.« less

  16. NREL Opens Large Database of Inorganic Thin-Film Materials | News | NREL

    Science.gov Websites

    Inorganic Thin-Film Materials April 3, 2018 An extensive experimental database of inorganic thin-film Energy Laboratory (NREL) is now publicly available. The High Throughput Experimental Materials (HTEM Schroeder / NREL) "All existing experimental databases either contain many entries or have all this

  17. Active Exploration of Large 3D Model Repositories.

    PubMed

    Gao, Lin; Cao, Yan-Pei; Lai, Yu-Kun; Huang, Hao-Zhi; Kobbelt, Leif; Hu, Shi-Min

    2015-12-01

    With broader availability of large-scale 3D model repositories, the need for efficient and effective exploration becomes more and more urgent. Existing model retrieval techniques do not scale well with the size of the database since often a large number of very similar objects are returned for a query, and the possibilities to refine the search are quite limited. We propose an interactive approach where the user feeds an active learning procedure by labeling either entire models or parts of them as "like" or "dislike" such that the system can automatically update an active set of recommended models. To provide an intuitive user interface, candidate models are presented based on their estimated relevance for the current query. From the methodological point of view, our main contribution is to exploit not only the similarity between a query and the database models but also the similarities among the database models themselves. We achieve this by an offline pre-processing stage, where global and local shape descriptors are computed for each model and a sparse distance metric is derived that can be evaluated efficiently even for very large databases. We demonstrate the effectiveness of our method by interactively exploring a repository containing over 100 K models.

  18. Vehicle-triggered video compression/decompression for fast and efficient searching in large video databases

    NASA Astrophysics Data System (ADS)

    Bulan, Orhan; Bernal, Edgar A.; Loce, Robert P.; Wu, Wencheng

    2013-03-01

    Video cameras are widely deployed along city streets, interstate highways, traffic lights, stop signs and toll booths by entities that perform traffic monitoring and law enforcement. The videos captured by these cameras are typically compressed and stored in large databases. Performing a rapid search for a specific vehicle within a large database of compressed videos is often required and can be a time-critical life or death situation. In this paper, we propose video compression and decompression algorithms that enable fast and efficient vehicle or, more generally, event searches in large video databases. The proposed algorithm selects reference frames (i.e., I-frames) based on a vehicle having been detected at a specified position within the scene being monitored while compressing a video sequence. A search for a specific vehicle in the compressed video stream is performed across the reference frames only, which does not require decompression of the full video sequence as in traditional search algorithms. Our experimental results on videos captured in a local road show that the proposed algorithm significantly reduces the search space (thus reducing time and computational resources) in vehicle search tasks within compressed video streams, particularly those captured in light traffic volume conditions.

  19. Inferring Higher Functional Information for RIKEN Mouse Full-Length cDNA Clones With FACTS

    PubMed Central

    Nagashima, Takeshi; Silva, Diego G.; Petrovsky, Nikolai; Socha, Luis A.; Suzuki, Harukazu; Saito, Rintaro; Kasukawa, Takeya; Kurochkin, Igor V.; Konagaya, Akihiko; Schönbach, Christian

    2003-01-01

    FACTS (Functional Association/Annotation of cDNA Clones from Text/Sequence Sources) is a semiautomated knowledge discovery and annotation system that integrates molecular function information derived from sequence analysis results (sequence inferred) with functional information extracted from text. Text-inferred information was extracted from keyword-based retrievals of MEDLINE abstracts and by matching of gene or protein names to OMIM, BIND, and DIP database entries. Using FACTS, we found that 47.5% of the 60,770 RIKEN mouse cDNA FANTOM2 clone annotations were informative for text searches. MEDLINE queries yielded molecular interaction-containing sentences for 23.1% of the clones. When disease MeSH and GO terms were matched with retrieved abstracts, 22.7% of clones were associated with potential diseases, and 32.5% with GO identifiers. A significant number (23.5%) of disease MeSH-associated clones were also found to have a hereditary disease association (OMIM Morbidmap). Inferred neoplastic and nervous system disease represented 49.6% and 36.0% of disease MeSH-associated clones, respectively. A comparison of sequence-based GO assignments with informative text-based GO assignments revealed that for 78.2% of clones, identical GO assignments were provided for that clone by either method, whereas for 21.8% of clones, the assignments differed. In contrast, for OMIM assignments, only 28.5% of clones had identical sequence-based and text-based OMIM assignments. Sequence, sentence, and term-based functional associations are included in the FACTS database (http://facts.gsc.riken.go.jp/), which permits results to be annotated and explored through web-accessible keyword and sequence search interfaces. The FACTS database will be a critical tool for investigating the functional complexity of the mouse transcriptome, cDNA-inferred interactome (molecular interactions), and pathome (pathologies). PMID:12819151

  20. Desktop Access to Full-Text NACA and NASA Reports: Systems Developed by NASA Langley Technical Library

    NASA Technical Reports Server (NTRS)

    Ambur, Manjula Y.; Adams, David L.; Trinidad, P. Paul

    1997-01-01

    NASA Langley Technical Library has been involved in developing systems for full-text information delivery of NACA/NASA technical reports since 1991. This paper will describe the two prototypes it has developed and the present production system configuration. The prototype systems are a NACA CD-ROM of thirty-three classic paper NACA reports and a network-based Full-text Electronic Reports Documents System (FEDS) constructed from both paper and electronic formats of NACA and NASA reports. The production system is the DigiDoc System (DIGItal Documents) presently being developed based on the experiences gained from the two prototypes. DigiDoc configuration integrates the on-line catalog database World Wide Web interface and PDF technology to provide a powerful and flexible search and retrieval system. It describes in detail significant achievements and lessons learned in terms of data conversion, storage technologies, full-text searching and retrieval, and image databases. The conclusions from the experiences of digitization and full- text access and future plans for DigiDoc system implementation are discussed.

  1. Crystallography Open Databases and Preservation: a World-wide Initiative

    NASA Astrophysics Data System (ADS)

    Chateigner, Daniel

    In 2003, an international team of crystallographers proposed the Crystallography Open Database (COD), a fully-free collection of crystal structure data, in the aim of ensuring their preservation. With nearly 250000 entries, this database represents a large open set of data for crystallographers, academics and industrials, located at five different places world-wide, and included in Thomson-Reuters’ ISI. As a large step towards data preservation, raw data can now be uploaded along with «digested» structure files, and COD can be questioned by most of the crystallography-linked industrial software. The COD initiative work deserves several other open developments.

  2. Iris indexing based on local intensity order pattern

    NASA Astrophysics Data System (ADS)

    Emerich, Simina; Malutan, Raul; Crisan, Septimiu; Lefkovits, Laszlo

    2017-03-01

    In recent years, iris biometric systems have increased in popularity and have been proven that are capable of handling large-scale databases. The main advantage of these systems is accuracy and reliability. A proper iris patterns classification is expected to reduce the matching time in huge databases. This paper presents an iris indexing technique based on Local Intensity Order Pattern. The performance of the present approach is evaluated on UPOL database and is compared with other recent systems designed for iris indexing. The results illustrate the potential of the proposed method for large scale iris identification.

  3. Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

    PubMed Central

    Rodríguez-Penagos, Carlos; Salgado, Heladia; Martínez-Flores, Irma; Collado-Vides, Julio

    2007-01-01

    Background Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12. Results Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners. Conclusion Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages. PMID:17683642

  4. OpenTrials: towards a collaborative open database of all available information on all clinical trials.

    PubMed

    Goldacre, Ben; Gray, Jonathan

    2016-04-08

    OpenTrials is a collaborative and open database for all available structured data and documents on all clinical trials, threaded together by individual trial. With a versatile and expandable data schema, it is initially designed to host and match the following documents and data for each trial: registry entries; links, abstracts, or texts of academic journal papers; portions of regulatory documents describing individual trials; structured data on methods and results extracted by systematic reviewers or other researchers; clinical study reports; and additional documents such as blank consent forms, blank case report forms, and protocols. The intention is to create an open, freely re-usable index of all such information and to increase discoverability, facilitate research, identify inconsistent data, enable audits on the availability and completeness of this information, support advocacy for better data and drive up standards around open data in evidence-based medicine. The project has phase I funding. This will allow us to create a practical data schema and populate the database initially through web-scraping, basic record linkage techniques, crowd-sourced curation around selected drug areas, and import of existing sources of structured and documents. It will also allow us to create user-friendly web interfaces onto the data and conduct user engagement workshops to optimise the database and interface designs. Where other projects have set out to manually and perfectly curate a narrow range of information on a smaller number of trials, we aim to use a broader range of techniques and attempt to match a very large quantity of information on all trials. We are currently seeking feedback and additional sources of structured data.

  5. Gene Discovery in the Apicomplexa as Revealed by EST Sequencing and Assembly of a Comparative Gene Database

    PubMed Central

    Li, Li; Brunk, Brian P.; Kissinger, Jessica C.; Pape, Deana; Tang, Keliang; Cole, Robert H.; Martin, John; Wylie, Todd; Dante, Mike; Fogarty, Steven J.; Howe, Daniel K.; Liberator, Paul; Diaz, Carmen; Anderson, Jennifer; White, Michael; Jerome, Maria E.; Johnson, Emily A.; Radke, Jay A.; Stoeckert, Christian J.; Waterston, Robert H.; Clifton, Sandra W.; Roos, David S.; Sibley, L. David

    2003-01-01

    Large-scale EST sequencing projects for several important parasites within the phylum Apicomplexa were undertaken for the purpose of gene discovery. Included were several parasites of medical importance (Plasmodium falciparum, Toxoplasma gondii) and others of veterinary importance (Eimeria tenella, Sarcocystis neurona, and Neospora caninum). A total of 55,192 ESTs, deposited into dbEST/GenBank, were included in the analyses. The resulting sequences have been clustered into nonredundant gene assemblies and deposited into a relational database that supports a variety of sequence and text searches. This database has been used to compare the gene assemblies using BLAST similarity comparisons to the public protein databases to identify putative genes. Of these new entries, ∼15%–20% represent putative homologs with a conservative cutoff of p < 10−9, thus identifying many conserved genes that are likely to share common functions with other well-studied organisms. Gene assemblies were also used to identify strain polymorphisms, examine stage-specific expression, and identify gene families. An interesting class of genes that are confined to members of this phylum and not shared by plants, animals, or fungi, was identified. These genes likely mediate the novel biological features of members of the Apicomplexa and hence offer great potential for biological investigation and as possible therapeutic targets. [The sequence data from this study have been submitted to dbEST division of GenBank under accession nos.: Toxoplasma gondii: –, –, –, –, – , –, –, –, –. Plasmodium falciparum: –, –, –, –. Sarcocystis neurona: , , , , , , , , , , , , , –, –, –, –, –. Eimeria tenella: –, –, –, –, –, –, –, –, – , –, –, –, –, –, –, –, –, –, –, –. Neospora caninum: –, –, , – , –, –.] PMID:12618375

  6. Publications of the Western Geologic Mapping Team 1997-1998

    USGS Publications Warehouse

    Stone, Paul; Powell, C.L.

    1999-01-01

    The Western Geologic Mapping Team (WGMT) of the U.S. Geological Survey, Geologic Division (USGS, GD), conducts geologic mapping and related topical earth-science studies in the western United States. This work is focused on areas where modern geologic maps and associated earth-science data are needed to address key societal and environmental issues such as ground-water quality, potential geologic hazards, and land-use decisions. Areas of primary emphasis currently include southern California, the San Francisco Bay region, the Pacific Northwest, the Las Vegas urban corridor, and selected National Park lands. The team has its headquarters in Menlo Park, California, and maintains smaller field offices at several other locations in the western United States. The results of research conducted by the WGMT are released to the public as a variety of databases, maps, text reports, and abstracts, both through the internal publication system of the USGS and in diverse external publications such as scientific journals and books. This report lists publications of the WGMT released in calendar years 1997 and 1998. Most of the publications listed were authored or coauthored by WGMT staff. However, the list also includes some publications authored by formal non-USGS cooperators with the WGMT, as well as some authored by USGS staff outside the WGMT in cooperation with WGMT projects. Several of the publications listed are available on the World Wide Web; for these, URL addresses are provided. Most of these Web publications are USGS open-file reports that contain large digital databases of geologic map and related information. For these, the bibliographic citation refers specifically to an explanatory pamphlet containing information about the content and accessibility of the database, not to the actual map or related information comprising the database itself.

  7. HMDB 4.0: the human metabolome database for 2018.

    PubMed

    Wishart, David S; Feunang, Yannick Djoumbou; Marcu, Ana; Guo, An Chi; Liang, Kevin; Vázquez-Fresno, Rosa; Sajed, Tanvir; Johnson, Daniel; Li, Carin; Karu, Naama; Sayeeda, Zinat; Lo, Elvis; Assempour, Nazanin; Berjanskii, Mark; Singhal, Sandeep; Arndt, David; Liang, Yonjie; Badran, Hasan; Grant, Jason; Serra-Cayuela, Arnau; Liu, Yifeng; Mandal, Rupa; Neveu, Vanessa; Pon, Allison; Knox, Craig; Wilson, Michael; Manach, Claudine; Scalbert, Augustin

    2018-01-04

    The Human Metabolome Database or HMDB (www.hmdb.ca) is a web-enabled metabolomic database containing comprehensive information about human metabolites along with their biological roles, physiological concentrations, disease associations, chemical reactions, metabolic pathways, and reference spectra. First described in 2007, the HMDB is now considered the standard metabolomic resource for human metabolic studies. Over the past decade the HMDB has continued to grow and evolve in response to emerging needs for metabolomics researchers and continuing changes in web standards. This year's update, HMDB 4.0, represents the most significant upgrade to the database in its history. For instance, the number of fully annotated metabolites has increased by nearly threefold, the number of experimental spectra has grown by almost fourfold and the number of illustrated metabolic pathways has grown by a factor of almost 60. Significant improvements have also been made to the HMDB's chemical taxonomy, chemical ontology, spectral viewing, and spectral/text searching tools. A great deal of brand new data has also been added to HMDB 4.0. This includes large quantities of predicted MS/MS and GC-MS reference spectral data as well as predicted (physiologically feasible) metabolite structures to facilitate novel metabolite identification. Additional information on metabolite-SNP interactions and the influence of drugs on metabolite levels (pharmacometabolomics) has also been added. Many other important improvements in the content, the interface, and the performance of the HMDB website have been made and these should greatly enhance its ease of use and its potential applications in nutrition, biochemistry, clinical chemistry, clinical genetics, medicine, and metabolomics science. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  8. A new method for the automatic retrieval of medical cases based on the RadLex ontology.

    PubMed

    Spanier, A B; Cohen, D; Joskowicz, L

    2017-03-01

    The goal of medical case-based image retrieval (M-CBIR) is to assist radiologists in the clinical decision-making process by finding medical cases in large archives that most resemble a given case. Cases are described by radiology reports comprised of radiological images and textual information on the anatomy and pathology findings. The textual information, when available in standardized terminology, e.g., the RadLex ontology, and used in conjunction with the radiological images, provides a substantial advantage for M-CBIR systems. We present a new method for incorporating textual radiological findings from medical case reports in M-CBIR. The input is a database of medical cases, a query case, and the number of desired relevant cases. The output is an ordered list of the most relevant cases in the database. The method is based on a new case formulation, the Augmented RadLex Graph and an Anatomy-Pathology List. It uses a new case relatedness metric [Formula: see text] that prioritizes more specific medical terms in the RadLex tree over less specific ones and that incorporates the length of the query case. An experimental study on 8 CT queries from the 2015 VISCERAL 3D Case Retrieval Challenge database consisting of 1497 volumetric CT scans shows that our method has accuracy rates of 82 and 70% on the first 10 and 30 most relevant cases, respectively, thereby outperforming six other methods. The increasing amount of medical imaging data acquired in clinical practice constitutes a vast database of untapped diagnostically relevant information. This paper presents a new hybrid approach to retrieving the most relevant medical cases based on textual and image information.

  9. Big Data and Total Hip Arthroplasty: How Do Large Databases Compare?

    PubMed

    Bedard, Nicholas A; Pugely, Andrew J; McHugh, Michael A; Lux, Nathan R; Bozic, Kevin J; Callaghan, John J

    2018-01-01

    Use of large databases for orthopedic research has become extremely popular in recent years. Each database varies in the methods used to capture data and the population it represents. The purpose of this study was to evaluate how these databases differed in reported demographics, comorbidities, and postoperative complications for primary total hip arthroplasty (THA) patients. Primary THA patients were identified within National Surgical Quality Improvement Programs (NSQIP), Nationwide Inpatient Sample (NIS), Medicare Standard Analytic Files (MED), and Humana administrative claims database (HAC). NSQIP definitions for comorbidities and complications were matched to corresponding International Classification of Diseases, 9th Revision/Current Procedural Terminology codes to query the other databases. Demographics, comorbidities, and postoperative complications were compared. The number of patients from each database was 22,644 in HAC, 371,715 in MED, 188,779 in NIS, and 27,818 in NSQIP. Age and gender distribution were clinically similar. Overall, there was variation in prevalence of comorbidities and rates of postoperative complications between databases. As an example, NSQIP had more than twice the obesity than NIS. HAC and MED had more than 2 times the diabetics than NSQIP. Rates of deep infection and stroke 30 days after THA had more than 2-fold difference between all databases. Among databases commonly used in orthopedic research, there is considerable variation in complication rates following THA depending upon the database used for analysis. It is important to consider these differences when critically evaluating database research. Additionally, with the advent of bundled payments, these differences must be considered in risk adjustment models. Copyright © 2017 Elsevier Inc. All rights reserved.

  10. Differences in the Reporting of Racial and Socioeconomic Disparities among Three Large National Databases for Breast Reconstruction.

    PubMed

    Kamali, Parisa; Zettervall, Sara L; Wu, Winona; Ibrahim, Ahmed M S; Medin, Caroline; Rakhorst, Hinne A; Schermerhorn, Marc L; Lee, Bernard T; Lin, Samuel J

    2017-04-01

    Research derived from large-volume databases plays an increasing role in the development of clinical guidelines and health policy. In breast cancer research, the Surveillance, Epidemiology and End Results, National Surgical Quality Improvement Program, and Nationwide Inpatient Sample databases are widely used. This study aims to compare the trends in immediate breast reconstruction and identify the drawbacks and benefits of each database. Patients with invasive breast cancer and ductal carcinoma in situ were identified from each database (2005-2012). Trends of immediate breast reconstruction over time were evaluated. Patient demographics and comorbidities were compared. Subgroup analysis of immediate breast reconstruction use per race was conducted. Within the three databases, 1.2 million patients were studied. Immediate breast reconstruction in invasive breast cancer patients increased significantly over time in all databases. A similar significant upward trend was seen in ductal carcinoma in situ patients. Significant differences in immediate breast reconstruction rates were seen among races; and the disparity differed among the three databases. Rates of comorbidities were similar among the three databases. There has been a significant increase in immediate breast reconstruction; however, the extent of the reporting of overall immediate breast reconstruction rates and of racial disparities differs significantly among databases. The Nationwide Inpatient Sample and the National Surgical Quality Improvement Program report similar findings, with the Surveillance, Epidemiology and End Results database reporting results significantly lower in several categories. These findings suggest that use of the Surveillance, Epidemiology and End Results database may not be universally generalizable to the entire U.S.

  11. Mean velocity and turbulence measurements in a 90 deg curved duct with thin inlet boundary layer

    NASA Technical Reports Server (NTRS)

    Crawford, R. A.; Peters, C. E.; Steinhoff, J.; Hornkohl, J. O.; Nourinejad, J.; Ramachandran, K.

    1985-01-01

    The experimental database established by this investigation of the flow in a large rectangular turning duct is of benchmark quality. The experimental Reynolds numbers, Deans numbers and boundary layer characteristics are significantly different from previous benchmark curved-duct experimental parameters. This investigation extends the experimental database to higher Reynolds number and thinner entrance boundary layers. The 5% to 10% thick boundary layers, based on duct half-width, results in a large region of near-potential flow in the duct core surrounded by developing boundary layers with large crossflows. The turbulent entrance boundary layer case at R sub ed = 328,000 provides an incompressible flowfield which approaches real turbine blade cascade characteristics. The results of this investigation provide a challenging benchmark database for computational fluid dynamics code development.

  12. The Odense University Pharmacoepidemiological Database (OPED)

    Cancer.gov

    The Odense University Pharmacoepidemiological Database is one of two large prescription registries in Denmark and covers a stable population that is representative of the Danish population as a whole.

  13. Evaluation of an open source tool for indexing and searching enterprise radiology and pathology reports

    NASA Astrophysics Data System (ADS)

    Kim, Woojin; Boonn, William

    2010-03-01

    Data mining of existing radiology and pathology reports within an enterprise health system can be used for clinical decision support, research, education, as well as operational analyses. In our health system, the database of radiology and pathology reports exceeds 13 million entries combined. We are building a web-based tool to allow search and data analysis of these combined databases using freely available and open source tools. This presentation will compare performance of an open source full-text indexing tool to MySQL's full-text indexing and searching and describe implementation procedures to incorporate these capabilities into a radiology-pathology search engine.

  14. Investigating trends in acoustics research from 1970-1999.

    PubMed

    Viator, J A; Pestorius, F M

    2001-05-01

    Text data mining is a burgeoning field in which new information is extracted from existing text databases. Computational methods are used to compare relationships between database elements to yield new information about the existing data. Text data mining software was used to determine research trends in acoustics for the years 1970, 1980, 1990, and 1999. Trends were indicated by the number of published articles in the categories of acoustics using the Journal of the Acoustical Society of America (JASA) as the article source. Research was classified using a method based on the Physics and Astronomy Classification Scheme (PACS). Research was further subdivided into world regions, including North and South America, Eastern and Western Europe, Asia, Africa, Middle East, and Australia/New Zealand. In order to gauge the use of JASA as an indicator of international acoustics research, three subjects, underwater sound, nonlinear acoustics, and bioacoustics, were further tracked in 1999, using all journals in the INSPEC database. Research trends indicated a shift in emphasis of certain areas, notably underwater sound, audition, and speech. JASA also showed steady growth, with increasing participation by non-US authors, from about 20% in 1970 to nearly 50% in 1999.

  15. HPIminer: A text mining system for building and visualizing human protein interaction networks and pathways.

    PubMed

    Subramani, Suresh; Kalpana, Raja; Monickaraj, Pankaj Moses; Natarajan, Jeyakumar

    2015-04-01

    The knowledge on protein-protein interactions (PPI) and their related pathways are equally important to understand the biological functions of the living cell. Such information on human proteins is highly desirable to understand the mechanism of several diseases such as cancer, diabetes, and Alzheimer's disease. Because much of that information is buried in biomedical literature, an automated text mining system for visualizing human PPI and pathways is highly desirable. In this paper, we present HPIminer, a text mining system for visualizing human protein interactions and pathways from biomedical literature. HPIminer extracts human PPI information and PPI pairs from biomedical literature, and visualize their associated interactions, networks and pathways using two curated databases HPRD and KEGG. To our knowledge, HPIminer is the first system to build interaction networks from literature as well as curated databases. Further, the new interactions mined only from literature and not reported earlier in databases are highlighted as new. A comparative study with other similar tools shows that the resultant network is more informative and provides additional information on interacting proteins and their associated networks. Copyright © 2015 Elsevier Inc. All rights reserved.

  16. Geologic map and map database of the Palo Alto 30' x 60' quadrangle, California

    USGS Publications Warehouse

    Brabb, E.E.; Jones, D.L.; Graymer, R.W.

    2000-01-01

    This digital map database, compiled from previously published and unpublished data, and new mapping by the authors, represents the general distribution of bedrock and surficial deposits in the mapped area. Together with the accompanying text file (pamf.ps, pamf.pdf, pamf.txt), it provides current information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The scale of the source maps limits the spatial resolution (scale) of the database to 1:62,500 or smaller.

  17. Geologic map and map database of western Sonoma, northernmost Marin, and southernmost Mendocino counties, California

    USGS Publications Warehouse

    Blake, M.C.; Graymer, R.W.; Stamski, R.E.

    2002-01-01

    This digital map database, compiled from previously published and unpublished data, and new mapping by the authors, represents the general distribution of bedrock and surficial deposits in the mapped area. Together with the accompanying text file (wsomf.ps, wsomf.pdf, wsomf.txt), it provides current information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The scale of the source maps limits the spatial resolution (scale) of the database to 1:62,500 or smaller.

  18. Using the TIGR gene index databases for biological discovery.

    PubMed

    Lee, Yuandan; Quackenbush, John

    2003-11-01

    The TIGR Gene Index web pages provide access to analyses of ESTs and gene sequences for nearly 60 species, as well as a number of resources derived from these. Each species-specific database is presented using a common format with a homepage. A variety of methods exist that allow users to search each species-specific database. Methods implemented currently include nucleotide or protein sequence queries using WU-BLAST, text-based searches using various sequence identifiers, searches by gene, tissue and library name, and searches using functional classes through Gene Ontology assignments. This protocol provides guidance for using the Gene Index Databases to extract information.

  19. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barnette, Daniel W.

    eCo-PylotDB, written completely in Python, provides a script that parses incoming emails and prepares extracted data for submission to a database table. The script extracts the database server, the database table, the server password, and the server username all from the email address to which the email is sent. The database table is specified on the Subject line. Any text in the body of the email is extracted as user comments for the database table. Attached files are extracted as data files with each file submitted to a specified table field but in separate rows of the targeted database table.more » Other information such as sender, date, time, and machine from which the email was sent is extracted and submitted to the database table as well. An email is sent back to the user specifying whether the data from the initial email was accepted or rejected by the database server. If rejected, the return email includes details as to why.« less

  20. Integrating Query of Relational and Textual Data in Clinical Databases: A Case Study

    PubMed Central

    Fisk, John M.; Mutalik, Pradeep; Levin, Forrest W.; Erdos, Joseph; Taylor, Caroline; Nadkarni, Prakash

    2003-01-01

    Objectives: The authors designed and implemented a clinical data mart composed of an integrated information retrieval (IR) and relational database management system (RDBMS). Design: Using commodity software, which supports interactive, attribute-centric text and relational searches, the mart houses 2.8 million documents that span a five-year period and supports basic IR features such as Boolean searches, stemming, and proximity and fuzzy searching. Measurements: Results are relevance-ranked using either “total documents per patient” or “report type weighting.” Results: Non-curated medical text has a significant degree of malformation with respect to spelling and punctuation, which creates difficulties for text indexing and searching. Presently, the IR facilities of RDBMS packages lack the features necessary to handle such malformed text adequately. Conclusion: A robust IR+RDBMS system can be developed, but it requires integrating RDBMSs with third-party IR software. RDBMS vendors need to make their IR offerings more accessible to non-programmers. PMID:12509355

  1. Social.Water - A crowdsourcing tool for environmental data acquisition

    USGS Publications Warehouse

    Fienen, Michael N.; Lowry, Christopher

    2012-01-01

    Remote telemetry has a long history of use for collection of environmental measurements. With the rise of mobile phones and SMS text-messaging capacity, many members of the general pubic carry communications equipment in their pockets at all times. Enabling the general public to provide environmental data through text messages has the potential both to provide additional data to scientific projects and also to raise awareness of the projects through participation. Hydrologic measurements – some of which can be made without training, involve a single measurement, and are often made in rural areas – are well-suited to text-message conveyance. Many other environmental measurements are similarly well-suited for this technology. Social.Water is a software package, written in Python, that collects, parses, and categorizes text messages sent to a dedicated phone number, updates a simple database, and posts both graphical results and the database on the Web. Social.Water was designed as the backend to the Crowdhydrology project and is written in an object-oriented design that makes customization and modification straightforward.

  2. Social.Water—A crowdsourcing tool for environmental data acquisition

    NASA Astrophysics Data System (ADS)

    Fienen, Michael N.; Lowry, Christopher S.

    2012-12-01

    Remote telemetry has a long history of use for collection of environmental measurements. With the rise of mobile phones and SMS text-messaging capacity, many members of the general pubic carry communications equipment in their pockets at all times. Enabling the general public to provide environmental data through text messages has the potential both to provide additional data to scientific projects and also to raise awareness of the projects through participation. Hydrologic measurements - some of which can be made without training, involve a single measurement, and are often made in rural areas - are well-suited to text-message conveyance. Many other environmental measurements are similarly well-suited for this technology. Social.Water is a software package, written in Python, that collects, parses, and categorizes text messages sent to a dedicated phone number, updates a simple database, and posts both graphical results and the database on the Web. Social.Water was designed as the backend to the Crowdhydrology project and is written in an object-oriented design that makes customization and modification straightforward.

  3. Content Is King: Databases Preserve the Collective Information of Science.

    PubMed

    Yates, John R

    2018-04-01

    Databases store sequence information experimentally gathered to create resources that further science. In the last 20 years databases have become critical components of fields like proteomics where they provide the basis for large-scale and high-throughput proteomic informatics. Amos Bairoch, winner of the Association of Biomolecular Resource Facilities Frederick Sanger Award, has created some of the important databases proteomic research depends upon for accurate interpretation of data.

  4. Use of a German longitudinal prescription database (LRx) in pharmacoepidemiology.

    PubMed

    Richter, Hartmut; Dombrowski, Silvia; Hamer, Hajo; Hadji, Peyman; Kostev, Karel

    2015-01-01

    Large epidemiological databases are often used to examine matters pertaining to drug utilization, health services, and drug safety. The major strength of such databases is that they include large sample sizes, which allow precise estimates to be made. The IMS® LRx database has in recent years been used as a data source for epidemiological research. The aim of this paper is to review a number of recent studies published with the aid of this database and compare these with the results of similar studies using independent data published in the literature. In spite of being somewhat limited to studies for which comparative independent results were available, it was possible to include a wide range of possible uses of the LRx database in a variety of therapeutic fields: prevalence/incidence rate determination (diabetes, epilepsy), persistence analyses (diabetes, osteoporosis), use of comedication (diabetes), drug utilization (G-CSF market) and treatment costs (diabetes, G-CSF market). In general, the results of the LRx studies were found to be clearly in line with previously published reports. In some cases, noticeable discrepancies between the LRx results and the literature data were found (e.g. prevalence in epilepsy, persistence in osteoporosis) and these were discussed and possible reasons presented. Overall, it was concluded that the IMS® LRx database forms a suitable database for pharmacoepidemiological studies.

  5. Rhinoplasty perioperative database using a personal digital assistant.

    PubMed

    Kotler, Howard S

    2004-01-01

    To construct a reliable, accurate, and easy-to-use handheld computer database that facilitates the point-of-care acquisition of perioperative text and image data specific to rhinoplasty. A user-modified database (Pendragon Forms [v.3.2]; Pendragon Software Corporation, Libertyville, Ill) and graphic image program (Tealpaint [v.4.87]; Tealpaint Software, San Rafael, Calif) were used to capture text and image data, respectively, on a Palm OS (v.4.11) handheld operating with 8 megabytes of memory. The handheld and desktop databases were maintained secure using PDASecure (v.2.0) and GoldSecure (v.3.0) (Trust Digital LLC, Fairfax, Va). The handheld data were then uploaded to a desktop database of either FileMaker Pro 5.0 (v.1) (FileMaker Inc, Santa Clara, Calif) or Microsoft Access 2000 (Microsoft Corp, Redmond, Wash). Patient data were collected from 15 patients undergoing rhinoplasty in a private practice outpatient ambulatory setting. Data integrity was assessed after 6 months' disk and hard drive storage. The handheld database was able to facilitate data collection and accurately record, transfer, and reliably maintain perioperative rhinoplasty data. Query capability allowed rapid search using a multitude of keyword search terms specific to the operative maneuvers performed in rhinoplasty. Handheld computer technology provides a method of reliably recording and storing perioperative rhinoplasty information. The handheld computer facilitates the reliable and accurate storage and query of perioperative data, assisting the retrospective review of one's own results and enhancement of surgical skills.

  6. Osteoporosis therapies: evidence from health-care databases and observational population studies.

    PubMed

    Silverman, Stuart L

    2010-11-01

    Osteoporosis is a well-recognized disease with severe consequences if left untreated. Randomized controlled trials are the most rigorous method for determining the efficacy and safety of therapies. Nevertheless, randomized controlled trials underrepresent the real-world patient population and are costly in both time and money. Modern technology has enabled researchers to use information gathered from large health-care or medical-claims databases to assess the practical utilization of available therapies in appropriate patients. Observational database studies lack randomization but, if carefully designed and successfully completed, can provide valuable information that complements results obtained from randomized controlled trials and extends our knowledge to real-world clinical patients. Randomized controlled trials comparing fracture outcomes among osteoporosis therapies are difficult to perform. In this regard, large observational database studies could be useful in identifying clinically important differences among therapeutic options. Database studies can also provide important information with regard to osteoporosis prevalence, health economics, and compliance and persistence with treatment. This article describes the strengths and limitations of both randomized controlled trials and observational database studies, discusses considerations for observational study design, and reviews a wealth of information generated by database studies in the field of osteoporosis.

  7. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.

    PubMed

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

    2015-01-01

    The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  8. Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users

    PubMed Central

    Shatkay, Hagit; Pan, Fengxia; Rzhetsky, Andrey; Wilbur, W. John

    2008-01-01

    Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice. Contact: shatkay@cs.queensu.ca PMID:18718948

  9. Aging assessment of large electric motors in nuclear power plants

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Villaran, M.; Subudhi, M.

    1996-03-01

    Large electric motors serve as the prime movers to drive high capacity pumps, fans, compressors, and generators in a variety of nuclear plant systems. This study examined the stressors that cause degradation and aging in large electric motors operating in various plant locations and environments. The operating history of these machines in nuclear plant service was studied by review and analysis of failure reports in the NPRDS and LER databases. This was supplemented by a review of motor designs, and their nuclear and balance of plant applications, in order to characterize the failure mechanisms that cause degradation, aging, and failuremore » in large electric motors. A generic failure modes and effects analysis for large squirrel cage induction motors was performed to identify the degradation and aging mechanisms affecting various components of these large motors, the failure modes that result, and their effects upon the function of the motor. The effects of large motor failures upon the systems in which they are operating, and on the plant as a whole, were analyzed from failure reports in the databases. The effectiveness of the industry`s large motor maintenance programs was assessed based upon the failure reports in the databases and reviews of plant maintenance procedures and programs.« less

  10. MET network in PubMed: a text-mined network visualization and curation system.

    PubMed

    Dai, Hong-Jie; Su, Chu-Hsien; Lai, Po-Ting; Huang, Ming-Siang; Jonnagaddala, Jitendra; Rose Jue, Toni; Rao, Shruti; Chou, Hui-Jou; Milacic, Marija; Singh, Onkar; Syed-Abdul, Shabbir; Hsu, Wen-Lian

    2016-01-01

    Metastasis is the dissemination of a cancer/tumor from one organ to another, and it is the most dangerous stage during cancer progression, causing more than 90% of cancer deaths. Improving the understanding of the complicated cellular mechanisms underlying metastasis requires investigations of the signaling pathways. To this end, we developed a METastasis (MET) network visualization and curation tool to assist metastasis researchers retrieve network information of interest while browsing through the large volume of studies in PubMed. MET can recognize relations among genes, cancers, tissues and organs of metastasis mentioned in the literature through text-mining techniques, and then produce a visualization of all mined relations in a metastasis network. To facilitate the curation process, MET is developed as a browser extension that allows curators to review and edit concepts and relations related to metastasis directly in PubMed. PubMed users can also view the metastatic networks integrated from the large collection of research papers directly through MET. For the BioCreative 2015 interactive track (IAT), a curation task was proposed to curate metastatic networks among PubMed abstracts. Six curators participated in the proposed task and a post-IAT task, curating 963 unique metastatic relations from 174 PubMed abstracts using MET.Database URL: http://btm.tmu.edu.tw/metastasisway. © The Author(s) 2016. Published by Oxford University Press.

  11. AutoFACT: An Automatic Functional Annotation and Classification Tool

    PubMed Central

    Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud

    2005-01-01

    Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857

  12. Techniques for Efficiently Managing Large Geosciences Data Sets

    NASA Astrophysics Data System (ADS)

    Kruger, A.; Krajewski, W. F.; Bradley, A. A.; Smith, J. A.; Baeck, M. L.; Steiner, M.; Lawrence, R. E.; Ramamurthy, M. K.; Weber, J.; Delgreco, S. A.; Domaszczynski, P.; Seo, B.; Gunyon, C. A.

    2007-12-01

    We have developed techniques and software tools for efficiently managing large geosciences data sets. While the techniques were developed as part of an NSF-Funded ITR project that focuses on making NEXRAD weather data and rainfall products available to hydrologists and other scientists, they are relevant to other geosciences disciplines that deal with large data sets. Metadata, relational databases, data compression, and networking are central to our methodology. Data and derived products are stored on file servers in a compressed format. URLs to, and metadata about the data and derived products are managed in a PostgreSQL database. Virtually all access to the data and products is through this database. Geosciences data normally require a number of processing steps to transform the raw data into useful products: data quality assurance, coordinate transformations and georeferencing, applying calibration information, and many more. We have developed the concept of crawlers that manage this scientific workflow. Crawlers are unattended processes that run indefinitely, and at set intervals query the database for their next assignment. A database table functions as a roster for the crawlers. Crawlers perform well-defined tasks that are, except for perhaps sequencing, largely independent from other crawlers. Once a crawler is done with its current assignment, it updates the database roster table, and gets its next assignment by querying the database. We have developed a library that enables one to quickly add crawlers. The library provides hooks to external (i.e., C-language) compiled codes, so that developers can work and contribute independently. Processes called ingesters inject data into the system. The bulk of the data are from a real-time feed using UCAR/Unidata's IDD/LDM software. An exciting recent development is the establishment of a Unidata HYDRO feed that feeds value-added metadata over the IDD/LDM. Ingesters grab the metadata and populate the PostgreSQL tables. These and other concepts we have developed have enabled us to efficiently manage a 70 Tb (and growing) data weather radar data set.

  13. BoreholeAR: A mobile tablet application for effective borehole database visualization using an augmented reality technology

    NASA Astrophysics Data System (ADS)

    Lee, Sangho; Suh, Jangwon; Park, Hyeong-Dong

    2015-03-01

    Boring logs are widely used in geological field studies since the data describes various attributes of underground and surface environments. However, it is difficult to manage multiple boring logs in the field as the conventional management and visualization methods are not suitable for integrating and combining large data sets. We developed an iPad application to enable its user to search the boring log rapidly and visualize them using the augmented reality (AR) technique. For the development of the application, a standard borehole database appropriate for a mobile-based borehole database management system was designed. The application consists of three modules: an AR module, a map module, and a database module. The AR module superimposes borehole data on camera imagery as viewed by the user and provides intuitive visualization of borehole locations. The map module shows the locations of corresponding borehole data on a 2D map with additional map layers. The database module provides data management functions for large borehole databases for other modules. Field survey was also carried out using more than 100,000 borehole data.

  14. FishTraits Database

    USGS Publications Warehouse

    Angermeier, Paul L.; Frimpong, Emmanuel A.

    2009-01-01

    The need for integrated and widely accessible sources of species traits data to facilitate studies of ecology, conservation, and management has motivated development of traits databases for various taxa. In spite of the increasing number of traits-based analyses of freshwater fishes in the United States, no consolidated database of traits of this group exists publicly, and much useful information on these species is documented only in obscure sources. The largely inaccessible and unconsolidated traits information makes large-scale analysis involving many fishes and/or traits particularly challenging. FishTraits is a database of >100 traits for 809 (731 native and 78 exotic) fish species found in freshwaters of the conterminous United States, including 37 native families and 145 native genera. The database contains information on four major categories of traits: (1) trophic ecology, (2) body size and reproductive ecology (life history), (3) habitat associations, and (4) salinity and temperature tolerances. Information on geographic distribution and conservation status is also included. Together, we refer to the traits, distribution, and conservation status information as attributes. Descriptions of attributes are available here. Many sources were consulted to compile attributes, including state and regional species accounts and other databases.

  15. Active browsing using similarity pyramids

    NASA Astrophysics Data System (ADS)

    Chen, Jau-Yuen; Bouman, Charles A.; Dalton, John C.

    1998-12-01

    In this paper, we describe a new approach to managing large image databases, which we call active browsing. Active browsing integrates relevance feedback into the browsing environment, so that users can modify the database's organization to suit the desired task. Our method is based on a similarity pyramid data structure, which hierarchically organizes the database, so that it can be efficiently browsed. At coarse levels, the similarity pyramid allows users to view the database as large clusters of similar images. Alternatively, users can 'zoom into' finer levels to view individual images. We discuss relevance feedback for the browsing process, and argue that it is fundamentally different from relevance feedback for more traditional search-by-query tasks. We propose two fundamental operations for active browsing: pruning and reorganization. Both of these operations depend on a user-defined relevance set, which represents the image or set of images desired by the user. We present statistical methods for accurately pruning the database, and we propose a new 'worm hole' distance metric for reorganizing the database, so that members of the relevance set are grouped together.

  16. The BioLexicon: a large-scale terminological resource for biomedical text mining

    PubMed Central

    2011-01-01

    Background Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. Conclusions The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring. PMID:21992002

  17. The BioLexicon: a large-scale terminological resource for biomedical text mining.

    PubMed

    Thompson, Paul; McNaught, John; Montemagni, Simonetta; Calzolari, Nicoletta; del Gratta, Riccardo; Lee, Vivian; Marchi, Simone; Monachini, Monica; Pezik, Piotr; Quochi, Valeria; Rupp, C J; Sasaki, Yutaka; Venturi, Giulia; Rebholz-Schuhmann, Dietrich; Ananiadou, Sophia

    2011-10-12

    Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.

  18. Meta-analysis of the effect and safety of berberine in the treatment of type 2 diabetes mellitus, hyperlipemia and hypertension.

    PubMed

    Lan, Jiarong; Zhao, Yanyun; Dong, Feixia; Yan, Ziyou; Zheng, Wenjie; Fan, Jinping; Sun, Guoli

    2015-02-23

    Berberine, extracted from Coptis Root and Phellodendron Chinese, has been frequently used for the adjuvant treatment of type 2 diabetes mellitus, hyperlipidemia, and hypertension in China. Safety and efficacy studies in terms of evidence-based medical practice have become more prevalent in application to Chinese Herbal Medicine. It is necessary to assess the efficacy and safety of berberine in the treatment of type 2 diabetes mellitus, hyperlipidemia and hypertension by conducting a systematic review and meta-analysis of available clinical data. We searched the English databases PubMed, ScienceDirect, Cochrane library, EMbase, etc., and Chinese databases including China biomedical literature database (CBM), Chinese Technology Journal Full-text Database, Chinese journal full text database (CNKI), and Wanfang digital periodical full text database. Relevant studies were selected based on the inclusion and exclusion criteria. Meta-analysis was performed with RevMan5.0 software after data extraction and the quality of studies assessment. Twenty-seven randomized controlled clinical trials were included with 2569 patients. There are seven subgroups in our meta-analysis: berberine versus placebo or berberine with intensive lifestyle intervention versus intensive lifestyle intervention alone; berberine combined with oral hypoglycemic versus hypoglycemic alone; berberine versus oral hypoglycemic; berberine combined with oral lipid lowering drugs versus lipid lowering drugs alone; berberine versus oral lipid lowering drugs; berberine combined with oral hypotensor versus hypotensive medications; berberine versus oral hypotensive medications. In the treatment of type 2 diabetes mellitus, we found that berberine with lifestyle intervention tended to lower the level of FPG, PPG and HbA1c than lifestyle intervention alone or placebo; the same as berberine combined with oral hypoglycaemics to the same hypoglycaemics; but there was no statistical significance between berberine and oral hypoglycaemics. As for the treatment of hyperlipidemia, berberine with lifestyle intervention was better than lifestyle intervention, berberine with oral lipid lowering drugs was better than lipid lowering drugs alone in reducing the level of TC and LDL-C, and raising the level of HDL-C. In the comparative study between berberine and oral lipid lowering drugs, there was no statistical significance in reducing the level of TC and LDL-C, but berberine shows better effect in lowering the level of TG and raising the level of HDL-C. In the treatment of hypertension, berberine with lifestyle intervention tended to lower the level of blood pressure more than the lifestyle intervention alone or placebo did; The same occurred when berberine combined with oral hypotensor was compared to the same hypotensor. Notably, no serious adverse reaction was reported in the 27 experiments. This study indicates that berberine has comparable therapeutic effect on type 2 DM, hyperlipidemia and hypertension with no serious side effect. Considering the relatively low cost compared with other first-line medicine and treatment, berberine might be a good alternative for low socioeconomic status patients to treat type 2 DM, hyperlipidemia, hypertension over long time period. Due to overall limited quality of the included studies, the therapeutic benefit of berberine can be substantiated to a limited degree. Better methodological quality, large controlled trials using standardized preparation are expected to further quantify the therapeutic effect of berberine. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  19. Alternatives to relational database: comparison of NoSQL and XML approaches for clinical data storage.

    PubMed

    Lee, Ken Ka-Yin; Tang, Wai-Choi; Choi, Kup-Sze

    2013-04-01

    Clinical data are dynamic in nature, often arranged hierarchically and stored as free text and numbers. Effective management of clinical data and the transformation of the data into structured format for data analysis are therefore challenging issues in electronic health records development. Despite the popularity of relational databases, the scalability of the NoSQL database model and the document-centric data structure of XML databases appear to be promising features for effective clinical data management. In this paper, three database approaches--NoSQL, XML-enabled and native XML--are investigated to evaluate their suitability for structured clinical data. The database query performance is reported, together with our experience in the databases development. The results show that NoSQL database is the best choice for query speed, whereas XML databases are advantageous in terms of scalability, flexibility and extensibility, which are essential to cope with the characteristics of clinical data. While NoSQL and XML technologies are relatively new compared to the conventional relational database, both of them demonstrate potential to become a key database technology for clinical data management as the technology further advances. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  20. PylotDB - A Database Management, Graphing, and Analysis Tool Written in Python

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barnette, Daniel W.

    2012-01-04

    PylotDB, written completely in Python, provides a user interface (UI) with which to interact with, analyze, graph data from, and manage open source databases such as MySQL. The UI mitigates the user having to know in-depth knowledge of the database application programming interface (API). PylotDB allows the user to generate various kinds of plots from user-selected data; generate statistical information on text as well as numerical fields; backup and restore databases; compare database tables across different databases as well as across different servers; extract information from any field to create new fields; generate, edit, and delete databases, tables, and fields;more » generate or read into a table CSV data; and similar operations. Since much of the database information is brought under control of the Python computer language, PylotDB is not intended for huge databases for which MySQL and Oracle, for example, are better suited. PylotDB is better suited for smaller databases that might be typically needed in a small research group situation. PylotDB can also be used as a learning tool for database applications in general.« less

  1. Commercial Database Design vs. Library Terminology Comprehension: Why Do Students Print Abstracts Instead of Full-Text Articles?

    ERIC Educational Resources Information Center

    Imler, Bonnie; Eichelberger, Michelle

    2014-01-01

    When asked to print the full text of an article, many undergraduate college students print the abstract instead of the full text. This study seeks to determine the underlying cause(s) of this confusion. In this quantitative study, participants (n = 40) performed five usability tasks to assess ease of use and usefulness of five commercial library…

  2. The use of DRG for identifying clinical trials centers with high recruitment potential: a feasability study.

    PubMed

    Aegerter, Philippe; Bendersky, Noelle; Tran, Thi-Chien; Ropers, Jacques; Taright, Namik; Chatellier, Gilles

    2014-01-01

    Recruitment of large samples of patients is crucial for evidence level and efficacy of clinical trials (CT). Clinical Trial Recruitment Support Systems (CTRSS) used to estimate patient recruitment are generally specific to Hospital Information Systems and few were evaluated on a large number of trials. Our aim was to assess, on a large number of CT, the usefulness of commonly available data as Diagnosis Related Groups (DRG) databases in order to estimate potential recruitment. We used the DRG database of a large French multicenter medical institution (1.2 million inpatient stays and 400 new trials each year). Eligibility criteria of protocols were broken down into in atomic entities (diagnosis, procedures, treatments...) then translated into codes and operators recorded in a standardized form. A program parsed the forms and generated requests on the DRG database. A large majority of selection criteria could be coded and final estimations of number of eligible patients were close to observed ones (median difference = 25). Such a system could be part of the feasability evaluation and center selection process before the start of the clinical trial.

  3. Efficient RNA structure comparison algorithms.

    PubMed

    Arslan, Abdullah N; Anandan, Jithendar; Fry, Eric; Monschke, Keith; Ganneboina, Nitin; Bowerman, Jason

    2017-12-01

    Recently proposed relative addressing-based ([Formula: see text]) RNA secondary structure representation has important features by which an RNA structure database can be stored into a suffix array. A fast substructure search algorithm has been proposed based on binary search on this suffix array. Using this substructure search algorithm, we present a fast algorithm that finds the largest common substructure of given multiple RNA structures in [Formula: see text] format. The multiple RNA structure comparison problem is NP-hard in its general formulation. We introduced a new problem for comparing multiple RNA structures. This problem has more strict similarity definition and objective, and we propose an algorithm that solves this problem efficiently. We also develop another comparison algorithm that iteratively calls this algorithm to locate nonoverlapping large common substructures in compared RNAs. With the new resulting tools, we improved the RNASSAC website (linked from http://faculty.tamuc.edu/aarslan ). This website now also includes two drawing tools: one specialized for preparing RNA substructures that can be used as input by the search tool, and another one for automatically drawing the entire RNA structure from a given structure sequence.

  4. pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.

    PubMed

    Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan

    2015-10-01

    The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR.

  5. Identification of literary movements using complex networks to represent texts

    NASA Astrophysics Data System (ADS)

    Amancio, Diego Raphael; Oliveira, Osvaldo N., Jr.; da Fontoura Costa, Luciano

    2012-04-01

    The use of statistical methods to analyze large databases of text has been useful in unveiling patterns of human behavior and establishing historical links between cultures and languages. In this study, we identified literary movements by treating books published from 1590 to 1922 as complex networks, whose metrics were analyzed with multivariate techniques to generate six clusters of books. The latter correspond to time periods coinciding with relevant literary movements over the last five centuries. The most important factor contributing to the distinctions between different literary styles was the average shortest path length, in particular the asymmetry of its distribution. Furthermore, over time there has emerged a trend toward larger average shortest path lengths, which is correlated with increased syntactic complexity, and a more uniform use of the words reflected in a smaller power-law coefficient for the distribution of word frequency. Changes in literary style were also found to be driven by opposition to earlier writing styles, as revealed by the analysis performed with geometrical concepts. The approaches adopted here are generic and may be extended to analyze a number of features of languages and cultures.

  6. On-site comparison of Quantum Hall Effect resistance standards of the CMI and the BIPM: ongoing key comparison BIPM.EM-K12

    NASA Astrophysics Data System (ADS)

    Gournay, Pierre; Rolland, Benjamin; Kučera, Jan; Vojáčková, Lucie

    2017-01-01

    An on-site comparison of the quantum Hall effect (QHE) resistance standards of the Czech Metrology Institute (CMI) and of the Bureau International des Poids et Mesures (BIPM) was made in April 2017. Measurements of a 100 Ω standard in terms of the conventional value of the von Klitzing constant, RK-90, agreed to 6 parts in 1010 with a relative combined standard uncertainty uc = 25 × 10-10. Measurements of 10000 Ω/100 Ω and 100 Ω/1 Ω ratios agreed to 11 parts in 1010 with uc = 22 × 10-10 and to 33 parts in 1010 with uc = 32 × 10-10, respectively. The influence of the current reversal time has been systematically investigated in order to take into account possible errors due to the Peltier effect, which may be particularly large in 1 Ω standards. Main text To reach the main text of this paper, click on Final Report. Note that this text is that which appears in Appendix B of the BIPM key comparison database kcdb.bipm.org/. The final report has been peer-reviewed and approved for publication by the CCEM, according to the provisions of the CIPM Mutual Recognition Arrangement (CIPM MRA).

  7. Open source database of images DEIMOS: extension for large-scale subjective image quality assessment

    NASA Astrophysics Data System (ADS)

    Vítek, Stanislav

    2014-09-01

    DEIMOS (Database of Images: Open Source) is an open-source database of images and video sequences for testing, verification and comparison of various image and/or video processing techniques such as compression, reconstruction and enhancement. This paper deals with extension of the database allowing performing large-scale web-based subjective image quality assessment. Extension implements both administrative and client interface. The proposed system is aimed mainly at mobile communication devices, taking into account advantages of HTML5 technology; it means that participants don't need to install any application and assessment could be performed using web browser. The assessment campaign administrator can select images from the large database and then apply rules defined by various test procedure recommendations. The standard test procedures may be fully customized and saved as a template. Alternatively the administrator can define a custom test, using images from the pool and other components, such as evaluating forms and ongoing questionnaires. Image sequence is delivered to the online client, e.g. smartphone or tablet, as a fully automated assessment sequence or viewer can decide on timing of the assessment if required. Environmental data and viewing conditions (e.g. illumination, vibrations, GPS coordinates, etc.), may be collected and subsequently analyzed.

  8. A Brief Review of RNA–Protein Interaction Database Resources

    PubMed Central

    Yi, Ying; Zhao, Yue; Huang, Yan; Wang, Dong

    2017-01-01

    RNA–Protein interactions play critical roles in various biological processes. By collecting and analyzing the RNA–Protein interactions and binding sites from experiments and predictions, RNA–Protein interaction databases have become an essential resource for the exploration of the transcriptional and post-transcriptional regulatory network. Here, we briefly review several widely used RNA–Protein interaction database resources developed in recent years to provide a guide of these databases. The content and major functions in databases are presented. The brief description of database helps users to quickly choose the database containing information they interested. In short, these RNA–Protein interaction database resources are continually updated, but the current state shows the efforts to identify and analyze the large amount of RNA–Protein interactions. PMID:29657278

  9. MimoSA: a system for minimotif annotation

    PubMed Central

    2010-01-01

    Background Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature. Results We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database. Conclusions MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context. PMID:20565705

  10. Remote visual analysis of large turbulence databases at multiple scales

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  11. Remote visual analysis of large turbulence databases at multiple scales

    DOE PAGES

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...

    2018-06-15

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  12. How does the size and shape of local populations in China compare to general anthropometric surveys currently used for product design?

    PubMed

    Daniell, Nathan; Fraysse, François; Paul, Gunther

    2012-01-01

    Anthropometry has long been used for a range of ergonomic applications & product design. Although products are often designed for specific cohorts, anthropometric data are typically sourced from large scale surveys representative of the general population. Additionally, few data are available for emerging markets like China and India. This study measured 80 Chinese males that were representative of a specific cohort targeted for the design of a new product. Thirteen anthropometric measurements were recorded and compared to two large databases that represented a general population, a Chinese database and a Western database. Substantial differences were identified between the Chinese males measured in this study and both databases. The subjects were substantially taller, heavier and broader than subjects in the older Chinese database. However, they were still substantially smaller, lighter and thinner than Western males. Data from current Western anthropometric surveys are unlikely to accurately represent the target population for product designers and manufacturers in emerging markets like China.

  13. Mars Global Digital Dune Database: MC2-MC29

    USGS Publications Warehouse

    Hayward, Rosalyn K.; Mullins, Kevin F.; Fenton, L.K.; Hare, T.M.; Titus, T.N.; Bourke, M.C.; Colaprete, Anthony; Christensen, P.R.

    2007-01-01

    Introduction The Mars Global Digital Dune Database presents data and describes the methodology used in creating the database. The database provides a comprehensive and quantitative view of the geographic distribution of moderate- to large-size dune fields from 65? N to 65? S latitude and encompasses ~ 550 dune fields. The database will be expanded to cover the entire planet in later versions. Although we have attempted to include all dune fields between 65? N and 65? S, some have likely been excluded for two reasons: 1) incomplete THEMIS IR (daytime) coverage may have caused us to exclude some moderate- to large-size dune fields or 2) resolution of THEMIS IR coverage (100m/pixel) certainly caused us to exclude smaller dune fields. The smallest dune fields in the database are ~ 1 km2 in area. While the moderate to large dune fields are likely to constitute the largest compilation of sediment on the planet, smaller stores of sediment of dunes are likely to be found elsewhere via higher resolution data. Thus, it should be noted that our database excludes all small dune fields and some moderate to large dune fields as well. Therefore the absence of mapped dune fields does not mean that such dune fields do not exist and is not intended to imply a lack of saltating sand in other areas. Where availability and quality of THEMIS visible (VIS) or Mars Orbiter Camera narrow angle (MOC NA) images allowed, we classifed dunes and included dune slipface measurements, which were derived from gross dune morphology and represent the prevailing wind direction at the last time of significant dune modification. For dunes located within craters, the azimuth from crater centroid to dune field centroid was calculated. Output from a general circulation model (GCM) is also included. In addition to polygons locating dune fields, the database includes over 1800 selected Thermal Emission Imaging System (THEMIS) infrared (IR), THEMIS visible (VIS) and Mars Orbiter Camera Narrow Angle (MOC NA) images that were used to build the database. The database is presented in a variety of formats. It is presented as a series of ArcReader projects which can be opened using the free ArcReader software. The latest version of ArcReader can be downloaded at http://www.esri.com/software/arcgis/arcreader/download.html. The database is also presented in ArcMap projects. The ArcMap projects allow fuller use of the data, but require ESRI ArcMap? software. Multiple projects were required to accommodate the large number of images needed. A fuller description of the projects can be found in the Dunes_ReadMe file and the ReadMe_GIS file in the Documentation folder. For users who prefer to create their own projects, the data is available in ESRI shapefile and geodatabase formats, as well as the open Geographic Markup Language (GML) format. A printable map of the dunes and craters in the database is available as a Portable Document Format (PDF) document. The map is also included as a JPEG file. ReadMe files are available in PDF and ASCII (.txt) files. Tables are available in both Excel (.xls) and ASCII formats.

  14. Improving wilderness stewardship through searchable databases of U.S. legislative history and legislated special provisions

    Treesearch

    David R. Craig; Peter Landres; Laurie Yung

    2010-01-01

    The online resource Wilderness.net currently provides quick access to the text of every public law designating wilderness in the U.S. National Wilderness Preservation System (NWPS). This article describes two new searchable databases recently completed and added to the information available on Wilderness.net to help wilderness managers and others understand and...

  15. SciELO, Scientific Electronic Library Online, a Database of Open Access Journals

    ERIC Educational Resources Information Center

    Meneghini, Rogerio

    2013-01-01

    This essay discusses SciELO, a scientific journal database operating in 14 countries. It covers over 1000 journals providing open access to full text and table sets of scientometrics data. In Brazil it is responsible for a collection of nearly 300 journals, selected along 15 years as the best Brazilian periodicals in natural and social sciences.…

  16. FDT 2.0: Improving scalability of the fuzzy decision tree induction tool - integrating database storage.

    PubMed

    Durham, Erin-Elizabeth A; Yu, Xiaxia; Harrison, Robert W

    2014-12-01

    Effective machine-learning handles large datasets efficiently. One key feature of handling large data is the use of databases such as MySQL. The freeware fuzzy decision tree induction tool, FDT, is a scalable supervised-classification software tool implementing fuzzy decision trees. It is based on an optimized fuzzy ID3 (FID3) algorithm. FDT 2.0 improves upon FDT 1.0 by bridging the gap between data science and data engineering: it combines a robust decisioning tool with data retention for future decisions, so that the tool does not need to be recalibrated from scratch every time a new decision is required. In this paper we briefly review the analytical capabilities of the freeware FDT tool and its major features and functionalities; examples of large biological datasets from HIV, microRNAs and sRNAs are included. This work shows how to integrate fuzzy decision algorithms with modern database technology. In addition, we show that integrating the fuzzy decision tree induction tool with database storage allows for optimal user satisfaction in today's Data Analytics world.

  17. Digital geomorphological landslide hazard mapping of the Alpago area, Italy

    NASA Astrophysics Data System (ADS)

    van Westen, Cees J.; Soeters, Rob; Sijmons, Koert

    Large-scale geomorphological maps of mountainous areas are traditionally made using complex symbol-based legends. They can serve as excellent "geomorphological databases", from which an experienced geomorphologist can extract a large amount of information for hazard mapping. However, these maps are not designed to be used in combination with a GIS, due to their complex cartographic structure. In this paper, two methods are presented for digital geomorphological mapping at large scales using GIS and digital cartographic software. The methods are applied to an area with a complex geomorphological setting on the Borsoia catchment, located in the Alpago region, near Belluno in the Italian Alps. The GIS database set-up is presented with an overview of the data layers that have been generated and how they are interrelated. The GIS database was also converted into a paper map, using a digital cartographic package. The resulting largescale geomorphological hazard map is attached. The resulting GIS database and cartographic product can be used to analyse the hazard type and hazard degree for each polygon, and to find the reasons for the hazard classification.

  18. Efficient Single-Pass Index Construction for Text Databases.

    ERIC Educational Resources Information Center

    Heinz, Steffen; Zobel, Justin

    2003-01-01

    Discusses index construction for text collections, reviews principal approaches to inverted indexes, analyzes their theoretical cost, and presents experimental results of the use of a single-pass inversion method on Web document collections. Shows that the single-pass approach is faster and does not require the complete vocabulary of the indexed…

  19. Understanding Student Article Retrieval Behaviors: Instructional Implications

    ERIC Educational Resources Information Center

    Cook-Cottone, Catherine P.; Dutt-Doner, Karen; Schoen, David

    2005-01-01

    This study evaluates the use of full-text databases amongst 425 undergraduate and graduate students in western New York. A review of literature implicated convenience, time issues, article retrieval option knowledge, and the appreciation and understanding of research article quality as potential predictors of full-text reliance. These variables…

  20. Key comparison BIPM.RI(I)-K9 of the absorbed dose to water standards of the PTB, Germany and the BIPM in medium-energy x-rays

    NASA Astrophysics Data System (ADS)

    Burns, D. T.; Kessler, C.; Büermann, L.; Ketelhut, S.

    2018-01-01

    A key comparison has been made between the absorbed dose to water standards of the PTB, Germany and the BIPM in the medium-energy x-ray range. The results show the standards to be in general agreement at the level of the standard uncertainty of the comparison of 9 to 11 parts in 103. The results are combined with those of a EURAMET comparison and presented in terms of degrees of equivalence for entry in the BIPM key comparison database. Main text To reach the main text of this paper, click on Final Report. Note that this text is that which appears in Appendix B of the BIPM key comparison database kcdb.bipm.org/. The final report has been peer-reviewed and approved for publication by the CCRI, according to the provisions of the CIPM Mutual Recognition Arrangement (CIPM MRA).

  1. Early ICU Standardized Rehabilitation Therapy for the Critically Injured Burn Patient

    DTIC Science & Technology

    2017-10-01

    phase proposed to examine medical records within a large national hospital database to identify optimal care delivery patters. Minimizing the...The original study was deemed phase I and closed. The second phase proposed to examine medical records within a large national hospital database to...engineering, and the academic world on areas such as: • improving public knowledge, attitudes, skills, and abilities; • changing behavior, practices

  2. Knowledge Quality Functions for Rule Discovery

    DTIC Science & Technology

    1994-09-01

    Managers in many organizations finding themselves in the possession of large and rapidly growing databases are beginning to suspect the information in their...missing values (Smyth and Goodman, 1992, p. 303). Decision trees "tend to grow very large for realistic applications and are thus difficult to interpret...by humans" (Holsheimer, 1994, p. 42). Decision trees also grow excessively complicated in the presence of noisy databases (Dhar and Tuzhilin, 1993, p

  3. Leveraging Cognitive Context for Object Recognition

    DTIC Science & Technology

    2014-06-01

    learned from large image databases. We build upon this concept by exploring cognitive context, demonstrating how rich dynamic context provided by...context that people rely upon as they perceive the world. Context in ACT-R/E takes the form of associations between related concepts that are learned ...and accuracy of object recognition. Context is most often viewed as a static concept, learned from large image databases. We build upon this concept by

  4. Reference Material Kydex(registered trademark)-100 Test Data Message for Flammability Testing

    NASA Technical Reports Server (NTRS)

    Engel, Carl D.; Richardson, Erin; Davis, Eddie

    2003-01-01

    The Marshall Space Flight Center (MSFC) Materials and Processes Technical Information System (MAPTIS) database contains, as an engineering resource, a large amount of material test data carefully obtained and recorded over a number of years. Flammability test data obtained using Test 1 of NASA-STD-6001 is a significant component of this database. NASA-STD-6001 recommends that Kydex 100 be used as a reference material for testing certification and for comparison between test facilities in the round-robin certification testing that occurs every 2 years. As a result of these regular activities, a large volume of test data is recorded within the MAPTIS database. The activity described in this technical report was undertaken to mine the database, recover flammability (Test 1) Kydex 100 data, and review the lessons learned from analysis of these data.

  5. An ab initio electronic transport database for inorganic materials.

    PubMed

    Ricci, Francesco; Chen, Wei; Aydemir, Umut; Snyder, G Jeffrey; Rignanese, Gian-Marco; Jain, Anubhav; Hautier, Geoffroy

    2017-07-04

    Electronic transport in materials is governed by a series of tensorial properties such as conductivity, Seebeck coefficient, and effective mass. These quantities are paramount to the understanding of materials in many fields from thermoelectrics to electronics and photovoltaics. Transport properties can be calculated from a material's band structure using the Boltzmann transport theory framework. We present here the largest computational database of electronic transport properties based on a large set of 48,000 materials originating from the Materials Project database. Our results were obtained through the interpolation approach developed in the BoltzTraP software, assuming a constant relaxation time. We present the workflow to generate the data, the data validation procedure, and the database structure. Our aim is to target the large community of scientists developing materials selection strategies and performing studies involving transport properties.

  6. Architectural Implications for Spatial Object Association Algorithms*

    PubMed Central

    Kumar, Vijay S.; Kurc, Tahsin; Saltz, Joel; Abdulla, Ghaleb; Kohn, Scott R.; Matarazzo, Celeste

    2013-01-01

    Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server®, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST). PMID:25692244

  7. ApoptoProteomics, an integrated database for analysis of proteomics data obtained from apoptotic cells.

    PubMed

    Arntzen, Magnus Ø; Thiede, Bernd

    2012-02-01

    Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no.

  8. ApoptoProteomics, an Integrated Database for Analysis of Proteomics Data Obtained from Apoptotic Cells*

    PubMed Central

    Arntzen, Magnus Ø.; Thiede, Bernd

    2012-01-01

    Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no. PMID:22067098

  9. SING: Subgraph search In Non-homogeneous Graphs

    PubMed Central

    2010-01-01

    Background Finding the subgraphs of a graph database that are isomorphic to a given query graph has practical applications in several fields, from cheminformatics to image understanding. Since subgraph isomorphism is a computationally hard problem, indexing techniques have been intensively exploited to speed up the process. Such systems filter out those graphs which cannot contain the query, and apply a subgraph isomorphism algorithm to each residual candidate graph. The applicability of such systems is limited to databases of small graphs, because their filtering power degrades on large graphs. Results In this paper, SING (Subgraph search In Non-homogeneous Graphs), a novel indexing system able to cope with large graphs, is presented. The method uses the notion of feature, which can be a small subgraph, subtree or path. Each graph in the database is annotated with the set of all its features. The key point is to make use of feature locality information. This idea is used to both improve the filtering performance and speed up the subgraph isomorphism task. Conclusions Extensive tests on chemical compounds, biological networks and synthetic graphs show that the proposed system outperforms the most popular systems in query time over databases of medium and large graphs. Other specific tests show that the proposed system is effective for single large graphs. PMID:20170516

  10. Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers

    PubMed Central

    2011-01-01

    Background Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed. Results This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs. The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html. Conclusions Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus. PMID:21992066

  11. A New Interface for the Magnetics Information Consortium (MagIC) Paleo and Rock Magnetic Database

    NASA Astrophysics Data System (ADS)

    Jarboe, N.; Minnett, R.; Koppers, A. A. P.; Tauxe, L.; Constable, C.; Shaar, R.; Jonestrask, L.

    2014-12-01

    The Magnetic Information Consortium (MagIC) database (http://earthref.org/MagIC/) continues to improve the ease of uploading data, the creation of complex searches, data visualization, and data downloads for the paleomagnetic, geomagnetic, and rock magnetic communities. Data uploading has been simplified and no longer requires the use of the Excel SmartBook interface. Instead, properly formatted MagIC text files can be dragged-and-dropped onto an HTML 5 web interface. Data can be uploaded one table at a time to facilitate ease of uploading and data error checking is done online on the whole dataset at once instead of incrementally in an Excel Console. Searching the database has improved with the addition of more sophisticated search parameters and with the ability to use them in complex combinations. Searches may also be saved as permanent URLs for easy reference or for use as a citation in a publication. Data visualization plots (ARAI, equal area, demagnetization, Zijderveld, etc.) are presented with the data when appropriate to aid the user in understanding the dataset. Data from the MagIC database may be downloaded from individual contributions or from online searches for offline use and analysis in the tab delimited MagIC text file format. With input from the paleomagnetic, geomagnetic, and rock magnetic communities, the MagIC database will continue to improve as a data warehouse and resource.

  12. Validation of a common data model for active safety surveillance research

    PubMed Central

    Ryan, Patrick B; Reich, Christian G; Hartzema, Abraham G; Stang, Paul E

    2011-01-01

    Objective Systematic analysis of observational medical databases for active safety surveillance is hindered by the variation in data models and coding systems. Data analysts often find robust clinical data models difficult to understand and ill suited to support their analytic approaches. Further, some models do not facilitate the computations required for systematic analysis across many interventions and outcomes for large datasets. Translating the data from these idiosyncratic data models to a common data model (CDM) could facilitate both the analysts' understanding and the suitability for large-scale systematic analysis. In addition to facilitating analysis, a suitable CDM has to faithfully represent the source observational database. Before beginning to use the Observational Medical Outcomes Partnership (OMOP) CDM and a related dictionary of standardized terminologies for a study of large-scale systematic active safety surveillance, the authors validated the model's suitability for this use by example. Validation by example To validate the OMOP CDM, the model was instantiated into a relational database, data from 10 different observational healthcare databases were loaded into separate instances, a comprehensive array of analytic methods that operate on the data model was created, and these methods were executed against the databases to measure performance. Conclusion There was acceptable representation of the data from 10 observational databases in the OMOP CDM using the standardized terminologies selected, and a range of analytic methods was developed and executed with sufficient performance to be useful for active safety surveillance. PMID:22037893

  13. Patterns of Undergraduates' Use of Scholarly Databases in a Large Research University

    ERIC Educational Resources Information Center

    Mbabu, Loyd Gitari; Bertram, Albert; Varnum, Ken

    2013-01-01

    Authentication data was utilized to explore undergraduate usage of subscription electronic databases. These usage patterns were linked to the information literacy curriculum of the library. The data showed that out of the 26,208 enrolled undergraduate students, 42% of them accessed a scholarly database at least once in the course of the entire…

  14. Database constraints applied to metabolic pathway reconstruction tools.

    PubMed

    Vilaplana, Jordi; Solsona, Francesc; Teixido, Ivan; Usié, Anabel; Karathia, Hiren; Alves, Rui; Mateo, Jordi

    2014-01-01

    Our group developed two biological applications, Biblio-MetReS and Homol-MetReS, accessing the same database of organisms with annotated genes. Biblio-MetReS is a data-mining application that facilitates the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the process(es) of interest and their function. It also enables the sets of proteins involved in the process(es) in different organisms to be compared directly. The efficiency of these biological applications is directly related to the design of the shared database. We classified and analyzed the different kinds of access to the database. Based on this study, we tried to adjust and tune the configurable parameters of the database server to reach the best performance of the communication data link to/from the database system. Different database technologies were analyzed. We started the study with a public relational SQL database, MySQL. Then, the same database was implemented by a MapReduce-based database named HBase. The results indicated that the standard configuration of MySQL gives an acceptable performance for low or medium size databases. Nevertheless, tuning database parameters can greatly improve the performance and lead to very competitive runtimes.

  15. An evaluation of information retrieval accuracy with simulated OCR output

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Croft, W.B.; Harding, S.M.; Taghva, K.

    Optical Character Recognition (OCR) is a critical part of many text-based applications. Although some commercial systems use the output from OCR devices to index documents without editing, there is very little quantitative data on the impact of OCR errors on the accuracy of a text retrieval system. Because of the difficulty of constructing test collections to obtain this data, we have carried out evaluation using simulated OCR output on a variety of databases. The results show that high quality OCR devices have little effect on the accuracy of retrieval, but low quality devices used with databases of short documents canmore » result in significant degradation.« less

  16. The database design of LAMOST based on MYSQL/LINUX

    NASA Astrophysics Data System (ADS)

    Li, Hui-Xian, Sang, Jian; Wang, Sha; Luo, A.-Li

    2006-03-01

    The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) will be set up in the coming years. A fully automated software system for reducing and analyzing the spectra has to be developed with the telescope. This database system is an important part of the software system. The requirements for the database of the LAMOST, the design of the LAMOST database system based on MYSQL/LINUX and performance tests of this system are described in this paper.

  17. Building a Database for a Quantitative Model

    NASA Technical Reports Server (NTRS)

    Kahn, C. Joseph; Kleinhammer, Roger

    2014-01-01

    A database can greatly benefit a quantitative analysis. The defining characteristic of a quantitative risk, or reliability, model is the use of failure estimate data. Models can easily contain a thousand Basic Events, relying on hundreds of individual data sources. Obviously, entering so much data by hand will eventually lead to errors. Not so obviously entering data this way does not aid linking the Basic Events to the data sources. The best way to organize large amounts of data on a computer is with a database. But a model does not require a large, enterprise-level database with dedicated developers and administrators. A database built in Excel can be quite sufficient. A simple spreadsheet database can link every Basic Event to the individual data source selected for them. This database can also contain the manipulations appropriate for how the data is used in the model. These manipulations include stressing factors based on use and maintenance cycles, dormancy, unique failure modes, the modeling of multiple items as a single "Super component" Basic Event, and Bayesian Updating based on flight and testing experience. A simple, unique metadata field in both the model and database provides a link from any Basic Event in the model to its data source and all relevant calculations. The credibility for the entire model often rests on the credibility and traceability of the data.

  18. ARCPHdb: A comprehensive protein database for SF1 and SF2 helicase from archaea.

    PubMed

    Moukhtar, Mirna; Chaar, Wafi; Abdel-Razzak, Ziad; Khalil, Mohamad; Taha, Samir; Chamieh, Hala

    2017-01-01

    Superfamily 1 and Superfamily 2 helicases, two of the largest helicase protein families, play vital roles in many biological processes including replication, transcription and translation. Study of helicase proteins in the model microorganisms of archaea have largely contributed to the understanding of their function, architecture and assembly. Based on a large phylogenomics approach, we have identified and classified all SF1 and SF2 protein families in ninety five sequenced archaea genomes. Here we developed an online webserver linked to a specialized protein database named ARCPHdb to provide access for SF1 and SF2 helicase families from archaea. ARCPHdb was implemented using MySQL relational database. Web interfaces were developed using Netbeans. Data were stored according to UniProt accession numbers, NCBI Ref Seq ID, PDB IDs and Entrez Databases. A user-friendly interactive web interface has been developed to browse, search and download archaeal helicase protein sequences, their available 3D structure models, and related documentation available in the literature provided by ARCPHdb. The database provides direct links to matching external databases. The ARCPHdb is the first online database to compile all protein information on SF1 and SF2 helicase from archaea in one platform. This database provides essential resource information for all researchers interested in the field. Copyright © 2016 Elsevier Ltd. All rights reserved.

  19. Building a protein name dictionary from full text: a machine learning term extraction approach.

    PubMed

    Shi, Lei; Campagne, Fabien

    2005-04-07

    The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.

  20. Building a protein name dictionary from full text: a machine learning term extraction approach

    PubMed Central

    Shi, Lei; Campagne, Fabien

    2005-01-01

    Background The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. Results We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. Conclusion This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt. PMID:15817129

  1. "Hyperstat": an educational and working tool in epidemiology.

    PubMed

    Nicolosi, A

    1995-01-01

    The work of a researcher in epidemiology is based on studying literature, planning studies, gathering data, analyzing data and writing results. Therefore he has need for performing, more or less, simple calculations, the need for consulting or quoting literature, the need for consulting textbooks about certain issues or procedures, and the need for looking at a specific formula. There are no programs conceived as a workstation to assist the different aspects of researcher work in an integrated fashion. A hypertextual system was developed which supports different stages of the epidemiologist's work. It combines database management, statistical analysis or planning, and literature searches. The software was developed on Apple Macintosh by using Hypercard 2.1 as a database and HyperTalk as a programming language. The program is structured in 7 "stacks" or files: Procedures; Statistical Tables; Graphs; References; Text; Formulas; Help. Each stack has its own management system with an automated Table of Contents. Stacks contain "cards" which make up the databases and carry executable programs. The programs are of four kinds: association; statistical procedure; formatting (input/output); database management. The system performs general statistical procedures, procedures applicable to epidemiological studies only (follow-up and case-control), and procedures for clinical trials. All commands are given by clicking the mouse on self-explanatory "buttons". In order to perform calculations, the user only needs to enter the data into the appropriate cells and then click on the selected procedure's button. The system has a hypertextual structure. The user can go from a procedure to other cards following the preferred order of succession and according to built-in associations. The user can access different levels of knowledge or information from any stack he is consulting or operating. From every card, the user can go to a selected procedure to perform statistical calculations, to the reference database management system, to the textbook in which all procedures and issues are discussed in detail, to the database of statistical formulas with automated table of contents, to statistical tables with automated table of contents, or to the help module. he program has a very user-friendly interface and leaves the user free to use the same format he would use on paper. The interface does not require special skills. It reflects the Macintosh philosophy of using windows, buttons and mouse. This allows the user to perform complicated calculations without losing the "feel" of data, weight alternatives, and simulations. This program shares many features in common with hypertexts. It has an underlying network database where the nodes consist of text, graphics, executable procedures, and combinations of these; the nodes in the database correspond to windows on the screen; the links between the nodes in the database are visible as "active" text or icons in the windows; the text is read by following links and opening new windows. The program is especially useful as an educational tool, directed to medical and epidemiology students. The combination of computing capabilities with a textbook and databases of formulas and literature references, makes the program versatile and attractive as a learning tool. The program is also helpful in the work done at the desk, where the researcher examines results, consults literature, explores different analytic approaches, plans new studies, or writes grant proposals or scientific articles.

  2. Morphology-based Query for Galaxy Image Databases

    NASA Astrophysics Data System (ADS)

    Shamir, Lior

    2017-02-01

    Galaxies of rare morphology are of paramount scientific interest, as they carry important information about the past, present, and future Universe. Once a rare galaxy is identified, studying it more effectively requires a set of galaxies of similar morphology, allowing generalization and statistical analysis that cannot be done when N=1. Databases generated by digital sky surveys can contain a very large number of galaxy images, and therefore once a rare galaxy of interest is identified it is possible that more instances of the same morphology are also present in the database. However, when a researcher identifies a certain galaxy of rare morphology in the database, it is virtually impossible to mine the database manually in the search for galaxies of similar morphology. Here we propose a computer method that can automatically search databases of galaxy images and identify galaxies that are morphologically similar to a certain user-defined query galaxy. That is, the researcher provides an image of a galaxy of interest, and the pattern recognition system automatically returns a list of galaxies that are visually similar to the target galaxy. The algorithm uses a comprehensive set of descriptors, allowing it to support different types of galaxies, and it is not limited to a finite set of known morphologies. While the list of returned galaxies is neither clean nor complete, it contains a far higher frequency of galaxies of the morphology of interest, providing a substantial reduction of the data. Such algorithms can be integrated into data management systems of autonomous digital sky surveys such as the Large Synoptic Survey Telescope (LSST), where the number of galaxies in the database is extremely large. The source code of the method is available at http://vfacstaff.ltu.edu/lshamir/downloads/udat.

  3. An SQL query generator for CLIPS

    NASA Technical Reports Server (NTRS)

    Snyder, James; Chirica, Laurian

    1990-01-01

    As expert systems become more widely used, their access to large amounts of external information becomes increasingly important. This information exists in several forms such as statistical, tabular data, knowledge gained by experts and large databases of information maintained by companies. Because many expert systems, including CLIPS, do not provide access to this external information, much of the usefulness of expert systems is left untapped. The scope of this paper is to describe a database extension for the CLIPS expert system shell. The current industry standard database language is SQL. Due to SQL standardization, large amounts of information stored on various computers, potentially at different locations, will be more easily accessible. Expert systems should be able to directly access these existing databases rather than requiring information to be re-entered into the expert system environment. The ORACLE relational database management system (RDBMS) was used to provide a database connection within the CLIPS environment. To facilitate relational database access a query generation system was developed as a CLIPS user function. The queries are entered in a CLlPS-like syntax and are passed to the query generator, which constructs and submits for execution, an SQL query to the ORACLE RDBMS. The query results are asserted as CLIPS facts. The query generator was developed primarily for use within the ICADS project (Intelligent Computer Aided Design System) currently being developed by the CAD Research Unit in the California Polytechnic State University (Cal Poly). In ICADS, there are several parallel or distributed expert systems accessing a common knowledge base of facts. Expert system has a narrow domain of interest and therefore needs only certain portions of the information. The query generator provides a common method of accessing this information and allows the expert system to specify what data is needed without specifying how to retrieve it.

  4. SFINX-a drug-drug interaction database designed for clinical decision support systems.

    PubMed

    Böttiger, Ylva; Laine, Kari; Andersson, Marine L; Korhonen, Tuomas; Molin, Björn; Ovesjö, Marie-Louise; Tirkkonen, Tuire; Rane, Anders; Gustafsson, Lars L; Eiermann, Birgit

    2009-06-01

    The aim was to develop a drug-drug interaction database (SFINX) to be integrated into decision support systems or to be used in website solutions for clinical evaluation of interactions. Key elements such as substance properties and names, drug formulations, text structures and references were defined before development of the database. Standard operating procedures for literature searches, text writing rules and a classification system for clinical relevance and documentation level were determined. ATC codes, CAS numbers and country-specific codes for substances were identified and quality assured to ensure safe integration of SFINX into other data systems. Much effort was put into giving short and practical advice regarding clinically relevant drug-drug interactions. SFINX includes over 8,000 interaction pairs and is integrated into Swedish and Finnish computerised decision support systems. Over 31,000 physicians and pharmacists are receiving interaction alerts through SFINX. User feedback is collected for continuous improvement of the content. SFINX is a potentially valuable tool delivering instant information on drug interactions during prescribing and dispensing.

  5. Bedrock geologic map of the Grafton quadrangle, Worcester County, Massachusetts

    USGS Publications Warehouse

    Walsh, Gregory J.; Aleinikoff, John N.; Dorais, Michael J.

    2011-01-01

    The bedrock geology of the 7.5-minute Grafton, Massachusetts, quadrangle consists of deformed Neoproterozoic to early Paleozoic crystalline metamorphic and intrusive igneous rocks. Neoproterozoic intrusive, metasedimentary, and metavolcanic rocks crop out in the Avalon zone, and Cambrian to Silurian intrusive, metasedimentary, and metavolcanic rocks crop out in the Nashoba zone. Rocks of the Avalon and Nashoba zones, or terranes, are separated by the Bloody Bluff fault. The bedrock geology was mapped to study the tectonic history of the area and to provide a framework for ongoing hydrogeologic characterization of the fractured bedrock of Massachusetts. This report presents mapping by G.J. Walsh, geochronology by J.N. Aleinikoff, geochemistry by M.J. Dorais, and consists of a map, text pamphlet, and GIS database. The map and text pamphlet are available in paper format or as downloadable files (see frame at right). The GIS database is available for download. The database includes contacts of bedrock geologic units, faults, outcrops, structural geologic information, and photographs.

  6. Systematic review for geo-authentic Lonicerae Japonicae Flos.

    PubMed

    Yang, Xingyue; Liu, Yali; Hou, Aijuan; Yang, Yang; Tian, Xin; He, Liyun

    2017-06-01

    In traditional Chinese medicine, Lonicerae Japonicae Flos is commonly used as anti-inflammatory, antiviral, and antipyretic herbal medicine, and geo-authentic herbs are believed to present the highest quality among all samples from different regions. To discuss the current situation and trend of geo-authentic Lonicerae Japonicae Flos, we searched Chinese Biomedicine Literature Database, Chinese Journal Full-text Database, Chinese Scientific Journal Full-text Database, Cochrane Central Register of Controlled Trials, Wanfang, and PubMed. We investigated all studies up to November 2015 pertaining to quality assessment, discrimination, pharmacological effects, planting or processing, or ecological system of geo-authentic Lonicerae Japonicae Flos. Sixty-five studies mainly discussing about chemical fingerprint, component analysis, planting and processing, discrimination between varieties, ecological system, pharmacological effects, and safety were systematically reviewed. By analyzing these studies, we found that the key points of geo-authentic Lonicerae Japonicae Flos research were quality and application. Further studies should focus on improving the quality by selecting the more superior of all varieties and evaluating clinical effectiveness.

  7. Sting_RDB: a relational database of structural parameters for protein analysis with support for data warehousing and data mining.

    PubMed

    Oliveira, S R M; Almeida, G V; Souza, K R R; Rodrigues, D N; Kuser-Falcão, P R; Yamagishi, M E B; Santos, E H; Vieira, F D; Jardine, J G; Neshich, G

    2007-10-05

    An effective strategy for managing protein databases is to provide mechanisms to transform raw data into consistent, accurate and reliable information. Such mechanisms will greatly reduce operational inefficiencies and improve one's ability to better handle scientific objectives and interpret the research results. To achieve this challenging goal for the STING project, we introduce Sting_RDB, a relational database of structural parameters for protein analysis with support for data warehousing and data mining. In this article, we highlight the main features of Sting_RDB and show how a user can explore it for efficient and biologically relevant queries. Considering its importance for molecular biologists, effort has been made to advance Sting_RDB toward data quality assessment. To the best of our knowledge, Sting_RDB is one of the most comprehensive data repositories for protein analysis, now also capable of providing its users with a data quality indicator. This paper differs from our previous study in many aspects. First, we introduce Sting_RDB, a relational database with mechanisms for efficient and relevant queries using SQL. Sting_rdb evolved from the earlier, text (flat file)-based database, in which data consistency and integrity was not guaranteed. Second, we provide support for data warehousing and mining. Third, the data quality indicator was introduced. Finally and probably most importantly, complex queries that could not be posed on a text-based database, are now easily implemented. Further details are accessible at the Sting_RDB demo web page: http://www.cbi.cnptia.embrapa.br/StingRDB.

  8. Full Text Searching and Customization in the NASA ADS Abstract Service

    NASA Technical Reports Server (NTRS)

    Eichhorn, G.; Accomazzi, A.; Grant, C. S.; Kurtz, M. J.; Henneken, E. A.; Thompson, D. M.; Murray, S. S.

    2004-01-01

    The NASA-ADS Abstract Service provides a sophisticated search capability for the literature in Astronomy, Planetary Sciences, Physics/Geophysics, and Space Instrumentation. The ADS is funded by NASA and access to the ADS services is free to anybody worldwide without restrictions. It allows the user to search the literature by author, title, and abstract text. The ADS database contains over 3.6 million references, with 965,000 in the Astronomy/Planetary Sciences database, and 1.6 million in the Physics/Geophysics database. 2/3 of the records have full abstracts, the rest are table of contents entries (titles and author lists only). The coverage for the Astronomy literature is better than 95% from 1975. Before that we cover all major journals and many smaller ones. Most of the journal literature is covered back to volume 1. We now get abstracts on a regular basis from most journals. Over the last year we have entered basically all conference proceedings tables of contents that are available at the Harvard Smithsonian Center for Astrophysics library. This has greatly increased the coverage of conference proceedings in the ADS. The ADS also covers the ArXiv Preprints. We download these preprints every night and index all the preprints. They can be searched either together with the other abstracts or separately. There are currently about 260,000 preprints in that database. In January 2004 we have introduced two new services, full text searching and a personal notification service called "myADS". As all other ADS services, these are free to use for anybody.

  9. OntoMate: a text-mining tool aiding curation at the Rat Genome Database

    PubMed Central

    Liu, Weisong; Laulederkind, Stanley J. F.; Hayman, G. Thomas; Wang, Shur-Jen; Nigam, Rajni; Smith, Jennifer R.; De Pons, Jeff; Dwinell, Melinda R.; Shimoyama, Mary

    2015-01-01

    The Rat Genome Database (RGD) is the premier repository of rat genomic, genetic and physiologic data. Converting data from free text in the scientific literature to a structured format is one of the main tasks of all model organism databases. RGD spends considerable effort manually curating gene, Quantitative Trait Locus (QTL) and strain information. The rapidly growing volume of biomedical literature and the active research in the biological natural language processing (bioNLP) community have given RGD the impetus to adopt text-mining tools to improve curation efficiency. Recently, RGD has initiated a project to use OntoMate, an ontology-driven, concept-based literature search engine developed at RGD, as a replacement for the PubMed (http://www.ncbi.nlm.nih.gov/pubmed) search engine in the gene curation workflow. OntoMate tags abstracts with gene names, gene mutations, organism name and most of the 16 ontologies/vocabularies used at RGD. All terms/ entities tagged to an abstract are listed with the abstract in the search results. All listed terms are linked both to data entry boxes and a term browser in the curation tool. OntoMate also provides user-activated filters for species, date and other parameters relevant to the literature search. Using the system for literature search and import has streamlined the process compared to using PubMed. The system was built with a scalable and open architecture, including features specifically designed to accelerate the RGD gene curation process. With the use of bioNLP tools, RGD has added more automation to its curation workflow. Database URL: http://rgd.mcw.edu PMID:25619558

  10. The influence of the Bible geographic objects peculiarities on the concept of the spatiotemporal geoinformation system

    NASA Astrophysics Data System (ADS)

    Linsebarth, A.; Moscicka, A.

    2010-01-01

    The article describes the infl uence of the Bible geographic object peculiarities on the spatiotemporal geoinformation system of the Bible events. In the proposed concept of this system the special attention was concentrated to the Bible geographic objects and interrelations between the names of these objects and their location in the geospace. In the Bible, both in the Old and New Testament, there are hundreds of geographical names, but the selection of these names from the Bible text is not so easy. The same names are applied for the persons and geographic objects. The next problem which arises is the classification of the geographical object, because in several cases the same name is used for the towns, mountains, hills, valleys etc. Also very serious problem is related to the time-changes of the names. The interrelation between the object name and its location is also complicated. The geographic object of this same name is located in various places which should be properly correlated with the Bible text. Above mentioned peculiarities of Bible geographic objects infl uenced the concept of the proposed system which consists of three databases: reference, geographic object, and subject/thematic. The crucial component of this system is proper architecture of the geographic object database. In the paper very detailed description of this database is presented. The interrelation between the databases allows to the Bible readers to connect the Bible text with the geography of the terrain on which the Bible events occurred and additionally to have access to the other geographical and historical information related to the geographic objects.

  11. Application of Ferulic Acid for Alzheimer’s Disease: Combination of Text Mining and Experimental Validation

    PubMed Central

    Meng, Guilin; Meng, Xiulin; Ma, Xiaoye; Zhang, Gengping; Hu, Xiaolin; Jin, Aiping; Liu, Xueyuan

    2018-01-01

    Alzheimer’s disease (AD) is an increasing concern in human health. Despite significant research, highly effective drugs to treat AD are lacking. The present study describes the text mining process to identify drug candidates from a traditional Chinese medicine (TCM) database, along with associated protein target mechanisms. We carried out text mining to identify literatures that referenced both AD and TCM and focused on identifying compounds and protein targets of interest. After targeting one potential TCM candidate, corresponding protein-protein interaction (PPI) networks were assembled in STRING to decipher the most possible mechanism of action. This was followed by validation using Western blot and co-immunoprecipitation in an AD cell model. The text mining strategy using a vast amount of AD-related literature and the TCM database identified curcumin, whose major component was ferulic acid (FA). This was used as a key candidate compound for further study. Using the top calculated interaction score in STRING, BACE1 and MMP2 were implicated in the activity of FA in AD. Exposure of SHSY5Y-APP cells to FA resulted in the decrease in expression levels of BACE-1 and APP, while the expression of MMP-2 and MMP-9 increased in a dose-dependent manner. This suggests that FA induced BACE1 and MMP2 pathways maybe novel potential mechanisms involved in AD. The text mining of literature and TCM database related to AD suggested FA as a promising TCM ingredient for the treatment of AD. Potential mechanisms interconnected and integrated with Aβ aggregation inhibition and extracellular matrix remodeling underlying the activity of FA were identified using in vitro studies. PMID:29896095

  12. Application of Ferulic Acid for Alzheimer's Disease: Combination of Text Mining and Experimental Validation.

    PubMed

    Meng, Guilin; Meng, Xiulin; Ma, Xiaoye; Zhang, Gengping; Hu, Xiaolin; Jin, Aiping; Zhao, Yanxin; Liu, Xueyuan

    2018-01-01

    Alzheimer's disease (AD) is an increasing concern in human health. Despite significant research, highly effective drugs to treat AD are lacking. The present study describes the text mining process to identify drug candidates from a traditional Chinese medicine (TCM) database, along with associated protein target mechanisms. We carried out text mining to identify literatures that referenced both AD and TCM and focused on identifying compounds and protein targets of interest. After targeting one potential TCM candidate, corresponding protein-protein interaction (PPI) networks were assembled in STRING to decipher the most possible mechanism of action. This was followed by validation using Western blot and co-immunoprecipitation in an AD cell model. The text mining strategy using a vast amount of AD-related literature and the TCM database identified curcumin, whose major component was ferulic acid (FA). This was used as a key candidate compound for further study. Using the top calculated interaction score in STRING, BACE1 and MMP2 were implicated in the activity of FA in AD. Exposure of SHSY5Y-APP cells to FA resulted in the decrease in expression levels of BACE-1 and APP, while the expression of MMP-2 and MMP-9 increased in a dose-dependent manner. This suggests that FA induced BACE1 and MMP2 pathways maybe novel potential mechanisms involved in AD. The text mining of literature and TCM database related to AD suggested FA as a promising TCM ingredient for the treatment of AD. Potential mechanisms interconnected and integrated with Aβ aggregation inhibition and extracellular matrix remodeling underlying the activity of FA were identified using in vitro studies.

  13. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.

  14. Large scale database scrubbing using object oriented software components.

    PubMed

    Herting, R L; Barnes, M R

    1998-01-01

    Now that case managers, quality improvement teams, and researchers use medical databases extensively, the ability to share and disseminate such databases while maintaining patient confidentiality is paramount. A process called scrubbing addresses this problem by removing personally identifying information while keeping the integrity of the medical information intact. Scrubbing entire databases, containing multiple tables, requires that the implicit relationships between data elements in different tables of the database be maintained. To address this issue we developed DBScrub, a Java program that interfaces with any JDBC compliant database and scrubs the database while maintaining the implicit relationships within it. DBScrub uses a small number of highly configurable object-oriented software components to carry out the scrubbing. We describe the structure of these software components and how they maintain the implicit relationships within the database.

  15. Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource.

    PubMed

    Perera, Gayan; Broadbent, Matthew; Callard, Felicity; Chang, Chin-Kuo; Downs, Johnny; Dutta, Rina; Fernandes, Andrea; Hayes, Richard D; Henderson, Max; Jackson, Richard; Jewell, Amelia; Kadra, Giouliana; Little, Ryan; Pritchard, Megan; Shetty, Hitesh; Tulloch, Alex; Stewart, Robert

    2016-03-01

    The South London and Maudsley National Health Service (NHS) Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register and its Clinical Record Interactive Search (CRIS) application were developed in 2008, generating a research repository of real-time, anonymised, structured and open-text data derived from the electronic health record system used by SLaM, a large mental healthcare provider in southeast London. In this paper, we update this register's descriptive data, and describe the substantial expansion and extension of the data resource since its original development. Descriptive data were generated from the SLaM BRC Case Register on 31 December 2014. Currently, there are over 250,000 patient records accessed through CRIS. Since 2008, the most significant developments in the SLaM BRC Case Register have been the introduction of natural language processing to extract structured data from open-text fields, linkages to external sources of data, and the addition of a parallel relational database (Structured Query Language) output. Natural language processing applications to date have brought in new and hitherto inaccessible data on cognitive function, education, social care receipt, smoking, diagnostic statements and pharmacotherapy. In addition, through external data linkages, large volumes of supplementary information have been accessed on mortality, hospital attendances and cancer registrations. Coupled with robust data security and governance structures, electronic health records provide potentially transformative information on mental disorders and outcomes in routine clinical care. The SLaM BRC Case Register continues to grow as a database, with approximately 20,000 new cases added each year, in addition to extension of follow-up for existing cases. Data linkages and natural language processing present important opportunities to enhance this type of research resource further, achieving both volume and depth of data. However, research projects still need to be carefully tailored, so that they take into account the nature and quality of the source information. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/

  16. Evolutionary characters, phenotypes and ontologies: curating data from the systematic biology literature.

    PubMed

    Dahdul, Wasila M; Balhoff, James P; Engeman, Jeffrey; Grande, Terry; Hilton, Eric J; Kothari, Cartik; Lapp, Hilmar; Lundberg, John G; Midford, Peter E; Vision, Todd J; Westerfield, Monte; Mabee, Paula M

    2010-05-20

    The wealth of phenotypic descriptions documented in the published articles, monographs, and dissertations of phylogenetic systematics is traditionally reported in a free-text format, and it is therefore largely inaccessible for linkage to biological databases for genetics, development, and phenotypes, and difficult to manage for large-scale integrative work. The Phenoscape project aims to represent these complex and detailed descriptions with rich and formal semantics that are amenable to computation and integration with phenotype data from other fields of biology. This entails reconceptualizing the traditional free-text characters into the computable Entity-Quality (EQ) formalism using ontologies. We used ontologies and the EQ formalism to curate a collection of 47 phylogenetic studies on ostariophysan fishes (including catfishes, characins, minnows, knifefishes) and their relatives with the goal of integrating these complex phenotype descriptions with information from an existing model organism database (zebrafish, http://zfin.org). We developed a curation workflow for the collection of character, taxonomic and specimen data from these publications. A total of 4,617 phenotypic characters (10,512 states) for 3,449 taxa, primarily species, were curated into EQ formalism (for a total of 12,861 EQ statements) using anatomical and taxonomic terms from teleost-specific ontologies (Teleost Anatomy Ontology and Teleost Taxonomy Ontology) in combination with terms from a quality ontology (Phenotype and Trait Ontology). Standards and guidelines for consistently and accurately representing phenotypes were developed in response to the challenges that were evident from two annotation experiments and from feedback from curators. The challenges we encountered and many of the curation standards and methods for improving consistency that we developed are generally applicable to any effort to represent phenotypes using ontologies. This is because an ontological representation of the detailed variations in phenotype, whether between mutant or wildtype, among individual humans, or across the diversity of species, requires a process by which a precise combination of terms from domain ontologies are selected and organized according to logical relations. The efficiencies that we have developed in this process will be useful for any attempt to annotate complex phenotypic descriptions using ontologies. We also discuss some ramifications of EQ representation for the domain of systematics.

  17. What is lost when searching only one literature database for articles relevant to injury prevention and safety promotion?

    PubMed

    Lawrence, D W

    2008-12-01

    To assess what is lost if only one literature database is searched for articles relevant to injury prevention and safety promotion (IPSP) topics. Serial textword (keyword, free-text) searches using multiple synonym terms for five key IPSP topics (bicycle-related brain injuries, ethanol-impaired driving, house fires, road rage, and suicidal behaviors among adolescents) were conducted in four of the bibliographic databases that are most used by IPSP professionals: EMBASE, MEDLINE, PsycINFO, and Web of Science. Through a systematic procedure, an inventory of articles on each topic in each database was conducted to identify the total unduplicated count of all articles on each topic, the number of articles unique to each database, and the articles available if only one database is searched. No single database included all of the relevant articles on any topic, and the database with the broadest coverage differed by topic. A search of only one literature database will return 16.7-81.5% (median 43.4%) of the available articles on any of five key IPSP topics. Each database contributed unique articles to the total bibliography for each topic. A literature search performed in only one database will, on average, lead to a loss of more than half of the available literature on a topic.

  18. Text Mining the Biomedical Literature

    DTIC Science & Technology

    2007-11-05

    activities, and repeating past mistakes, or 3) agencies not participating in joint efforts that would fully exploit each agency’s strengths...research and joint projects (multi- department, multi-agency, multi-national, and government-industry) appropriate? • Is the balance among single...overall database taxonomy, i.e., are there any concepts missing from any of the databases, and even if not, do all the concepts bear the same

  19. Computer-Assisted Promotion of Recreational Opportunities in Natural Resource Areas: A Demonstration and Case Example

    Treesearch

    Emilyn Sheffield; Leslie Furr; Charles Nelson

    1992-01-01

    Filevision IV is a multilayer imaging and data-base management system that combines drawing, filing and extensive report-writing capabilities (Filevision IV, 1988). Filevision IV users access data by attaching graphics to text-oriented data-base records. Tourist attractions, support services, and geo-graphic features can be located on a base map of an area or region....

  20. NUCFRG2: An evaluation of the semiempirical nuclear fragmentation database

    NASA Technical Reports Server (NTRS)

    Wilson, J. W.; Tripathi, R. K.; Cucinotta, F. A.; Shinn, J. L.; Badavi, F. F.; Chun, S. Y.; Norbury, J. W.; Zeitlin, C. J.; Heilbronn, L.; Miller, J.

    1995-01-01

    A semiempirical abrasion-ablation model has been successful in generating a large nuclear database for the study of high charge and energy (HZE) ion beams, radiation physics, and galactic cosmic ray shielding. The cross sections that are generated are compared with measured HZE fragmentation data from various experimental groups. A research program for improvement of the database generator is also discussed.

  1. NCBI2RDF: enabling full RDF-based access to NCBI databases.

    PubMed

    Anguita, Alberto; García-Remesal, Miguel; de la Iglesia, Diana; Maojo, Victor

    2013-01-01

    RDF has become the standard technology for enabling interoperability among heterogeneous biomedical databases. The NCBI provides access to a large set of life sciences databases through a common interface called Entrez. However, the latter does not provide RDF-based access to such databases, and, therefore, they cannot be integrated with other RDF-compliant databases and accessed via SPARQL query interfaces. This paper presents the NCBI2RDF system, aimed at providing RDF-based access to the complete NCBI data repository. This API creates a virtual endpoint for servicing SPARQL queries over different NCBI repositories and presenting to users the query results in SPARQL results format, thus enabling this data to be integrated and/or stored with other RDF-compliant repositories. SPARQL queries are dynamically resolved, decomposed, and forwarded to the NCBI-provided E-utilities programmatic interface to access the NCBI data. Furthermore, we show how our approach increases the expressiveness of the native NCBI querying system, allowing several databases to be accessed simultaneously. This feature significantly boosts productivity when working with complex queries and saves time and effort to biomedical researchers. Our approach has been validated with a large number of SPARQL queries, thus proving its reliability and enhanced capabilities in biomedical environments.

  2. A comparison of database systems for XML-type data.

    PubMed

    Risse, Judith E; Leunissen, Jack A M

    2010-01-01

    In the field of bioinformatics interchangeable data formats based on XML are widely used. XML-type data is also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. In this paper we analyse the suitability of different database systems for storing and querying large datasets in general and Medline in particular. All reviewed database systems perform well when tested with small to medium sized datasets, however when the full Medline dataset is queried a large variation in query times is observed. There is not one system that is vastly superior to the others in this comparison and, depending on the database size and the query requirements, different systems are most suitable. The best all-round solution is the Oracle 11~g database system using the new binary storage option. Alias-i's Lingpipe is a more lightweight, customizable and sufficiently fast solution. It does however require more initial configuration steps. For data with a changing XML structure Sedna and BaseX as native XML database systems or MySQL with an XML-type column are suitable.

  3. FishTraits: a database of ecological and life-history traits of freshwater fishes of the United States

    USGS Publications Warehouse

    Angermeier, Paul L.; Frimpong, Emmanuel A.

    2011-01-01

    The need for integrated and widely accessible sources of species traits data to facilitate studies of ecology, conservation, and management has motivated development of traits databases for various taxa. In spite of the increasing number of traits-based analyses of freshwater fishes in the United States, no consolidated database of traits of this group exists publicly, and much useful information on these species is documented only in obscure sources. The largely inaccessible and unconsolidated traits information makes large-scale analysis involving many fishes and/or traits particularly challenging. We have compiled a database of > 100 traits for 809 (731 native and 78 nonnative) fish species found in freshwaters of the conterminous United States, including 37 native families and 145 native genera. The database, named Fish Traits, contains information on four major categories of traits: (1) trophic ecology; (2) body size, reproductive ecology, and life history; (3) habitat preferences; and (4) salinity and temperature tolerances. Information on geographic distribution and conservation status was also compiled. The database enhances many opportunities for conducting research on fish species traits and constitutes the first step toward establishing a central repository for a continually expanding set of traits of North American fishes.

  4. Evolution of the use of relational and NoSQL databases in the ATLAS experiment

    NASA Astrophysics Data System (ADS)

    Barberis, D.

    2016-09-01

    The ATLAS experiment used for many years a large database infrastructure based on Oracle to store several different types of non-event data: time-dependent detector configuration and conditions data, calibrations and alignments, configurations of Grid sites, catalogues for data management tools, job records for distributed workload management tools, run and event metadata. The rapid development of "NoSQL" databases (structured storage services) in the last five years allowed an extended and complementary usage of traditional relational databases and new structured storage tools in order to improve the performance of existing applications and to extend their functionalities using the possibilities offered by the modern storage systems. The trend is towards using the best tool for each kind of data, separating for example the intrinsically relational metadata from payload storage, and records that are frequently updated and benefit from transactions from archived information. Access to all components has to be orchestrated by specialised services that run on front-end machines and shield the user from the complexity of data storage infrastructure. This paper describes this technology evolution in the ATLAS database infrastructure and presents a few examples of large database applications that benefit from it.

  5. NVST Data Archiving System Based On FastBit NoSQL Database

    NASA Astrophysics Data System (ADS)

    Liu, Ying-bo; Wang, Feng; Ji, Kai-fan; Deng, Hui; Dai, Wei; Liang, Bo

    2014-06-01

    The New Vacuum Solar Telescope (NVST) is a 1-meter vacuum solar telescope that aims to observe the fine structures of active regions on the Sun. The main tasks of the NVST are high resolution imaging and spectral observations, including the measurements of the solar magnetic field. The NVST has been collecting more than 20 million FITS files since it began routine observations in 2012 and produces a maximum observational records of 120 thousand files in a day. Given the large amount of files, the effective archiving and retrieval of files becomes a critical and urgent problem. In this study, we implement a new data archiving system for the NVST based on the Fastbit Not Only Structured Query Language (NoSQL) database. Comparing to the relational database (i.e., MySQL; My Structured Query Language), the Fastbit database manifests distinctive advantages on indexing and querying performance. In a large scale database of 40 million records, the multi-field combined query response time of Fastbit database is about 15 times faster and fully meets the requirements of the NVST. Our study brings a new idea for massive astronomical data archiving and would contribute to the design of data management systems for other astronomical telescopes.

  6. Scaling up health knowledge at European level requires sharing integrated data: an approach for collection of database specification.

    PubMed

    Menditto, Enrica; Bolufer De Gea, Angela; Cahir, Caitriona; Marengoni, Alessandra; Riegler, Salvatore; Fico, Giuseppe; Costa, Elisio; Monaco, Alessandro; Pecorelli, Sergio; Pani, Luca; Prados-Torres, Alexandra

    2016-01-01

    Computerized health care databases have been widely described as an excellent opportunity for research. The availability of "big data" has brought about a wave of innovation in projects when conducting health services research. Most of the available secondary data sources are restricted to the geographical scope of a given country and present heterogeneous structure and content. Under the umbrella of the European Innovation Partnership on Active and Healthy Ageing, collaborative work conducted by the partners of the group on "adherence to prescription and medical plans" identified the use of observational and large-population databases to monitor medication-taking behavior in the elderly. This article describes the methodology used to gather the information from available databases among the Adherence Action Group partners with the aim of improving data sharing on a European level. A total of six databases belonging to three different European countries (Spain, Republic of Ireland, and Italy) were included in the analysis. Preliminary results suggest that there are some similarities. However, these results should be applied in different contexts and European countries, supporting the idea that large European studies should be designed in order to get the most of already available databases.

  7. FCDD: A Database for Fruit Crops Diseases.

    PubMed

    Chauhan, Rupal; Jasrai, Yogesh; Pandya, Himanshu; Chaudhari, Suman; Samota, Chand Mal

    2014-01-01

    Fruit Crops Diseases Database (FCDD) requires a number of biotechnology and bioinformatics tools. The FCDD is a unique bioinformatics resource that compiles information about 162 details on fruit crops diseases, diseases type, its causal organism, images, symptoms and their control. The FCDD contains 171 phytochemicals from 25 fruits, their 2D images and their 20 possible sequences. This information has been manually extracted and manually verified from numerous sources, including other electronic databases, textbooks and scientific journals. FCDD is fully searchable and supports extensive text search. The main focus of the FCDD is on providing possible information of fruit crops diseases, which will help in discovery of potential drugs from one of the common bioresource-fruits. The database was developed using MySQL. The database interface is developed in PHP, HTML and JAVA. FCDD is freely available. http://www.fruitcropsdd.com/

  8. Querying databases of trajectories of differential equations: Data structures for trajectories

    NASA Technical Reports Server (NTRS)

    Grossman, Robert

    1989-01-01

    One approach to qualitative reasoning about dynamical systems is to extract qualitative information by searching or making queries on databases containing very large numbers of trajectories. The efficiency of such queries depends crucially upon finding an appropriate data structure for trajectories of dynamical systems. Suppose that a large number of parameterized trajectories gamma of a dynamical system evolving in R sup N are stored in a database. Let Eta is contained in set R sup N denote a parameterized path in Euclidean Space, and let the Euclidean Norm denote a norm on the space of paths. A data structure is defined to represent trajectories of dynamical systems, and an algorithm is sketched which answers queries.

  9. Ice Accretion Test Results for Three Large-Scale Swept-Wing Models in the NASA Icing Research Tunnel

    NASA Technical Reports Server (NTRS)

    Broeren, Andy; Potapczuk, Mark; Lee, Sam; Malone, Adam; Paul, Ben; Woodard, Brian

    2016-01-01

    The design and certification of modern transport airplanes for flight in icing conditions increasing relies on three-dimensional numerical simulation tools for ice accretion prediction. There is currently no publically available, high-quality, ice accretion database upon which to evaluate the performance of icing simulation tools for large-scale swept wings that are representative of modern commercial transport airplanes. The purpose of this presentation is to present the results of a series of icing wind tunnel test campaigns whose aim was to provide an ice accretion database for large-scale, swept wings.

  10. An ab initio electronic transport database for inorganic materials

    DOE PAGES

    Ricci, Francesco; Chen, Wei; Aydemir, Umut; ...

    2017-07-04

    Electronic transport in materials is governed by a series of tensorial properties such as conductivity, Seebeck coefficient, and effective mass. These quantities are paramount to the understanding of materials in many fields from thermoelectrics to electronics and photovoltaics. Transport properties can be calculated from a material’s band structure using the Boltzmann transport theory framework. We present here the largest computational database of electronic transport properties based on a large set of 48,000 materials originating from the Materials Project database. Our results were obtained through the interpolation approach developed in the BoltzTraP software, assuming a constant relaxation time. We present themore » workflow to generate the data, the data validation procedure, and the database structure. In conclusion, our aim is to target the large community of scientists developing materials selection strategies and performing studies involving transport properties.« less

  11. Architectural Implications for Spatial Object Association Algorithms

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kumar, V S; Kurc, T; Saltz, J

    2009-01-29

    Spatial object association, also referred to as cross-match of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server R, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation providesmore » insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST).« less

  12. The future of medical diagnostics: large digitized databases.

    PubMed

    Kerr, Wesley T; Lau, Edward P; Owens, Gwen E; Trefler, Aaron

    2012-09-01

    The electronic health record mandate within the American Recovery and Reinvestment Act of 2009 will have a far-reaching affect on medicine. In this article, we provide an in-depth analysis of how this mandate is expected to stimulate the production of large-scale, digitized databases of patient information. There is evidence to suggest that millions of patients and the National Institutes of Health will fully support the mining of such databases to better understand the process of diagnosing patients. This data mining likely will reaffirm and quantify known risk factors for many diagnoses. This quantification may be leveraged to further develop computer-aided diagnostic tools that weigh risk factors and provide decision support for health care providers. We expect that creation of these databases will stimulate the development of computer-aided diagnostic support tools that will become an integral part of modern medicine.

  13. An ab initio electronic transport database for inorganic materials

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ricci, Francesco; Chen, Wei; Aydemir, Umut

    Electronic transport in materials is governed by a series of tensorial properties such as conductivity, Seebeck coefficient, and effective mass. These quantities are paramount to the understanding of materials in many fields from thermoelectrics to electronics and photovoltaics. Transport properties can be calculated from a material’s band structure using the Boltzmann transport theory framework. We present here the largest computational database of electronic transport properties based on a large set of 48,000 materials originating from the Materials Project database. Our results were obtained through the interpolation approach developed in the BoltzTraP software, assuming a constant relaxation time. We present themore » workflow to generate the data, the data validation procedure, and the database structure. In conclusion, our aim is to target the large community of scientists developing materials selection strategies and performing studies involving transport properties.« less

  14. The MPI Facial Expression Database — A Validated Database of Emotional and Conversational Facial Expressions

    PubMed Central

    Kaulard, Kathrin; Cunningham, Douglas W.; Bülthoff, Heinrich H.; Wallraven, Christian

    2012-01-01

    The ability to communicate is one of the core aspects of human life. For this, we use not only verbal but also nonverbal signals of remarkable complexity. Among the latter, facial expressions belong to the most important information channels. Despite the large variety of facial expressions we use in daily life, research on facial expressions has so far mostly focused on the emotional aspect. Consequently, most databases of facial expressions available to the research community also include only emotional expressions, neglecting the largely unexplored aspect of conversational expressions. To fill this gap, we present the MPI facial expression database, which contains a large variety of natural emotional and conversational expressions. The database contains 55 different facial expressions performed by 19 German participants. Expressions were elicited with the help of a method-acting protocol, which guarantees both well-defined and natural facial expressions. The method-acting protocol was based on every-day scenarios, which are used to define the necessary context information for each expression. All facial expressions are available in three repetitions, in two intensities, as well as from three different camera angles. A detailed frame annotation is provided, from which a dynamic and a static version of the database have been created. In addition to describing the database in detail, we also present the results of an experiment with two conditions that serve to validate the context scenarios as well as the naturalness and recognizability of the video sequences. Our results provide clear evidence that conversational expressions can be recognized surprisingly well from visual information alone. The MPI facial expression database will enable researchers from different research fields (including the perceptual and cognitive sciences, but also affective computing, as well as computer vision) to investigate the processing of a wider range of natural facial expressions. PMID:22438875

  15. Biological Databases for Behavioral Neurobiology

    PubMed Central

    Baker, Erich J.

    2014-01-01

    Databases are, at their core, abstractions of data and their intentionally derived relationships. They serve as a central organizing metaphor and repository, supporting or augmenting nearly all bioinformatics. Behavioral domains provide a unique stage for contemporary databases, as research in this area spans diverse data types, locations, and data relationships. This chapter provides foundational information on the diversity and prevalence of databases, how data structures support the various needs of behavioral neuroscience analysis and interpretation. The focus is on the classes of databases, data curation, and advanced applications in bioinformatics using examples largely drawn from research efforts in behavioral neuroscience. PMID:23195119

  16. A crowdsourcing workflow for extracting chemical-induced disease relations from free text

    PubMed Central

    Li, Tong Shu; Bravo, Àlex; Furlong, Laura I.; Good, Benjamin M.; Su, Andrew I.

    2016-01-01

    Relations between chemicals and diseases are one of the most queried biomedical interactions. Although expert manual curation is the standard method for extracting these relations from the literature, it is expensive and impractical to apply to large numbers of documents, and therefore alternative methods are required. We describe here a crowdsourcing workflow for extracting chemical-induced disease relations from free text as part of the BioCreative V Chemical Disease Relation challenge. Five non-expert workers on the CrowdFlower platform were shown each potential chemical-induced disease relation highlighted in the original source text and asked to make binary judgments about whether the text supported the relation. Worker responses were aggregated through voting, and relations receiving four or more votes were predicted as true. On the official evaluation dataset of 500 PubMed abstracts, the crowd attained a 0.505 F-score (0.475 precision, 0.540 recall), with a maximum theoretical recall of 0.751 due to errors with named entity recognition. The total crowdsourcing cost was $1290.67 ($2.58 per abstract) and took a total of 7 h. A qualitative error analysis revealed that 46.66% of sampled errors were due to task limitations and gold standard errors, indicating that performance can still be improved. All code and results are publicly available at https://github.com/SuLab/crowd_cid_relex Database URL: https://github.com/SuLab/crowd_cid_relex PMID:27087308

  17. BioCarian: search engine for exploratory searches in heterogeneous biological databases.

    PubMed

    Zaki, Nazar; Tennakoon, Chandana

    2017-10-02

    There are a large number of biological databases publicly available for scientists in the web. Also, there are many private databases generated in the course of research projects. These databases are in a wide variety of formats. Web standards have evolved in the recent times and semantic web technologies are now available to interconnect diverse and heterogeneous sources of data. Therefore, integration and querying of biological databases can be facilitated by techniques used in semantic web. Heterogeneous databases can be converted into Resource Description Format (RDF) and queried using SPARQL language. Searching for exact queries in these databases is trivial. However, exploratory searches need customized solutions, especially when multiple databases are involved. This process is cumbersome and time consuming for those without a sufficient background in computer science. In this context, a search engine facilitating exploratory searches of databases would be of great help to the scientific community. We present BioCarian, an efficient and user-friendly search engine for performing exploratory searches on biological databases. The search engine is an interface for SPARQL queries over RDF databases. We note that many of the databases can be converted to tabular form. We first convert the tabular databases to RDF. The search engine provides a graphical interface based on facets to explore the converted databases. The facet interface is more advanced than conventional facets. It allows complex queries to be constructed, and have additional features like ranking of facet values based on several criteria, visually indicating the relevance of a facet value and presenting the most important facet values when a large number of choices are available. For the advanced users, SPARQL queries can be run directly on the databases. Using this feature, users will be able to incorporate federated searches of SPARQL endpoints. We used the search engine to do an exploratory search on previously published viral integration data and were able to deduce the main conclusions of the original publication. BioCarian is accessible via http://www.biocarian.com . We have developed a search engine to explore RDF databases that can be used by both novice and advanced users.

  18. A Study of the Efficiency of Spatial Indexing Methods Applied to Large Astronomical Databases

    NASA Astrophysics Data System (ADS)

    Donaldson, Tom; Berriman, G. Bruce; Good, John; Shiao, Bernie

    2018-01-01

    Spatial indexing of astronomical databases generally uses quadrature methods, which partition the sky into cells used to create an index (usually a B-tree) written as database column. We report the results of a study to compare the performance of two common indexing methods, HTM and HEALPix, on Solaris and Windows database servers installed with a PostgreSQL database, and a Windows Server installed with MS SQL Server. The indexing was applied to the 2MASS All-Sky Catalog and to the Hubble Source catalog. On each server, the study compared indexing performance by submitting 1 million queries at each index level with random sky positions and random cone search radius, which was computed on a logarithmic scale between 1 arcsec and 1 degree, and measuring the time to complete the query and write the output. These simulated queries, intended to model realistic use patterns, were run in a uniform way on many combinations of indexing method and indexing level. The query times in all simulations are strongly I/O-bound and are linear with number of records returned for large numbers of sources. There are, however, considerable differences between simulations, which reveal that hardware I/O throughput is a more important factor in managing the performance of a DBMS than the choice of indexing scheme. The choice of index itself is relatively unimportant: for comparable index levels, the performance is consistent within the scatter of the timings. At small index levels (large cells; e.g. level 4; cell size 3.7 deg), there is large scatter in the timings because of wide variations in the number of sources found in the cells. At larger index levels, performance improves and scatter decreases, but the improvement at level 8 (14 min) and higher is masked to some extent in the timing scatter caused by the range of query sizes. At very high levels (20; 0.0004 arsec), the granularity of the cells becomes so high that a large number of extraneous empty cells begin to degrade performance. Thus, for the use patterns studied here the database performance is not critically dependent on the exact choices of index or level.

  19. Systematic review and technological overview of the antimicrobial activity of Tagetes minuta and future perspectives.

    PubMed

    Santos, Daniela Coelho Dos; Schneider, Lara Rodrigues; da Silva Barboza, Andressa; Diniz Campos, Ângela; Lund, Rafael Guerra

    2017-08-17

    The antimicrobial potential of Tagetes minuta was correlated with its traditional use as antibacterial, insecticidal, biocide, disinfectant, anthelminthic, antifungal, and antiseptic agent as well as its use in urinary tract infections. This study aimed to systematically review articles and patents regarding the antimicrobial activity of T. minuta and give rise to perspectives on this plant as a potential antimicrobial agent. A literature search of studies published between 1997 and 2015 was conducted over five databases: MedLine (PubMed), Web of Science, Scopus, Google Scholar, Portal de Periódicos Capes and SciFinder, grey literature was explored using the System for Information on Dissertations database, and theses were searched using the ProQuest Dissertations and Theses Full text database and the Periódicos Capes Theses database. Additionally, the following databases for patents were analysed: United States Patent and Trademark Office (USPTO), Google Patents, National Institute of Industrial Property (INPI) and Espacenet patent search (EPO). The data were tabulated and analysed using Microsoft Office Excel 2010. After title screening, 51 studies remained and this number decreased to 26 after careful examinations of the abstracts. The full texts of these 26 studies were assessed to check if they were eligible. Among them, 3 were excluded for not having full text access, and 11 were excluded because they did not fit the inclusion criteria, which left 10 articles for this systematic review. The same process was conducted for the patent search, resulting in 4 patents being included in this study. Recent advances highlighted by this review may shed light on future directions of studies concerning T. minuta as a novel antimicrobial agent, which should be repeatedly proven in future animal and clinical studies. Although more evidence on its specificity and clinical efficacy are necessary to support its clinical use, T. minuta is expected to be a highly effective, safe and affordable treatment for infectious diseases. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.

  20. FOUNTAIN: A JAVA open-source package to assist large sequencing projects

    PubMed Central

    Buerstedde, Jean-Marie; Prill, Florian

    2001-01-01

    Background Better automation, lower cost per reaction and a heightened interest in comparative genomics has led to a dramatic increase in DNA sequencing activities. Although the large sequencing projects of specialized centers are supported by in-house bioinformatics groups, many smaller laboratories face difficulties managing the appropriate processing and storage of their sequencing output. The challenges include documentation of clones, templates and sequencing reactions, and the storage, annotation and analysis of the large number of generated sequences. Results We describe here a new program, named FOUNTAIN, for the management of large sequencing projects . FOUNTAIN uses the JAVA computer language and data storage in a relational database. Starting with a collection of sequencing objects (clones), the program generates and stores information related to the different stages of the sequencing project using a web browser interface for user input. The generated sequences are subsequently imported and annotated based on BLAST searches against the public databases. In addition, simple algorithms to cluster sequences and determine putative polymorphic positions are implemented. Conclusions A simple, but flexible and scalable software package is presented to facilitate data generation and storage for large sequencing projects. Open source and largely platform and database independent, we wish FOUNTAIN to be improved and extended in a community effort. PMID:11591214

Top