Sample records for red text annotations

  1. INDIGO – INtegrated Data Warehouse of MIcrobial GenOmes with Examples from the Red Sea Extremophiles

    PubMed Central

    Alam, Intikhab; Antunes, André; Kamau, Allan Anthony; Ba alawi, Wail; Kalkatawi, Manal; Stingl, Ulrich; Bajic, Vladimir B.

    2013-01-01

    Background The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes. Results We developed a data warehouse system (INDIGO) that enables the integration of annotations for exploration and analysis of newly sequenced microbial genomes. INDIGO offers an opportunity to construct complex queries and combine annotations from multiple sources starting from genomic sequence to protein domain, gene ontology and pathway levels. This data warehouse is aimed at being populated with information from genomes of pure cultures and uncultured single cells of Red Sea bacteria and Archaea. Currently, INDIGO contains information from Salinisphaera shabanensis, Haloplasma contractile, and Halorhabdus tiamatea - extremophiles isolated from deep-sea anoxic brine lakes of the Red Sea. We provide examples of utilizing the system to gain new insights into specific aspects on the unique lifestyle and adaptations of these organisms to extreme environments. Conclusions We developed a data warehouse system, INDIGO, which enables comprehensive integration of information from various resources to be used for annotation, exploration and analysis of microbial genomes. It will be regularly updated and extended with new genomes. It is aimed to serve as a resource dedicated to the Red Sea microbes. In addition, through INDIGO, we provide our Automatic Annotation of Microbial Genomes (AAMG) pipeline. The INDIGO web server is freely available at http://www.cbrc.kaust.edu.sa/indigo. PMID:24324765

  2. INDIGO - INtegrated data warehouse of microbial genomes with examples from the red sea extremophiles.

    PubMed

    Alam, Intikhab; Antunes, André; Kamau, Allan Anthony; Ba Alawi, Wail; Kalkatawi, Manal; Stingl, Ulrich; Bajic, Vladimir B

    2013-01-01

    The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes. We developed a data warehouse system (INDIGO) that enables the integration of annotations for exploration and analysis of newly sequenced microbial genomes. INDIGO offers an opportunity to construct complex queries and combine annotations from multiple sources starting from genomic sequence to protein domain, gene ontology and pathway levels. This data warehouse is aimed at being populated with information from genomes of pure cultures and uncultured single cells of Red Sea bacteria and Archaea. Currently, INDIGO contains information from Salinisphaera shabanensis, Haloplasma contractile, and Halorhabdus tiamatea - extremophiles isolated from deep-sea anoxic brine lakes of the Red Sea. We provide examples of utilizing the system to gain new insights into specific aspects on the unique lifestyle and adaptations of these organisms to extreme environments. We developed a data warehouse system, INDIGO, which enables comprehensive integration of information from various resources to be used for annotation, exploration and analysis of microbial genomes. It will be regularly updated and extended with new genomes. It is aimed to serve as a resource dedicated to the Red Sea microbes. In addition, through INDIGO, we provide our Automatic Annotation of Microbial Genomes (AAMG) pipeline. The INDIGO web server is freely available at http://www.cbrc.kaust.edu.sa/indigo.

  3. New directions in biomedical text annotation: definitions, guidelines and corpus construction

    PubMed Central

    Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit

    2006-01-01

    Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. PMID:16867190

  4. Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.

    PubMed

    He, Bin; Dong, Bin; Guan, Yi; Yang, Jinfeng; Jiang, Zhipeng; Yu, Qiubin; Cheng, Jianyi; Qu, Chunyan

    2017-05-01

    To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain. An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus. The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective. The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency. In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain. Copyright © 2017. Published by Elsevier Inc.

  5. Accurate identification of RNA editing sites from primitive sequence with deep neural networks.

    PubMed

    Ouyang, Zhangyi; Liu, Feng; Zhao, Chenghui; Ren, Chao; An, Gaole; Mei, Chuan; Bo, Xiaochen; Shu, Wenjie

    2018-04-16

    RNA editing is a post-transcriptional RNA sequence alteration. Current methods have identified editing sites and facilitated research but require sufficient genomic annotations and prior-knowledge-based filtering steps, resulting in a cumbersome, time-consuming identification process. Moreover, these methods have limited generalizability and applicability in species with insufficient genomic annotations or in conditions of limited prior knowledge. We developed DeepRed, a deep learning-based method that identifies RNA editing from primitive RNA sequences without prior-knowledge-based filtering steps or genomic annotations. DeepRed achieved 98.1% and 97.9% area under the curve (AUC) in training and test sets, respectively. We further validated DeepRed using experimentally verified U87 cell RNA-seq data, achieving 97.9% positive predictive value (PPV). We demonstrated that DeepRed offers better prediction accuracy and computational efficiency than current methods with large-scale, mass RNA-seq data. We used DeepRed to assess the impact of multiple factors on editing identification with RNA-seq data from the Association of Biomolecular Resource Facilities and Sequencing Quality Control projects. We explored developmental RNA editing pattern changes during human early embryogenesis and evolutionary patterns in Drosophila species and the primate lineage using DeepRed. Our work illustrates DeepRed's state-of-the-art performance; it may decipher the hidden principles behind RNA editing, making editing detection convenient and effective.

  6. Corpus annotation for mining biomedical events from literature

    PubMed Central

    Kim, Jin-Dong; Ohta, Tomoko; Tsujii, Jun'ichi

    2008-01-01

    Background Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain. PMID:18182099

  7. Mapping annotations with textual evidence using an scLDA model.

    PubMed

    Jin, Bo; Chen, Vicky; Chen, Lujia; Lu, Xinghua

    2011-01-01

    Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. This study uses GO annotation data as a testbed; the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents.

  8. Semantator: semantic annotator for converting biomedical text to linked data.

    PubMed

    Tao, Cui; Song, Dezhao; Sharma, Deepak; Chute, Christopher G

    2013-10-01

    More than 80% of biomedical data is embedded in plain text. The unstructured nature of these text-based documents makes it challenging to easily browse and query the data of interest in them. One approach to facilitate browsing and querying biomedical text is to convert the plain text to a linked web of data, i.e., converting data originally in free text to structured formats with defined meta-level semantics. In this paper, we introduce Semantator (Semantic Annotator), a semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools. The annotated results will be stored in RDF and can be queried using the SPARQL query language. In addition, semantic reasoners can be directly applied to the annotated data for consistency checking and knowledge inference. Semantator has been released online and was used by the biomedical ontology community who provided positive feedbacks. Our evaluation results indicated that (1) Semantator can perform the annotation functionalities as designed; (2) Semantator can be adopted in real applications in clinical and transactional research; and (3) the annotated results using Semantator can be easily used in Semantic-web-based reasoning tools for further inference. Copyright © 2013 Elsevier Inc. All rights reserved.

  9. Enhanced functionalities for annotating and indexing clinical text with the NCBO Annotator.

    PubMed

    Tchechmedjiev, Andon; Abdaoui, Amine; Emonet, Vincent; Melzi, Soumia; Jonnagaddala, Jitendra; Jonquet, Clement

    2018-06-01

    Second use of clinical data commonly involves annotating biomedical text with terminologies and ontologies. The National Center for Biomedical Ontology Annotator is a frequently used annotation service, originally designed for biomedical data, but not very suitable for clinical text annotation. In order to add new functionalities to the NCBO Annotator without hosting or modifying the original Web service, we have designed a proxy architecture that enables seamless extensions by pre-processing of the input text and parameters, and post processing of the annotations. We have then implemented enhanced functionalities for annotating and indexing free text such as: scoring, detection of context (negation, experiencer, temporality), new output formats and coarse-grained concept recognition (with UMLS Semantic Groups). In this paper, we present the NCBO Annotator+, a Web service which incorporates these new functionalities as well as a small set of evaluation results for concept recognition and clinical context detection on two standard evaluation tasks (Clef eHealth 2017, SemEval 2014). The Annotator+ has been successfully integrated into the SIFR BioPortal platform-an implementation of NCBO BioPortal for French biomedical terminologies and ontologies-to annotate English text. A Web user interface is available for testing and ontology selection (http://bioportal.lirmm.fr/ncbo_annotatorplus); however the Annotator+ is meant to be used through the Web service application programming interface (http://services.bioportal.lirmm.fr/ncbo_annotatorplus). The code is openly available, and we also provide a Docker packaging to enable easy local deployment to process sensitive (e.g. clinical) data in-house (https://github.com/sifrproject). andon.tchechmedjiev@lirmm.fr. Supplementary data are available at Bioinformatics online.

  10. A Novel Approach to Semantic and Coreference Annotation at LLNL

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Firpo, M

    A case is made for the importance of high quality semantic and coreference annotation. The challenges of providing such annotation are described. Asperger's Syndrome is introduced, and the connections are drawn between the needs of text annotation and the abilities of persons with Asperger's Syndrome to meet those needs. Finally, a pilot program is recommended wherein semantic annotation is performed by people with Asperger's Syndrome. The primary points embodied in this paper are as follows: (1) Document annotation is essential to the Natural Language Processing (NLP) projects at Lawrence Livermore National Laboratory (LLNL); (2) LLNL does not currently have amore » system in place to meet its need for text annotation; (3) Text annotation is challenging for a variety of reasons, many related to its very rote nature; (4) Persons with Asperger's Syndrome are particularly skilled at rote verbal tasks, and behavioral experts agree that they would excel at text annotation; and (6) A pilot study is recommend in which two to three people with Asperger's Syndrome annotate documents and then the quality and throughput of their work is evaluated relative to that of their neuro-typical peers.« less

  11. Pyrosequencing of the northern red oak (Quercus rubra L.) chloroplast genome reveals high quality polymorphisms for population management

    Treesearch

    Lisa W. Alexander; Keith E. Woeste

    2014-01-01

    Given the low intraspecific chloroplast diversity detected in northern red oak (Quercus rubra L.), more powerful genetic tools are necessary to accurately characterize Q. rubra chloroplast diversity and structure. We report the sequencing, assembly, and annotation of the chloroplast genome of northern red oak via pyrosequencing and...

  12. DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures

    PubMed Central

    Yin, Xu-Cheng; Yang, Chun; Pei, Wei-Yi; Man, Haixia; Zhang, Jun; Learned-Miller, Erik; Yu, Hong

    2015-01-01

    Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/. PMID:25951377

  13. Semantic annotation in biomedicine: the current landscape.

    PubMed

    Jovanović, Jelena; Bagheri, Ebrahim

    2017-09-22

    The abundance and unstructured nature of biomedical texts, be it clinical or research content, impose significant challenges for the effective and efficient use of information and knowledge stored in such texts. Annotation of biomedical documents with machine intelligible semantics facilitates advanced, semantics-based text management, curation, indexing, and search. This paper focuses on annotation of biomedical entity mentions with concepts from relevant biomedical knowledge bases such as UMLS. As a result, the meaning of those mentions is unambiguously and explicitly defined, and thus made readily available for automated processing. This process is widely known as semantic annotation, and the tools that perform it are known as semantic annotators.Over the last dozen years, the biomedical research community has invested significant efforts in the development of biomedical semantic annotation technology. Aiming to establish grounds for further developments in this area, we review a selected set of state of the art biomedical semantic annotators, focusing particularly on general purpose annotators, that is, semantic annotation tools that can be customized to work with texts from any area of biomedicine. We also examine potential directions for further improvements of today's annotators which could make them even more capable of meeting the needs of real-world applications. To motivate and encourage further developments in this area, along the suggested and/or related directions, we review existing and potential practical applications and benefits of semantic annotators.

  14. Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)

    PubMed Central

    Grötzinger, Stefan W.; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B.; Stingl, Ulrich; Eppinger, Jörg

    2014-01-01

    Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website. PMID:24778629

  15. Deep landscape update of dispersed and tandem repeats in the genome model of the red jungle fowl, Gallus gallus, using a series of de novo investigating tools.

    PubMed

    Guizard, Sébastien; Piégu, Benoît; Arensburger, Peter; Guillou, Florian; Bigot, Yves

    2016-08-19

    The program RepeatMasker and the database Repbase-ISB are part of the most widely used strategy for annotating repeats in animal genomes. They have been used to show that avian genomes have a lower repeat content (8-12 %) than the sequenced genomes of many vertebrate species (30-55 %). However, the efficiency of such a library-based strategies is dependent on the quality and completeness of the sequences in the database that is used. An alternative to these library based methods are methods that identify repeats de novo. These alternative methods have existed for a least a decade and may be more powerful than the library based methods. We have used an annotation strategy involving several complementary de novo tools to determine the repeat content of the model genome galGal4 (1.04 Gbp), including identifying simple sequence repeats (SSRs), tandem repeats and transposable elements (TEs). We annotated over one Gbp. of the galGal4 genome and showed that it is composed of approximately 19 % SSRs and TEs repeats. Furthermore, we estimate that the actual genome of the red jungle fowl contains about 31-35 % repeats. We find that library-based methods tend to overestimate TE diversity. These results have a major impact on the current understanding of repeats distributions throughout chromosomes in the red jungle fowl. Our results are a proof of concept of the reliability of using de novo tools to annotate repeats in large animal genomes. They have also revealed issues that will need to be resolved in order to develop gold-standard methodologies for annotating repeats in eukaryote genomes.

  16. A sentence sliding window approach to extract protein annotations from biomedical articles

    PubMed Central

    Krallinger, Martin; Padron, Maria; Valencia, Alfonso

    2005-01-01

    Background Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a great ned of comparative assessment of the performance of the proposed methods and the development of common evaluation criteria. This issue was addressed by the Critical Assessment of Text Mining Methods in Molecular Biology (BioCreative) contest. The aim of this contest was to assess the performance of text mining systems applied to biomedical texts including tools which recognize named entities such as genes and proteins, and tools which automatically extract protein annotations. Results The "sentence sliding window" approach proposed here was found to efficiently extract text fragments from full text articles containing annotations on proteins, providing the highest number of correctly predicted annotations. Moreover, the number of correct extractions of individual entities (i.e. proteins and GO terms) involved in the relationships used for the annotations was significantly higher than the correct extractions of the complete annotations (protein-function relations). Conclusion We explored the use of averaging sentence sliding windows for information extraction, especially in a context where conventional training data is unavailable. The combination of our approach with more refined statistical estimators and machine learning techniques might be a way to improve annotation extraction for future biomedical text mining applications. PMID:15960831

  17. A manually annotated Actinidia chinensis var. chinensis (kiwifruit) genome highlights the challenges associated with draft genomes and gene prediction in plants.

    PubMed

    Pilkington, Sarah M; Crowhurst, Ross; Hilario, Elena; Nardozza, Simona; Fraser, Lena; Peng, Yongyan; Gunaseelan, Kularajathevan; Simpson, Robert; Tahir, Jibran; Deroles, Simon C; Templeton, Kerry; Luo, Zhiwei; Davy, Marcus; Cheng, Canhong; McNeilage, Mark; Scaglione, Davide; Liu, Yifei; Zhang, Qiong; Datson, Paul; De Silva, Nihal; Gardiner, Susan E; Bassett, Heather; Chagné, David; McCallum, John; Dzierzon, Helge; Deng, Cecilia; Wang, Yen-Yi; Barron, Lorna; Manako, Kelvina; Bowen, Judith; Foster, Toshi M; Erridge, Zoe A; Tiffin, Heather; Waite, Chethi N; Davies, Kevin M; Grierson, Ella P; Laing, William A; Kirk, Rebecca; Chen, Xiuyin; Wood, Marion; Montefiori, Mirco; Brummell, David A; Schwinn, Kathy E; Catanach, Andrew; Fullerton, Christina; Li, Dawei; Meiyalaghan, Sathiyamoorthy; Nieuwenhuizen, Niels; Read, Nicola; Prakash, Roneel; Hunter, Don; Zhang, Huaibi; McKenzie, Marian; Knäbel, Mareike; Harris, Alastair; Allan, Andrew C; Gleave, Andrew; Chen, Angela; Janssen, Bart J; Plunkett, Blue; Ampomah-Dwamena, Charles; Voogd, Charlotte; Leif, Davin; Lafferty, Declan; Souleyre, Edwige J F; Varkonyi-Gasic, Erika; Gambi, Francesco; Hanley, Jenny; Yao, Jia-Long; Cheung, Joey; David, Karine M; Warren, Ben; Marsh, Ken; Snowden, Kimberley C; Lin-Wang, Kui; Brian, Lara; Martinez-Sanchez, Marcela; Wang, Mindy; Ileperuma, Nadeesha; Macnee, Nikolai; Campin, Robert; McAtee, Peter; Drummond, Revel S M; Espley, Richard V; Ireland, Hilary S; Wu, Rongmei; Atkinson, Ross G; Karunairetnam, Sakuntala; Bulley, Sean; Chunkath, Shayhan; Hanley, Zac; Storey, Roy; Thrimawithana, Amali H; Thomson, Susan; David, Charles; Testolin, Raffaele; Huang, Hongwen; Hellens, Roger P; Schaffer, Robert J

    2018-04-16

    Most published genome sequences are drafts, and most are dominated by computational gene prediction. Draft genomes typically incorporate considerable sequence data that are not assigned to chromosomes, and predicted genes without quality confidence measures. The current Actinidia chinensis (kiwifruit) 'Hongyang' draft genome has 164 Mb of sequences unassigned to pseudo-chromosomes, and omissions have been identified in the gene models. A second genome of an A. chinensis (genotype Red5) was fully sequenced. This new sequence resulted in a 554.0 Mb assembly with all but 6 Mb assigned to pseudo-chromosomes. Pseudo-chromosomal comparisons showed a considerable number of translocation events have occurred following a whole genome duplication (WGD) event some consistent with centromeric Robertsonian-like translocations. RNA sequencing data from 12 tissues and ab initio analysis informed a genome-wide manual annotation, using the WebApollo tool. In total, 33,044 gene loci represented by 33,123 isoforms were identified, named and tagged for quality of evidential support. Of these 3114 (9.4%) were identical to a protein within 'Hongyang' The Kiwifruit Information Resource (KIR v2). Some proportion of the differences will be varietal polymorphisms. However, as most computationally predicted Red5 models required manual re-annotation this proportion is expected to be small. The quality of the new gene models was tested by fully sequencing 550 cloned 'Hort16A' cDNAs and comparing with the predicted protein models for Red5 and both the original 'Hongyang' assembly and the revised annotation from KIR v2. Only 48.9% and 63.5% of the cDNAs had a match with 90% identity or better to the original and revised 'Hongyang' annotation, respectively, compared with 90.9% to the Red5 models. Our study highlights the need to take a cautious approach to draft genomes and computationally predicted genes. Our use of the manual annotation tool WebApollo facilitated manual checking and correction of gene models enabling improvement of computational prediction. This utility was especially relevant for certain types of gene families such as the EXPANSIN like genes. Finally, this high quality gene set will supply the kiwifruit and general plant community with a new tool for genomics and other comparative analysis.

  18. Annotated bibliography of the red tree vole (Arborimus longicaudus), Sonoma tree vole (A. pomo), and white-footed vole (A. albipes).

    Treesearch

    James K. Swingle; Eric D. Forsman

    2016-01-01

    This annotated bibliography contains all citations that we could find on tree voles (Arborimus longicaudus, A. pomo) and white-footed voles (A. albipes), including many unpublished sources in museum archives, court proceedings, and agency documents. Some documents not readily available in published form or museum archives are...

  19. Aggregating and Predicting Sequence Labels from Crowd Annotations

    PubMed Central

    Nguyen, An T.; Wallace, Byron C.; Li, Junyi Jessy; Nenkova, Ani; Lease, Matthew

    2017-01-01

    Despite sequences being core to NLP, scant work has considered how to handle noisy sequence labels from multiple annotators for the same text. Given such annotations, we consider two complementary tasks: (1) aggregating sequential crowd labels to infer a best single set of consensus annotations; and (2) using crowd annotations as training data for a model that can predict sequences in unannotated text. For aggregation, we propose a novel Hidden Markov Model variant. To predict sequences in unannotated text, we propose a neural approach using Long Short Term Memory. We evaluate a suite of methods across two different applications and text genres: Named-Entity Recognition in news articles and Information Extraction from biomedical abstracts. Results show improvement over strong baselines. Our source code and data are available online1. PMID:29093611

  20. BioCreative V CDR task corpus: a resource for chemical disease relation extraction.

    PubMed

    Li, Jiao; Sun, Yueping; Johnson, Robin J; Sciaky, Daniela; Wei, Chih-Hsuan; Leaman, Robert; Davis, Allan Peter; Mattingly, Carolyn J; Wiegers, Thomas C; Lu, Zhiyong

    2016-01-01

    Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the United States.

  1. Challenges and Insights in Using HIPAA Privacy Rule for Clinical Text Annotation.

    PubMed

    Kayaalp, Mehmet; Browne, Allen C; Sagan, Pamela; McGee, Tyne; McDonald, Clement J

    2015-01-01

    The Privacy Rule of Health Insurance Portability and Accountability Act (HIPAA) requires that clinical documents be stripped of personally identifying information before they can be released to researchers and others. We have been manually annotating clinical text since 2008 in order to test and evaluate an algorithmic clinical text de-identification tool, NLM Scrubber, which we have been developing in parallel. Although HIPAA provides some guidance about what must be de-identified, translating those guidelines into practice is not as straightforward, especially when one deals with free text. As a result we have changed our manual annotation labels and methods six times. This paper explains why we have made those annotation choices, which have been evolved throughout seven years of practice on this field. The aim of this paper is to start a community discussion towards developing standards for clinical text annotation with the end goal of studying and comparing clinical text de-identification systems more accurately.

  2. Effects of Multimedia Annotations on Thai EFL Readers' Words and Text Recall

    ERIC Educational Resources Information Center

    Gasigijtamrong, Jenjit

    2013-01-01

    This study aimed to investigate the effects of using multimedia annotations on EFL readers' word recall and text recall and to explore which type of multimedia annotations--L1 meaning, L2 meaning, sound, and image--would have a better effect on their recall of new words and text comprehension. The participants were 78 students who enrolled in an…

  3. A survey on annotation tools for the biomedical literature.

    PubMed

    Neves, Mariana; Leser, Ulf

    2014-03-01

    New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.

  4. Draft genome of the red harvester ant Pogonomyrmex barbatus.

    PubMed

    Smith, Chris R; Smith, Christopher D; Robertson, Hugh M; Helmkampf, Martin; Zimin, Aleksey; Yandell, Mark; Holt, Carson; Hu, Hao; Abouheif, Ehab; Benton, Richard; Cash, Elizabeth; Croset, Vincent; Currie, Cameron R; Elhaik, Eran; Elsik, Christine G; Favé, Marie-Julie; Fernandes, Vilaiwan; Gibson, Joshua D; Graur, Dan; Gronenberg, Wulfila; Grubbs, Kirk J; Hagen, Darren E; Viniegra, Ana Sofia Ibarraran; Johnson, Brian R; Johnson, Reed M; Khila, Abderrahman; Kim, Jay W; Mathis, Kaitlyn A; Munoz-Torres, Monica C; Murphy, Marguerite C; Mustard, Julie A; Nakamura, Rin; Niehuis, Oliver; Nigam, Surabhi; Overson, Rick P; Placek, Jennifer E; Rajakumar, Rajendhran; Reese, Justin T; Suen, Garret; Tao, Shu; Torres, Candice W; Tsutsui, Neil D; Viljakainen, Lumi; Wolschin, Florian; Gadau, Jürgen

    2011-04-05

    We report the draft genome sequence of the red harvester ant, Pogonomyrmex barbatus. The genome was sequenced using 454 pyrosequencing, and the current assembly and annotation were completed in less than 1 y. Analyses of conserved gene groups (more than 1,200 manually annotated genes to date) suggest a high-quality assembly and annotation comparable to recently sequenced insect genomes using Sanger sequencing. The red harvester ant is a model for studying reproductive division of labor, phenotypic plasticity, and sociogenomics. Although the genome of P. barbatus is similar to other sequenced hymenopterans (Apis mellifera and Nasonia vitripennis) in GC content and compositional organization, and possesses a complete CpG methylation toolkit, its predicted genomic CpG content differs markedly from the other hymenopterans. Gene networks involved in generating key differences between the queen and worker castes (e.g., wings and ovaries) show signatures of increased methylation and suggest that ants and bees may have independently co-opted the same gene regulatory mechanisms for reproductive division of labor. Gene family expansions (e.g., 344 functional odorant receptors) and pseudogene accumulation in chemoreception and P450 genes compared with A. mellifera and N. vitripennis are consistent with major life-history changes during the adaptive radiation of Pogonomyrmex spp., perhaps in parallel with the development of the North American deserts.

  5. Accurate and consistent automatic seismocardiogram annotation without concurrent ECG.

    PubMed

    Laurin, A; Khosrow-Khavar, F; Blaber, A P; Tavakolian, Kouhyar

    2016-09-01

    Seismocardiography (SCG) is the measurement of vibrations in the sternum caused by the beating of the heart. Precise cardiac mechanical timings that are easily obtained from SCG are critically dependent on accurate identification of fiducial points. So far, SCG annotation has relied on concurrent ECG measurements. An algorithm capable of annotating SCG without the use any other concurrent measurement was designed. We subjected 18 participants to graded lower body negative pressure. We collected ECG and SCG, obtained R peaks from the former, and annotated the latter by hand, using these identified peaks. We also annotated the SCG automatically. We compared the isovolumic moment timings obtained by hand to those obtained using our algorithm. Mean  ±  confidence interval of the percentage of accurately annotated cardiac cycles were [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], and [Formula: see text] for levels of negative pressure 0, -20, -30, -40, and  -50 mmHg. LF/HF ratios, the relative power of low-frequency variations to high-frequency variations in heart beat intervals, obtained from isovolumic moments were also compared to those obtained from R peaks. The mean differences  ±  confidence interval were [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], and [Formula: see text] for increasing levels of negative pressure. The accuracy and consistency of the algorithm enables the use of SCG as a stand-alone heart monitoring tool in healthy individuals at rest, and could serve as a basis for an eventual application in pathological cases.

  6. Hypertext Annotation: Effects of Presentation Formats and Learner Proficiency on Reading Comprehension and Vocabulary Learning in Foreign Languages

    ERIC Educational Resources Information Center

    Chen, I-Jung; Yen, Jung-Chuan

    2013-01-01

    This study extends current knowledge by exploring the effect of different annotation formats, namely in-text annotation, glossary annotation, and pop-up annotation, on hypertext reading comprehension in a foreign language and vocabulary acquisition across student proficiencies. User attitudes toward the annotation presentation were also…

  7. A multi-ontology approach to annotate scientific documents based on a modularization technique.

    PubMed

    Gomes, Priscilla Corrêa E Castro; Moura, Ana Maria de Carvalho; Cavalcanti, Maria Cláudia

    2015-12-01

    Scientific text annotation has become an important task for biomedical scientists. Nowadays, there is an increasing need for the development of intelligent systems to support new scientific findings. Public databases available on the Web provide useful data, but much more useful information is only accessible in scientific texts. Text annotation may help as it relies on the use of ontologies to maintain annotations based on a uniform vocabulary. However, it is difficult to use an ontology, especially those that cover a large domain. In addition, since scientific texts explore multiple domains, which are covered by distinct ontologies, it becomes even more difficult to deal with such task. Moreover, there are dozens of ontologies in the biomedical area, and they are usually big in terms of the number of concepts. It is in this context that ontology modularization can be useful. This work presents an approach to annotate scientific documents using modules of different ontologies, which are built according to a module extraction technique. The main idea is to analyze a set of single-ontology annotations on a text to find out the user interests. Based on these annotations a set of modules are extracted from a set of distinct ontologies, and are made available for the user, for complementary annotation. The reduced size and focus of the extracted modules tend to facilitate the annotation task. An experiment was conducted to evaluate this approach, with the participation of a bioinformatician specialist of the Laboratory of Peptides and Proteins of the IOC/Fiocruz, who was interested in discovering new drug targets aiming at the combat of tropical diseases. Copyright © 2015 Elsevier Inc. All rights reserved.

  8. Knowledge Extraction and Semantic Annotation of Text from the Encyclopedia of Life

    PubMed Central

    Thessen, Anne E.; Parr, Cynthia Sims

    2014-01-01

    Numerous digitization and ontological initiatives have focused on translating biological knowledge from narrative text to machine-readable formats. In this paper, we describe two workflows for knowledge extraction and semantic annotation of text data objects featured in an online biodiversity aggregator, the Encyclopedia of Life. One workflow tags text with DBpedia URIs based on keywords. Another workflow finds taxon names in text using GNRD for the purpose of building a species association network. Both workflows work well: the annotation workflow has an F1 Score of 0.941 and the association algorithm has an F1 Score of 0.885. Existing text annotators such as Terminizer and DBpedia Spotlight performed well, but require some optimization to be useful in the ecology and evolution domain. Important future work includes scaling up and improving accuracy through the use of distributional semantics. PMID:24594988

  9. Annotated chemical patent corpus: a gold standard for text mining.

    PubMed

    Akhondi, Saber A; Klenner, Alexander G; Tyrchan, Christian; Manchala, Anil K; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A R P; Sayle, Roger; Kors, Jan A; Muresan, Sorel

    2014-01-01

    Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

  10. Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

    PubMed Central

    Akhondi, Saber A.; Klenner, Alexander G.; Tyrchan, Christian; Manchala, Anil K.; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A. R. P.; Sayle, Roger; Kors, Jan A.; Muresan, Sorel

    2014-01-01

    Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org. PMID:25268232

  11. Annotations and the Collaborative Digital Library: Effects of an Aligned Annotation Interface on Student Argumentation and Reading Strategies

    ERIC Educational Resources Information Center

    Wolfe, Joanna

    2008-01-01

    Recent research on annotation interfaces provides provocative evidence that anchored, annotation-based discussion environments may lead to better conversations about a text. However, annotation interfaces raise complicated tradeoffs regarding screen real estate and positioning. It is argued that solving this screen real estate problem requires…

  12. Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

    PubMed

    Carrell, David S; Cronkite, David J; Malin, Bradley A; Aberdeen, John S; Hirschman, Lynette

    2016-08-05

    Clinical text contains valuable information but must be de-identified before it can be used for secondary purposes. Accurate annotation of personally identifiable information (PII) is essential to the development of automated de-identification systems and to manual redaction of PII. Yet the accuracy of annotations may vary considerably across individual annotators and annotation is costly. As such, the marginal benefit of incorporating additional annotators has not been well characterized. This study models the costs and benefits of incorporating increasing numbers of independent human annotators to identify the instances of PII in a corpus. We used a corpus with gold standard annotations to evaluate the performance of teams of annotators of increasing size. Four annotators independently identified PII in a 100-document corpus consisting of randomly selected clinical notes from Family Practice clinics in a large integrated health care system. These annotations were pooled and validated to generate a gold standard corpus for evaluation. Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators to 0.998 to 1.0 for teams of three, when meas-ured against the gold standard. Median cost per PII instance discovered during corpus annotation ranged from $ 0.71 for an individual annotator to $ 377 for annotations discovered only by a fourth annotator. Incorporating a second annotator into a PII annotation process reduces unredacted PII and improves the quality of annotations to 0.99 recall, yielding clear benefit at reasonable cost; the cost advantages of annotation teams larger than two diminish rapidly.

  13. The Function of Annotations in the Comprehension of Scientific Texts: Cognitive Load Effects and the Impact of Verbal Ability

    ERIC Educational Resources Information Center

    Wallen, Erik; Plass, Jan L.; Brunken, Roland

    2005-01-01

    Students participated in a study (n = 98) investigating the effectiveness of three types of annotations on three learning outcome measures. The annotations were designed to support the cognitive processes in the comprehension of scientific texts, with a function to aid either the process of selecting relevant information, organizing the…

  14. A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

    PubMed Central

    Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich

    2015-01-01

    Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. PMID:25948699

  15. Concept annotation in the CRAFT corpus.

    PubMed

    Bada, Michael; Eckert, Miriam; Evans, Donald; Garcia, Kristin; Shipley, Krista; Sitnikov, Dmitry; Baumgartner, William A; Cohen, K Bretonnel; Verspoor, Karin; Blake, Judith A; Hunter, Lawrence E

    2012-07-09

    Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

  16. Concept annotation in the CRAFT corpus

    PubMed Central

    2012-01-01

    Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. PMID:22776079

  17. Teaching Students To Annotate and Underline Text Effectively--Guidelines and Procedures. College Reading and Learning Assistance Technical Report No. 87-02.

    ERIC Educational Resources Information Center

    Nist, Sherrie L.

    Of all the effective strategies available to college developmental reading students, annotating (noting important ideas or examples in text margins) and underlining have the widest appeal among students and the most practical application in any course. Annotating/underlining serves a dual function: students can isolate key ideas at the time of the…

  18. The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions

    PubMed Central

    Kim, Sun; Chatr-aryamontri, Andrew; Chang, Christie S.; Oughtred, Rose; Rust, Jennifer; Wilbur, W. John; Comeau, Donald C.; Dolinski, Kara; Tyers, Mike

    2017-01-01

    A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL: http://bioc.sourceforge.net/BioC-BioGRID.html PMID:28077563

  19. "Doch ein Begriff muss bei dem Worte sein": Some Thoughts on Electronically Annotated Literary Texts.

    ERIC Educational Resources Information Center

    Duncan, Bruce; Foelsche, Otmar K. E.

    1995-01-01

    Discusses the use of electronic glossaries, annotated texts, and other ancillae in German-as-a-Second-Language courses at the college level, and asserts that electronic media deliver lexical assistance far more effectively than do traditional methods, using texts in ways previously unavailable. (MDM)

  20. A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.

    PubMed

    Kors, Jan A; Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich

    2015-09-01

    To create a multilingual gold-standard corpus for biomedical concept recognition. We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  1. Text Mining Improves Prediction of Protein Functional Sites

    PubMed Central

    Cohn, Judith D.; Ravikumar, Komandur E.

    2012-01-01

    We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions. PMID:22393388

  2. Annotating Socio-Cultural Structures in Text

    DTIC Science & Technology

    2012-10-31

    parts of speech (POS) within text, using the Stanford Part of Speech Tagger (Stanford Log-Linear, 2011). The ERDC-CERL taxonomy is then used to...annotated NP/VP Pane: Shows the sentence parsed using the Parts of Speech tagger Document View Pane: Specifies the document (being annotated) in three...first parsed using the Stanford Parts of Speech tagger and converted to an XML document both components which are done through the Import function

  3. Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users

    PubMed Central

    Shatkay, Hagit; Pan, Fengxia; Rzhetsky, Andrey; Wilbur, W. John

    2008-01-01

    Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice. Contact: shatkay@cs.queensu.ca PMID:18718948

  4. RysannMD: A biomedical semantic annotator balancing speed and accuracy.

    PubMed

    Cuzzola, John; Jovanović, Jelena; Bagheri, Ebrahim

    2017-07-01

    Recently, both researchers and practitioners have explored the possibility of semantically annotating large and continuously evolving collections of biomedical texts such as research papers, medical reports, and physician notes in order to enable their efficient and effective management and use in clinical practice or research laboratories. Such annotations can be automatically generated by biomedical semantic annotators - tools that are specifically designed for detecting and disambiguating biomedical concepts mentioned in text. The biomedical community has already presented several solid automated semantic annotators. However, the existing tools are either strong in their disambiguation capacity, i.e., the ability to identify the correct biomedical concept for a given piece of text among several candidate concepts, or they excel in their processing time, i.e., work very efficiently, but none of the semantic annotation tools reported in the literature has both of these qualities. In this paper, we present RysannMD (Ryerson Semantic Annotator for Medical Domain), a biomedical semantic annotation tool that strikes a balance between processing time and performance while disambiguating biomedical terms. In other words, RysannMD provides reasonable disambiguation performance when choosing the right sense for a biomedical term in a given context, and does that in a reasonable time. To examine how RysannMD stands with respect to the state of the art biomedical semantic annotators, we have conducted a series of experiments using standard benchmarking corpora, including both gold and silver standards, and four modern biomedical semantic annotators, namely cTAKES, MetaMap, NOBLE Coder, and Neji. The annotators were compared with respect to the quality of the produced annotations measured against gold and silver standards using precision, recall, and F 1 measure and speed, i.e., processing time. In the experiments, RysannMD achieved the best median F 1 measure across the benchmarking corpora, independent of the standard used (silver/gold), biomedical subdomain, and document size. In terms of the annotation speed, RysannMD scored the second best median processing time across all the experiments. The obtained results indicate that RysannMD offers the best performance among the examined semantic annotators when both quality of annotation and speed are considered simultaneously. Copyright © 2017 Elsevier Inc. All rights reserved.

  5. Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.

    PubMed

    Oellrich, Anika; Collier, Nigel; Smedley, Damian; Groza, Tudor

    2015-01-01

    Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES), the National Center for Biomedical Ontology (NCBO) Annotator, the Biomedical Concept Annotation System (BeCAS) and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74%) and their quality (best F1-measure of 33%), independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%), the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content of the ShARe/CLEF (https://sites.google.com/site/shareclefehealth/data) and i2b2 (https://i2b2.org/NLP/DataSets/) corpora needs to be requested with the individual corpus providers.

  6. VideoANT: Extending Online Video Annotation beyond Content Delivery

    ERIC Educational Resources Information Center

    Hosack, Bradford

    2010-01-01

    This paper expands the boundaries of video annotation in education by outlining the need for extended interaction in online video use, identifying the challenges faced by existing video annotation tools, and introducing Video-ANT, a tool designed to create text-based annotations integrated within the time line of a video hosted online. Several…

  7. EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation.

    PubMed

    Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra; Pereira, Emiliano; Schnetzer, Julia; Arvanitidis, Christos; Jensen, Lars Juhl

    2016-01-01

    The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, well documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15-25% and helps curators to detect terms that would otherwise have been missed. Database URL: https://extract.hcmr.gr/. © The Author(s) 2016. Published by Oxford University Press.

  8. EXTRACT: Interactive extraction of environment metadata and term suggestion for metagenomic sample annotation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra

    The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, wellmore » documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Here the comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed.« less

  9. EXTRACT: Interactive extraction of environment metadata and term suggestion for metagenomic sample annotation

    DOE PAGES

    Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra; ...

    2016-01-01

    The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, wellmore » documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Here the comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed.« less

  10. ezTag: tagging biomedical concepts via interactive learning.

    PubMed

    Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan; Leaman, Robert; Lu, Zhiyong

    2018-05-18

    Recently, advanced text-mining techniques have been shown to speed up manual data curation by providing human annotators with automated pre-annotations generated by rules or machine learning models. Due to the limited training data available, however, current annotation systems primarily focus only on common concept types such as genes or diseases. To support annotating a wide variety of biological concepts with or without pre-existing training data, we developed ezTag, a web-based annotation tool that allows curators to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central. It also provides lexicon-based concept tagging as well as the state-of-the-art pre-trained taggers such as TaggerOne, GNormPlus and tmVar. ezTag is freely available at http://eztag.bioqrator.org.

  11. The Effects of Literacy Support Tools on the Comprehension of Informational e-Books and Print-Based Text

    ERIC Educational Resources Information Center

    Herman, Heather A.

    2017-01-01

    This mixed methods research explores the effects of literacy support tools to support comprehension strategies when reading informational e-books and print-based text with 14 first-grade students. This study focused on the following comprehension strategies: annotating connections, annotating "I wonders," and looking back in the text.…

  12. Developing a corpus of clinical notes manually annotated for part-of-speech.

    PubMed

    Pakhomov, Serguei V; Coden, Anni; Chute, Christopher G

    2006-06-01

    This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.

  13. Displaying Annotations for Digitised Globes

    NASA Astrophysics Data System (ADS)

    Gede, Mátyás; Farbinger, Anna

    2018-05-01

    Thanks to the efforts of the various globe digitising projects, nowadays there are plenty of old globes that can be examined as 3D models on the computer screen. These globes usually contain a lot of interesting details that an average observer would not entirely discover for the first time. The authors developed a website that can display annotations for such digitised globes. These annotations help observers of the globe to discover all the important, interesting details. Annotations consist of a plain text title, a HTML formatted descriptive text and a corresponding polygon and are stored in KML format. The website is powered by the Cesium virtual globe engine.

  14. The Effect of Hypertext Annotation Presentation Formats on Perceived Cognitive Load and Learner Control

    ERIC Educational Resources Information Center

    Yao, Yuanming; Gill, Michele

    2009-01-01

    The impact of hypertext presentation formats on learner control and cognitive load was examined in this study using Campbell and Stanley's (1963) Posttest Only Control Group design. One hundred eighty-six undergraduate students were randomly assigned to read a web-based text with no annotations, online glossary annotations, embedded annotations,…

  15. Data mart construction based on semantic annotation of scientific articles: A case study for the prioritization of drug targets.

    PubMed

    Teixeira, Marlon Amaro Coelho; Belloze, Kele Teixeira; Cavalcanti, Maria Cláudia; Silva-Junior, Floriano P

    2018-04-01

    Semantic text annotation enables the association of semantic information (ontology concepts) to text expressions (terms), which are readable by software agents. In the scientific scenario, this is particularly useful because it reveals a lot of scientific discoveries that are hidden within academic articles. The Biomedical area has more than 300 ontologies, most of them composed of over 500 concepts. These ontologies can be used to annotate scientific papers and thus, facilitate data extraction. However, in the context of a scientific research, a simple keyword-based query using the interface of a digital scientific texts library can return more than a thousand hits. The analysis of such a large set of texts, annotated with such numerous and large ontologies, is not an easy task. Therefore, the main objective of this work is to provide a method that could facilitate this task. This work describes a method called Text and Ontology ETL (TOETL), to build an analytical view over such texts. First, a corpus of selected papers is semantically annotated using distinct ontologies. Then, the annotation data is extracted, organized and aggregated into the dimensional schema of a data mart. Besides the TOETL method, this work illustrates its application through the development of the TaP DM (Target Prioritization data mart). This data mart has focus on the research of gene essentiality, a key concept to be considered when searching for genes showing potential as anti-infective drug targets. This work reveals that the proposed approach is a relevant tool to support decision making in the prioritization of new drug targets, being more efficient than the keyword-based traditional tools. Copyright © 2018 Elsevier B.V. All rights reserved.

  16. Detecting long tandem duplications in genomic sequences.

    PubMed

    Audemard, Eric; Schiex, Thomas; Faraut, Thomas

    2012-05-08

    Detecting duplication segments within completely sequenced genomes provides valuable information to address genome evolution and in particular the important question of the emergence of novel functions. The usual approach to gene duplication detection, based on all-pairs protein gene comparisons, provides only a restricted view of duplication. In this paper, we introduce ReD Tandem, a software using a flow based chaining algorithm targeted at detecting tandem duplication arrays of moderate to longer length regions, with possibly locally weak similarities, directly at the DNA level. On the A. thaliana genome, using a reference set of tandem duplicated genes built using TAIR,(a) we show that ReD Tandem is able to predict a large fraction of recently duplicated genes (dS  <  1) and that it is also able to predict tandem duplications involving non coding elements such as pseudo-genes or RNA genes. ReD Tandem allows to identify large tandem duplications without any annotation, leading to agnostic identification of tandem duplications. This approach nicely complements the usual protein gene based which ignores duplications involving non coding regions. It is however inherently restricted to relatively recent duplications. By recovering otherwise ignored events, ReD Tandem gives a more comprehensive view of existing evolutionary processes and may also allow to improve existing annotations.

  17. @Note: a workbench for biomedical text mining.

    PubMed

    Lourenço, Anália; Carreira, Rafael; Carneiro, Sónia; Maia, Paulo; Glez-Peña, Daniel; Fdez-Riverola, Florentino; Ferreira, Eugénio C; Rocha, Isabel; Rocha, Miguel

    2009-08-01

    Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.

  18. Efficient Queries of Stand-off Annotations for Natural Language Processing on Electronic Medical Records.

    PubMed

    Luo, Yuan; Szolovits, Peter

    2016-01-01

    In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently retrieving such annotations that satisfy position constraints. Efficient storage and retrieval of stand-off annotations can facilitate tasks such as mapping unstructured text to electronic medical record ontologies. We first formulate this problem into the interval query problem, for which optimal query/update time is in general logarithm. We next perform a tight time complexity analysis on the basic interval tree query algorithm and show its nonoptimality when being applied to a collection of 13 query types from Allen's interval algebra. We then study two closely related state-of-the-art interval query algorithms, proposed query reformulations, and augmentations to the second algorithm. Our proposed algorithm achieves logarithmic time stabbing-max query time complexity and solves the stabbing-interval query tasks on all of Allen's relations in logarithmic time, attaining the theoretic lower bound. Updating time is kept logarithmic and the space requirement is kept linear at the same time. We also discuss interval management in external memory models and higher dimensions.

  19. Efficient Queries of Stand-off Annotations for Natural Language Processing on Electronic Medical Records

    PubMed Central

    Luo, Yuan; Szolovits, Peter

    2016-01-01

    In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently retrieving such annotations that satisfy position constraints. Efficient storage and retrieval of stand-off annotations can facilitate tasks such as mapping unstructured text to electronic medical record ontologies. We first formulate this problem into the interval query problem, for which optimal query/update time is in general logarithm. We next perform a tight time complexity analysis on the basic interval tree query algorithm and show its nonoptimality when being applied to a collection of 13 query types from Allen’s interval algebra. We then study two closely related state-of-the-art interval query algorithms, proposed query reformulations, and augmentations to the second algorithm. Our proposed algorithm achieves logarithmic time stabbing-max query time complexity and solves the stabbing-interval query tasks on all of Allen’s relations in logarithmic time, attaining the theoretic lower bound. Updating time is kept logarithmic and the space requirement is kept linear at the same time. We also discuss interval management in external memory models and higher dimensions. PMID:27478379

  20. ChemBrowser: a flexible framework for mining chemical documents.

    PubMed

    Wu, Xian; Zhang, Li; Chen, Ying; Rhodes, James; Griffin, Thomas D; Boyer, Stephen K; Alba, Alfredo; Cai, Keke

    2010-01-01

    The ability to extract chemical and biological entities and relations from text documents automatically has great value to biochemical research and development activities. The growing maturity of text mining and artificial intelligence technologies shows promise in enabling such automatic chemical entity extraction capabilities (called "Chemical Annotation" in this paper). Many techniques have been reported in the literature, ranging from dictionary and rule-based techniques to machine learning approaches. In practice, we found that no single technique works well in all cases. A combinatorial approach that allows one to quickly compose different annotation techniques together for a given situation is most effective. In this paper, we describe the key challenges we face in real-world chemical annotation scenarios. We then present a solution called ChemBrowser which has a flexible framework for chemical annotation. ChemBrowser includes a suite of customizable processing units that might be utilized in a chemical annotator, a high-level language that describes the composition of various processing units that would form a chemical annotator, and an execution engine that translates the composition language to an actual annotator that can generate annotation results for a given set of documents. We demonstrate the impact of this approach by tailoring an annotator for extracting chemical names from patent documents and show how this annotator can be easily modified with simple configuration alone.

  1. Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows.

    PubMed

    Fu, Xiao; Batista-Navarro, Riza; Rak, Rafal; Ananiadou, Sophia

    2015-01-01

    Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients. A corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents. When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors. We describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.

  2. Ontology design patterns to disambiguate relations between genes and gene products in GENIA

    PubMed Central

    2011-01-01

    Motivation Annotated reference corpora play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies is challenging due to the inherent ambiguity of natural language. The provision of formal definitions and axioms for semantic annotations offers the means for ensuring consistency as well as enables the development of verifiable annotation guidelines. Consistent semantic annotations facilitate the automatic discovery of new information through deductive inferences. Results We provide a formal characterization of the relations used in the recent GENIA corpus annotations. For this purpose, we both select existing axiom systems based on the desired properties of the relations within the domain and develop new axioms for several relations. To apply this ontology of relations to the semantic annotation of text corpora, we implement two ontology design patterns. In addition, we provide a software application to convert annotated GENIA abstracts into OWL ontologies by combining both the ontology of relations and the design patterns. As a result, the GENIA abstracts become available as OWL ontologies and are amenable for automated verification, deductive inferences and other knowledge-based applications. Availability Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/. PMID:22166341

  3. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion.

    PubMed

    Agarwal, Shashank; Yu, Hong

    2009-12-01

    Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in full-text biomedical articles could be reliably annotated into the IMRAD format and then explored different approaches for automatically classifying these sentences into the IMRAD categories. Our results show an overall annotation agreement of 82.14% with a Kappa score of 0.756. The best classification system is a multinomial naïve Bayes classifier trained on manually annotated data that achieved 91.95% accuracy and an average F-score of 91.55%, which is significantly higher than baseline systems. A web version of this system is available online at-http://wood.ims.uwm.edu/full_text_classifier/.

  4. Desiderata for ontologies to be used in semantic annotation of biomedical documents.

    PubMed

    Bada, Michael; Hunter, Lawrence

    2011-02-01

    A wealth of knowledge valuable to the translational research scientist is contained within the vast biomedical literature, but this knowledge is typically in the form of natural language. Sophisticated natural-language-processing systems are needed to translate text into unambiguous formal representations grounded in high-quality consensus ontologies, and these systems in turn rely on gold-standard corpora of annotated documents for training and testing. To this end, we are constructing the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-text biomedical journal articles that are being manually annotated with the entire sets of terms from select vocabularies, predominantly from the Open Biomedical Ontologies (OBO) library. Our efforts in building this corpus has illuminated infelicities of these ontologies with respect to the semantic annotation of biomedical documents, and we propose desiderata whose implementation could substantially improve their utility in this task; these include the integration of overlapping terms across OBOs, the resolution of OBO-specific ambiguities, the integration of the BFO with the OBOs and the use of mid-level ontologies, the inclusion of noncanonical instances, and the expansion of relations and realizable entities. Copyright © 2010 Elsevier Inc. All rights reserved.

  5. Protein Annotators' Assistant: A Novel Application of Information Retrieval Techniques.

    ERIC Educational Resources Information Center

    Wise, Michael J.

    2000-01-01

    Protein Annotators' Assistant (PAA) is a software system which assists protein annotators in assigning functions to newly sequenced proteins. PAA employs a number of information retrieval techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the…

  6. The Effects of Multimedia Annotation and Summary Writing on Taiwanese EFL Students' Reading Comprehension

    ERIC Educational Resources Information Center

    Huang, Wen-Chi

    2014-01-01

    The present study investigates the effects of multimedia annotation through the discourse scheme and summary writing through the grounding theory (Chang, 1997) on text comprehension. Specifically, the study focuses on examining the influences of multimedia annotation from a special perspective, namely, the use of modified discourse scheme to…

  7. Dense Annotation of Free-Text Critical Care Discharge Summaries from an Indian Hospital and Associated Performance of a Clinical NLP Annotator.

    PubMed

    Ramanan, S V; Radhakrishna, Kedar; Waghmare, Abijeet; Raj, Tony; Nathan, Senthil P; Sreerama, Sai Madhukar; Sampath, Sriram

    2016-08-01

    Electronic Health Record (EHR) use in India is generally poor, and structured clinical information is mostly lacking. This work is the first attempt aimed at evaluating unstructured text mining for extracting relevant clinical information from Indian clinical records. We annotated a corpus of 250 discharge summaries from an Intensive Care Unit (ICU) in India, with markups for diseases, procedures, and lab parameters, their attributes, as well as key demographic information and administrative variables such as patient outcomes. In this process, we have constructed guidelines for an annotation scheme useful to clinicians in the Indian context. We evaluated the performance of an NLP engine, Cocoa, on a cohort of these Indian clinical records. We have produced an annotated corpus of roughly 90 thousand words, which to our knowledge is the first tagged clinical corpus from India. Cocoa was evaluated on a test corpus of 50 documents. The overlap F-scores across the major categories, namely disease/symptoms, procedures, laboratory parameters and outcomes, are 0.856, 0.834, 0.961 and 0.872 respectively. These results are competitive with results from recent shared tasks based on US records. The annotated corpus and associated results from the Cocoa engine indicate that unstructured text mining is a viable method for cohort analysis in the Indian clinical context, where structured EHR records are largely absent.

  8. Building gold standard corpora for medical natural language processing tasks.

    PubMed

    Deleger, Louise; Li, Qi; Lingren, Todd; Kaiser, Megan; Molnar, Katalin; Stoutenborough, Laura; Kouril, Michal; Marsolo, Keith; Solti, Imre

    2012-01-01

    We present the construction of three annotated corpora to serve as gold standards for medical natural language processing (NLP) tasks. Clinical notes from the medical record, clinical trial announcements, and FDA drug labels are annotated. We report high inter-annotator agreements (overall F-measures between 0.8467 and 0.9176) for the annotation of Personal Health Information (PHI) elements for a de-identification task and of medications, diseases/disorders, and signs/symptoms for information extraction (IE) task. The annotated corpora of clinical trials and FDA labels will be publicly released and to facilitate translational NLP tasks that require cross-corpora interoperability (e.g. clinical trial eligibility screening) their annotation schemas are aligned with a large scale, NIH-funded clinical text annotation project.

  9. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    PubMed Central

    Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

    2006-01-01

    Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810

  10. A Case Study of Using a Social Annotation Tool to Support Collaboratively Learning

    ERIC Educational Resources Information Center

    Gao, Fei

    2013-01-01

    The purpose of the study was to understand student interaction and learning supported by a collaboratively social annotation tool--Diigo. The researcher examined through a case study how students participated and interacted when learning an online text with the social annotation tool--Diigo, and how they perceived their experience. The findings…

  11. METSP: a maximum-entropy classifier based text mining tool for transporter-substrate identification with semistructured text.

    PubMed

    Zhao, Min; Chen, Yanming; Qu, Dacheng; Qu, Hong

    2015-01-01

    The substrates of a transporter are not only useful for inferring function of the transporter, but also important to discover compound-compound interaction and to reconstruct metabolic pathway. Though plenty of data has been accumulated with the developing of new technologies such as in vitro transporter assays, the search for substrates of transporters is far from complete. In this article, we introduce METSP, a maximum-entropy classifier devoted to retrieve transporter-substrate pairs (TSPs) from semistructured text. Based on the high quality annotation from UniProt, METSP achieves high precision and recall in cross-validation experiments. When METSP is applied to 182,829 human transporter annotation sentences in UniProt, it identifies 3942 sentences with transporter and compound information. Finally, 1547 confidential human TSPs are identified for further manual curation, among which 58.37% pairs with novel substrates not annotated in public transporter databases. METSP is the first efficient tool to extract TSPs from semistructured annotation text in UniProt. This tool can help to determine the precise substrates and drugs of transporters, thus facilitating drug-target prediction, metabolic network reconstruction, and literature classification.

  12. Using text mining techniques to extract phenotypic information from the PhenoCHF corpus

    PubMed Central

    2015-01-01

    Background Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. Text mining (TM) techniques have previously been applied successfully to extract different types of information from text in the biomedical domain. They have the potential to be extended to allow the extraction of information relating to phenotypes from free text. Methods To stimulate the development of TM systems that are able to extract phenotypic information from text, we have created a new corpus (PhenoCHF) that is annotated by domain experts with several types of phenotypic information relating to congestive heart failure. To ensure that systems developed using the corpus are robust to multiple text types, it integrates text from heterogeneous sources, i.e., electronic health records (EHRs) and scientific articles from the literature. We have developed several different phenotype extraction methods to demonstrate the utility of the corpus, and tested these methods on a further corpus, i.e., ShARe/CLEF 2013. Results Evaluation of our automated methods showed that PhenoCHF can facilitate the training of reliable phenotype extraction systems, which are robust to variations in text type. These results have been reinforced by evaluating our trained systems on the ShARe/CLEF corpus, which contains clinical records of various types. Like other studies within the biomedical domain, we found that solutions based on conditional random fields produced the best results, when coupled with a rich feature set. Conclusions PhenoCHF is the first annotated corpus aimed at encoding detailed phenotypic information. The unique heterogeneous composition of the corpus has been shown to be advantageous in the training of systems that can accurately extract phenotypic information from a range of different text types. Although the scope of our annotation is currently limited to a single disease, the promising results achieved can stimulate further work into the extraction of phenotypic information for other diseases. The PhenoCHF annotation guidelines and annotations are publicly available at https://code.google.com/p/phenochf-corpus. PMID:26099853

  13. Using text mining techniques to extract phenotypic information from the PhenoCHF corpus.

    PubMed

    Alnazzawi, Noha; Thompson, Paul; Batista-Navarro, Riza; Ananiadou, Sophia

    2015-01-01

    Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. Text mining (TM) techniques have previously been applied successfully to extract different types of information from text in the biomedical domain. They have the potential to be extended to allow the extraction of information relating to phenotypes from free text. To stimulate the development of TM systems that are able to extract phenotypic information from text, we have created a new corpus (PhenoCHF) that is annotated by domain experts with several types of phenotypic information relating to congestive heart failure. To ensure that systems developed using the corpus are robust to multiple text types, it integrates text from heterogeneous sources, i.e., electronic health records (EHRs) and scientific articles from the literature. We have developed several different phenotype extraction methods to demonstrate the utility of the corpus, and tested these methods on a further corpus, i.e., ShARe/CLEF 2013. Evaluation of our automated methods showed that PhenoCHF can facilitate the training of reliable phenotype extraction systems, which are robust to variations in text type. These results have been reinforced by evaluating our trained systems on the ShARe/CLEF corpus, which contains clinical records of various types. Like other studies within the biomedical domain, we found that solutions based on conditional random fields produced the best results, when coupled with a rich feature set. PhenoCHF is the first annotated corpus aimed at encoding detailed phenotypic information. The unique heterogeneous composition of the corpus has been shown to be advantageous in the training of systems that can accurately extract phenotypic information from a range of different text types. Although the scope of our annotation is currently limited to a single disease, the promising results achieved can stimulate further work into the extraction of phenotypic information for other diseases. The PhenoCHF annotation guidelines and annotations are publicly available at https://code.google.com/p/phenochf-corpus.

  14. Bloodroot (Sanguinaria canadensis): an annotated bibliography

    Treesearch

    Mary L. Predny; James L. Chamberlain

    2005-01-01

    Bloodroot (Sanguinaria canadensis) is a spring-blooming herbaceous perennial found mainly in rich woods throughout the Appalachian Mountain regions and across the Eastern United States. The common name bloodroot and scientific name Sanguinaria denote the blood-red sap found throughout the plant, particularly in the roots. This sap...

  15. MimoSA: a system for minimotif annotation

    PubMed Central

    2010-01-01

    Background Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature. Results We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database. Conclusions MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context. PMID:20565705

  16. Common data model for natural language processing based on two existing standard information models: CDA+GrAF.

    PubMed

    Meystre, Stéphane M; Lee, Sanghoon; Jung, Chai Young; Chevrier, Raphaël D

    2012-08-01

    An increasing need for collaboration and resources sharing in the Natural Language Processing (NLP) research and development community motivates efforts to create and share a common data model and a common terminology for all information annotated and extracted from clinical text. We have combined two existing standards: the HL7 Clinical Document Architecture (CDA), and the ISO Graph Annotation Format (GrAF; in development), to develop such a data model entitled "CDA+GrAF". We experimented with several methods to combine these existing standards, and eventually selected a method wrapping separate CDA and GrAF parts in a common standoff annotation (i.e., separate from the annotated text) XML document. Two use cases, clinical document sections, and the 2010 i2b2/VA NLP Challenge (i.e., problems, tests, and treatments, with their assertions and relations), were used to create examples of such standoff annotation documents, and were successfully validated with the XML schemata provided with both standards. We developed a tool to automatically translate annotation documents from the 2010 i2b2/VA NLP Challenge format to GrAF, and automatically generated 50 annotation documents using this tool, all successfully validated. Finally, we adapted the XSL stylesheet provided with HL7 CDA to allow viewing annotation XML documents in a web browser, and plan to adapt existing tools for translating annotation documents between CDA+GrAF and the UIMA and GATE frameworks. This common data model may ease directly comparing NLP tools and applications, combining their output, transforming and "translating" annotations between different NLP applications, and eventually "plug-and-play" of different modules in NLP applications. Copyright © 2011 Elsevier Inc. All rights reserved.

  17. Essential Annotation Schema for Ecology (EASE)-A framework supporting the efficient data annotation and faceted navigation in ecology.

    PubMed

    Pfaff, Claas-Thido; Eichenberg, David; Liebergesell, Mario; König-Ries, Birgitta; Wirth, Christian

    2017-01-01

    Ecology has become a data intensive science over the last decades which often relies on the reuse of data in cross-experimental analyses. However, finding data which qualifies for the reuse in a specific context can be challenging. It requires good quality metadata and annotations as well as efficient search strategies. To date, full text search (often on the metadata only) is the most widely used search strategy although it is known to be inaccurate. Faceted navigation is providing a filter mechanism which is based on fine granular metadata, categorizing search objects along numeric and categorical parameters relevant for their discovery. Selecting from these parameters during a full text search creates a system of filters which allows to refine and improve the results towards more relevance. We developed a framework for the efficient annotation and faceted navigation in ecology. It consists of an XML schema for storing the annotation of search objects and is accompanied by a vocabulary focused on ecology to support the annotation process. The framework consolidates ideas which originate from widely accepted metadata standards, textbooks, scientific literature, and vocabularies as well as from expert knowledge contributed by researchers from ecology and adjacent disciplines.

  18. An annotated corpus with nanomedicine and pharmacokinetic parameters

    PubMed Central

    Lewinski, Nastassja A; Jimenez, Ivan; McInnes, Bridget T

    2017-01-01

    A vast amount of data on nanomedicines is being generated and published, and natural language processing (NLP) approaches can automate the extraction of unstructured text-based data. Annotated corpora are a key resource for NLP and information extraction methods which employ machine learning. Although corpora are available for pharmaceuticals, resources for nanomedicines and nanotechnology are still limited. To foster nanotechnology text mining (NanoNLP) efforts, we have constructed a corpus of annotated drug product inserts taken from the US Food and Drug Administration’s Drugs@FDA online database. In this work, we present the development of the Engineered Nanomedicine Database corpus to support the evaluation of nanomedicine entity extraction. The data were manually annotated for 21 entity mentions consisting of nanomedicine physicochemical characterization, exposure, and biologic response information of 41 Food and Drug Administration-approved nanomedicines. We evaluate the reliability of the manual annotations and demonstrate the use of the corpus by evaluating two state-of-the-art named entity extraction systems, OpenNLP and Stanford NER. The annotated corpus is available open source and, based on these results, guidelines and suggestions for future development of additional nanomedicine corpora are provided. PMID:29066897

  19. The CHEMDNER corpus of chemicals and drugs and its annotation principles.

    PubMed

    Krallinger, Martin; Rabal, Obdulia; Leitner, Florian; Vazquez, Miguel; Salgado, David; Lu, Zhiyong; Leaman, Robert; Lu, Yanan; Ji, Donghong; Lowe, Daniel M; Sayle, Roger A; Batista-Navarro, Riza Theresa; Rak, Rafal; Huber, Torsten; Rocktäschel, Tim; Matos, Sérgio; Campos, David; Tang, Buzhou; Xu, Hua; Munkhdalai, Tsendsuren; Ryu, Keun Ho; Ramanan, S V; Nathan, Senthil; Žitnik, Slavko; Bajec, Marko; Weber, Lutz; Irmer, Matthias; Akhondi, Saber A; Kors, Jan A; Xu, Shuo; An, Xin; Sikdar, Utpal Kumar; Ekbal, Asif; Yoshioka, Masaharu; Dieb, Thaer M; Choi, Miji; Verspoor, Karin; Khabsa, Madian; Giles, C Lee; Liu, Hongfang; Ravikumar, Komandur Elayavilli; Lamurias, Andre; Couto, Francisco M; Dai, Hong-Jie; Tsai, Richard Tzong-Han; Ata, Caglar; Can, Tolga; Usié, Anabel; Alves, Rui; Segura-Bedmar, Isabel; Martínez, Paloma; Oyarzabal, Julen; Valencia, Alfonso

    2015-01-01

    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

  20. The CHEMDNER corpus of chemicals and drugs and its annotation principles

    PubMed Central

    2015-01-01

    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/ PMID:25810773

  1. Transcriptome sequence analysis of an ornamental plant, Ananas comosus var. bracteatus, revealed the potential unigenes involved in terpenoid and phenylpropanoid biosynthesis.

    PubMed

    Ma, Jun; Kanakala, S; He, Yehua; Zhang, Junli; Zhong, Xiaolan

    2015-01-01

    Ananas comosus var. bracteatus (Red Pineapple) is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies. The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis. The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus.

  2. Transcriptome Sequence Analysis of an Ornamental Plant, Ananas comosus var. bracteatus, Revealed the Potential Unigenes Involved in Terpenoid and Phenylpropanoid Biosynthesis

    PubMed Central

    Ma, Jun; Kanakala, S.; He, Yehua; Zhang, Junli; Zhong, Xiaolan

    2015-01-01

    Background Ananas comosus var. bracteatus (Red Pineapple) is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies. Results The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis. Conclusion The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus. PMID:25769053

  3. How to Lessen the Effects of User Resistance on the Adoption of an E-Learning Environment: Screenshot Annotation on Flickr

    ERIC Educational Resources Information Center

    Huang, T. K.

    2018-01-01

    The study makes use of the photo-hosting site, namely Flickr, for students to upload screenshots to demonstrate computer software problems and troubleshooting software. By creating non-text stickers and text-based annotations above the screenshots, students are able to help one another to diagnose and solve problems with greater certainty. In…

  4. Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.

    PubMed

    Islamaj Doğan, Rezarta; Comeau, Donald C; Yeganova, Lana; Wilbur, W John

    2014-01-01

    BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

  5. Assisted annotation of medical free text using RapTAT

    PubMed Central

    Gobbel, Glenn T; Garvin, Jennifer; Reeves, Ruth; Cronin, Robert M; Heavirland, Julia; Williams, Jenifer; Weaver, Allison; Jayaramaraja, Shrimalini; Giuse, Dario; Speroff, Theodore; Brown, Steven H; Xu, Hua; Matheny, Michael E

    2014-01-01

    Objective To determine whether assisted annotation using interactive training can reduce the time required to annotate a clinical document corpus without introducing bias. Materials and methods A tool, RapTAT, was designed to assist annotation by iteratively pre-annotating probable phrases of interest within a document, presenting the annotations to a reviewer for correction, and then using the corrected annotations for further machine learning-based training before pre-annotating subsequent documents. Annotators reviewed 404 clinical notes either manually or using RapTAT assistance for concepts related to quality of care during heart failure treatment. Notes were divided into 20 batches of 19–21 documents for iterative annotation and training. Results The number of correct RapTAT pre-annotations increased significantly and annotation time per batch decreased by ∼50% over the course of annotation. Annotation rate increased from batch to batch for assisted but not manual reviewers. Pre-annotation F-measure increased from 0.5 to 0.6 to >0.80 (relative to both assisted reviewer and reference annotations) over the first three batches and more slowly thereafter. Overall inter-annotator agreement was significantly higher between RapTAT-assisted reviewers (0.89) than between manual reviewers (0.85). Discussion The tool reduced workload by decreasing the number of annotations needing to be added and helping reviewers to annotate at an increased rate. Agreement between the pre-annotations and reference standard, and agreement between the pre-annotations and assisted annotations, were similar throughout the annotation process, which suggests that pre-annotation did not introduce bias. Conclusions Pre-annotations generated by a tool capable of interactive training can reduce the time required to create an annotated document corpus by up to 50%. PMID:24431336

  6. 33 CFR 164.70 - Definitions.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... Allen-Morgan City Alternative Route, and that part of the Atchafalaya River above its junction with the Port Allen-Morgan City Alternative Route including the Old River and the Red River and those waters... Navigation reasonably available and that apply to the vessel's transit. Hand-annotated river maps from the U...

  7. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task

    PubMed Central

    Arighi, Cecilia N.; Carterette, Ben; Cohen, K. Bretonnel; Krallinger, Martin; Wilbur, W. John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E.; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L.; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P.; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O.; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy

    2013-01-01

    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV. PMID:23327936

  8. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

    PubMed

    Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel; Krallinger, Martin; Wilbur, W John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy

    2013-01-01

    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.

  9. Alga-PrAS (Algal Protein Annotation Suite): A Database of Comprehensive Annotation in Algal Proteomes

    PubMed Central

    Kurotani, Atsushi; Yamada, Yutaka

    2017-01-01

    Algae are smaller organisms than land plants and offer clear advantages in research over terrestrial species in terms of rapid production, short generation time and varied commercial applications. Thus, studies investigating the practical development of effective algal production are important and will improve our understanding of both aquatic and terrestrial plants. In this study we estimated multiple physicochemical and secondary structural properties of protein sequences, the predicted presence of post-translational modification (PTM) sites, and subcellular localization using a total of 510,123 protein sequences from the proteomes of 31 algal and three plant species. Algal species were broadly selected from green and red algae, glaucophytes, oomycetes, diatoms and other microalgal groups. The results were deposited in the Algal Protein Annotation Suite database (Alga-PrAS; http://alga-pras.riken.jp/), which can be freely accessed online. PMID:28069893

  10. Active learning reduces annotation time for clinical concept extraction.

    PubMed

    Kholghi, Mahnoosh; Sitbon, Laurianne; Zuccon, Guido; Nguyen, Anthony

    2017-10-01

    To investigate: (1) the annotation time savings by various active learning query strategies compared to supervised learning and a random sampling baseline, and (2) the benefits of active learning-assisted pre-annotations in accelerating the manual annotation process compared to de novo annotation. There are 73 and 120 discharge summary reports provided by Beth Israel institute in the train and test sets of the concept extraction task in the i2b2/VA 2010 challenge, respectively. The 73 reports were used in user study experiments for manual annotation. First, all sequences within the 73 reports were manually annotated from scratch. Next, active learning models were built to generate pre-annotations for the sequences selected by a query strategy. The annotation/reviewing time per sequence was recorded. The 120 test reports were used to measure the effectiveness of the active learning models. When annotating from scratch, active learning reduced the annotation time up to 35% and 28% compared to a fully supervised approach and a random sampling baseline, respectively. Reviewing active learning-assisted pre-annotations resulted in 20% further reduction of the annotation time when compared to de novo annotation. The number of concepts that require manual annotation is a good indicator of the annotation time for various active learning approaches as demonstrated by high correlation between time rate and concept annotation rate. Active learning has a key role in reducing the time required to manually annotate domain concepts from clinical free text, either when annotating from scratch or reviewing active learning-assisted pre-annotations. Copyright © 2017 Elsevier B.V. All rights reserved.

  11. Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles.

    PubMed

    Cohen, K Bretonnel; Lanfranchi, Arrick; Choi, Miji Joo-Young; Bada, Michael; Baumgartner, William A; Panteleyeva, Natalya; Verspoor, Karin; Palmer, Martha; Hunter, Lawrence E

    2017-08-17

    Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations. The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus. The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.

  12. An ensemble deep learning based approach for red lesion detection in fundus images.

    PubMed

    Orlando, José Ignacio; Prokofyeva, Elena; Del Fresno, Mariana; Blaschko, Matthew B

    2018-01-01

    Diabetic retinopathy (DR) is one of the leading causes of preventable blindness in the world. Its earliest sign are red lesions, a general term that groups both microaneurysms (MAs) and hemorrhages (HEs). In daily clinical practice, these lesions are manually detected by physicians using fundus photographs. However, this task is tedious and time consuming, and requires an intensive effort due to the small size of the lesions and their lack of contrast. Computer-assisted diagnosis of DR based on red lesion detection is being actively explored due to its improvement effects both in clinicians consistency and accuracy. Moreover, it provides comprehensive feedback that is easy to assess by the physicians. Several methods for detecting red lesions have been proposed in the literature, most of them based on characterizing lesion candidates using hand crafted features, and classifying them into true or false positive detections. Deep learning based approaches, by contrast, are scarce in this domain due to the high expense of annotating the lesions manually. In this paper we propose a novel method for red lesion detection based on combining both deep learned and domain knowledge. Features learned by a convolutional neural network (CNN) are augmented by incorporating hand crafted features. Such ensemble vector of descriptors is used afterwards to identify true lesion candidates using a Random Forest classifier. We empirically observed that combining both sources of information significantly improve results with respect to using each approach separately. Furthermore, our method reported the highest performance on a per-lesion basis on DIARETDB1 and e-ophtha, and for screening and need for referral on MESSIDOR compared to a second human expert. Results highlight the fact that integrating manually engineered approaches with deep learned features is relevant to improve results when the networks are trained from lesion-level annotated data. An open source implementation of our system is publicly available at https://github.com/ignaciorlando/red-lesion-detection. Copyright © 2017 Elsevier B.V. All rights reserved.

  13. Essential Annotation Schema for Ecology (EASE)—A framework supporting the efficient data annotation and faceted navigation in ecology

    PubMed Central

    Eichenberg, David; Liebergesell, Mario; König-Ries, Birgitta; Wirth, Christian

    2017-01-01

    Ecology has become a data intensive science over the last decades which often relies on the reuse of data in cross-experimental analyses. However, finding data which qualifies for the reuse in a specific context can be challenging. It requires good quality metadata and annotations as well as efficient search strategies. To date, full text search (often on the metadata only) is the most widely used search strategy although it is known to be inaccurate. Faceted navigation is providing a filter mechanism which is based on fine granular metadata, categorizing search objects along numeric and categorical parameters relevant for their discovery. Selecting from these parameters during a full text search creates a system of filters which allows to refine and improve the results towards more relevance. We developed a framework for the efficient annotation and faceted navigation in ecology. It consists of an XML schema for storing the annotation of search objects and is accompanied by a vocabulary focused on ecology to support the annotation process. The framework consolidates ideas which originate from widely accepted metadata standards, textbooks, scientific literature, and vocabularies as well as from expert knowledge contributed by researchers from ecology and adjacent disciplines. PMID:29023519

  14. Enhancement of the Shared Graphics Workspace.

    DTIC Science & Technology

    1987-12-31

    participants to share videodisc images and computer graphics displayed in color and text and facsimile information displayed in black on amber. They...could annotate the information in up to five * colors and print the annotated version at both sites, using a standard fax machine. The SGWS also used a fax...system to display a document, whether text or photo, the camera scans the document, digitizes the data, and sends it via direct memory access (DMA) to

  15. Marky: a tool supporting annotation consistency in multi-user and iterative document annotation projects.

    PubMed

    Pérez-Pérez, Martín; Glez-Peña, Daniel; Fdez-Riverola, Florentino; Lourenço, Anália

    2015-02-01

    Document annotation is a key task in the development of Text Mining methods and applications. High quality annotated corpora are invaluable, but their preparation requires a considerable amount of resources and time. Although the existing annotation tools offer good user interaction interfaces to domain experts, project management and quality control abilities are still limited. Therefore, the current work introduces Marky, a new Web-based document annotation tool equipped to manage multi-user and iterative projects, and to evaluate annotation quality throughout the project life cycle. At the core, Marky is a Web application based on the open source CakePHP framework. User interface relies on HTML5 and CSS3 technologies. Rangy library assists in browser-independent implementation of common DOM range and selection tasks, and Ajax and JQuery technologies are used to enhance user-system interaction. Marky grants solid management of inter- and intra-annotator work. Most notably, its annotation tracking system supports systematic and on-demand agreement analysis and annotation amendment. Each annotator may work over documents as usual, but all the annotations made are saved by the tracking system and may be further compared. So, the project administrator is able to evaluate annotation consistency among annotators and across rounds of annotation, while annotators are able to reject or amend subsets of annotations made in previous rounds. As a side effect, the tracking system minimises resource and time consumption. Marky is a novel environment for managing multi-user and iterative document annotation projects. Compared to other tools, Marky offers a similar visually intuitive annotation experience while providing unique means to minimise annotation effort and enforce annotation quality, and therefore corpus consistency. Marky is freely available for non-commercial use at http://sing.ei.uvigo.es/marky. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  16. CUILESS2016: a clinical corpus applying compositional normalization of text mentions.

    PubMed

    Osborne, John D; Neu, Matthew B; Danila, Maria I; Solorio, Thamar; Bethard, Steven J

    2018-01-10

    Traditionally text mention normalization corpora have normalized concepts to single ontology identifiers ("pre-coordinated concepts"). Less frequently, normalization corpora have used concepts with multiple identifiers ("post-coordinated concepts") but the additional identifiers have been restricted to a defined set of relationships to the core concept. This approach limits the ability of the normalization process to express semantic meaning. We generated a freely available corpus using post-coordinated concepts without a defined set of relationships that we term "compositional concepts" to evaluate their use in clinical text. We annotated 5397 disorder mentions from the ShARe corpus to SNOMED CT that were previously normalized as "CUI-less" in the "SemEval-2015 Task 14" shared task because they lacked a pre-coordinated mapping. Unlike the previous normalization method, we do not restrict concept mappings to a particular set of the Unified Medical Language System (UMLS) semantic types and allow normalization to occur to multiple UMLS Concept Unique Identifiers (CUIs). We computed annotator agreement and assessed semantic coverage with this method. We generated the largest clinical text normalization corpus to date with mappings to multiple identifiers and made it freely available. All but 8 of the 5397 disorder mentions were normalized using this methodology. Annotator agreement ranged from 52.4% using the strictest metric (exact matching) to 78.2% using a hierarchical agreement that measures the overlap of shared ancestral nodes. Our results provide evidence that compositional concepts can increase semantic coverage in clinical text. To our knowledge we provide the first freely available corpus of compositional concept annotation in clinical text.

  17. Automatic extraction of angiogenesis bioprocess from text

    PubMed Central

    Wang, Xinglong; McKendrick, Iain; Barrett, Ian; Dix, Ian; French, Tim; Tsujii, Jun'ichi; Ananiadou, Sophia

    2011-01-01

    Motivation: Understanding key biological processes (bioprocesses) and their relationships with constituent biological entities and pharmaceutical agents is crucial for drug design and discovery. One way to harvest such information is searching the literature. However, bioprocesses are difficult to capture because they may occur in text in a variety of textual expressions. Moreover, a bioprocess is often composed of a series of bioevents, where a bioevent denotes changes to one or a group of cells involved in the bioprocess. Such bioevents are often used to refer to bioprocesses in text, which current techniques, relying solely on specialized lexicons, struggle to find. Results: This article presents a range of methods for finding bioprocess terms and events. To facilitate the study, we built a gold standard corpus in which terms and events related to angiogenesis, a key biological process of the growth of new blood vessels, were annotated. Statistics of the annotated corpus revealed that over 36% of the text expressions that referred to angiogenesis appeared as events. The proposed methods respectively employed domain-specific vocabularies, a manually annotated corpus and unstructured domain-specific documents. Evaluation results showed that, while a supervised machine-learning model yielded the best precision, recall and F1 scores, the other methods achieved reasonable performance and less cost to develop. Availability: The angiogenesis vocabularies, gold standard corpus, annotation guidelines and software described in this article are available at http://text0.mib.man.ac.uk/~mbassxw2/angiogenesis/ Contact: xinglong.wang@gmail.com PMID:21821664

  18. Galax (Galax urceolata):an annotated bibliography

    Treesearch

    Mary L. Predny; James L. Chamberlain

    2005-01-01

    Galax (Galax urceolata) is an evergreen groundcover harvested for use in the floral industry. The plant’s durable, shiny green leaves turn red in the fall and are popular background foliage in floral arrangements. People living in the mountains of North Carolina and other rural Appalachian locations have harvested (pulled) galax to supplement their...

  19. Active Learning with a Human in The Loop

    DTIC Science & Technology

    2012-11-01

    handwrit - ten digits (LeCun et al. [1998]). In the red curve the model is built iteratively: at each iteration the twenty examples with the lowest...continuum. The most we can say about MUC annotation is that it’s simple enough that other tasks are likely to impose a heavier load on the user for

  20. An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB.

    PubMed

    Bell, Michael J; Gillespie, Colin S; Swan, Daniel; Lord, Phillip

    2012-09-15

    Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Source code is available at the authors website: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation. phillip.lord@newcastle.ac.uk.

  1. Representing annotation compositionality and provenance for the Semantic Web

    PubMed Central

    2013-01-01

    Background Though the annotation of digital artifacts with metadata has a long history, the bulk of that work focuses on the association of single terms or concepts to single targets. As annotation efforts expand to capture more complex information, annotations will need to be able to refer to knowledge structures formally defined in terms of more atomic knowledge structures. Existing provenance efforts in the Semantic Web domain primarily focus on tracking provenance at the level of whole triples and do not provide enough detail to track how individual triple elements of annotations were derived from triple elements of other annotations. Results We present a task- and domain-independent ontological model for capturing annotations and their linkage to their denoted knowledge representations, which can be singular concepts or more complex sets of assertions. We have implemented this model as an extension of the Information Artifact Ontology in OWL and made it freely available, and we show how it can be integrated with several prominent annotation and provenance models. We present several application areas for the model, ranging from linguistic annotation of text to the annotation of disease-associations in genome sequences. Conclusions With this model, progressively more complex annotations can be composed from other annotations, and the provenance of compositional annotations can be represented at the annotation level or at the level of individual elements of the RDF triples composing the annotations. This in turn allows for progressively richer annotations to be constructed from previous annotation efforts, the precise provenance recording of which facilitates evidence-based inference and error tracking. PMID:24268021

  2. Closely Reading Informational Texts in the Primary Grades

    ERIC Educational Resources Information Center

    Fisher, Douglas; Frey, Nancy

    2014-01-01

    In this article we discuss the differences between close reading in the primary grades and upper elementary grades. We focus on text selection, initial reading. repeated reading, annotation, text-based discussions, and responding to texts.

  3. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature.

    PubMed

    Müller, H-M; Van Auken, K M; Li, Y; Sternberg, P W

    2018-03-09

    The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world. Textpresso Central URL: http://www.textpresso.org/tpc.

  4. Annotation an effective device for student feedback: a critical review of the literature.

    PubMed

    Ball, Elaine C

    2010-05-01

    The paper examines hand-written annotation, its many features, difficulties and strengths as a feedback tool. It extends and clarifies what modest evidence is in the public domain and offers an evaluation of how to use annotation effectively in the support of student feedback [Marshall, C.M., 1998a. The Future of Annotation in a Digital (paper) World. Presented at the 35th Annual GLSLIS Clinic: Successes and Failures of Digital Libraries, June 20-24, University of Illinois at Urbana-Champaign, March 24, pp. 1-20; Marshall, C.M., 1998b. Toward an ecology of hypertext annotation. Hypertext. In: Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia, June 20-24, Pittsburgh Pennsylvania, US, pp. 40-49; Wolfe, J.L., Nuewirth, C.M., 2001. From the margins to the centre: the future of annotation. Journal of Business and Technical Communication, 15(3), 333-371; Diyanni, R., 2002. One Hundred Great Essays. Addison-Wesley, New York; Wolfe, J.L., 2002. Marginal pedagogy: how annotated texts affect writing-from-source texts. Written Communication, 19(2), 297-333; Liu, K., 2006. Annotation as an index to critical writing. Urban Education, 41, 192-207; Feito, A., Donahue, P., 2008. Minding the gap annotation as preparation for discussion. Arts and Humanities in Higher Education, 7(3), 295-307; Ball, E., 2009. A participatory action research study on handwritten annotation feedback and its impact on staff and students. Systemic Practice and Action Research, 22(2), 111-124; Ball, E., Franks, H., McGrath, M., Leigh, J., 2009. Annotation is a valuable tool to enhance learning and assessment in student essays. Nurse Education Today, 29(3), 284-291]. Although a significant number of studies examine annotation, this is largely related to on-line tools and computer mediated communication and not hand-written annotation as comment, phrase or sign written on the student essay to provide critique. Little systematic research has been conducted to consider how this latter form of annotation influences student learning and assessment or, indeed, helps tutors to employ better annotative practices [Juwah, C., Macfarlane-Dick, D., Matthew, B., Nicol, D., Ross, D., Smith, B., 2004. Enhancing student learning through effective formative feedback. The Higher Education Academy, 1-40; Jewitt, C., Kress, G., 2005. English in classrooms: only write down what you need to know: annotation for what? English in Education, 39(1), 5-18]. There is little evidence on ways to heighten students' self-awareness when their essays are returned with annotated feedback [Storch, N., Tapper, J., 1997. Student annotations: what NNS and NS university students say about their own writing. Journal of Second Language Writing, 6(3), 245-265]. The literature review clarifies forms of annotation as feedback practice and offers a summary of the challenges and usefulness of annotation. Copyright 2009. Published by Elsevier Ltd.

  5. The language of gene ontology: a Zipf's law analysis.

    PubMed

    Kalankesh, Leila Ranandeh; Stevens, Robert; Brass, Andy

    2012-06-07

    Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.

  6. Annotation analysis for testing drug safety signals using unstructured clinical notes

    PubMed Central

    2012-01-01

    Background The electronic surveillance for adverse drug events is largely based upon the analysis of coded data from reporting systems. Yet, the vast majority of electronic health data lies embedded within the free text of clinical notes and is not gathered into centralized repositories. With the increasing access to large volumes of electronic medical data—in particular the clinical notes—it may be possible to computationally encode and to test drug safety signals in an active manner. Results We describe the application of simple annotation tools on clinical text and the mining of the resulting annotations to compute the risk of getting a myocardial infarction for patients with rheumatoid arthritis that take Vioxx. Our analysis clearly reveals elevated risks for myocardial infarction in rheumatoid arthritis patients taking Vioxx (odds ratio 2.06) before 2005. Conclusions Our results show that it is possible to apply annotation analysis methods for testing hypotheses about drug safety using electronic medical records. PMID:22541596

  7. BioTextQuest: a web-based biomedical text mining suite for concept discovery.

    PubMed

    Papanikolaou, Nikolas; Pafilis, Evangelos; Nikolaou, Stavros; Ouzounis, Christos A; Iliopoulos, Ioannis; Promponas, Vasilis J

    2011-12-01

    BioTextQuest combines automated discovery of significant terms in article clusters with structured knowledge annotation, via Named Entity Recognition services, offering interactive user-friendly visualization. A tag-cloud-based illustration of terms labeling each document cluster are semantically annotated according to the biological entity, and a list of document titles enable users to simultaneously compare terms and documents of each cluster, facilitating concept association and hypothesis generation. BioTextQuest allows customization of analysis parameters, e.g. clustering/stemming algorithms, exclusion of documents/significant terms, to better match the biological question addressed. http://biotextquest.biol.ucy.ac.cy vprobon@ucy.ac.cy; iliopj@med.uoc.gr Supplementary data are available at Bioinformatics online.

  8. TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations

    PubMed Central

    Miyao, Yusuke; Collier, Nigel

    2017-01-01

    Background Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner. Objective This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the area of pharmacovigilance using natural language processing (NLP) to perform experiments to better understand the similarities and differences between drug reports in Twitter and PubMed. Methods We produced a corpus comprising 1000 tweets and 1000 PubMed sentences selected using the same strategy and annotated at entity level by the same experts (pharmacists) using the same set of guidelines. Results The resulting corpus, annotated by two pharmacists, comprises semantically correct annotations for a set of drugs, diseases, and symptoms. This corpus contains the annotations for 3144 entities, 2749 relations, and 5003 attributes. Conclusions We present a corpus that is unique in its characteristics as this is the first corpus for pharmacovigilance curated from Twitter messages and PubMed sentences using the same data selection and annotation strategies. We believe this corpus will be of particular interest for researchers willing to compare results from pharmacovigilance systems (eg, classifiers and named entity recognition systems) when using data from Twitter and from PubMed. We hope that given the comprehensive set of drug names and the annotated entities and relations, this corpus becomes a standard resource to compare results from different pharmacovigilance studies in the area of NLP. PMID:28468748

  9. TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.

    PubMed

    Alvaro, Nestor; Miyao, Yusuke; Collier, Nigel

    2017-05-03

    Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner. This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the area of pharmacovigilance using natural language processing (NLP) to perform experiments to better understand the similarities and differences between drug reports in Twitter and PubMed. We produced a corpus comprising 1000 tweets and 1000 PubMed sentences selected using the same strategy and annotated at entity level by the same experts (pharmacists) using the same set of guidelines. The resulting corpus, annotated by two pharmacists, comprises semantically correct annotations for a set of drugs, diseases, and symptoms. This corpus contains the annotations for 3144 entities, 2749 relations, and 5003 attributes. We present a corpus that is unique in its characteristics as this is the first corpus for pharmacovigilance curated from Twitter messages and PubMed sentences using the same data selection and annotation strategies. We believe this corpus will be of particular interest for researchers willing to compare results from pharmacovigilance systems (eg, classifiers and named entity recognition systems) when using data from Twitter and from PubMed. We hope that given the comprehensive set of drug names and the annotated entities and relations, this corpus becomes a standard resource to compare results from different pharmacovigilance studies in the area of NLP. ©Nestor Alvaro, Yusuke Miyao, Nigel Collier. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 03.05.2017.

  10. Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation.

    PubMed

    Clark, Alex M; Bunin, Barry A; Litterman, Nadia K; Schürer, Stephan C; Visser, Ubbo

    2014-01-01

    Bioinformatics and computer aided drug design rely on the curation of a large number of protocols for biological assays that measure the ability of potential drugs to achieve a therapeutic effect. These assay protocols are generally published by scientists in the form of plain text, which needs to be more precisely annotated in order to be useful to software methods. We have developed a pragmatic approach to describing assays according to the semantic definitions of the BioAssay Ontology (BAO) project, using a hybrid of machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort. We have carried out this work based on the premise that pure machine learning is insufficiently accurate, and that expecting scientists to find the time to annotate their protocols manually is unrealistic. By combining these approaches, we have created an effective prototype for which annotation of bioassay text within the domain of the training set can be accomplished very quickly. Well-trained annotations require single-click user approval, while annotations from outside the training set domain can be identified using the search feature of a well-designed user interface, and subsequently used to improve the underlying models. By drastically reducing the time required for scientists to annotate their assays, we can realistically advocate for semantic annotation to become a standard part of the publication process. Once even a small proportion of the public body of bioassay data is marked up, bioinformatics researchers can begin to construct sophisticated and useful searching and analysis algorithms that will provide a diverse and powerful set of tools for drug discovery researchers.

  11. Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation

    PubMed Central

    Bunin, Barry A.; Litterman, Nadia K.; Schürer, Stephan C.; Visser, Ubbo

    2014-01-01

    Bioinformatics and computer aided drug design rely on the curation of a large number of protocols for biological assays that measure the ability of potential drugs to achieve a therapeutic effect. These assay protocols are generally published by scientists in the form of plain text, which needs to be more precisely annotated in order to be useful to software methods. We have developed a pragmatic approach to describing assays according to the semantic definitions of the BioAssay Ontology (BAO) project, using a hybrid of machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort. We have carried out this work based on the premise that pure machine learning is insufficiently accurate, and that expecting scientists to find the time to annotate their protocols manually is unrealistic. By combining these approaches, we have created an effective prototype for which annotation of bioassay text within the domain of the training set can be accomplished very quickly. Well-trained annotations require single-click user approval, while annotations from outside the training set domain can be identified using the search feature of a well-designed user interface, and subsequently used to improve the underlying models. By drastically reducing the time required for scientists to annotate their assays, we can realistically advocate for semantic annotation to become a standard part of the publication process. Once even a small proportion of the public body of bioassay data is marked up, bioinformatics researchers can begin to construct sophisticated and useful searching and analysis algorithms that will provide a diverse and powerful set of tools for drug discovery researchers. PMID:25165633

  12. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

    PubMed Central

    2012-01-01

    Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications. PMID:22901054

  13. Inferring Higher Functional Information for RIKEN Mouse Full-Length cDNA Clones With FACTS

    PubMed Central

    Nagashima, Takeshi; Silva, Diego G.; Petrovsky, Nikolai; Socha, Luis A.; Suzuki, Harukazu; Saito, Rintaro; Kasukawa, Takeya; Kurochkin, Igor V.; Konagaya, Akihiko; Schönbach, Christian

    2003-01-01

    FACTS (Functional Association/Annotation of cDNA Clones from Text/Sequence Sources) is a semiautomated knowledge discovery and annotation system that integrates molecular function information derived from sequence analysis results (sequence inferred) with functional information extracted from text. Text-inferred information was extracted from keyword-based retrievals of MEDLINE abstracts and by matching of gene or protein names to OMIM, BIND, and DIP database entries. Using FACTS, we found that 47.5% of the 60,770 RIKEN mouse cDNA FANTOM2 clone annotations were informative for text searches. MEDLINE queries yielded molecular interaction-containing sentences for 23.1% of the clones. When disease MeSH and GO terms were matched with retrieved abstracts, 22.7% of clones were associated with potential diseases, and 32.5% with GO identifiers. A significant number (23.5%) of disease MeSH-associated clones were also found to have a hereditary disease association (OMIM Morbidmap). Inferred neoplastic and nervous system disease represented 49.6% and 36.0% of disease MeSH-associated clones, respectively. A comparison of sequence-based GO assignments with informative text-based GO assignments revealed that for 78.2% of clones, identical GO assignments were provided for that clone by either method, whereas for 21.8% of clones, the assignments differed. In contrast, for OMIM assignments, only 28.5% of clones had identical sequence-based and text-based OMIM assignments. Sequence, sentence, and term-based functional associations are included in the FACTS database (http://facts.gsc.riken.go.jp/), which permits results to be annotated and explored through web-accessible keyword and sequence search interfaces. The FACTS database will be a critical tool for investigating the functional complexity of the mouse transcriptome, cDNA-inferred interactome (molecular interactions), and pathome (pathologies). PMID:12819151

  14. Benchmarking infrastructure for mutation text mining

    PubMed Central

    2014-01-01

    Background Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. Results We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. Conclusion We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption. PMID:24568600

  15. Benchmarking infrastructure for mutation text mining.

    PubMed

    Klein, Artjom; Riazanov, Alexandre; Hindle, Matthew M; Baker, Christopher Jo

    2014-02-25

    Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.

  16. PANNZER2: a rapid functional annotation web server.

    PubMed

    Törönen, Petri; Medlar, Alan; Holm, Liisa

    2018-05-08

    The unprecedented growth of high-throughput sequencing has led to an ever-widening annotation gap in protein databases. While computational prediction methods are available to make up the shortfall, a majority of public web servers are hindered by practical limitations and poor performance. Here, we introduce PANNZER2 (Protein ANNotation with Z-scoRE), a fast functional annotation web server that provides both Gene Ontology (GO) annotations and free text description predictions. PANNZER2 uses SANSparallel to perform high-performance homology searches, making bulk annotation based on sequence similarity practical. PANNZER2 can output GO annotations from multiple scoring functions, enabling users to see which predictions are robust across predictors. Finally, PANNZER2 predictions scored within the top 10 methods for molecular function and biological process in the CAFA2 NK-full benchmark. The PANNZER2 web server is updated on a monthly schedule and is accessible at http://ekhidna2.biocenter.helsinki.fi/sanspanz/. The source code is available under the GNU Public Licence v3.

  17. Game-powered machine learning

    PubMed Central

    Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert

    2012-01-01

    Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the “wisdom of the crowds.” Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., “funky jazz with saxophone,” “spooky electronica,” etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data. PMID:22460786

  18. Game-powered machine learning.

    PubMed

    Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert

    2012-04-24

    Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the "wisdom of the crowds." Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., "funky jazz with saxophone," "spooky electronica," etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data.

  19. Assessment of disease named entity recognition on a corpus of annotated sentences.

    PubMed

    Jimeno, Antonio; Jimenez-Ruiz, Ernesto; Lee, Vivian; Gaudan, Sylvain; Berlanga, Rafael; Rebholz-Schuhmann, Dietrich

    2008-04-11

    In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Different solutions have been proposed to identify disease named entities in the scientific literature. While matching the terminology with language patterns suffers from low recall (e.g., Whatizit) other solutions make use of morpho-syntactic features to better cover the full scope of terminological variability (e.g., MetaMap). Currently, MetaMap that is provided from the National Library of Medicine (NLM) is the state of the art solution for the annotation of concepts from UMLS (Unified Medical Language System) in the literature. Nonetheless, its performance has not yet been assessed on an annotated corpus. In addition, little effort has been invested so far to generate an annotated dataset that links disease entities in text to disease entries in a database, thesaurus or ontology and that could serve as a gold standard to benchmark text mining solutions. As part of our research work, we have taken a corpus that has been delivered in the past for the identification of associations of genes to diseases based on the UMLS Metathesaurus and we have reprocessed and re-annotated the corpus. We have gathered annotations for disease entities from two curators, analyzed their disagreement (0.51 in the kappa-statistic) and composed a single annotated corpus for public use. Thereafter, three solutions for disease named entity recognition including MetaMap have been applied to the corpus to automatically annotate it with UMLS Metathesaurus concepts. The resulting annotations have been benchmarked to compare their performance. The annotated corpus is publicly available at ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases and can serve as a benchmark to other systems. In addition, we found that dictionary look-up already provides competitive results indicating that the use of disease terminology is highly standardized throughout the terminologies and the literature. MetaMap generates precise results at the expense of insufficient recall while our statistical method obtains better recall at a lower precision rate. Even better results in terms of precision are achieved by combining at least two of the three methods leading, but this approach again lowers recall. Altogether, our analysis gives a better understanding of the complexity of disease annotations in the literature. MetaMap and the dictionary based approach are available through the Whatizit web service infrastructure (Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: Calling Whatizit. Bioinformatics 2008, 24:296-298).

  20. Galileo's Discorsi as a Tool for the Analytical Art.

    PubMed

    Raphael, Renee Jennifer

    2015-01-01

    A heretofore overlooked response to Galileo's 1638 Discorsi is described by examining two extant copies of the text (one which has received little attention in the historiography, the other apparently unknown) which are heavily annotated. It is first demonstrated that these copies contain annotations made by Seth Ward and Sir Christopher Wren. This article then examines one feature of Ward's and Wren's responses to the Discorsi, namely their decision to re-write several of Galileo's geometrical demonstrations into the language of symbolic algebra. It is argued that this type of active reading of period mathematical texts may have been part of the regular scholarly and pedagogical practices of early modern British mathematicians like Ward and Wren. A set of Appendices contains a transcription and translation of the analytical solutions found in these annotated copies.

  1. In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.

    PubMed

    Garcia Castro, Leyla Jael; Berlanga, Rafael; Garcia, Alexander

    2015-10-01

    Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript. We have benchmarked three similarity metrics - BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles. We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles. Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323. Copyright © 2015 Elsevier Inc. All rights reserved.

  2. Finding Cervical Cancer Symptoms in Swedish Clinical Text using a Machine Learning Approach and NegEx

    PubMed Central

    Weegar, Rebecka; Kvist, Maria; Sundström, Karin; Brunak, Søren; Dalianis, Hercules

    2015-01-01

    Detection of early symptoms in cervical cancer is crucial for early treatment and survival. To find symptoms of cervical cancer in clinical text, Named Entity Recognition is needed. In this paper the Clinical Entity Finder, a machine-learning tool trained on annotated clinical text from a Swedish internal medicine emergency unit, is evaluated on cervical cancer records. The Clinical Entity Finder identifies entities of the types body part, finding and disorder and is extended with negation detection using the rule-based tool NegEx, to distinguish between negated and non-negated entities. To measure the performance of the tools on this new domain, two physicians annotated a set of clinical notes from the health records of cervical cancer patients. The inter-annotator agreement for finding, disorder and body part obtained an average F-score of 0.677 and the Clinical Entity Finder extended with NegEx had an average F-score of 0.667. PMID:26958270

  3. Finding Cervical Cancer Symptoms in Swedish Clinical Text using a Machine Learning Approach and NegEx.

    PubMed

    Weegar, Rebecka; Kvist, Maria; Sundström, Karin; Brunak, Søren; Dalianis, Hercules

    2015-01-01

    Detection of early symptoms in cervical cancer is crucial for early treatment and survival. To find symptoms of cervical cancer in clinical text, Named Entity Recognition is needed. In this paper the Clinical Entity Finder, a machine-learning tool trained on annotated clinical text from a Swedish internal medicine emergency unit, is evaluated on cervical cancer records. The Clinical Entity Finder identifies entities of the types body part, finding and disorder and is extended with negation detection using the rule-based tool NegEx, to distinguish between negated and non-negated entities. To measure the performance of the tools on this new domain, two physicians annotated a set of clinical notes from the health records of cervical cancer patients. The inter-annotator agreement for finding, disorder and body part obtained an average F-score of 0.677 and the Clinical Entity Finder extended with NegEx had an average F-score of 0.667.

  4. French as a Second Language. Annotated Bibliography of Learning Resources: Beginning Level. Early Childhood Services - Grade 12.

    ERIC Educational Resources Information Center

    Alberta Dept. of Education, Edmonton. Language Services Branch.

    This annotated bibliography of instructional resources for Alberta (Canada) introductory French second language teaching in early childhood, elementary, and secondary education consists of citations in 10 categories: audio/video recordings; communicative activity resources (primarily texts and workbooks); dictionaries and vocabulary handbooks;…

  5. Effective Writing Tasks and Feedback for the Internet Generation

    ERIC Educational Resources Information Center

    Buyse, Kris

    2012-01-01

    Teaching foreign language writing often lacks adjustments to the requirements of today's students of the "Internet Generation" (iGen): traditionally teachers set a--not very inspiring--topic, a deadline and then return a discouraging, manually underlined and/or annotated text without systematic labeling. The annotated document is then…

  6. BioC: a minimalist approach to interoperability for biomedical text processing

    PubMed Central

    Comeau, Donald C.; Islamaj Doğan, Rezarta; Ciccarese, Paolo; Cohen, Kevin Bretonnel; Krallinger, Martin; Leitner, Florian; Lu, Zhiyong; Peng, Yifan; Rinaldi, Fabio; Torii, Manabu; Valencia, Alfonso; Verspoor, Karin; Wiegers, Thomas C.; Wu, Cathy H.; Wilbur, W. John

    2013-01-01

    A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/ PMID:24048470

  7. AphidBase: A centralized bioinformatic resource for annotation of the pea aphid genome

    PubMed Central

    Legeai, Fabrice; Shigenobu, Shuji; Gauthier, Jean-Pierre; Colbourne, John; Rispe, Claude; Collin, Olivier; Richards, Stephen; Wilson, Alex C. C.; Tagu, Denis

    2015-01-01

    AphidBase is a centralized bioinformatic resource that was developed to facilitate community annotation of the pea aphid genome by the International Aphid Genomics Consortium (IAGC). The AphidBase Information System designed to organize and distribute genomic data and annotations for a large international community was constructed using open source software tools from the Generic Model Organism Database (GMOD). The system includes Apollo and GBrowse utilities as well as a wiki, blast search capabilities and a full text search engine. AphidBase strongly supported community cooperation and coordination in the curation of gene models during community annotation of the pea aphid genome. AphidBase can be accessed at http://www.aphidbase.com. PMID:20482635

  8. GeneView: a comprehensive semantic search engine for PubMed.

    PubMed

    Thomas, Philippe; Starlinger, Johannes; Vowinkel, Alexander; Arzt, Sebastian; Leser, Ulf

    2012-07-01

    Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein-protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271,000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.

  9. Chemical Geology: An Annotated Bibliography. CEGS Programs Publication Number 11.

    ERIC Educational Resources Information Center

    Billings, Gale K.

    The annotated bibliography is intended to aid geologists whose primary background is not in geochemistry. The references thus range from chemistry texts to papers on complex geochemical applications. The emphasis has been on those books and papers concerned with the application of chemical concepts to geology. Citations are arranged topically to…

  10. The Africa Collection: An Annotated Historical Resource Bibliography for the Student of Africa.

    ERIC Educational Resources Information Center

    Lynn, Karen

    This annotated bibliographic collection of resources on Africa including non-fiction, fiction, texts, poetry, draft papers, addresses, periodicals, film, records, and travel agencies is designed to aid secondary students and their teachers interested in research on Africa. An instructional approach is taken, drawing upon examples to demonstrate…

  11. Temporal Annotation in the Clinical Domain

    PubMed Central

    Styler, William F.; Bethard, Steven; Finan, Sean; Palmer, Martha; Pradhan, Sameer; de Groen, Piet C; Erickson, Brad; Miller, Timothy; Lin, Chen; Savova, Guergana; Pustejovsky, James

    2014-01-01

    This article discusses the requirements of a formal specification for the annotation of temporal information in clinical narratives. We discuss the implementation and extension of ISO-TimeML for annotating a corpus of clinical notes, known as the THYME corpus. To reflect the information task and the heavily inference-based reasoning demands in the domain, a new annotation guideline has been developed, “the THYME Guidelines to ISO-TimeML (THYME-TimeML)”. To clarify what relations merit annotation, we distinguish between linguistically-derived and inferentially-derived temporal orderings in the text. We also apply a top performing TempEval 2013 system against this new resource to measure the difficulty of adapting systems to the clinical domain. The corpus is available to the community and has been proposed for use in a SemEval 2015 task. PMID:29082229

  12. Combining rules, background knowledge and change patterns to maintain semantic annotations.

    PubMed

    Cardoso, Silvio Domingos; Chantal, Reynaud-Delaître; Da Silveira, Marcos; Pruski, Cédric

    2017-01-01

    Knowledge Organization Systems (KOS) play a key role in enriching biomedical information in order to make it machine-understandable and shareable. This is done by annotating medical documents, or more specifically, associating concept labels from KOS with pieces of digital information, e.g., images or texts. However, the dynamic nature of KOS may impact the annotations, thus creating a mismatch between the evolved concept and the associated information. To solve this problem, methods to maintain the quality of the annotations are required. In this paper, we define a framework based on rules, background knowledge and change patterns to drive the annotation adaption process. We evaluate experimentally the proposed approach in realistic cases-studies and demonstrate the overall performance of our approach in different KOS considering the precision, recall, F1-score and AUC value of the system.

  13. Combining rules, background knowledge and change patterns to maintain semantic annotations

    PubMed Central

    Cardoso, Silvio Domingos; Chantal, Reynaud-Delaître; Da Silveira, Marcos; Pruski, Cédric

    2017-01-01

    Knowledge Organization Systems (KOS) play a key role in enriching biomedical information in order to make it machine-understandable and shareable. This is done by annotating medical documents, or more specifically, associating concept labels from KOS with pieces of digital information, e.g., images or texts. However, the dynamic nature of KOS may impact the annotations, thus creating a mismatch between the evolved concept and the associated information. To solve this problem, methods to maintain the quality of the annotations are required. In this paper, we define a framework based on rules, background knowledge and change patterns to drive the annotation adaption process. We evaluate experimentally the proposed approach in realistic cases-studies and demonstrate the overall performance of our approach in different KOS considering the precision, recall, F1-score and AUC value of the system. PMID:29854115

  14. Use of Co-occurrences for Temporal Expressions Annotation

    NASA Astrophysics Data System (ADS)

    Craveiro, Olga; Macedo, Joaquim; Madeira, Henrique

    The annotation or extraction of temporal information from text documents is becoming increasingly important in many natural language processing applications such as text summarization, information retrieval, question answering, etc.. This paper presents an original method for easy recognition of temporal expressions in text documents. The method creates semantically classified temporal patterns, using word co-occurrences obtained from training corpora and a pre-defined seed keywords set, derived from the used language temporal references. A participation on a Portuguese named entity evaluation contest showed promising effectiveness and efficiency results. This approach can be adapted to recognize other type of expressions or languages, within other contexts, by defining the suitable word sets and training corpora.

  15. Text Sets.

    ERIC Educational Resources Information Center

    Giorgis, Cyndi; Johnson, Nancy J.

    2002-01-01

    Presents annotations of approximately 30 titles grouped in text sets. Defines a text set as five to ten books on a particular topic or theme. Discusses books on the following topics: living creatures; pirates; physical appearance; natural disasters; and the Irish potato famine. (SG)

  16. Data Acquisition and Linguistic Resources

    NASA Astrophysics Data System (ADS)

    Strassel, Stephanie; Christianson, Caitlin; McCary, John; Staderman, William; Olive, Joseph

    All human language technology demands substantial quantities of data for system training and development, plus stable benchmark data to measure ongoing progress. While creation of high quality linguistic resources is both costly and time consuming, such data has the potential to profoundly impact not just a single evaluation program but language technology research in general. GALE's challenging performance targets demand linguistic data on a scale and complexity never before encountered. Resources cover multiple languages (Arabic, Chinese, and English) and multiple genres -- both structured (newswire and broadcast news) and unstructured (web text, including blogs and newsgroups, and broadcast conversation). These resources include significant volumes of monolingual text and speech, parallel text, and transcribed audio combined with multiple layers of linguistic annotation, ranging from word aligned parallel text and Treebanks to rich semantic annotation.

  17. Construction of an annotated corpus to support biomedical information extraction

    PubMed Central

    Thompson, Paul; Iqbal, Syed A; McNaught, John; Ananiadou, Sophia

    2009-01-01

    Background Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. Results We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. Conclusion The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes. PMID:19852798

  18. Transcriptome analysis and metabolic profiling of green and red kale (Brassica oleracea var. acephala) seedlings.

    PubMed

    Jeon, Jin; Kim, Jae Kwang; Kim, HyeRan; Kim, Yeon Jeong; Park, Yun Ji; Kim, Sun Ju; Kim, Changsoo; Park, Sang Un

    2018-02-15

    Kale (Brassica oleracea var. acephala) is a rich source of numerous health-benefiting compounds, including vitamins, glucosinolates, phenolic compounds, and carotenoids. However, the genetic resources for exploiting the phyto-nutritional traits of kales are limited. To acquire precise information on secondary metabolites in kales, we performed a comprehensive analysis of the transcriptome and metabolome of green and red kale seedlings. Kale transcriptome datasets revealed 37,149 annotated genes and several secondary metabolite biosynthetic genes. HPLC analysis revealed 14 glucosinolates, 20 anthocyanins, 3 phenylpropanoids, and 6 carotenoids in the kale seedlings that were examined. Red kale contained more glucosinolates, anthocyanins, and phenylpropanoids than green kale, whereas the carotenoid contents were much higher in green kale than in red kale. Ultimately, our data will be a valuable resource for future research on kale bio-engineering and will provide basic information to define gene-to-metabolite networks in kale. Copyright © 2017 Elsevier Ltd. All rights reserved.

  19. German as a Second Language: Annotated Bibliography of Learning Resources, Grades 1-12.

    ERIC Educational Resources Information Center

    Alberta Dept. of Education, Edmonton. Language Services Branch.

    The annotated bibliography of print and non-print materials for students and teachers of German includes standard student texts, audiovisual materials, student and teacher references, and other media. It is intended to guide teachers in the selection of student and instructional materials for the teaching and learning of German at the elementary…

  20. Examining Students' Use of Online Annotation Tools in Support of Argumentative Reading

    ERIC Educational Resources Information Center

    Lu, Jingyan; Deng, Liping

    2013-01-01

    This study examined how students in a Hong Kong high school used Diigo, an online annotation tool, to support their argumentative reading activities. Two year 10 classes, a high-performance class (HPC) and an ordinary-performance class (OPC), highlighted passages of text and wrote and attached sticky notes to them to clarify argumentation…

  1. Multi-label literature classification based on the Gene Ontology graph.

    PubMed

    Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua

    2008-12-08

    The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

  2. Challenges for automatically extracting molecular interactions from full-text articles.

    PubMed

    McIntosh, Tara; Curran, James R

    2009-09-24

    The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles. We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set. We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.

  3. GFFview: A Web Server for Parsing and Visualizing Annotation Information of Eukaryotic Genome.

    PubMed

    Deng, Feilong; Chen, Shi-Yi; Wu, Zhou-Lin; Hu, Yongsong; Jia, Xianbo; Lai, Song-Jia

    2017-10-01

    Owing to wide application of RNA sequencing (RNA-seq) technology, more and more eukaryotic genomes have been extensively annotated, such as the gene structure, alternative splicing, and noncoding loci. Annotation information of genome is prevalently stored as plain text in General Feature Format (GFF), which could be hundreds or thousands Mb in size. Therefore, it is a challenge for manipulating GFF file for biologists who have no bioinformatic skill. In this study, we provide a web server (GFFview) for parsing the annotation information of eukaryotic genome and then generating statistical description of six indices for visualization. GFFview is very useful for investigating quality and difference of the de novo assembled transcriptome in RNA-seq studies.

  4. Overview of the gene ontology task at BioCreative IV.

    PubMed

    Mao, Yuqing; Van Auken, Kimberly; Li, Donghui; Arighi, Cecilia N; McQuilton, Peter; Hayman, G Thomas; Tweedie, Susan; Schaeffer, Mary L; Laulederkind, Stanley J F; Wang, Shur-Jen; Gobeill, Julien; Ruch, Patrick; Luu, Anh Tuan; Kim, Jung-Jae; Chiang, Jung-Hsien; Chen, Yu-De; Yang, Chia-Jung; Liu, Hongfang; Zhu, Dongqing; Li, Yanpeng; Yu, Hong; Emadzadeh, Ehsan; Gonzalez, Graciela; Chen, Jian-Ming; Dai, Hong-Jie; Lu, Zhiyong

    2014-01-01

    Gene ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation. http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

  5. AutoFACT: An Automatic Functional Annotation and Classification Tool

    PubMed Central

    Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud

    2005-01-01

    Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857

  6. Toward the Automated Scoring of Written Arguments: Developing an Innovative Approach for Annotation. Research Report. ETS RR-17-11

    ERIC Educational Resources Information Center

    Song, Yi; Deane, Paul; Beigman Klebanov, Beata

    2017-01-01

    This project focuses on laying the foundations for automated analysis of argumentation schemes, supporting identification and classification of the arguments being made in a text, for the purpose of scoring the quality of written analyses of arguments. We developed annotation protocols for 20 argument prompts from a college-level test under the…

  7. A Protocol for Annotating Parser Differences. Research Report. ETS RR-16-02

    ERIC Educational Resources Information Center

    Bruno, James V.; Cahill, Aoife; Gyawali, Binod

    2016-01-01

    We present an annotation scheme for classifying differences in the outputs of syntactic constituency parsers when a gold standard is unavailable or undesired, as in the case of texts written by nonnative speakers of English. We discuss its automated implementation and the results of a case study that uses the scheme to choose a parser best suited…

  8. A UIMA wrapper for the NCBO annotator.

    PubMed

    Roeder, Christophe; Jonquet, Clement; Shah, Nigam H; Baumgartner, William A; Verspoor, Karin; Hunter, Lawrence

    2010-07-15

    The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator-an ontology-based annotation service-to make it available as a component in UIMA workflows. This wrapper is freely available on the web at http://bionlp-uima.sourceforge.net/ as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows.

  9. Transcriptome deep-sequencing and clustering of expressed isoforms from Favia corals

    PubMed Central

    2013-01-01

    Background Genomic and transcriptomic sequence data are essential tools for tackling ecological problems. Using an approach that combines next-generation sequencing, de novo transcriptome assembly, gene annotation and synthetic gene construction, we identify and cluster the protein families from Favia corals from the northern Red Sea. Results We obtained 80 million 75 bp paired-end cDNA reads from two Favia adult samples collected at 65 m (Fav1, Fav2) on the Illumina GA platform, and generated two de novo assemblies using ABySS and CAP3. After removing redundancy and filtering out low quality reads, our transcriptome datasets contained 58,268 (Fav1) and 62,469 (Fav2) contigs longer than 100 bp, with N50 values of 1,665 bp and 1,439 bp, respectively. Using the proteome of the sea anemone Nematostella vectensis as a reference, we were able to annotate almost 20% of each dataset using reciprocal homology searches. Homologous clustering of these annotated transcripts allowed us to divide them into 7,186 (Fav1) and 6,862 (Fav2) homologous transcript clusters (E-value ≤ 2e-30). Functional annotation categories were assigned to homologous clusters using the functional annotation of Nematostella vectensis. General annotation of the assembled transcripts was improved 1-3% using the Acropora digitifera proteome. In addition, we screened these transcript isoform clusters for fluorescent proteins (FPs) homologs and identified seven potential FP homologs in Fav1, and four in Fav2. These transcripts were validated as bona fide FP transcripts via robust fluorescence heterologous expression. Annotation of the assembled contigs revealed that 1.34% and 1.61% (in Fav1 and Fav2, respectively) of the total assembled contigs likely originated from the corals’ algal symbiont, Symbiodinium spp. Conclusions Here we present a study to identify the homologous transcript isoform clusters from the transcriptome of Favia corals using a far-related reference proteome. Furthermore, the symbiont-derived transcripts were isolated from the datasets and their contribution quantified. This is the first annotated transcriptome of the genus Favia, a major increase in genomics resources available in this important family of corals. PMID:23937070

  10. On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.

    PubMed

    Oronoz, Maite; Gojenola, Koldo; Pérez, Alicia; de Ilarraza, Arantza Díaz; Casillas, Arantza

    2015-08-01

    The advances achieved in Natural Language Processing make it possible to automatically mine information from electronically created documents. Many Natural Language Processing methods that extract information from texts make use of annotated corpora, but these are scarce in the clinical domain due to legal and ethical issues. In this paper we present the creation of the IxaMed-GS gold standard composed of real electronic health records written in Spanish and manually annotated by experts in pharmacology and pharmacovigilance. The experts mainly annotated entities related to diseases and drugs, but also relationships between entities indicating adverse drug reaction events. To help the experts in the annotation task, we adapted a general corpus linguistic analyzer to the medical domain. The quality of the annotation process in the IxaMed-GS corpus has been assessed by measuring the inter-annotator agreement, which was 90.53% for entities and 82.86% for events. In addition, the corpus has been used for the automatic extraction of adverse drug reaction events using machine learning. Copyright © 2015 Elsevier Inc. All rights reserved.

  11. A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text

    PubMed Central

    Miwa, Makoto; Ohta, Tomoko; Rak, Rafal; Rowley, Andrew; Kell, Douglas B.; Pyysalo, Sampo; Ananiadou, Sophia

    2013-01-01

    Motivation: To create, verify and maintain pathway models, curators must discover and assess knowledge distributed over the vast body of biological literature. Methods supporting these tasks must understand both the pathway model representations and the natural language in the literature. These methods should identify and order documents by relevance to any given pathway reaction. No existing system has addressed all aspects of this challenge. Method: We present novel methods for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models and then turns them into queries for three text mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machine-learning approaches. Results: Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find that a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText. Availability: An online demonstration of PathText 2 and the annotated corpus are available for research purposes at http://www.nactem.ac.uk/pathtext2/. Contact: makoto.miwa@manchester.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23813008

  12. Integrating UIMA annotators in a web-based text processing framework.

    PubMed

    Chen, Xiang; Arnold, Corey W

    2013-01-01

    The Unstructured Information Management Architecture (UIMA) [1] framework is a growing platform for natural language processing (NLP) applications. However, such applications may be difficult for non-technical users deploy. This project presents a web-based framework that wraps UIMA-based annotator systems into a graphical user interface for researchers and clinicians, and a web service for developers. An annotator that extracts data elements from lung cancer radiology reports is presented to illustrate the use of the system. Annotation results from the web system can be exported to multiple formats for users to utilize in other aspects of their research and workflow. This project demonstrates the benefits of a lay-user interface for complex NLP applications. Efforts such as this can lead to increased interest and support for NLP work in the clinical domain.

  13. Developing a disease outbreak event corpus.

    PubMed

    Conway, Mike; Kawazoe, Ai; Chanlekha, Hutchatai; Collier, Nigel

    2010-09-28

    In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking. This study seeks to create a "gold standard" data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area. We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event--in the context of our annotation scheme--consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination). The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration. In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area.

  14. Comparative Transcriptome Analysis of Genes Involved in Anthocyanin Biosynthesis in the Red and Yellow Fruits of Sweet Cherry (Prunus avium L.)

    PubMed Central

    Wei, Hairong; Chen, Xin; Zong, Xiaojuan; Shu, Huairui; Gao, Dongsheng; Liu, Qingzhong

    2015-01-01

    Background Fruit color is one of the most important economic traits of the sweet cherry (Prunus avium L.). The red coloration of sweet cherry fruit is mainly attributed to anthocyanins. However, limited information is available regarding the molecular mechanisms underlying anthocyanin biosynthesis and its regulation in sweet cherry. Methodology/Principal Findings In this study, a reference transcriptome of P. avium L. was sequenced and annotated to identify the transcriptional determinants of fruit color. Normalized cDNA libraries from red and yellow fruits were sequenced using the next-generation Illumina/Solexa sequencing platform and de novo assembly. Over 66 million high-quality reads were assembled into 43,128 unigenes using a combined assembly strategy. Then a total of 22,452 unigenes were compared to public databases using homology searches, and 20,095 of these unigenes were annotated in the Nr protein database. Furthermore, transcriptome differences between the four stages of fruit ripening were analyzed using Illumina digital gene expression (DGE) profiling. Biological pathway analysis revealed that 72 unigenes were involved in anthocyanin biosynthesis. The expression patterns of unigenes encoding phenylalanine ammonia-lyase (PAL), 4-coumarate-CoA ligase (4CL), chalcone synthase (CHS), chalcone isomerase (CHI), flavanone 3-hydroxylase (F3H), flavanone 3’-hydroxylase (F3’H), dihydroflavonol 4-reductase (DFR), anthocyanidin synthase (ANS) and UDP glucose: flavonol 3-O-glucosyltransferase (UFGT) during fruit ripening differed between red and yellow fruit. In addition, we identified some transcription factor families (such as MYB, bHLH and WD40) that may control anthocyanin biosynthesis. We confirmed the altered expression levels of eighteen unigenes that encode anthocyanin biosynthetic enzymes and transcription factors using quantitative real-time PCR (qRT-PCR). Conclusions/Significance The obtained sweet cherry transcriptome and DGE profiling data provide comprehensive gene expression information that lends insights into the molecular mechanisms underlying anthocyanin biosynthesis. These results will provide a platform for further functional genomic research on this fruit crop. PMID:25799516

  15. MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence.

    PubMed

    Liu, Ke; Peng, Shengwen; Wu, Junqiu; Zhai, Chengxiang; Mamitsuka, Hiroshi; Zhu, Shanfeng

    2015-06-15

    Medical Subject Headings (MeSHs) are used by National Library of Medicine (NLM) to index almost all citations in MEDLINE, which greatly facilitates the applications of biomedical information retrieval and text mining. To reduce the time and financial cost of manual annotation, NLM has developed a software package, Medical Text Indexer (MTI), for assisting MeSH annotation, which uses k-nearest neighbors (KNN), pattern matching and indexing rules. Other types of information, such as prediction by MeSH classifiers (trained separately), can also be used for automatic MeSH annotation. However, existing methods cannot effectively integrate multiple evidence for MeSH annotation. We propose a novel framework, MeSHLabeler, to integrate multiple evidence for accurate MeSH annotation by using 'learning to rank'. Evidence includes numerous predictions from MeSH classifiers, KNN, pattern matching, MTI and the correlation between different MeSH terms, etc. Each MeSH classifier is trained independently, and thus prediction scores from different classifiers are incomparable. To address this issue, we have developed an effective score normalization procedure to improve the prediction accuracy. MeSHLabeler won the first place in Task 2A of 2014 BioASQ challenge, achieving the Micro F-measure of 0.6248 for 9,040 citations provided by the BioASQ challenge. Note that this accuracy is around 9.15% higher than 0.5724, obtained by MTI. The software is available upon request. © The Author 2015. Published by Oxford University Press.

  16. MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence

    PubMed Central

    Liu, Ke; Peng, Shengwen; Wu, Junqiu; Zhai, Chengxiang; Mamitsuka, Hiroshi; Zhu, Shanfeng

    2015-01-01

    Motivation: Medical Subject Headings (MeSHs) are used by National Library of Medicine (NLM) to index almost all citations in MEDLINE, which greatly facilitates the applications of biomedical information retrieval and text mining. To reduce the time and financial cost of manual annotation, NLM has developed a software package, Medical Text Indexer (MTI), for assisting MeSH annotation, which uses k-nearest neighbors (KNN), pattern matching and indexing rules. Other types of information, such as prediction by MeSH classifiers (trained separately), can also be used for automatic MeSH annotation. However, existing methods cannot effectively integrate multiple evidence for MeSH annotation. Methods: We propose a novel framework, MeSHLabeler, to integrate multiple evidence for accurate MeSH annotation by using ‘learning to rank’. Evidence includes numerous predictions from MeSH classifiers, KNN, pattern matching, MTI and the correlation between different MeSH terms, etc. Each MeSH classifier is trained independently, and thus prediction scores from different classifiers are incomparable. To address this issue, we have developed an effective score normalization procedure to improve the prediction accuracy. Results: MeSHLabeler won the first place in Task 2A of 2014 BioASQ challenge, achieving the Micro F-measure of 0.6248 for 9,040 citations provided by the BioASQ challenge. Note that this accuracy is around 9.15% higher than 0.5724, obtained by MTI. Availability and implementation: The software is available upon request. Contact: zhusf@fudan.edu.cn PMID:26072501

  17. Applying Active Learning to Assertion Classification of Concepts in Clinical Text

    PubMed Central

    Chen, Yukun; Mani, Subramani; Xu, Hua

    2012-01-01

    Supervised machine learning methods for clinical natural language processing (NLP) research require a large number of annotated samples, which are very expensive to build because of the involvement of physicians. Active learning, an approach that actively samples from a large pool, provides an alternative solution. Its major goal in classification is to reduce the annotation effort while maintaining the quality of the predictive model. However, few studies have investigated its uses in clinical NLP. This paper reports an application of active learning to a clinical text classification task: to determine the assertion status of clinical concepts. The annotated corpus for the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge was used in this study. We implemented several existing and newly developed active learning algorithms and assessed their uses. The outcome is reported in the global ALC score, based on the Area under the average Learning Curve of the AUC (Area Under the Curve) score. Results showed that when the same number of annotated samples was used, active learning strategies could generate better classification models (best ALC – 0.7715) than the passive learning method (random sampling) (ALC – 0.7411). Moreover, to achieve the same classification performance, active learning strategies required fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort. PMID:22127105

  18. Ontology modularization to improve semantic medical image annotation.

    PubMed

    Wennerberg, Pinar; Schulz, Klaus; Buitelaar, Paul

    2011-02-01

    Searching for medical images and patient reports is a significant challenge in a clinical setting. The contents of such documents are often not described in sufficient detail thus making it difficult to utilize the inherent wealth of information contained within them. Semantic image annotation addresses this problem by describing the contents of images and reports using medical ontologies. Medical images and patient reports are then linked to each other through common annotations. Subsequently, search algorithms can more effectively find related sets of documents on the basis of these semantic descriptions. A prerequisite to realizing such a semantic search engine is that the data contained within should have been previously annotated with concepts from medical ontologies. One major challenge in this regard is the size and complexity of medical ontologies as annotation sources. Manual annotation is particularly time consuming labor intensive in a clinical environment. In this article we propose an approach to reducing the size of clinical ontologies for more efficient manual image and text annotation. More precisely, our goal is to identify smaller fragments of a large anatomy ontology that are relevant for annotating medical images from patients suffering from lymphoma. Our work is in the area of ontology modularization, which is a recent and active field of research. We describe our approach, methods and data set in detail and we discuss our results. Copyright © 2010 Elsevier Inc. All rights reserved.

  19. Self-evaluation and peer-feedback of medical students' communication skills using a web-based video annotation system. Exploring content and specificity.

    PubMed

    Hulsman, Robert L; van der Vloodt, Jane

    2015-03-01

    Self-evaluation and peer-feedback are important strategies within the reflective practice paradigm for the development and maintenance of professional competencies like medical communication. Characteristics of the self-evaluation and peer-feedback annotations of medical students' video recorded communication skills were analyzed. Twenty-five year 4 medical students recorded history-taking consultations with a simulated patient, uploaded the video to a web-based platform, marked and annotated positive and negative events. Peers reviewed the video and self-evaluations and provided feedback. Analyzed were the number of marked positive and negative annotations and the amount of text entered. Topics and specificity of the annotations were coded and analyzed qualitatively. Students annotated on average more negative than positive events. Additional peer-feedback was more often positive. Topics most often related to structuring the consultation. Students were most critical about their biomedical topics. Negative annotations were more specific than positive annotations. Self-evaluations were more specific than peer-feedback and both show a significant correlation. Four response patterns were detected that negatively bias specificity assessment ratings. Teaching students to be more specific in their self-evaluations may be effective for receiving more specific peer-feedback. Videofragmentrating is a convenient tool to implement reflective practice activities like self-evaluation and peer-feedback to the classroom in the teaching of clinical skills. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  20. Natural-Annotation-based Unsupervised Construction of Korean-Chinese Domain Dictionary

    NASA Astrophysics Data System (ADS)

    Liu, Wuying; Wang, Lin

    2018-03-01

    The large-scale bilingual parallel resource is significant to statistical learning and deep learning in natural language processing. This paper addresses the automatic construction issue of the Korean-Chinese domain dictionary, and presents a novel unsupervised construction method based on the natural annotation in the raw corpus. We firstly extract all Korean-Chinese word pairs from Korean texts according to natural annotations, secondly transform the traditional Chinese characters into the simplified ones, and finally distill out a bilingual domain dictionary after retrieving the simplified Chinese words in an extra Chinese domain dictionary. The experimental results show that our method can automatically build multiple Korean-Chinese domain dictionaries efficiently.

  1. Identification of candidate genes involved in the sugar metabolism and accumulation during pear fruit post-harvest ripening of 'Red Clapp's Favorite' (Pyrus communis L.) by transcriptome analysis.

    PubMed

    Wang, Long; Chen, Yun; Wang, Suke; Xue, Huabai; Su, Yanli; Yang, Jian; Li, Xiugen

    2018-01-01

    Pear ( Pyrus spp.) is a popular fruit that is commercially cultivated in most temperate regions. In fruits, sugar metabolism and accumulation are important factors for fruit organoleptic quality. Post-harvest ripening is a special feature of 'Red Clapp's Favorite'. In this study, transcriptome sequencing based on the Illumina platform generated 23.8 - 35.8 million unigenes of nine cDNA libraries constructed using RNAs from the 'Red Clapp's Favorite' pear variety with different treatments, in which 2629 new genes were discovered, and 2121 of them were annotated. A total of 2146 DEGs, 3650 DEGs, 1830 DEGs from each comparison were assembled. Moreover, the gene expression patterns of 8 unigenes related to sugar metabolism revealed by qPCR. The main constituents of soluble sugars were fructose and glucose after pear fruit post-harvest ripening, and five unigenes involved in sugar metabolism were discovered. Our study not only provides a large-scale assessment of transcriptome resources of 'Red Clapp's Favorite' but also lays the foundation for further research into genes correlated with sugar metabolism.

  2. Context and Domain Knowledge Enhanced Entity Spotting in Informal Text

    NASA Astrophysics Data System (ADS)

    Gruhl, Daniel; Nagarajan, Meena; Pieper, Jan; Robson, Christine; Sheth, Amit

    This paper explores the application of restricted relationship graphs (RDF) and statistical NLP techniques to improve named entity annotation in challenging Informal English domains. We validate our approach using on-line forums discussing popular music. Named entity annotation is particularly difficult in this domain because it is characterized by a large number of ambiguous entities, such as the Madonna album "Music" or Lilly Allen's pop hit "Smile".

  3. MicroScope: a platform for microbial genome annotation and comparative genomics

    PubMed Central

    Vallenet, D.; Engelen, S.; Mornico, D.; Cruveiller, S.; Fleury, L.; Lajus, A.; Rouy, Z.; Roche, D.; Salvignol, G.; Scarpelli, C.; Médigue, C.

    2009-01-01

    The initial outcome of genome sequencing is the creation of long text strings written in a four letter alphabet. The role of in silico sequence analysis is to assist biologists in the act of associating biological knowledge with these sequences, allowing investigators to make inferences and predictions that can be tested experimentally. A wide variety of software is available to the scientific community, and can be used to identify genomic objects, before predicting their biological functions. However, only a limited number of biologically interesting features can be revealed from an isolated sequence. Comparative genomics tools, on the other hand, by bringing together the information contained in numerous genomes simultaneously, allow annotators to make inferences based on the idea that evolution and natural selection are central to the definition of all biological processes. We have developed the MicroScope platform in order to offer a web-based framework for the systematic and efficient revision of microbial genome annotation and comparative analysis (http://www.genoscope.cns.fr/agc/microscope). Starting with the description of the flow chart of the annotation processes implemented in the MicroScope pipeline, and the development of traditional and novel microbial annotation and comparative analysis tools, this article emphasizes the essential role of expert annotation as a complement of automatic annotation. Several examples illustrate the use of implemented tools for the review and curation of annotations of both new and publicly available microbial genomes within MicroScope’s rich integrated genome framework. The platform is used as a viewer in order to browse updated annotation information of available microbial genomes (more than 440 organisms to date), and in the context of new annotation projects (117 bacterial genomes). The human expertise gathered in the MicroScope database (about 280,000 independent annotations) contributes to improve the quality of microbial genome annotation, especially for genomes initially analyzed by automatic procedures alone. Database URLs: http://www.genoscope.cns.fr/agc/mage and http://www.genoscope.cns.fr/agc/microcyc PMID:20157493

  4. MicroScope: a platform for microbial genome annotation and comparative genomics.

    PubMed

    Vallenet, D; Engelen, S; Mornico, D; Cruveiller, S; Fleury, L; Lajus, A; Rouy, Z; Roche, D; Salvignol, G; Scarpelli, C; Médigue, C

    2009-01-01

    The initial outcome of genome sequencing is the creation of long text strings written in a four letter alphabet. The role of in silico sequence analysis is to assist biologists in the act of associating biological knowledge with these sequences, allowing investigators to make inferences and predictions that can be tested experimentally. A wide variety of software is available to the scientific community, and can be used to identify genomic objects, before predicting their biological functions. However, only a limited number of biologically interesting features can be revealed from an isolated sequence. Comparative genomics tools, on the other hand, by bringing together the information contained in numerous genomes simultaneously, allow annotators to make inferences based on the idea that evolution and natural selection are central to the definition of all biological processes. We have developed the MicroScope platform in order to offer a web-based framework for the systematic and efficient revision of microbial genome annotation and comparative analysis (http://www.genoscope.cns.fr/agc/microscope). Starting with the description of the flow chart of the annotation processes implemented in the MicroScope pipeline, and the development of traditional and novel microbial annotation and comparative analysis tools, this article emphasizes the essential role of expert annotation as a complement of automatic annotation. Several examples illustrate the use of implemented tools for the review and curation of annotations of both new and publicly available microbial genomes within MicroScope's rich integrated genome framework. The platform is used as a viewer in order to browse updated annotation information of available microbial genomes (more than 440 organisms to date), and in the context of new annotation projects (117 bacterial genomes). The human expertise gathered in the MicroScope database (about 280,000 independent annotations) contributes to improve the quality of microbial genome annotation, especially for genomes initially analyzed by automatic procedures alone.Database URLs: http://www.genoscope.cns.fr/agc/mage and http://www.genoscope.cns.fr/agc/microcyc.

  5. Elucidating high-dimensional cancer hallmark annotation via enriched ontology.

    PubMed

    Yan, Shankai; Wong, Ka-Chun

    2017-09-01

    Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge. To address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers. https://github.com/cskyan/chmannot. Copyright © 2017 Elsevier Inc. All rights reserved.

  6. Transcriptome profiling of anthocyanin-related genes reveals effects of light intensity on anthocyanin biosynthesis in red leaf lettuce.

    PubMed

    Zhang, Yanzhao; Xu, Shuzhen; Cheng, Yanwei; Peng, Zhengfeng; Han, Jianming

    2018-01-01

    Red leaf lettuce ( Lactuca sativa L.) is popular due to its high anthocyanin content, but poor leaf coloring often occurs under low light intensity. In order to reveal the mechanisms of anthocyanins affected by light intensity, we compared the transcriptome of L. sativa L. var. capitata under light intensities of 40 and 100 μmol m -2 s -1 . A total of 62,111 unigenes were de novo assembled with an N50 of 1,681 bp, and 48,435 unigenes were functionally annotated in public databases. A total of 3,899 differentially expressed genes (DEGs) were detected, of which 1,377 unigenes were up-regulated and 2,552 unigenes were down-regulated in the high light samples. By Kyoto Encyclopedia of Genes and Genomes enrichment analysis, the DEGs were significantly enriched in 14 pathways. Using gene annotation and phylogenetic analysis, we identified seven anthocyanin structural genes, including CHS , CHI , F3H , F3'H , DFR , ANS , and 3GT , and two anthocyanin transport genes, GST and MATE . In terms of anthocyanin regulatory genes, five MYBs and one bHLH gene were identified. An HY5 gene was discovered, which may respond to light-signaling and regulate anthocyanin structural genes. These genes showed a log2FC of 2.7-9.0 under high irradiance, and were validated using quantitative real-time-PCR. In conclusion, our results indicated transcriptome variance in red leaf lettuce under low and high light intensity, and observed a anthocyanin biosynthesis and regulation pattern. The data should further help to unravel the molecular mechanisms of anthocyanins influenced by light intensity.

  7. Arbeitsmittel fur den Deutschunterricht an Auslander (Materials for German Instruction to Foreigners).

    ERIC Educational Resources Information Center

    Kaufmann, Gerhard, Comp.

    This annotated German bibliography of textbooks for use in language programs includes a wide variety of instructional media. Materials are classified under: (1) language texts, (2) teachers' texts, (3) language texts for foreigners, (4) exercise books and tapes, (5) testing materials, (6) radio courses, (7) instructional films, (8) vocational…

  8. Collaborative Workspaces within Distributed Virtual Environments.

    DTIC Science & Technology

    1996-12-01

    such as a text document, a 3D model, or a captured image using a collaborative workspace called the InPerson Whiteboard . The Whiteboard contains a...commands for editing objects drawn on the screen. Finally, when the call is completed, the Whiteboard can be saved to a file for future use . IRIS Annotator... use , and a shared whiteboard that includes a number of multimedia annotation tools. Both systems are also mindful of bandwidth limitations and can

  9. tmBioC: improving interoperability of text-mining tools with BioC.

    PubMed

    Khare, Ritu; Wei, Chih-Hsuan; Mao, Yuqing; Leaman, Robert; Lu, Zhiyong

    2014-01-01

    The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

  10. Social Image Tag Ranking by Two-View Learning

    NASA Astrophysics Data System (ADS)

    Zhuang, Jinfeng; Hoi, Steven C. H.

    Tags play a central role in text-based social image retrieval and browsing. However, the tags annotated by web users could be noisy, irrelevant, and often incomplete for describing the image contents, which may severely deteriorate the performance of text-based image retrieval models. In order to solve this problem, researchers have proposed techniques to rank the annotated tags of a social image according to their relevance to the visual content of the image. In this paper, we aim to overcome the challenge of social image tag ranking for a corpus of social images with rich user-generated tags by proposing a novel two-view learning approach. It can effectively exploit both textual and visual contents of social images to discover the complicated relationship between tags and images. Unlike the conventional learning approaches that usually assumes some parametric models, our method is completely data-driven and makes no assumption about the underlying models, making the proposed solution practically more effective. We formulate our method as an optimization task and present an efficient algorithm to solve it. To evaluate the efficacy of our method, we conducted an extensive set of experiments by applying our technique to both text-based social image retrieval and automatic image annotation tasks. Our empirical results showed that the proposed method can be more effective than the conventional approaches.

  11. pGenN, a gene normalization tool for plant genes and proteins in scientific literature.

    PubMed

    Ding, Ruoyao; Arighi, Cecilia N; Lee, Jung-Youn; Wu, Cathy H; Vijay-Shanker, K

    2015-01-01

    Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).

  12. SNAD: Sequence Name Annotation-based Designer.

    PubMed

    Sidorov, Igor A; Reshetov, Denis A; Gorbalenya, Alexander E

    2009-08-14

    A growing diversity of biological data is tagged with unique identifiers (UIDs) associated with polynucleotides and proteins to ensure efficient computer-mediated data storage, maintenance, and processing. These identifiers, which are not informative for most people, are often substituted by biologically meaningful names in various presentations to facilitate utilization and dissemination of sequence-based knowledge. This substitution is commonly done manually that may be a tedious exercise prone to mistakes and omissions. Here we introduce SNAD (Sequence Name Annotation-based Designer) that mediates automatic conversion of sequence UIDs (associated with multiple alignment or phylogenetic tree, or supplied as plain text list) into biologically meaningful names and acronyms. This conversion is directed by precompiled or user-defined templates that exploit wealth of annotation available in cognate entries of external databases. Using examples, we demonstrate how this tool can be used to generate names for practical purposes, particularly in virology. A tool for controllable annotation-based conversion of sequence UIDs into biologically meaningful names and acronyms has been developed and placed into service, fostering links between quality of sequence annotation, and efficiency of communication and knowledge dissemination among researchers.

  13. Figure-associated text summarization and evaluation.

    PubMed

    Polepalli Ramesh, Balaji; Sethi, Ricky J; Yu, Hong

    2015-01-01

    Biomedical literature incorporates millions of figures, which are a rich and important knowledge resource for biomedical researchers. Scientists need access to the figures and the knowledge they represent in order to validate research findings and to generate new hypotheses. By themselves, these figures are nearly always incomprehensible to both humans and machines and their associated texts are therefore essential for full comprehension. The associated text of a figure, however, is scattered throughout its full-text article and contains redundant information content. In this paper, we report the continued development and evaluation of several figure summarization systems, the FigSum+ systems, that automatically identify associated texts, remove redundant information, and generate a text summary for every figure in an article. Using a set of 94 annotated figures selected from 19 different journals, we conducted an intrinsic evaluation of FigSum+. We evaluate the performance by precision, recall, F1, and ROUGE scores. The best FigSum+ system is based on an unsupervised method, achieving F1 score of 0.66 and ROUGE-1 score of 0.97. The annotated data is available at figshare.com (http://figshare.com/articles/Figure_Associated_Text_Summarization_and_Evaluation/858903).

  14. Figure-Associated Text Summarization and Evaluation

    PubMed Central

    Polepalli Ramesh, Balaji; Sethi, Ricky J.; Yu, Hong

    2015-01-01

    Biomedical literature incorporates millions of figures, which are a rich and important knowledge resource for biomedical researchers. Scientists need access to the figures and the knowledge they represent in order to validate research findings and to generate new hypotheses. By themselves, these figures are nearly always incomprehensible to both humans and machines and their associated texts are therefore essential for full comprehension. The associated text of a figure, however, is scattered throughout its full-text article and contains redundant information content. In this paper, we report the continued development and evaluation of several figure summarization systems, the FigSum+ systems, that automatically identify associated texts, remove redundant information, and generate a text summary for every figure in an article. Using a set of 94 annotated figures selected from 19 different journals, we conducted an intrinsic evaluation of FigSum+. We evaluate the performance by precision, recall, F1, and ROUGE scores. The best FigSum+ system is based on an unsupervised method, achieving F1 score of 0.66 and ROUGE-1 score of 0.97. The annotated data is available at figshare.com (http://figshare.com/articles/Figure_Associated_Text_Summarization_and_Evaluation/858903). PMID:25643357

  15. Text-mining and information-retrieval services for molecular biology

    PubMed Central

    Krallinger, Martin; Valencia, Alfonso

    2005-01-01

    Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators. PMID:15998455

  16. Argo: enabling the development of bespoke workflows and services for disease annotation.

    PubMed

    Batista-Navarro, Riza; Carter, Jacob; Ananiadou, Sophia

    2016-01-01

    Argo (http://argo.nactem.ac.uk) is a generic text mining workbench that can cater to a variety of use cases, including the semi-automatic annotation of literature. It enables its technical users to build their own customised text mining solutions by providing a wide array of interoperable and configurable elementary components that can be seamlessly integrated into processing workflows. With Argo's graphical annotation interface, domain experts can then make use of the workflows' automatically generated output to curate information of interest.With the continuously rising need to understand the aetiology of diseases as well as the demand for their informed diagnosis and personalised treatment, the curation of disease-relevant information from medical and clinical documents has become an indispensable scientific activity. In the Fifth BioCreative Challenge Evaluation Workshop (BioCreative V), there was substantial interest in the mining of literature for disease-relevant information. Apart from a panel discussion focussed on disease annotations, the chemical-disease relations (CDR) track was also organised to foster the sharing and advancement of disease annotation tools and resources.This article presents the application of Argo's capabilities to the literature-based annotation of diseases. As part of our participation in BioCreative V's User Interactive Track (IAT), we demonstrated and evaluated Argo's suitability to the semi-automatic curation of chronic obstructive pulmonary disease (COPD) phenotypes. Furthermore, the workbench facilitated the development of some of the CDR track's top-performing web services for normalising disease mentions against the Medical Subject Headings (MeSH) database. In this work, we highlight Argo's support for developing various types of bespoke workflows ranging from ones which enabled us to easily incorporate information from various databases, to those which train and apply machine learning-based concept recognition models, through to user-interactive ones which allow human curators to manually provide their corrections to automatically generated annotations. Our participation in the BioCreative V challenges shows Argo's potential as an enabling technology for curating disease and phenotypic information from literature.Database URL: http://argo.nactem.ac.uk. © The Author(s) 2016. Published by Oxford University Press.

  17. Argo: enabling the development of bespoke workflows and services for disease annotation

    PubMed Central

    Batista-Navarro, Riza; Carter, Jacob; Ananiadou, Sophia

    2016-01-01

    Argo (http://argo.nactem.ac.uk) is a generic text mining workbench that can cater to a variety of use cases, including the semi-automatic annotation of literature. It enables its technical users to build their own customised text mining solutions by providing a wide array of interoperable and configurable elementary components that can be seamlessly integrated into processing workflows. With Argo's graphical annotation interface, domain experts can then make use of the workflows' automatically generated output to curate information of interest. With the continuously rising need to understand the aetiology of diseases as well as the demand for their informed diagnosis and personalised treatment, the curation of disease-relevant information from medical and clinical documents has become an indispensable scientific activity. In the Fifth BioCreative Challenge Evaluation Workshop (BioCreative V), there was substantial interest in the mining of literature for disease-relevant information. Apart from a panel discussion focussed on disease annotations, the chemical-disease relations (CDR) track was also organised to foster the sharing and advancement of disease annotation tools and resources. This article presents the application of Argo’s capabilities to the literature-based annotation of diseases. As part of our participation in BioCreative V’s User Interactive Track (IAT), we demonstrated and evaluated Argo’s suitability to the semi-automatic curation of chronic obstructive pulmonary disease (COPD) phenotypes. Furthermore, the workbench facilitated the development of some of the CDR track’s top-performing web services for normalising disease mentions against the Medical Subject Headings (MeSH) database. In this work, we highlight Argo’s support for developing various types of bespoke workflows ranging from ones which enabled us to easily incorporate information from various databases, to those which train and apply machine learning-based concept recognition models, through to user-interactive ones which allow human curators to manually provide their corrections to automatically generated annotations. Our participation in the BioCreative V challenges shows Argo’s potential as an enabling technology for curating disease and phenotypic information from literature. Database URL: http://argo.nactem.ac.uk PMID:27189607

  18. Text-mining-assisted biocuration workflows in Argo

    PubMed Central

    Rak, Rafal; Batista-Navarro, Riza Theresa; Rowley, Andrew; Carter, Jacob; Ananiadou, Sophia

    2014-01-01

    Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced. Database URL: http://argo.nactem.ac.uk PMID:25037308

  19. Exploring Preservice Teachers' Emerging Understandings of Disciplinary Literacy

    ERIC Educational Resources Information Center

    Masuda, Avis M.

    2014-01-01

    This qualitative inquiry examined 14 secondary preservice teachers' emerging understandings of disciplinary literacy. Data included preservice teachers' written reflections and annotated lesson plans, which were analyzed for understanding of discipline-specific habits of thinking, texts, reading and writing demands of academic texts, language and…

  20. The gene normalization task in BioCreative III

    PubMed Central

    2011-01-01

    Background We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). Results We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. Conclusions By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance. PMID:22151901

  1. The gene normalization task in BioCreative III.

    PubMed

    Lu, Zhiyong; Kao, Hung-Yu; Wei, Chih-Hsuan; Huang, Minlie; Liu, Jingchen; Kuo, Cheng-Ju; Hsu, Chun-Nan; Tsai, Richard Tzong-Han; Dai, Hong-Jie; Okazaki, Naoaki; Cho, Han-Cheol; Gerner, Martin; Solt, Illes; Agarwal, Shashank; Liu, Feifan; Vishnyakova, Dina; Ruch, Patrick; Romacker, Martin; Rinaldi, Fabio; Bhattacharya, Sanmitra; Srinivasan, Padmini; Liu, Hongfang; Torii, Manabu; Matos, Sergio; Campos, David; Verspoor, Karin; Livingston, Kevin M; Wilbur, W John

    2011-10-03

    We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.

  2. Ontorat: automatic generation of new ontology terms, annotations, and axioms based on ontology design patterns.

    PubMed

    Xiang, Zuoshuang; Zheng, Jie; Lin, Yu; He, Yongqun

    2015-01-01

    It is time-consuming to build an ontology with many terms and axioms. Thus it is desired to automate the process of ontology development. Ontology Design Patterns (ODPs) provide a reusable solution to solve a recurrent modeling problem in the context of ontology engineering. Because ontology terms often follow specific ODPs, the Ontology for Biomedical Investigations (OBI) developers proposed a Quick Term Templates (QTTs) process targeted at generating new ontology classes following the same pattern, using term templates in a spreadsheet format. Inspired by the ODPs and QTTs, the Ontorat web application is developed to automatically generate new ontology terms, annotations of terms, and logical axioms based on a specific ODP(s). The inputs of an Ontorat execution include axiom expression settings, an input data file, ID generation settings, and a target ontology (optional). The axiom expression settings can be saved as a predesigned Ontorat setting format text file for reuse. The input data file is generated based on a template file created by a specific ODP (text or Excel format). Ontorat is an efficient tool for ontology expansion. Different use cases are described. For example, Ontorat was applied to automatically generate over 1,000 Japan RIKEN cell line cell terms with both logical axioms and rich annotation axioms in the Cell Line Ontology (CLO). Approximately 800 licensed animal vaccines were represented and annotated in the Vaccine Ontology (VO) by Ontorat. The OBI team used Ontorat to add assay and device terms required by ENCODE project. Ontorat was also used to add missing annotations to all existing Biobank specific terms in the Biobank Ontology. A collection of ODPs and templates with examples are provided on the Ontorat website and can be reused to facilitate ontology development. With ever increasing ontology development and applications, Ontorat provides a timely platform for generating and annotating a large number of ontology terms by following design patterns. http://ontorat.hegroup.org/.

  3. Genome sequence of the oleaginous yeast Rhodotorula toruloides strain CGMCC 2.1609.

    PubMed

    Sambles, Christine; Middelhaufe, Sabine; Soanes, Darren; Kolak, Dagmara; Lux, Thomas; Moore, Karen; Matoušková, Petra; Parker, David; Lee, Rob; Love, John; Aves, Stephen J

    2017-09-01

    Most eukaryotic oleaginous species are yeasts and among them the basidiomycete red yeast, Rhodotorula ( Rhodosporidium ) toruloides (Pucciniomycotina) is known to produce high quantities of lipids when grown in nitrogen-limiting media, and has potential for biodiesel production. The genome of the CGMCC 2.1609 strain of this oleaginous red yeast was sequenced using a hybrid of Roche 454 and Illumina technology generating 13 × coverage. The de novo assembly was carried out using MIRA and scaffolded using MAQ and BAMBUS. The sequencing and assembly resulted in 365 scaffolds with total genome size of 33.4 Mb. The complete genome sequence of this strain was deposited in GenBank and the accession number is LKER00000000. The annotation is available on Figshare (doi:10.6084/m9.figshare.4754251).

  4. Linking DICOM pixel data with radiology reports using automatic semantic annotation

    NASA Astrophysics Data System (ADS)

    Pathak, Sayan D.; Kim, Woojin; Munasinghe, Indeera; Criminisi, Antonio; White, Steve; Siddiqui, Khan

    2012-02-01

    Improved access to DICOM studies to both physicians and patients is changing the ways medical imaging studies are visualized and interpreted beyond the confines of radiologists' PACS workstations. While radiologists are trained for viewing and image interpretation, a non-radiologist physician relies on the radiologists' reports. Consequently, patients historically have been typically informed about their imaging findings via oral communication with their physicians, even though clinical studies have shown that patients respond to physician's advice significantly better when the individual patients are shown their own actual data. Our previous work on automated semantic annotation of DICOM Computed Tomography (CT) images allows us to further link radiology report with the corresponding images, enabling us to bridge the gap between image data with the human interpreted textual description of the corresponding imaging studies. The mapping of radiology text is facilitated by natural language processing (NLP) based search application. When combined with our automated semantic annotation of images, it enables navigation in large DICOM studies by clicking hyperlinked text in the radiology reports. An added advantage of using semantic annotation is the ability to render the organs to their default window level setting thus eliminating another barrier to image sharing and distribution. We believe such approaches would potentially enable the consumer to have access to their imaging data and navigate them in an informed manner.

  5. It’s about This and That: A Description of Anaphoric Expressions in Clinical Text

    PubMed Central

    Wang, Yan; Melton, Genevieve B.; Pakhomov, Serguei

    2011-01-01

    Although anaphoric expressions are very common in biomedical and clinical documents, little work has been done to systematically characterize their use in clinical text. Samples of ‘it’, ‘this’, and ‘that’ expressions occurring in inpatient clinical notes from four metropolitan hospitals were analyzed using a combination of semi-automated and manual annotation techniques. We developed a rule-based approach to filter potential non-referential expressions. A physician then manually annotated 1000 potential referential instances to determine referent status and the antecedent of each referent expression. A distributional analysis of the three referring expressions in the entire corpus of notes demonstrates a high prevalence of anaphora and large variance in distributions of referential expressions with different notes. Our results confirm that anaphoric expressions are common in clinical texts. Effective co-reference resolution with anaphoric expressions remains an important challenge in medical natural language processing research. PMID:22195211

  6. The role of fine-grained annotations in supervised recognition of risk factors for heart disease from EHRs.

    PubMed

    Roberts, Kirk; Shooshan, Sonya E; Rodriguez, Laritza; Abhyankar, Swapna; Kilicoglu, Halil; Demner-Fushman, Dina

    2015-12-01

    This paper describes a supervised machine learning approach for identifying heart disease risk factors in clinical text, and assessing the impact of annotation granularity and quality on the system's ability to recognize these risk factors. We utilize a series of support vector machine models in conjunction with manually built lexicons to classify triggers specific to each risk factor. The features used for classification were quite simple, utilizing only lexical information and ignoring higher-level linguistic information such as syntax and semantics. Instead, we incorporated high-quality data to train the models by annotating additional information on top of a standard corpus. Despite the relative simplicity of the system, it achieves the highest scores (micro- and macro-F1, and micro- and macro-recall) out of the 20 participants in the 2014 i2b2/UTHealth Shared Task. This system obtains a micro- (macro-) precision of 0.8951 (0.8965), recall of 0.9625 (0.9611), and F1-measure of 0.9276 (0.9277). Additionally, we perform a series of experiments to assess the value of the annotated data we created. These experiments show how manually-labeled negative annotations can improve information extraction performance, demonstrating the importance of high-quality, fine-grained natural language annotations. Copyright © 2015 Elsevier Inc. All rights reserved.

  7. Semantics-Based Intelligent Indexing and Retrieval of Digital Images - A Case Study

    NASA Astrophysics Data System (ADS)

    Osman, Taha; Thakker, Dhavalkumar; Schaefer, Gerald

    The proliferation of digital media has led to a huge interest in classifying and indexing media objects for generic search and usage. In particular, we are witnessing colossal growth in digital image repositories that are difficult to navigate using free-text search mechanisms, which often return inaccurate matches as they typically rely on statistical analysis of query keyword recurrence in the image annotation or surrounding text. In this chapter we present a semantically enabled image annotation and retrieval engine that is designed to satisfy the requirements of commercial image collections market in terms of both accuracy and efficiency of the retrieval process. Our search engine relies on methodically structured ontologies for image annotation, thus allowing for more intelligent reasoning about the image content and subsequently obtaining a more accurate set of results and a richer set of alternatives matchmaking the original query. We also show how our well-analysed and designed domain ontology contributes to the implicit expansion of user queries as well as presenting our initial thoughts on exploiting lexical databases for explicit semantic-based query expansion.

  8. An open annotation ontology for science on web 3.0

    PubMed Central

    2011-01-01

    Background There is currently a gap between the rich and expressive collection of published biomedical ontologies, and the natural language expression of biomedical papers consumed on a daily basis by scientific researchers. The purpose of this paper is to provide an open, shareable structure for dynamic integration of biomedical domain ontologies with the scientific document, in the form of an Annotation Ontology (AO), thus closing this gap and enabling application of formal biomedical ontologies directly to the literature as it emerges. Methods Initial requirements for AO were elicited by analysis of integration needs between biomedical web communities, and of needs for representing and integrating results of biomedical text mining. Analysis of strengths and weaknesses of previous efforts in this area was also performed. A series of increasingly refined annotation tools were then developed along with a metadata model in OWL, and deployed for feedback and additional requirements the ontology to users at a major pharmaceutical company and a major academic center. Further requirements and critiques of the model were also elicited through discussions with many colleagues and incorporated into the work. Results This paper presents Annotation Ontology (AO), an open ontology in OWL-DL for annotating scientific documents on the web. AO supports both human and algorithmic content annotation. It enables “stand-off” or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO contains a provenance model to support versioning, and a set model for specifying groups and containers of annotation. AO is freely available under open source license at http://purl.org/ao/, and extensive documentation including screencasts is available on AO’s Google Code page: http://code.google.com/p/annotation-ontology/ . Conclusions The Annotation Ontology meets critical requirements for an open, freely shareable model in OWL, of annotation metadata created against scientific documents on the Web. We believe AO can become a very useful common model for annotation metadata on Web documents, and will enable biomedical domain ontologies to be used quite widely to annotate the scientific literature. Potential collaborators and those with new relevant use cases are invited to contact the authors. PMID:21624159

  9. An open annotation ontology for science on web 3.0.

    PubMed

    Ciccarese, Paolo; Ocana, Marco; Garcia Castro, Leyla Jael; Das, Sudeshna; Clark, Tim

    2011-05-17

    There is currently a gap between the rich and expressive collection of published biomedical ontologies, and the natural language expression of biomedical papers consumed on a daily basis by scientific researchers. The purpose of this paper is to provide an open, shareable structure for dynamic integration of biomedical domain ontologies with the scientific document, in the form of an Annotation Ontology (AO), thus closing this gap and enabling application of formal biomedical ontologies directly to the literature as it emerges. Initial requirements for AO were elicited by analysis of integration needs between biomedical web communities, and of needs for representing and integrating results of biomedical text mining. Analysis of strengths and weaknesses of previous efforts in this area was also performed. A series of increasingly refined annotation tools were then developed along with a metadata model in OWL, and deployed for feedback and additional requirements the ontology to users at a major pharmaceutical company and a major academic center. Further requirements and critiques of the model were also elicited through discussions with many colleagues and incorporated into the work. This paper presents Annotation Ontology (AO), an open ontology in OWL-DL for annotating scientific documents on the web. AO supports both human and algorithmic content annotation. It enables "stand-off" or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO contains a provenance model to support versioning, and a set model for specifying groups and containers of annotation. AO is freely available under open source license at http://purl.org/ao/, and extensive documentation including screencasts is available on AO's Google Code page: http://code.google.com/p/annotation-ontology/ . The Annotation Ontology meets critical requirements for an open, freely shareable model in OWL, of annotation metadata created against scientific documents on the Web. We believe AO can become a very useful common model for annotation metadata on Web documents, and will enable biomedical domain ontologies to be used quite widely to annotate the scientific literature. Potential collaborators and those with new relevant use cases are invited to contact the authors.

  10. Using Nonexperts for Annotating Pharmacokinetic Drug-Drug Interaction Mentions in Product Labeling: A Feasibility Study

    PubMed Central

    Ning, Yifan; Hernandez, Andres; Horn, John R; Jacobson, Rebecca; Boyce, Richard D

    2016-01-01

    Background Because vital details of potential pharmacokinetic drug-drug interactions are often described in free-text structured product labels, manual curation is a necessary but expensive step in the development of electronic drug-drug interaction information resources. The use of nonexperts to annotate potential drug-drug interaction (PDDI) mentions in drug product label annotation may be a means of lessening the burden of manual curation. Objective Our goal was to explore the practicality of using nonexpert participants to annotate drug-drug interaction descriptions from structured product labels. By presenting annotation tasks to both pharmacy experts and relatively naïve participants, we hoped to demonstrate the feasibility of using nonexpert annotators for drug-drug information annotation. We were also interested in exploring whether and to what extent natural language processing (NLP) preannotation helped improve task completion time, accuracy, and subjective satisfaction. Methods Two experts and 4 nonexperts were asked to annotate 208 structured product label sections under 4 conditions completed sequentially: (1) no NLP assistance, (2) preannotation of drug mentions, (3) preannotation of drug mentions and PDDIs, and (4) a repeat of the no-annotation condition. Results were evaluated within the 2 groups and relative to an existing gold standard. Participants were asked to provide reports on the time required to complete tasks and their perceptions of task difficulty. Results One of the experts and 3 of the nonexperts completed all tasks. Annotation results from the nonexpert group were relatively strong in every scenario and better than the performance of the NLP pipeline. The expert and 2 of the nonexperts were able to complete most tasks in less than 3 hours. Usability perceptions were generally positive (3.67 for expert, mean of 3.33 for nonexperts). Conclusions The results suggest that nonexpert annotation might be a feasible option for comprehensive labeling of annotated PDDIs across a broader range of drug product labels. Preannotation of drug mentions may ease the annotation task. However, preannotation of PDDIs, as operationalized in this study, presented the participants with difficulties. Future work should test if these issues can be addressed by the use of better performing NLP and a different approach to presenting the PDDI preannotations to users during the annotation workflow. PMID:27066806

  11. Using Nonexperts for Annotating Pharmacokinetic Drug-Drug Interaction Mentions in Product Labeling: A Feasibility Study.

    PubMed

    Hochheiser, Harry; Ning, Yifan; Hernandez, Andres; Horn, John R; Jacobson, Rebecca; Boyce, Richard D

    2016-04-11

    Because vital details of potential pharmacokinetic drug-drug interactions are often described in free-text structured product labels, manual curation is a necessary but expensive step in the development of electronic drug-drug interaction information resources. The use of nonexperts to annotate potential drug-drug interaction (PDDI) mentions in drug product label annotation may be a means of lessening the burden of manual curation. Our goal was to explore the practicality of using nonexpert participants to annotate drug-drug interaction descriptions from structured product labels. By presenting annotation tasks to both pharmacy experts and relatively naïve participants, we hoped to demonstrate the feasibility of using nonexpert annotators for drug-drug information annotation. We were also interested in exploring whether and to what extent natural language processing (NLP) preannotation helped improve task completion time, accuracy, and subjective satisfaction. Two experts and 4 nonexperts were asked to annotate 208 structured product label sections under 4 conditions completed sequentially: (1) no NLP assistance, (2) preannotation of drug mentions, (3) preannotation of drug mentions and PDDIs, and (4) a repeat of the no-annotation condition. Results were evaluated within the 2 groups and relative to an existing gold standard. Participants were asked to provide reports on the time required to complete tasks and their perceptions of task difficulty. One of the experts and 3 of the nonexperts completed all tasks. Annotation results from the nonexpert group were relatively strong in every scenario and better than the performance of the NLP pipeline. The expert and 2 of the nonexperts were able to complete most tasks in less than 3 hours. Usability perceptions were generally positive (3.67 for expert, mean of 3.33 for nonexperts). The results suggest that nonexpert annotation might be a feasible option for comprehensive labeling of annotated PDDIs across a broader range of drug product labels. Preannotation of drug mentions may ease the annotation task. However, preannotation of PDDIs, as operationalized in this study, presented the participants with difficulties. Future work should test if these issues can be addressed by the use of better performing NLP and a different approach to presenting the PDDI preannotations to users during the annotation workflow.

  12. Corpus Linguistics and the Design of a Response Message

    NASA Astrophysics Data System (ADS)

    Atwell, E.

    2002-01-01

    Most research related to SETI, the Search for Extra-Terrestrial Intelligence, is focussed on techniques for detection of possible incoming signals from extra-terrestrial intelligent sources (e.g. Turnbull et al. 1999), and algorithms for analysis of these signals to identify intelligent language-like characteristics (e.g. Elliott and Atwell 1999, 2000). However, another issue for research and debate is the nature of our response, should a signal arrive and be detected. The design of potentially the most significant communicative act in history should not be decided solely by astrophysicists; the Corpus Linguistics research community has a contribution to make to what is essentially a Corpus design and implementation project. (Vakoch 1998) advocated that the message constructed to transmit to extraterrestrials should include a broad, representative collection of perspectives rather than a single viewpoint or genre; this should strike a chord with Corpus Linguists for whom a central principle is that a corpus must be "balanced" to be representative (Meyer 2001). One idea favoured by SETI researchers is to transmit an encyclopaedia summarising human knowledge, such as the Encyclopaedia Britannica, to give ET communicators an overview and "training set" key to analysis of subsequent messages. Furthermore, this should be sent in several versions in parallel: the text; page-images, to include illustrations left out of the text-file and perhaps some sort of abstract linguistic representation of the text, using a functional or logic language (Ollongren 1999, Freudenthal 1960). The idea of "enriching" the message corpus with annotations at several levels should also strike a chord with Corpus Linguists who have long known that Natural language exhibits highly complex multi-layering sequencing, structural and functional patterns, as difficult to model as sequences and structures found in more traditional physical and biological sciences. Some corpora have been annotated with several levels or layers of linguistic knowledge, for example the SEC corpus (Taylor and Knowles 1988), the ISLE corpus (Menzel et al. 2000). Tagged and parsed corpus can be used by corpus linguists as a testbed to guide their development of grammars (e.g. Souter and Atwell 1994); and they can be used to train Natural Language Learning or data-mining models of complex sequence data (e.g. Brill 1993, Hughes 1993, Atwell 1996). Corpus linguists have a range of standards and tools for design and annotation of representative corpus resources, and experience of which annotation types are more amenable to Natural Language Learning algorithms. An Advisory panel of corpus linguists could help design and implement an extended Multi-annotated Interstellar Corpus of English, incorporating ideas from Corpus Linguistics such as: - Augment the Encyclopaedia Britannica with a collection of samples representing the diversity of language in real use. - As an additional "key", transmit a dictionary aimed at language learners which has also been a rich source for NLP - Supply our ET communicators with several levels of linguistic annotation, to give them a richer training set for their - Add translations of the English text into other human languages: Humanity should not be represented by English alone, This calls for a large-scale corpus annotation project, requiring an Interstellar Corpus Advisory Panel, analogous to the BNC or MATE advisory panels, to include experts in English grammar and semantics, English language learning, computational Natural language Learning algorithms, and corpus design, implementation, annotation, standardisation, and analysis.

  13. Framework for performance evaluation of face, text, and vehicle detection and tracking in video: data, metrics, and protocol.

    PubMed

    Kasturi, Rangachar; Goldgof, Dmitry; Soundararajan, Padmanabhan; Manohar, Vasant; Garofolo, John; Bowers, Rachel; Boonstra, Matthew; Korzhova, Valentina; Zhang, Jing

    2009-02-01

    Common benchmark data sets, standardized performance metrics, and baseline algorithms have demonstrated considerable impact on research and development in a variety of application domains. These resources provide both consumers and developers of technology with a common framework to objectively compare the performance of different algorithms and algorithmic improvements. In this paper, we present such a framework for evaluating object detection and tracking in video: specifically for face, text, and vehicle objects. This framework includes the source video data, ground-truth annotations (along with guidelines for annotation), performance metrics, evaluation protocols, and tools including scoring software and baseline algorithms. For each detection and tracking task and supported domain, we developed a 50-clip training set and a 50-clip test set. Each data clip is approximately 2.5 minutes long and has been completely spatially/temporally annotated at the I-frame level. Each task/domain, therefore, has an associated annotated corpus of approximately 450,000 frames. The scope of such annotation is unprecedented and was designed to begin to support the necessary quantities of data for robust machine learning approaches, as well as a statistically significant comparison of the performance of algorithms. The goal of this work was to systematically address the challenges of object detection and tracking through a common evaluation framework that permits a meaningful objective comparison of techniques, provides the research community with sufficient data for the exploration of automatic modeling techniques, encourages the incorporation of objective evaluation into the development process, and contributes useful lasting resources of a scale and magnitude that will prove to be extremely useful to the computer vision research community for years to come.

  14. The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature

    PubMed Central

    Korhonen, Anna; Silins, Ilona; Sun, Lin; Stenius, Ulla

    2009-01-01

    Background One of the most neglected areas of biomedical Text Mining (TM) is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA). Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature. Results The taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice. Conclusion We present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA. PMID:19772619

  15. v3NLP Framework: Tools to Build Applications for Extracting Concepts from Clinical Text

    PubMed Central

    Divita, Guy; Carter, Marjorie E.; Tran, Le-Thuy; Redd, Doug; Zeng, Qing T; Duvall, Scott; Samore, Matthew H.; Gundlapalli, Adi V.

    2016-01-01

    Introduction: Substantial amounts of clinically significant information are contained only within the narrative of the clinical notes in electronic medical records. The v3NLP Framework is a set of “best-of-breed” functionalities developed to transform this information into structured data for use in quality improvement, research, population health surveillance, and decision support. Background: MetaMap, cTAKES and similar well-known natural language processing (NLP) tools do not have sufficient scalability out of the box. The v3NLP Framework evolved out of the necessity to scale-up these tools up and provide a framework to customize and tune techniques that fit a variety of tasks, including document classification, tuned concept extraction for specific conditions, patient classification, and information retrieval. Innovation: Beyond scalability, several v3NLP Framework-developed projects have been efficacy tested and benchmarked. While v3NLP Framework includes annotators, pipelines and applications, its functionalities enable developers to create novel annotators and to place annotators into pipelines and scaled applications. Discussion: The v3NLP Framework has been successfully utilized in many projects including general concept extraction, risk factors for homelessness among veterans, and identification of mentions of the presence of an indwelling urinary catheter. Projects as diverse as predicting colonization with methicillin-resistant Staphylococcus aureus and extracting references to military sexual trauma are being built using v3NLP Framework components. Conclusion: The v3NLP Framework is a set of functionalities and components that provide Java developers with the ability to create novel annotators and to place those annotators into pipelines and applications to extract concepts from clinical text. There are scale-up and scale-out functionalities to process large numbers of records. PMID:27683667

  16. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial.

    PubMed

    Velupillai, Sumithra; Dalianis, Hercules; Hassel, Martin; Nilsson, Gunnar H

    2009-12-01

    Electronic patient records (EPRs) contain a large amount of information written in free text. This information is considered very valuable for research but is also very sensitive since the free text parts may contain information that could reveal the identity of a patient. Therefore, methods for de-identifying EPRs are needed. The work presented here aims to perform a manual and automatic Protected Health Information (PHI)-annotation trial for EPRs written in Swedish. This study consists of two main parts: the initial creation of a manually PHI-annotated gold standard, and the porting and evaluation of an existing de-identification software written for American English to Swedish in a preliminary automatic de-identification trial. Results are measured with precision, recall and F-measure. This study reports fairly high Inter-Annotator Agreement (IAA) results on the manually created gold standard, especially for specific tags such as names. The average IAA over all tags was 0.65 F-measure (0.84 F-measure highest pairwise agreement). For name tags the average IAA was 0.80 F-measure (0.91 F-measure highest pairwise agreement). Porting a de-identification software written for American English to Swedish directly was unfortunately non-trivial, yielding poor results. Developing gold standard sets as well as automatic systems for de-identification tasks in Swedish is feasible. However, discussions and definitions on identifiable information is needed, as well as further developments both on the tag sets and the annotation guidelines, in order to get a reliable gold standard. A completely new de-identification software needs to be developed.

  17. Argo: an integrative, interactive, text mining-based workbench supporting curation

    PubMed Central

    Rak, Rafal; Rowley, Andrew; Black, William; Ananiadou, Sophia

    2012-01-01

    Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variety of tasks, types of information and applications. Processing components usually come from different sources and often lack interoperability. The well established Unstructured Information Management Architecture is a framework that addresses interoperability by defining common data structures and interfaces. However, most of the efforts are targeted towards software developers and are not suitable for curators, or are otherwise inconvenient to use on a higher level of abstraction. To overcome these issues we introduce Argo, an interoperable, integrative, interactive and collaborative system for text analysis with a convenient graphic user interface to ease the development of processing workflows and boost productivity in labour-intensive manual curation. Robust, scalable text analytics follow a modular approach, adopting component modules for distinct levels of text analysis. The user interface is available entirely through a web browser that saves the user from going through often complicated and platform-dependent installation procedures. Argo comes with a predefined set of processing components commonly used in text analysis, while giving the users the ability to deposit their own components. The system accommodates various areas and levels of user expertise, from TM and computational linguistics to ontology-based curation. One of the key functionalities of Argo is its ability to seamlessly incorporate user-interactive components, such as manual annotation editors, into otherwise completely automatic pipelines. As a use case, we demonstrate the functionality of an in-built manual annotation editor that is well suited for in-text corpus annotation tasks. Database URL: http://www.nactem.ac.uk/Argo PMID:22434844

  18. Cell line name recognition in support of the identification of synthetic lethality in cancer from text

    PubMed Central

    Kaewphan, Suwisa; Van Landeghem, Sofie; Ohta, Tomoko; Van de Peer, Yves; Ginter, Filip; Pyysalo, Sampo

    2016-01-01

    Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers. Availability and implementation: The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/. Contact: sukaew@utu.fi PMID:26428294

  19. PAPARA(ZZ)I: An open-source software interface for annotating photographs of the deep-sea

    NASA Astrophysics Data System (ADS)

    Marcon, Yann; Purser, Autun

    PAPARA(ZZ)I is a lightweight and intuitive image annotation program developed for the study of benthic megafauna. It offers functionalities such as free, grid and random point annotation. Annotations may be made following existing classification schemes for marine biota and substrata or with the use of user defined, customised lists of keywords, which broadens the range of potential application of the software to other types of studies (e.g. marine litter distribution assessment). If Internet access is available, PAPARA(ZZ)I can also query and use standardised taxa names directly from the World Register of Marine Species (WoRMS). Program outputs include abundances, densities and size calculations per keyword (e.g. per taxon). These results are written into text files that can be imported into spreadsheet programs for further analyses. PAPARA(ZZ)I is open-source and is available at http://papara-zz-i.github.io. Compiled versions exist for most 64-bit operating systems: Windows, Mac OS X and Linux.

  20. ANNOTATION TAKEN, IN THE PERSPECTIVE OF CRIMINAL AND CONSTITUTIONAL LAW, AS WELL AS IN CRIMINOLOGY, TO THE DECISION OF THE PORTUGUESE CONSTITUTIONAL COURT, OF JANUARY 13, 2011--WITH RESPECT TO THE PROBLEMS OF "CONSENT" AND "MEDICAL ACT".

    PubMed

    Bandeira, Gonçalo S de Melo

    2014-07-01

    1--Summary of the decision taken by the Portuguese Constitutional Court, of January 13, 2011; 2--Complete text of the decision of the Portuguese Constitutional Court, of January 13, 2011, Judge Maria João ANTUNES (Reporter), Judge Carlos Pamplona de OLIVEIRA, Judge José Borges SOEIRO, Judge Gil GALVÃO, Judge Rui Manuel Moura RAMOS (President)--in terms of the appositive declaration to the sentence n. 487/2010: t.c.http://www. tribunalconstitucional.pt, August 1, 2011; 3--Brief annotation to the problem of the "medical act"; 3.1--Plus some conclusions on the brief annotation to the problem of the "medical act"; 3.2--Brief annotation to the problem of "consent"--continuation of the previous comments; 4--Conclusions. It must never be forgotten that "consent" does not stand as the only cause of exclusion of unlawfulness.

  1. pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature

    PubMed Central

    Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.

    2015-01-01

    Background Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. Methods In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. Results We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/). PMID:26258475

  2. Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida

    PubMed Central

    Pirooznia, Mehdi; Gong, Ping; Guan, Xin; Inouye, Laura S; Yang, Kuan; Perkins, Edward J; Deng, Youping

    2007-01-01

    Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at . PMID:18047730

  3. Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: an annotation and machine learning study.

    PubMed

    Skeppstedt, Maria; Kvist, Maria; Nilsson, Gunnar H; Dalianis, Hercules

    2014-06-01

    Automatic recognition of clinical entities in the narrative text of health records is useful for constructing applications for documentation of patient care, as well as for secondary usage in the form of medical knowledge extraction. There are a number of named entity recognition studies on English clinical text, but less work has been carried out on clinical text in other languages. This study was performed on Swedish health records, and focused on four entities that are highly relevant for constructing a patient overview and for medical hypothesis generation, namely the entities: Disorder, Finding, Pharmaceutical Drug and Body Structure. The study had two aims: to explore how well named entity recognition methods previously applied to English clinical text perform on similar texts written in Swedish; and to evaluate whether it is meaningful to divide the more general category Medical Problem, which has been used in a number of previous studies, into the two more granular entities, Disorder and Finding. Clinical notes from a Swedish internal medicine emergency unit were annotated for the four selected entity categories, and the inter-annotator agreement between two pairs of annotators was measured, resulting in an average F-score of 0.79 for Disorder, 0.66 for Finding, 0.90 for Pharmaceutical Drug and 0.80 for Body Structure. A subset of the developed corpus was thereafter used for finding suitable features for training a conditional random fields model. Finally, a new model was trained on this subset, using the best features and settings, and its ability to generalise to held-out data was evaluated. This final model obtained an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for Body Structure and 0.78 for the combined category Disorder+Finding. The obtained results, which are in line with or slightly lower than those for similar studies on English clinical text, many of them conducted using a larger training data set, show that the approaches used for English are also suitable for Swedish clinical text. However, a small proportion of the errors made by the model are less likely to occur in English text, showing that results might be improved by further tailoring the system to clinical Swedish. The entity recognition results for the individual entities Disorder and Finding show that it is meaningful to separate the general category Medical Problem into these two more granular entity types, e.g. for knowledge mining of co-morbidity relations and disorder-finding relations. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.

  4. Discovering gene annotations in biomedical text databases

    PubMed Central

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-01-01

    Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values. PMID:18325104

  5. Towards the VWO Annotation Service: a Success Story of the IMAGE RPI Expert Rating System

    NASA Astrophysics Data System (ADS)

    Reinisch, B. W.; Galkin, I. A.; Fung, S. F.; Benson, R. F.; Kozlov, A. V.; Khmyrov, G. M.; Garcia, L. N.

    2010-12-01

    Interpretation of Heliophysics wave data requires specialized knowledge of wave phenomena. Users of the virtual wave observatory (VWO) will greatly benefit from a data annotation service that will allow querying of data by phenomenon type, thus helping accomplish the VWO goal to make Heliophysics wave data searchable, understandable, and usable by the scientific community. Individual annotations can be sorted by phenomenon type and reduced into event lists (catalogs). However, in contrast to the event lists, annotation records allow a greater flexibility of collaborative management by more easily admitting operations of addition, revision, or deletion. They can therefore become the building blocks for an interactive Annotation Service with a suitable graphic user interface to the VWO middleware. The VWO Annotation Service vision is an interactive, collaborative sharing of domain expert knowledge with fellow scientists and students alike. An effective prototype of the VWO Annotation Service has been in operation at the University of Massachusetts Lowell since 2001. An expert rating system (ERS) was developed for annotating the IMAGE radio plasma imager (RPI) active sounding data containing 1.2 million plasmagrams. The RPI data analysts can use ERS to submit expert ratings of plasmagram features, such as presence of echo traces resulted from reflected RPI signals from distant plasma structures. Since its inception in 2001, the RPI ERS has accumulated 7351 expert plasmagram ratings in 16 phenomenon categories, together with free-text descriptions and other metadata. In addition to human expert ratings, the system holds 225,125 ratings submitted by the CORPRAL data prospecting software that employs a model of the human pre-attentive vision to select images potentially containing interesting features. The annotation records proved to be instrumental in a number of investigations where manual data exploration would have been prohibitively tedious and expensive. Especially useful are queries of the annotation database for successive plasmagrams containing echo traces. Several success stories of the RPI ERS using this capability will be discussed, particularly in terms of how they may be extended to develop the VWO Annotation Service.

  6. Discovering gene annotations in biomedical text databases.

    PubMed

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-03-06

    Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.

  7. Facet Annotation by Extending CNN with a Matching Strategy.

    PubMed

    Wu, Bei; Wei, Bifan; Liu, Jun; Guo, Zhaotong; Zheng, Yuanhao; Chen, Yihe

    2018-06-01

    Most community question answering (CQA) websites manage plenty of question-answer pairs (QAPs) through topic-based organizations, which may not satisfy users' fine-grained search demands. Facets of topics serve as a powerful tool to navigate, refine, and group the QAPs. In this work, we propose FACM, a model to annotate QAPs with facets by extending convolution neural networks (CNNs) with a matching strategy. First, phrase information is incorporated into text representation by CNNs with different kernel sizes. Then, through a matching strategy among QAPs and facet label texts (FaLTs) acquired from Wikipedia, we generate similarity matrices to deal with the facet heterogeneity. Finally, a three-channel CNN is trained for facet label assignment of QAPs. Experiments on three real-world data sets show that FACM outperforms the state-of-the-art methods.

  8. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

    PubMed Central

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

    2015-01-01

    Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595

  9. Student and Teacher Perspectives on a Close Reading Protocol

    ERIC Educational Resources Information Center

    Fisher, Douglas; Frey, Nancy

    2014-01-01

    Close reading is an instructional practice that has gained attention of late. It involves reading a complex text, annotation, and repeatedly reading to answer text-dependent questions. Although there are a number of recommendations for the use of close reading, there has not been a systematic analysis of student or teacher perceptions of this…

  10. Wege zur Festigung der Fachlexik (Ways of Building a Technical Vocabulary)

    ERIC Educational Resources Information Center

    Kalmykowa, Elena

    1974-01-01

    Learning to read foreign technical texts requires (a) a minimum vocabulary and (b) effective drill. The following are suggested: translation, question-answer method, definition exercises, exercises to develop skill in the use of reference works, schematic presentations and tables, annotating and making reports and summaries. (Text is in German.)…

  11. BioUSeR: a semantic-based tool for retrieving Life Science web resources driven by text-rich user requirements

    PubMed Central

    2013-01-01

    Background Open metadata registries are a fundamental tool for researchers in the Life Sciences trying to locate resources. While most current registries assume that resources are annotated with well-structured metadata, evidence shows that most of the resource annotations simply consists of informal free text. This reality must be taken into account in order to develop effective techniques for resource discovery in Life Sciences. Results BioUSeR is a semantic-based tool aimed at retrieving Life Sciences resources described in free text. The retrieval process is driven by the user requirements, which consist of a target task and a set of facets of interest, both expressed in free text. BioUSeR is able to effectively exploit the available textual descriptions to find relevant resources by using semantic-aware techniques. Conclusions BioUSeR overcomes the limitations of the current registries thanks to: (i) rich specification of user information needs, (ii) use of semantics to manage textual descriptions, (iii) retrieval and ranking of resources based on user requirements. PMID:23635042

  12. Database citation in full text biomedical articles.

    PubMed

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.

  13. Database Citation in Full Text Biomedical Articles

    PubMed Central

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R.

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services. PMID:23734176

  14. Unsupervised Biomedical Named Entity Recognition: Experiments with Clinical and Biological Texts

    PubMed Central

    Zhang, Shaodian; Elhadad, Nóemie

    2013-01-01

    Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervised approach to extracting named entities from biomedical text. We describe a stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rules, heuristics, or annotated data. A noun phrase chunker followed by a filter based on inverse document frequency extracts candidate entities from free text. Classification of candidate entities into categories of interest is carried out by leveraging principles from distributional semantics. Experiments show that our system, especially the entity classification step, yields competitive results on two popular biomedical datasets of clinical notes and biological literature, and outperforms a baseline dictionary match approach. Detailed error analysis provides a road map for future work. PMID:23954592

  15. MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads

    PubMed Central

    Lukjancenko, Oksana; Thomsen, Martin Christen Frølund; Maddalena Sperotto, Maria; Lund, Ole; Møller Aarestrup, Frank; Sicheritz-Pontén, Thomas

    2017-01-01

    An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets. PMID:28467460

  16. MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads.

    PubMed

    Petersen, Thomas Nordahl; Lukjancenko, Oksana; Thomsen, Martin Christen Frølund; Maddalena Sperotto, Maria; Lund, Ole; Møller Aarestrup, Frank; Sicheritz-Pontén, Thomas

    2017-01-01

    An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets.

  17. NCBI disease corpus: a resource for disease name recognition and concept normalization.

    PubMed

    Doğan, Rezarta Islamaj; Leaman, Robert; Lu, Zhiyong

    2014-02-01

    Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. Published by Elsevier Inc.

  18. Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing

    PubMed Central

    Wu, Stephen; Miller, Timothy; Masanz, James; Coarr, Matt; Halgrim, Scott; Carrell, David; Clark, Cheryl

    2014-01-01

    A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been “solved.” This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP. PMID:25393544

  19. PedAM: a database for Pediatric Disease Annotation and Medicine.

    PubMed

    Jia, Jinmeng; An, Zhongxin; Ming, Yue; Guo, Yongli; Li, Wei; Li, Xin; Liang, Yunxiang; Guo, Dongming; Tai, Jun; Chen, Geng; Jin, Yaqiong; Liu, Zhimei; Ni, Xin; Shi, Tieliu

    2018-01-04

    There is a significant number of children around the world suffering from the consequence of the misdiagnosis and ineffective treatment for various diseases. To facilitate the precision medicine in pediatrics, a database namely the Pediatric Disease Annotations & Medicines (PedAM) has been built to standardize and classify pediatric diseases. The PedAM integrates both biomedical resources and clinical data from Electronic Medical Records to support the development of computational tools, by which enables robust data analysis and integration. It also uses disease-manifestation (D-M) integrated from existing biomedical ontologies as prior knowledge to automatically recognize text-mined, D-M-specific syntactic patterns from 774 514 full-text articles and 8 848 796 abstracts in MEDLINE. Additionally, disease connections based on phenotypes or genes can be visualized on the web page of PedAM. Currently, the PedAM contains standardized 8528 pediatric disease terms (4542 unique disease concepts and 3986 synonyms) with eight annotation fields for each disease, including definition synonyms, gene, symptom, cross-reference (Xref), human phenotypes and its corresponding phenotypes in the mouse. The database PedAM is freely accessible at http://www.unimd.org/pedam/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. Metabolic Profiles and cDNA-AFLP Analysis of Salvia miltiorrhiza and Salvia castanea Diel f. tomentosa Stib

    PubMed Central

    Liang, Xiao; Liang, Zongsuo; Zhang, Meixiang; Shen, Shuang; Liu, Hongyun; Liu, Yan

    2012-01-01

    Plants of the genus Salvia produce various types of phenolic compounds and tanshinones which are effective for treatment of coronary heart disease. Salvia miltiorrhiza and S. castanea Diels f. tomentosa Stib are two important members of the genus. In this study, metabolic profiles and cDNA-AFLP analysis of four samples were employed to identify novel genes potentially involved in phenolic compounds and tanshinones biosynthesis, including the red roots from the two species and two tanshinone-free roots from S. miltiorrhiza. The results showed that the red roots of S. castanea Diels f. tomentosa Stib produced high contents of rosmarinic acid (21.77 mg/g) and tanshinone IIA (12.60 mg/g), but low content of salvianolic acid B (1.45 mg/g). The red roots of S. miltiorrhiza produced high content of salvianolic acid B (18.69 mg/g), while tanshinones accumulation in this sample was much less than that in S. castanea Diels f. tomentosa Stib. Tanshinones were not detected in the two tanshinone-free samples, which produced high contents of phenolic compounds. A cDNA-AFLP analysis with 128 primer pairs revealed that 2300 transcript derived fragments (TDFs) were differentially expressed among the four samples. About 323 TDFs were sequenced, of which 78 TDFs were annotated with known functions through BLASTX searching the Genbank database and 14 annotated TDFs were assigned into secondary metabolic pathways through searching the KEGGPATHWAY database. The quantitative real-time PCR analysis indicated that the expression of 9 TDFs was positively correlated with accumulation of phenolic compounds and tanshinones. These TDFs additionally showed coordinated transcriptional response with 6 previously-identified genes involved in biosynthesis of tanshinones and phenolic compounds in S. miltiorrhiza hairy roots treated with yeast extract. The sequence data in the present work not only provided us candidate genes involved in phenolic compounds and tanshinones biosynthesis but also gave us further insight into secondary metabolism in Salvia. PMID:22303439

  1. ENVIRONMENTS and EOL: identification of Environment Ontology terms in text and the annotation of the Encyclopedia of Life.

    PubMed

    Pafilis, Evangelos; Frankild, Sune P; Schnetzer, Julia; Fanini, Lucia; Faulwetter, Sarah; Pavloudi, Christina; Vasileiadou, Katerina; Leary, Patrick; Hammock, Jennifer; Schulz, Katja; Parr, Cynthia Sims; Arvanitidis, Christos; Jensen, Lars Juhl

    2015-06-01

    The association of organisms to their environments is a key issue in exploring biodiversity patterns. This knowledge has traditionally been scattered, but textual descriptions of taxa and their habitats are now being consolidated in centralized resources. However, structured annotations are needed to facilitate large-scale analyses. Therefore, we developed ENVIRONMENTS, a fast dictionary-based tagger capable of identifying Environment Ontology (ENVO) terms in text. We evaluate the accuracy of the tagger on a new manually curated corpus of 600 Encyclopedia of Life (EOL) species pages. We use the tagger to associate taxa with environments by tagging EOL text content monthly, and integrate the results into the EOL to disseminate them to a broad audience of users. The software and the corpus are available under the open-source BSD and the CC-BY-NC-SA 3.0 licenses, respectively, at http://environments.hcmr.gr. © The Author 2015. Published by Oxford University Press.

  2. Recognizing chemicals in patents: a comparative analysis.

    PubMed

    Habibi, Maryam; Wiegandt, David Luis; Schmedding, Florian; Leser, Ulf

    2016-01-01

    Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results.

  3. Ahuna Mons

    NASA Image and Video Library

    2018-03-14

    This view from NASA's Dawn mission shows Ceres' tallest mountain, Ahuna Mons, 2.5 miles (4 kilometers) high and 11 miles (17 kilometers) wide. This is one of the few sites on Ceres at which a significant amount of sodium carbonate has been found, shown in green and red colors in the lower right image. The top and lower left images were collected by Dawn's framing camera. The top image is a 3D view reconstructed with the help of topography data. A non-annotated version is available at https://photojournal.jpl.nasa.gov/catalog/PIA21919

  4. LC-MS analysis of Hep-2 and Hek-293 cell lines treated with Brazilian red propolis reveals differences in protein expression.

    PubMed

    da Silva Frozza, Caroline O; da Silva Brum, Emyle; Alving, Anjali; Moura, Sidnei; Henriques, João A P; Roesch-Ely, Mariana

    2016-08-01

    Red propolis, an exclusive variety of propolis found in the northeast of Brazil has shown to present antitumour activity, among several other biological properties. This article aimed to help to evaluate the underlying molecular mechanisms of the potential anticancer effects of red propolis on tumour, Hep-2, and non-tumour cells, Hek-293. Differentially expressed proteins in human cell lines were identified through label-free quantitative MS-based proteomic platform, and cells were stained with Giemsa to show morphological changes. A total of 1336 and 773 proteins were identified for Hep-2 and Hek-293, respectively. Among the proteins here identified, 16 were regulated in the Hep-2 cell line and 04 proteins in the Hek-293 line. Over a total of 2000 proteins were identified under MS analysis, and approximately 1% presented differential expression patterns. The GO annotation using Protein Analysis THrough Evolutionary Relationships classification system revealed predominant molecular function of catalytic activity, and among the biological processes, the most prominent was associated to cell metabolism. The proteomic profile here presented should help to elucidate further molecular mechanisms involved in inhibition of cancer cell proliferation by red propolis, which remain unclear to date. © 2016 Royal Pharmaceutical Society.

  5. Goethe-Institut Lehrerhandbibliothek: A Bibliography of Texts and Reference Works for the Teacher of German.

    ERIC Educational Resources Information Center

    Goethe Inst., Munich (West Germany).

    Teachers of German will find 120 items dating from 1960 entered in this alphabetized and annotated bibliography of texts and references. Language entries include material on: (1) grammar, (2) vocabulary, (3) pronunciation, (4) writing and style, (5) syntax, (6) morphology, (7) etymology, (8) dictation, (9) translation exercises, (10) verbs, (11)…

  6. From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks

    PubMed Central

    Thomer, Andrea; Vaidya, Gaurav; Guralnick, Robert; Bloom, David; Russell, Laura

    2012-01-01

    Abstract Part diary, part scientific record, biological field notebooks often contain details necessary to understanding the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from field notebooks have been idiosyncratic to specific research projects, and impossible to discover or re-use. Best practices and workflows for digitization, transcription, extraction, and integration with other sources are nascent or non-existent. In this paper, we demonstrate a workflow to generate structured outputs while also maintaining links to the original texts. The first step in this workflow was to place already digitized and transcribed field notebooks from the University of Colorado Museum of Natural History founder, Junius Henderson, on Wikisource, an open text transcription platform. Next, we created Wikisource templates to document places, dates, and taxa to facilitate annotation and wiki-linking. We then requested help from the public, through social media tools, to take advantage of volunteer efforts and energy. After three notebooks were fully annotated, content was converted into XML and annotations were extracted and cross-walked into Darwin Core compliant record sets. Finally, these recordsets were vetted, to provide valid taxon names, via a process we call “taxonomic referencing.” The result is identification and mobilization of 1,068 observations from three of Henderson’s thirteen notebooks and a publishable Darwin Core record set for use in other analyses. Although challenges remain, this work demonstrates a feasible approach to unlock observations from field notebooks that enhances their discovery and interoperability without losing the narrative context from which those observations are drawn. “Compose your notes as if you were writing a letter to someone a century in the future.” Perrine and Patton (2011) PMID:22859891

  7. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database.

    PubMed

    Carver, Tim; Berriman, Matthew; Tivey, Adrian; Patel, Chinmay; Böhme, Ulrike; Barrell, Barclay G; Parkhill, Julian; Rajandream, Marie-Adèle

    2008-12-01

    Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/

  8. Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus

    PubMed Central

    Comeau, Donald C.; Liu, Haibin; Islamaj Doğan, Rezarta; Wilbur, W. John

    2014-01-01

    BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net PMID:24935050

  9. A Bayesian network coding scheme for annotating biomedical information presented to genetic counseling clients.

    PubMed

    Green, Nancy

    2005-04-01

    We developed a Bayesian network coding scheme for annotating biomedical content in layperson-oriented clinical genetics documents. The coding scheme supports the representation of probabilistic and causal relationships among concepts in this domain, at a high enough level of abstraction to capture commonalities among genetic processes and their relationship to health. We are using the coding scheme to annotate a corpus of genetic counseling patient letters as part of the requirements analysis and knowledge acquisition phase of a natural language generation project. This paper describes the coding scheme and presents an evaluation of intercoder reliability for its tag set. In addition to giving examples of use of the coding scheme for analysis of discourse and linguistic features in this genre, we suggest other uses for it in analysis of layperson-oriented text and dialogue in medical communication.

  10. De novo transcriptomic analysis and development of EST-SSRs for Sorbus pohuashanensis (Hance) Hedl.

    PubMed Central

    Guan, Xuelian; Fu, Qiang; Zhang, Ze; Hu, Zenghui; Zheng, Jian; Lu, Yizeng; Li, Wei

    2017-01-01

    Sorbus pohuashanensis is a native tree species of northern China that is used for a variety of ecological purposes. The species is often grown as an ornamental landscape tree because of its beautiful form, silver flowers in early summer, attractive pinnate leaves in summer, and red leaves and fruits in autumn. However, development and further utilization of the species are hindered by the lack of comprehensive genetic information, which impedes research into its genetics and molecular biology. Recent advances in de novo transcriptome sequencing (RNA-seq) technology have provided an effective means to obtain genomic information from non-model species. Here, we applied RNA-seq for sequencing S. pohuashanensis leaves and obtained a total of 137,506 clean reads. After assembly, 96,213 unigenes with an average length of 770 bp were obtained. We found that 64.5% of the unigenes could be annotated using bioinformatics tools to analyze gene function and alignment with the NCBI database. Overall, 59,089 unigenes were annotated using the Nr database(non-redundant protein database), 35,225 unigenes were annotated using the GO (Gene Ontology categories) database, and 33,168 unigenes were annotated using COG (Cluster of Orthologous Groups). Analysis of the unigenes using the KEGG (Kyoto Encyclopedia of Genes and Genomes) database indicated that 13,953 unigenes were involved in 322 metabolic pathways. Finally, simple sequence repeat (SSR) site detection identified 6,604 unigenes that included EST-SSRs and a total of 7,473 EST-SSRs in the unigene sequences. Fifteen polymorphic SSRs were screened and found to be of use for future genetic research. These unigene sequences will provide important genetic resources for genetic improvement and investigation of biochemical processes in S. pohuashanensis. PMID:28614366

  11. Complete genome sequence of Coriobacterium glomerans type strain (PW2T) from the midgut of Pyrrhocoris apterus L. (red soldier bug)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Stackebrandt, Erko; Zeytun, Ahmet; Lapidus, Alla L.

    2013-01-01

    Coriobacterium glomerans Haas and Ko nig 1988, is the only species of the genus Coriobacterium, family Coriobacteriaceae, order Coriobacteriales, phylum Actinobacteria. The bacterium thrives as an endosymbiont of pyrrhocorid bugs, i.e. the red fire bug Pyrrhocoris apterus L. The rationale for sequencing the genome of strain PW2T is its endosymbiotic life style which is rare among members of Actinobacteria. Here we describe the features of this symbiont, together with the complete genome sequence and its annotation. This is the first complete genome sequence of a member of the genus Coriobacterium and the sixth member of the order Coriobacteriales for whichmore » complete genome sequences are now available. The 2,115,681 bp long single replicon genome with its 1,804 protein-coding and 54 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.« less

  12. Literature-based concept profiles for gene annotation: the issue of weighting.

    PubMed

    Jelier, Rob; Schuemie, Martijn J; Roes, Peter-Jan; van Mulligen, Erik M; Kors, Jan A

    2008-05-01

    Text-mining has been used to link biomedical concepts, such as genes or biological processes, to each other for annotation purposes or the generation of new hypotheses. To relate two concepts to each other several authors have used the vector space model, as vectors can be compared efficiently and transparently. Using this model, a concept is characterized by a list of associated concepts, together with weights that indicate the strength of the association. The associated concepts in the vectors and their weights are derived from a set of documents linked to the concept of interest. An important issue with this approach is the determination of the weights of the associated concepts. Various schemes have been proposed to determine these weights, but no comparative studies of the different approaches are available. Here we compare several weighting approaches in a large scale classification experiment. Three different techniques were evaluated: (1) weighting based on averaging, an empirical approach; (2) the log likelihood ratio, a test-based measure; (3) the uncertainty coefficient, an information-theory based measure. The weighting schemes were applied in a system that annotates genes with Gene Ontology codes. As the gold standard for our study we used the annotations provided by the Gene Ontology Annotation project. Classification performance was evaluated by means of the receiver operating characteristics (ROC) curve using the area under the curve (AUC) as the measure of performance. All methods performed well with median AUC scores greater than 0.84, and scored considerably higher than a binary approach without any weighting. Especially for the more specific Gene Ontology codes excellent performance was observed. The differences between the methods were small when considering the whole experiment. However, the number of documents that were linked to a concept proved to be an important variable. When larger amounts of texts were available for the generation of the concepts' vectors, the performance of the methods diverged considerably, with the uncertainty coefficient then outperforming the two other methods.

  13. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.

    PubMed

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

    2015-01-01

    The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  14. Annotating images by mining image search results.

    PubMed

    Wang, Xin-Jing; Zhang, Lei; Li, Xirong; Ma, Wei-Ying

    2008-11-01

    Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results. Some 2.4 million images with their surrounding text are collected from a few photo forums to support this approach. The entire process is formulated in a divide-and-conquer framework where a query keyword is provided along with the uncaptioned image to improve both the effectiveness and efficiency. This is helpful when the collected data set is not dense everywhere. In this sense, our approach contains three steps: 1) the search process to discover visually and semantically similar search results, 2) the mining process to identify salient terms from textual descriptions of the search results, and 3) the annotation rejection process to filter out noisy terms yielded by Step 2. To ensure real-time annotation, two key techniques are leveraged-one is to map the high-dimensional image visual features into hash codes, the other is to implement it as a distributed system, of which the search and mining processes are provided as Web services. As a typical result, the entire process finishes in less than 1 second. Since no training data set is required, our approach enables annotating with unlimited vocabulary and is highly scalable and robust to outliers. Experimental results on both real Web images and a benchmark image data set show the effectiveness and efficiency of the proposed algorithm. It is also worth noting that, although the entire approach is illustrated within the divide-and conquer framework, a query keyword is not crucial to our current implementation. We provide experimental results to prove this.

  15. Boosting drug named entity recognition using an aggregate classifier.

    PubMed

    Korkontzelos, Ioannis; Piliouras, Dimitrios; Dowsey, Andrew W; Ananiadou, Sophia

    2015-10-01

    Drug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations. We perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition. Our approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers. Aggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names. We conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.

  16. An Annotated Bibliography of Materials for Teaching Advanced Written Skills in English as a Second Language.

    ERIC Educational Resources Information Center

    Macha, Dyne; Angelis, Paul

    This bibliography includes textbooks useful for the teaching of written skills in English as a second language to college and university students. The major emphasis of the texts listed is the teaching of writing but some contain material on reading as well. The first section contains only texts designed specifically for second language use. The…

  17. Comprehensive red blood cell and platelet antigen prediction from whole genome sequencing: proof of principle

    PubMed Central

    Westhoff, Connie M.; Uy, Jon Michael; Aguad, Maria; Smeland‐Wagman, Robin; Kaufman, Richard M.; Rehm, Heidi L.; Green, Robert C.; Silberstein, Leslie E.

    2015-01-01

    BACKGROUND There are 346 serologically defined red blood cell (RBC) antigens and 33 serologically defined platelet (PLT) antigens, most of which have known genetic changes in 45 RBC or six PLT genes that correlate with antigen expression. Polymorphic sites associated with antigen expression in the primary literature and reference databases are annotated according to nucleotide positions in cDNA. This makes antigen prediction from next‐generation sequencing data challenging, since it uses genomic coordinates. STUDY DESIGN AND METHODS The conventional cDNA reference sequences for all known RBC and PLT genes that correlate with antigen expression were aligned to the human reference genome. The alignments allowed conversion of conventional cDNA nucleotide positions to the corresponding genomic coordinates. RBC and PLT antigen prediction was then performed using the human reference genome and whole genome sequencing (WGS) data with serologic confirmation. RESULTS Some major differences and alignment issues were found when attempting to convert the conventional cDNA to human reference genome sequences for the following genes: ABO, A4GALT, RHD, RHCE, FUT3, ACKR1 (previously DARC), ACHE, FUT2, CR1, GCNT2, and RHAG. However, it was possible to create usable alignments, which facilitated the prediction of all RBC and PLT antigens with a known molecular basis from WGS data. Traditional serologic typing for 18 RBC antigens were in agreement with the WGS‐based antigen predictions, providing proof of principle for this approach. CONCLUSION Detailed mapping of conventional cDNA annotated RBC and PLT alleles can enable accurate prediction of RBC and PLT antigens from whole genomic sequencing data. PMID:26634332

  18. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications

    PubMed Central

    Masanz, James J; Ogren, Philip V; Zheng, Jiaping; Sohn, Sunghwan; Kipper-Schuler, Karin C; Chute, Christopher G

    2010-01-01

    We aim to build and evaluate an open-source natural language processing system for information extraction from electronic medical record clinical free-text. We describe and evaluate our system, the clinical Text Analysis and Knowledge Extraction System (cTAKES), released open-source at http://www.ohnlp.org. The cTAKES builds on existing open-source technologies—the Unstructured Information Management Architecture framework and OpenNLP natural language processing toolkit. Its components, specifically trained for the clinical domain, create rich linguistic and semantic annotations. Performance of individual components: sentence boundary detector accuracy=0.949; tokenizer accuracy=0.949; part-of-speech tagger accuracy=0.936; shallow parser F-score=0.924; named entity recognizer and system-level evaluation F-score=0.715 for exact and 0.824 for overlapping spans, and accuracy for concept mapping, negation, and status attributes for exact and overlapping spans of 0.957, 0.943, 0.859, and 0.580, 0.939, and 0.839, respectively. Overall performance is discussed against five applications. The cTAKES annotations are the foundation for methods and modules for higher-level semantic processing of clinical free-text. PMID:20819853

  19. Feature engineering for MEDLINE citation categorization with MeSH.

    PubMed

    Jimeno Yepes, Antonio Jose; Plaza, Laura; Carrillo-de-Albornoz, Jorge; Mork, James G; Aronson, Alan R

    2015-04-08

    Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.

  20. Film annotation system for a space experiment

    NASA Technical Reports Server (NTRS)

    Browne, W. R.; Johnson, S. S.

    1989-01-01

    This microprocessor system was designed to control and annotate a Nikon 35 mm camera for the purpose of obtaining photographs and data at predefined time intervals. The single STD BUSS interface card was designed in such a way as to allow it to be used in either a stand alone application with minimum features or installed in a STD BUSS computer allowing for maximum features. This control system also allows the exposure of twenty eight alpha/numeric characters across the bottom of each photograph. The data contains such information as camera identification, frame count, user defined text, and time to .01 second.

  1. Challenges in discriminating profanity from hate speech

    NASA Astrophysics Data System (ADS)

    Malmasi, Shervin; Zampieri, Marcos

    2018-03-01

    In this study, we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered. Using a new dataset annotated specifically for this task, we employ supervised classification along with a set of features that includes ?-grams, skip-grams and clustering-based word representations. We apply approaches based on single classifiers as well as more advanced ensemble classifiers and stacked generalisation, achieving the best result of ? accuracy for this 3-class classification task. Analysis of the results reveals that discriminating hate speech and profanity is not a simple task, which may require features that capture a deeper understanding of the text not always possible with surface ?-grams. The variability of gold labels in the annotated data, due to differences in the subjective adjudications of the annotators, is also an issue. Other directions for future work are discussed.

  2. Discourse Analysis in Stylistics and Literature Instruction.

    ERIC Educational Resources Information Center

    Short, Mick

    1990-01-01

    A review of research regarding discourse analysis in stylistics and literature instruction covers studies of text, systematic analysis, meaning, style, literature pedagogy, and applied linguistics. A 10-citation annotated bibliography and a larger unannotated bibliography are included. (CB)

  3. SureChEMBL: a large-scale, chemically annotated patent document database.

    PubMed

    Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P

    2016-01-04

    SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  4. Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.

    PubMed

    Comeau, Donald C; Liu, Haibin; Islamaj Doğan, Rezarta; Wilbur, W John

    2014-01-01

    BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net. © The Author(s) 2014. Published by Oxford University Press.

  5. Generating quality word sense disambiguation test sets based on MeSH indexing.

    PubMed

    Fan, Jung-Wei; Friedman, Carol

    2009-11-14

    Word sense disambiguation (WSD) determines the correct meaning of a word that has more than one meaning, and is a critical step in biomedical natural language processing, as interpretation of information in text can be correct only if the meanings of their component terms are correctly identified first. Quality evaluation sets are important to WSD because they can be used as representative samples for developing automatic programs and as referees for comparing different WSD programs. To help create quality test sets for WSD, we developed a MeSH-based automatic sense-tagging method that preferentially annotates terms being topical of the text. Preliminary results were promising and revealed important issues to be addressed in biomedical WSD research. We also suggest that, by cross-validating with 2 or 3 annotators, the method should be able to efficiently generate quality WSD test sets. Online supplement is available at: http://www.dbmi.columbia.edu/~juf7002/AMIA09.

  6. Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics.

    PubMed

    Percha, Bethany; Altman, Russ B

    2013-01-01

    The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology.

  7. Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics

    PubMed Central

    Percha, Bethany; Altman, Russ B.

    2013-01-01

    The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology. PMID:24551397

  8. Automatic textual annotation of video news based on semantic visual object extraction

    NASA Astrophysics Data System (ADS)

    Boujemaa, Nozha; Fleuret, Francois; Gouet, Valerie; Sahbi, Hichem

    2003-12-01

    In this paper, we present our work for automatic generation of textual metadata based on visual content analysis of video news. We present two methods for semantic object detection and recognition from a cross modal image-text thesaurus. These thesaurus represent a supervised association between models and semantic labels. This paper is concerned with two semantic objects: faces and Tv logos. In the first part, we present our work for efficient face detection and recogniton with automatic name generation. This method allows us also to suggest the textual annotation of shots close-up estimation. On the other hand, we were interested to automatically detect and recognize different Tv logos present on incoming different news from different Tv Channels. This work was done jointly with the French Tv Channel TF1 within the "MediaWorks" project that consists on an hybrid text-image indexing and retrieval plateform for video news.

  9. SureChEMBL: a large-scale, chemically annotated patent document database

    PubMed Central

    Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A.; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.

    2016-01-01

    SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. PMID:26582922

  10. The National Cancer Informatics Program (NCIP) Annotation and Image Markup (AIM) Foundation model.

    PubMed

    Mongkolwat, Pattanasak; Kleper, Vladimir; Talbot, Skip; Rubin, Daniel

    2014-12-01

    Knowledge contained within in vivo imaging annotated by human experts or computer programs is typically stored as unstructured text and separated from other associated information. The National Cancer Informatics Program (NCIP) Annotation and Image Markup (AIM) Foundation information model is an evolution of the National Institute of Health's (NIH) National Cancer Institute's (NCI) Cancer Bioinformatics Grid (caBIG®) AIM model. The model applies to various image types created by various techniques and disciplines. It has evolved in response to the feedback and changing demands from the imaging community at NCI. The foundation model serves as a base for other imaging disciplines that want to extend the type of information the model collects. The model captures physical entities and their characteristics, imaging observation entities and their characteristics, markups (two- and three-dimensional), AIM statements, calculations, image source, inferences, annotation role, task context or workflow, audit trail, AIM creator details, equipment used to create AIM instances, subject demographics, and adjudication observations. An AIM instance can be stored as a Digital Imaging and Communications in Medicine (DICOM) structured reporting (SR) object or Extensible Markup Language (XML) document for further processing and analysis. An AIM instance consists of one or more annotations and associated markups of a single finding along with other ancillary information in the AIM model. An annotation describes information about the meaning of pixel data in an image. A markup is a graphical drawing placed on the image that depicts a region of interest. This paper describes fundamental AIM concepts and how to use and extend AIM for various imaging disciplines.

  11. Color- and motion-specific units in the tectum opticum of goldfish.

    PubMed

    Gruber, Morna; Behrend, Konstantin; Neumeyer, Christa

    2016-01-05

    Extracellular recordings were performed from 69 units at different depths between 50 and [Formula: see text]m below the surface of tectum opticum in goldfish. Using large field stimuli (86[Formula: see text] visual angle) of 21 colored HKS-papers we were able to record from 54 color-sensitive units. The colored papers were presented for 5[Formula: see text]s each. They were arranged in the sequence of the color circle in humans separated by gray of medium brightness. We found 22 units with best responses between orange, red and pink. About 12 of these red-sensitive units were of the opponent "red-ON/blue-green-OFF" type as found in retinal bipolar- and ganglion cells as well. Most of them were also activated or inhibited by black and/or white. Some units responded specifically to red either with activation or inhibition. 18 units were sensitive to blue and/or green, 10 of them to both colors and most of them to black as well. They were inhibited by red, and belonged to the opponent "blue-green-ON/red-OFF" type. Other units responded more selectively either to blue, to green or to purple. Two units were selectively sensitive to yellow. A total of 15 units were sensitive to motion, stimulated by an excentrically rotating black and white random dot pattern. Activity of these units was also large when a red-green random dot pattern of high L-cone contrast was used. Activity dropped to zero when the red-green pattern did not modulate the L-cones. Neither of these motion selective units responded to any color. The results directly show color-blindness of motion vision, and confirm the hypothesis of separate and parallel processing of "color" and "motion".

  12. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts

    PubMed Central

    Lu, Zhiyong

    2012-01-01

    Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/ PMID:23160414

  13. ENVIRONMENTS and EOL: identification of Environment Ontology terms in text and the annotation of the Encyclopedia of Life

    PubMed Central

    Pafilis, Evangelos; Frankild, Sune P.; Schnetzer, Julia; Fanini, Lucia; Faulwetter, Sarah; Pavloudi, Christina; Vasileiadou, Katerina; Leary, Patrick; Hammock, Jennifer; Schulz, Katja; Parr, Cynthia Sims; Arvanitidis, Christos; Jensen, Lars Juhl

    2015-01-01

    Summary: The association of organisms to their environments is a key issue in exploring biodiversity patterns. This knowledge has traditionally been scattered, but textual descriptions of taxa and their habitats are now being consolidated in centralized resources. However, structured annotations are needed to facilitate large-scale analyses. Therefore, we developed ENVIRONMENTS, a fast dictionary-based tagger capable of identifying Environment Ontology (ENVO) terms in text. We evaluate the accuracy of the tagger on a new manually curated corpus of 600 Encyclopedia of Life (EOL) species pages. We use the tagger to associate taxa with environments by tagging EOL text content monthly, and integrate the results into the EOL to disseminate them to a broad audience of users. Availability and implementation: The software and the corpus are available under the open-source BSD and the CC-BY-NC-SA 3.0 licenses, respectively, at http://environments.hcmr.gr Contact: pafilis@hcmr.gr or lars.juhl.jensen@cpr.ku.dk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25619994

  14. Automated annotation of functional imaging experiments via multi-label classification

    PubMed Central

    Turner, Matthew D.; Chakrabarti, Chayan; Jones, Thomas B.; Xu, Jiawei F.; Fox, Peter T.; Luger, George F.; Laird, Angela R.; Turner, Jessica A.

    2013-01-01

    Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text. PMID:24409112

  15. ADEPt, a semantically-enriched pipeline for extracting adverse drug events from free-text electronic health records.

    PubMed

    Iqbal, Ehtesham; Mallah, Robbie; Rhodes, Daniel; Wu, Honghan; Romero, Alvin; Chang, Nynn; Dzahini, Olubanke; Pandey, Chandra; Broadbent, Matthew; Stewart, Robert; Dobson, Richard J B; Ibrahim, Zina M

    2017-01-01

    Adverse drug events (ADEs) are unintended responses to medical treatment. They can greatly affect a patient's quality of life and present a substantial burden on healthcare. Although Electronic health records (EHRs) document a wealth of information relating to ADEs, they are frequently stored in the unstructured or semi-structured free-text narrative requiring Natural Language Processing (NLP) techniques to mine the relevant information. Here we present a rule-based ADE detection and classification pipeline built and tested on a large Psychiatric corpus comprising 264k patients using the de-identified EHRs of four UK-based psychiatric hospitals. The pipeline uses characteristics specific to Psychiatric EHRs to guide the annotation process, and distinguishes: a) the temporal value associated with the ADE mention (whether it is historical or present), b) the categorical value of the ADE (whether it is assertive, hypothetical, retrospective or a general discussion) and c) the implicit contextual value where the status of the ADE is deduced from surrounding indicators, rather than explicitly stated. We manually created the rulebase in collaboration with clinicians and pharmacists by studying ADE mentions in various types of clinical notes. We evaluated the open-source Adverse Drug Event annotation Pipeline (ADEPt) using 19 ADEs specific to antipsychotics and antidepressants medication. The ADEs chosen vary in severity, regularity and persistence. The average F-measure and accuracy achieved by our tool across all tested ADEs were 0.83 and 0.83 respectively. In addition to annotation power, the ADEPT pipeline presents an improvement to the state of the art context-discerning algorithm, ConText.

  16. Modeling semantic aspects for cross-media image indexing.

    PubMed

    Monay, Florent; Gatica-Perez, Daniel

    2007-10-01

    To go beyond the query-by-example paradigm in image retrieval, there is a need for semantic indexing of large image collections for intuitive text-based image search. Different models have been proposed to learn the dependencies between the visual content of an image set and the associated text captions, then allowing for the automatic creation of semantic indices for unannotated images. The task, however, remains unsolved. In this paper, we present three alternatives to learn a Probabilistic Latent Semantic Analysis model (PLSA) for annotated images, and evaluate their respective performance for automatic image indexing. Under the PLSA assumptions, an image is modeled as a mixture of latent aspects that generates both image features and text captions, and we investigate three ways to learn the mixture of aspects. We also propose a more discriminative image representation than the traditional Blob histogram, concatenating quantized local color information and quantized local texture descriptors. The first learning procedure of a PLSA model for annotated images is a standard EM algorithm, which implicitly assumes that the visual and the textual modalities can be treated equivalently. The other two models are based on an asymmetric PLSA learning, allowing to constrain the definition of the latent space on the visual or on the textual modality. We demonstrate that the textual modality is more appropriate to learn a semantically meaningful latent space, which translates into improved annotation performance. A comparison of our learning algorithms with respect to recent methods on a standard dataset is presented, and a detailed evaluation of the performance shows the validity of our framework.

  17. Datalingvistik, 2000.

    ERIC Educational Resources Information Center

    Kjaersgaard, Poul Soren, Ed.

    2002-01-01

    Papers from the conference in this volume include the following: "Towards Corpus Annotation Standards--The MATE Workbench" (Laila Dybkjaer and Niels Ole Bernsen); "Danish Text-to-Speech Synthesis Based on Stored Acoustic Segments" (Charles Hoequist); "Toward a Method for the Automated Design of Semantic…

  18. Portable automatic text classification for adverse drug reaction detection via multi-corpus training.

    PubMed

    Sarker, Abeed; Gonzalez, Graciela

    2015-02-01

    Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.

  19. Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-corpus Training

    PubMed Central

    Gonzalez, Graciela

    2014-01-01

    Objective Automatic detection of Adverse Drug Reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media — where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. PMID:25451103

  20. Annotating image ROIs with text descriptions for multimodal biomedical document retrieval

    NASA Astrophysics Data System (ADS)

    You, Daekeun; Simpson, Matthew; Antani, Sameer; Demner-Fushman, Dina; Thoma, George R.

    2013-01-01

    Regions of interest (ROIs) that are pointed to by overlaid markers (arrows, asterisks, etc.) in biomedical images are expected to contain more important and relevant information than other regions for biomedical article indexing and retrieval. We have developed several algorithms that localize and extract the ROIs by recognizing markers on images. Cropped ROIs then need to be annotated with contents describing them best. In most cases accurate textual descriptions of the ROIs can be found from figure captions, and these need to be combined with image ROIs for annotation. The annotated ROIs can then be used to, for example, train classifiers that separate ROIs into known categories (medical concepts), or to build visual ontologies, for indexing and retrieval of biomedical articles. We propose an algorithm that pairs visual and textual ROIs that are extracted from images and figure captions, respectively. This algorithm based on dynamic time warping (DTW) clusters recognized pointers into groups, each of which contains pointers with identical visual properties (shape, size, color, etc.). Then a rule-based matching algorithm finds the best matching group for each textual ROI mention. Our method yields a precision and recall of 96% and 79%, respectively, when ground truth textual ROI data is used.

  1. Transcriptome analysis of the exocarp of apple fruit identifies light-induced genes involved in red color pigmentation.

    PubMed

    Vimolmangkang, Sornkanok; Zheng, Danman; Han, Yuepeng; Khan, M Awais; Soria-Guerra, Ruth Elena; Korban, Schuyler S

    2014-01-15

    Although the mechanism of light regulation of color pigmentation of apple fruit is not fully understood, it has been shown that light can regulate expression of genes in the anthocyanin biosynthesis pathway by inducing transcription factors (TFs). Moreover, expression of genes encoding enzymes involved in this pathway may be coordinately regulated by multiple TFs. In this study, fruits on trees of apple cv. Red Delicious were covered with paper bags during early stages of fruit development and then removed prior to maturation to analyze the transcriptome in the exocarp of apple fruit. Comparisons of gene expression profiles of fruit covered with paper bags (dark-grown treatment) and those subjected to 14 h light treatment, following removal of paper bags, were investigated using an apple microarray of 40,000 sequences. Expression profiles were investigated over three time points, at one week intervals, during fruit development. Overall, 736 genes with expression values greater than two-fold were found to be modulated by light treatment. Light-induced products were classified into 19 categories with highest scores in primary metabolism (17%) and transcription (12%). Based on the Arabidopsis gene ontology annotation, 18 genes were identified as TFs. To further confirm expression patterns of flavonoid-related genes, these were subjected to quantitative RT-PCR (qRT-PCR) using fruit of red-skinned apple cv. Red Delicious and yellow-skinned apple cv. Golden Delicious. Of these, two genes showed higher levels of expression in 'Red Delicious' than in 'Golden Delicious', and were likely involved in the regulation of fruit red color pigmentation. © 2013 Elsevier B.V. All rights reserved.

  2. A UIMA wrapper for the NCBO annotator

    PubMed Central

    Roeder, Christophe; Jonquet, Clement; Shah, Nigam H.; Baumgartner, William A.; Verspoor, Karin; Hunter, Lawrence

    2010-01-01

    Summary: The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator—an ontology-based annotation service—to make it available as a component in UIMA workflows. Availability: This wrapper is freely available on the web at http://bionlp-uima.sourceforge.net/ as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows. Contact: chris.roeder@ucdenver.edu PMID:20505005

  3. De Novo Assembly and Comparative Transcriptome Analyses of Red and Green Morphs of Sweet Basil Grown in Full Sunlight.

    PubMed

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Guidi, Lucia; Gori, Antonella; Marzano, Cristina; Landi, Marco; Sebastiani, Federico

    2016-01-01

    Sweet basil (Ocimum basilicum), one of the most popular cultivated herbs worldwide, displays a number of varieties differing in several characteristics, such as the color of the leaves. The development of a reference transcriptome for sweet basil, and the analysis of differentially expressed genes in acyanic and cyanic cultivars exposed to natural sunlight irradiance, has interest from horticultural and biological point of views. There is still great uncertainty about the significance of anthocyanins in photoprotection, and how green and red morphs may perform when exposed to photo-inhibitory light, a condition plants face on daily and seasonal basis. We sequenced the leaf transcriptome of the green-leaved Tigullio (TIG) and the purple-leaved Red Rubin (RR) exposed to full sunlight over a four-week experimental period. We assembled and annotated 111,007 transcripts. A total of 5,468 and 5,969 potential SSRs were identified in TIG and RR, respectively, out of which 66 were polymorphic in silico. Comparative analysis of the two transcriptomes showed 2,372 differentially expressed genes (DEGs) clustered in 222 enriched Gene ontology terms. Green and red basil mostly differed for transcripts abundance of genes involved in secondary metabolism. While the biosynthesis of waxes was up-regulated in red basil, the biosynthesis of flavonols and carotenoids was up-regulated in green basil. Data from our study provides a comprehensive transcriptome survey, gene sequence resources and microsatellites that can be used for further investigations in sweet basil. The analysis of DEGs and their functional classification also offers new insights on the functional role of anthocyanins in photoprotection.

  4. U.S. Army Research Laboratory (ARL) Corporate Dari Document Transcription and Translation Guidelines

    DTIC Science & Technology

    2012-10-01

    text file format. 15. SUBJECT TERMS Transcription, Translation, guidelines, ground truth, Optical character recognition , OCR, Machine Translation, MT...foreign language into a target language in order to train, test, and evaluate optical character recognition (OCR) and machine translation (MT) embedded...graphic element and should not be transcribed. Elements that are not part of the primary text such as handwritten annotations or stamps should not be

  5. Part-of-speech tagging for clinical text: wall or bridge between institutions?

    PubMed Central

    Fan, Jung-wei; Prasad, Rashmi; Yabut, Rommel M.; Loomis, Richard M.; Zisook, Daniel S.; Mattison, John E.; Huang, Yang

    2011-01-01

    Part-of-speech (POS) tagging is a fundamental step required by various NLP systems. The training of a POS tagger relies on sufficient quality annotations. However, the annotation process is both knowledge-intensive and time-consuming in the clinical domain. A promising solution appears to be for institutions to share their annotation efforts, and yet there is little research on associated issues. We performed experiments to understand how POS tagging performance would be affected by using a pre-trained tagger versus raw training data across different institutions. We manually annotated a set of clinical notes at Kaiser Permanente Southern California (KPSC) and a set from the University of Pittsburg Medical Center (UPMC), and trained/tested POS taggers with intra- and inter-institution settings. The cTAKES POS tagger was also included in the comparison to represent a tagger partially trained from the notes of a third institution, Mayo Clinic at Rochester. Intra-institution 5-fold cross-validation estimated an accuracy of 0.953 and 0.945 on the KPSC and UPMC notes respectively. Trained purely on KPSC notes, the accuracy was 0.897 when tested on UPMC notes. Trained purely on UPMC notes, the accuracy was 0.904 when tested on KPSC notes. Applying the cTAKES tagger pre-trained with Mayo Clinic’s notes, the accuracy was 0.881 on KPSC notes and 0.883 on UPMC notes. After adding UPMC annotations to KPSC training data, the average accuracy on tested KPSC notes increased to 0.965. After adding KPSC annotations to UPMC training data, the average accuracy on tested UPMC notes increased to 0.953. The results indicated: first, the performance of pre-trained POS taggers dropped about 5% when applied directly across the institutions; second, mixing annotations from another institution following the same guideline increased tagging accuracy for about 1%. Our findings suggest that institutions can benefit more from sharing raw annotations but less from sharing pre-trained models for the POS tagging task. We believe the study could also provide general insights on cross-institution data sharing for other types of NLP tasks. PMID:22195091

  6. Part-of-speech tagging for clinical text: wall or bridge between institutions?

    PubMed

    Fan, Jung-wei; Prasad, Rashmi; Yabut, Rommel M; Loomis, Richard M; Zisook, Daniel S; Mattison, John E; Huang, Yang

    2011-01-01

    Part-of-speech (POS) tagging is a fundamental step required by various NLP systems. The training of a POS tagger relies on sufficient quality annotations. However, the annotation process is both knowledge-intensive and time-consuming in the clinical domain. A promising solution appears to be for institutions to share their annotation efforts, and yet there is little research on associated issues. We performed experiments to understand how POS tagging performance would be affected by using a pre-trained tagger versus raw training data across different institutions. We manually annotated a set of clinical notes at Kaiser Permanente Southern California (KPSC) and a set from the University of Pittsburg Medical Center (UPMC), and trained/tested POS taggers with intra- and inter-institution settings. The cTAKES POS tagger was also included in the comparison to represent a tagger partially trained from the notes of a third institution, Mayo Clinic at Rochester. Intra-institution 5-fold cross-validation estimated an accuracy of 0.953 and 0.945 on the KPSC and UPMC notes respectively. Trained purely on KPSC notes, the accuracy was 0.897 when tested on UPMC notes. Trained purely on UPMC notes, the accuracy was 0.904 when tested on KPSC notes. Applying the cTAKES tagger pre-trained with Mayo Clinic's notes, the accuracy was 0.881 on KPSC notes and 0.883 on UPMC notes. After adding UPMC annotations to KPSC training data, the average accuracy on tested KPSC notes increased to 0.965. After adding KPSC annotations to UPMC training data, the average accuracy on tested UPMC notes increased to 0.953. The results indicated: first, the performance of pre-trained POS taggers dropped about 5% when applied directly across the institutions; second, mixing annotations from another institution following the same guideline increased tagging accuracy for about 1%. Our findings suggest that institutions can benefit more from sharing raw annotations but less from sharing pre-trained models for the POS tagging task. We believe the study could also provide general insights on cross-institution data sharing for other types of NLP tasks.

  7. LipidPedia: a comprehensive lipid knowledgebase.

    PubMed

    Kuo, Tien-Chueh; Tseng, Yufeng Jane

    2018-04-10

    Lipids are divided into fatty acyls, glycerolipids, glycerophospholipids, sphingolipids, saccharolipids, sterols, prenol lipids and polyketides. Fatty acyls and glycerolipids are commonly used as energy storage, whereas glycerophospholipids, sphingolipids, sterols and saccharolipids are common used as components of cell membranes. Lipids in fatty acyls, glycerophospholipids, sphingolipids and sterols classes play important roles in signaling. Although more than 36 million lipids can be identified or computationally generated, no single lipid database provides comprehensive information on lipids. Furthermore, the complex systematic or common names of lipids make the discovery of related information challenging. Here, we present LipidPedia, a comprehensive lipid knowledgebase. The content of this database is derived from integrating annotation data with full-text mining of 3,923 lipids and more than 400,000 annotations of associated diseases, pathways, functions, and locations that are essential for interpreting lipid functions and mechanisms from over 1,400,000 scientific publications. Each lipid in LipidPedia also has its own entry containing a text summary curated from the most frequently cited diseases, pathways, genes, locations, functions, lipids and experimental models in the biomedical literature. LipidPedia aims to provide an overall synopsis of lipids to summarize lipid annotations and provide a detailed listing of references for understanding complex lipid functions and mechanisms. LipidPedia is available at http://lipidpedia.cmdm.tw. yjtseng@csie.ntu.edu.tw. Supplementary data are available at Bioinformatics online.

  8. TrOn: an anatomical ontology for the beetle Tribolium castaneum.

    PubMed

    Dönitz, Jürgen; Grossmann, Daniela; Schild, Inga; Schmitt-Engel, Christian; Bradler, Sven; Prpic, Nikola-Michael; Bucher, Gregor

    2013-01-01

    In a morphological ontology the expert's knowledge is represented in terms, which describe morphological structures and how these structures relate to each other. With the assistance of ontologies this expert knowledge is made processable by machines, through a formal and standardized representation of terms and their relations to each other. The red flour beetle Tribolium castaneum, a representative of the most species rich animal taxon on earth (the Coleoptera), is an emerging model organism for development, evolution, physiology, and pest control. In order to foster Tribolium research, we have initiated the Tribolium Ontology (TrOn), which describes the morphology of the red flour beetle. The content of this ontology comprises so far most external morphological structures as well as some internal ones. All modeled structures are consistently annotated for the developmental stages larva, pupa and adult. In TrOn all terms are grouped into three categories: Generic terms represent morphological structures, which are independent of a developmental stage. In contrast, downstream of such terms are concrete terms which stand for a dissectible structure of a beetle at a specific life stage. Finally, there are mixed terms describing structures that are only found at one developmental stage. These terms combine the characteristics of generic and concrete terms with features of both. These annotation principles take into account the changing morphology of the beetle during development and provide generic terms to be used in applications or for cross linking with other ontologies and data resources. We use the ontology for implementing an intuitive search function at the electronic iBeetle-Base, which stores morphological defects found in a genome wide RNA interference (RNAi) screen. The ontology is available for download at http://ibeetle-base.uni-goettingen.de.

  9. THE BERBER LANGUAGES. A SELECTED BIBLIOGRAPHY.

    ERIC Educational Resources Information Center

    APPLEGATE, JOSEPH R.

    ORGANIZED INTO TWO MAIN SECTIONS--BOOKS AND ARTICLES AND SERIAL PUBLICATIONS AND PUBLISHERS--THIS BIBLIOGRAPHY OF THE BERBER LANGUAGES REPRESENTS 758 ENTRIES, SOME OF WHICH ARE ANNOTATED. SUBDIVISIONS INCLUDE GENERAL LINGUISTICS, DIALECT GEOGRAPHY, PHONOLOGY, MORPHOLOGY, SYNTAX, LEXICON, TEXTS AND TRANSLATIONS, HISTORICAL AND COMPARATIVE STUDIES,…

  10. Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing

    PubMed Central

    Deleger, Louise; Li, Qi; Kaiser, Megan; Stoutenborough, Laura

    2013-01-01

    Background A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora. Objective Building upon previous work from the general crowdsourcing research, this study investigated the usability of crowdsourcing in the clinical NLP domain with special emphasis on achieving high agreement between crowdsourced and traditionally-developed corpora. Methods To build the gold standard for evaluating the crowdsourcing workers’ performance, 1042 clinical trial announcements (CTAs) from the ClinicalTrials.gov website were randomly selected and double annotated for medication names, medication types, and linked attributes. For the experiments, we used CrowdFlower, an Amazon Mechanical Turk-based crowdsourcing platform. We calculated sensitivity, precision, and F-measure to evaluate the quality of the crowd’s work and tested the statistical significance (P<.001, chi-square test) to detect differences between the crowdsourced and traditionally-developed annotations. Results The agreement between the crowd’s annotations and the traditionally-generated corpora was high for: (1) annotations (0.87, F-measure for medication names; 0.73, medication types), (2) correction of previous annotations (0.90, medication names; 0.76, medication types), and excellent for (3) linking medications with their attributes (0.96). Simple voting provided the best judgment aggregation approach. There was no statistically significant difference between the crowd and traditionally-generated corpora. Our results showed a 27.9% improvement over previously reported results on medication named entity annotation task. Conclusions This study offers three contributions. First, we proved that crowdsourcing is a feasible, inexpensive, fast, and practical approach to collect high-quality annotations for clinical text (when protected health information was excluded). We believe that well-designed user interfaces and rigorous quality control strategy for entity annotation and linking were critical to the success of this work. Second, as a further contribution to the Internet-based crowdsourcing field, we will publicly release the JavaScript and CrowdFlower Markup Language infrastructure code that is necessary to utilize CrowdFlower’s quality control and crowdsourcing interfaces for named entity annotations. Finally, to spur future research, we will release the CTA annotations that were generated by traditional and crowdsourced approaches. PMID:23548263

  11. A study of active learning methods for named entity recognition in clinical text.

    PubMed

    Chen, Yukun; Lasko, Thomas A; Mei, Qiaozhu; Denny, Joshua C; Xu, Hua

    2015-12-01

    Named entity recognition (NER), a sequential labeling task, is one of the fundamental tasks for building clinical natural language processing (NLP) systems. Machine learning (ML) based approaches can achieve good performance, but they often require large amounts of annotated samples, which are expensive to build due to the requirement of domain experts in annotation. Active learning (AL), a sample selection approach integrated with supervised ML, aims to minimize the annotation cost while maximizing the performance of ML-based models. In this study, our goal was to develop and evaluate both existing and new AL methods for a clinical NER task to identify concepts of medical problems, treatments, and lab tests from the clinical notes. Using the annotated NER corpus from the 2010 i2b2/VA NLP challenge that contained 349 clinical documents with 20,423 unique sentences, we simulated AL experiments using a number of existing and novel algorithms in three different categories including uncertainty-based, diversity-based, and baseline sampling strategies. They were compared with the passive learning that uses random sampling. Learning curves that plot performance of the NER model against the estimated annotation cost (based on number of sentences or words in the training set) were generated to evaluate different active learning and the passive learning methods and the area under the learning curve (ALC) score was computed. Based on the learning curves of F-measure vs. number of sentences, uncertainty sampling algorithms outperformed all other methods in ALC. Most diversity-based methods also performed better than random sampling in ALC. To achieve an F-measure of 0.80, the best method based on uncertainty sampling could save 66% annotations in sentences, as compared to random sampling. For the learning curves of F-measure vs. number of words, uncertainty sampling methods again outperformed all other methods in ALC. To achieve 0.80 in F-measure, in comparison to random sampling, the best uncertainty based method saved 42% annotations in words. But the best diversity based method reduced only 7% annotation effort. In the simulated setting, AL methods, particularly uncertainty-sampling based approaches, seemed to significantly save annotation cost for the clinical NER task. The actual benefit of active learning in clinical NER should be further evaluated in a real-time setting. Copyright © 2015 Elsevier Inc. All rights reserved.

  12. Using Information from the Electronic Health Record to Improve Measurement of Unemployment in Service Members and Veterans with mTBI and Post-Deployment Stress

    PubMed Central

    Dillahunt-Aspillaga, Christina; Finch, Dezon; Massengale, Jill; Kretzmer, Tracy; Luther, Stephen L.; McCart, James A.

    2014-01-01

    Objective The purpose of this pilot study is 1) to develop an annotation schema and a training set of annotated notes to support the future development of a natural language processing (NLP) system to automatically extract employment information, and 2) to determine if information about employment status, goals and work-related challenges reported by service members and Veterans with mild traumatic brain injury (mTBI) and post-deployment stress can be identified in the Electronic Health Record (EHR). Design Retrospective cohort study using data from selected progress notes stored in the EHR. Setting Post-deployment Rehabilitation and Evaluation Program (PREP), an in-patient rehabilitation program for Veterans with TBI at the James A. Haley Veterans' Hospital in Tampa, Florida. Participants Service members and Veterans with TBI who participated in the PREP program (N = 60). Main Outcome Measures Documentation of employment status, goals, and work-related challenges reported by service members and recorded in the EHR. Results Two hundred notes were examined and unique vocational information was found indicating a variety of self-reported employment challenges. Current employment status and future vocational goals along with information about cognitive, physical, and behavioral symptoms that may affect return-to-work were extracted from the EHR. The annotation schema developed for this study provides an excellent tool upon which NLP studies can be developed. Conclusions Information related to employment status and vocational history is stored in text notes in the EHR system. Information stored in text does not lend itself to easy extraction or summarization for research and rehabilitation planning purposes. Development of NLP systems to automatically extract text-based employment information provides data that may improve the understanding and measurement of employment in this important cohort. PMID:25541956

  13. EliXR-TIME: A Temporal Knowledge Representation for Clinical Research Eligibility Criteria.

    PubMed

    Boland, Mary Regina; Tu, Samson W; Carini, Simona; Sim, Ida; Weng, Chunhua

    2012-01-01

    Effective clinical text processing requires accurate extraction and representation of temporal expressions. Multiple temporal information extraction models were developed but a similar need for extracting temporal expressions in eligibility criteria (e.g., for eligibility determination) remains. We identified the temporal knowledge representation requirements of eligibility criteria by reviewing 100 temporal criteria. We developed EliXR-TIME, a frame-based representation designed to support semantic annotation for temporal expressions in eligibility criteria by reusing applicable classes from well-known clinical temporal knowledge representations. We used EliXR-TIME to analyze a training set of 50 new temporal eligibility criteria. We evaluated EliXR-TIME using an additional random sample of 20 eligibility criteria with temporal expressions that have no overlap with the training data, yielding 92.7% (76 / 82) inter-coder agreement on sentence chunking and 72% (72 / 100) agreement on semantic annotation. We conclude that this knowledge representation can facilitate semantic annotation of the temporal expressions in eligibility criteria.

  14. Grass buffers for playas in agricultural landscapes: An annotated bibliography

    USGS Publications Warehouse

    Melcher, Cynthia P.; Skagen, Susan K.

    2005-01-01

    References on best management practices (BMPs) for agricultural lands were included because certain BMPs are crucial for informing decisions about buffer design/ effectiveness and overall playa ecology. We also included various papers that increase the spectrum of time over which buffer theories and practices have evolved. An unannotated section lists references that we did not prioritize for annotation and references that may be helpful but were beyond the scope of this document. Finally, we provide notes on conversations we had with scientists, land managers, and other buffer experts whom we consulted, and their contact information. We conclude the bibliography with appendices of common and scientific names of birds and plants and acronyms used in both the bibliography. In the annotations, italicized text signifies our own editorial remarks. Readers should also note that much of the work on buffers has been designed using English units of measure rather than metrics; in most cases, their results have been converted to metrics for publication, explaining the seemingly odd or irregular buffer widths and other parameters reported.

  15. Semantic Annotation of Complex Text Structures in Problem Reports

    NASA Technical Reports Server (NTRS)

    Malin, Jane T.; Throop, David R.; Fleming, Land D.

    2011-01-01

    Text analysis is important for effective information retrieval from databases where the critical information is embedded in text fields. Aerospace safety depends on effective retrieval of relevant and related problem reports for the purpose of trend analysis. The complex text syntax in problem descriptions has limited statistical text mining of problem reports. The presentation describes an intelligent tagging approach that applies syntactic and then semantic analysis to overcome this problem. The tags identify types of problems and equipment that are embedded in the text descriptions. The power of these tags is illustrated in a faceted searching and browsing interface for problem report trending that combines automatically generated tags with database code fields and temporal information.

  16. Resources for Performance-Based Education.

    ERIC Educational Resources Information Center

    Houston, W. Robert; And Others

    This volume presents annotations of resources on performance-based teacher education. The materials, produced after 1967, include films, slide/tapes, modules, programmed texts, and multimedia kits for training pre- and in-service educational personnel. The materials are indexed according to both competency categories and key words, descriptions,…

  17. Solar Energy Education Bibliography.

    ERIC Educational Resources Information Center

    Center for Renewable Resources, Washington, DC.

    This annotated bibliography lists publications and audiovisual materials devoted to renewable energy sources: sun, wind, water and biomass. A few general texts are included that present concepts fundamental to all energy sources. Materials were selected to be adaptable to classroom, workshops, and training sessions. Also, many do-it-yourself…

  18. The Prosodic Basis of the Tiberian Hebrew System of Accents.

    ERIC Educational Resources Information Center

    Dresher, Bezalel Elan

    1994-01-01

    It is argued that the Tiberian system of accents that annotate the text of the Hebrew Bible has a prosodic basis. Tiberian representation can best be understood by integrating results of phonological, phonetic, and psycholinguistic research on prosodic structure. (93 references) (Author/LB)

  19. SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data.

    PubMed

    Venkatesan, Aravind; Kim, Jee-Hyub; Talo, Francesco; Ide-Smith, Michele; Gobeill, Julien; Carter, Jacob; Batista-Navarro, Riza; Ananiadou, Sophia; Ruch, Patrick; McEntyre, Johanna

    2016-01-01

    The tremendous growth in biological data has resulted in an increase in the number of research papers being published. This presents a great challenge for scientists in searching and assimilating facts described in those papers. Particularly, biological databases depend on curators to add highly precise and useful information that are usually extracted by reading research articles. Therefore, there is an urgent need to find ways to improve linking literature to the underlying data, thereby minimising the effort in browsing content and identifying key biological concepts.   As part of the development of Europe PMC, we have developed a new platform, SciLite, which integrates text-mined annotations from different sources and overlays those outputs on research articles. The aim is to aid researchers and curators using Europe PMC in finding key concepts more easily and provide links to related resources or tools, bridging the gap between literature and biological data.

  20. SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data

    PubMed Central

    Talo, Francesco; Ide-Smith, Michele; Gobeill, Julien; Carter, Jacob; Batista-Navarro, Riza; Ananiadou, Sophia; Ruch, Patrick; McEntyre, Johanna

    2017-01-01

    The tremendous growth in biological data has resulted in an increase in the number of research papers being published. This presents a great challenge for scientists in searching and assimilating facts described in those papers. Particularly, biological databases depend on curators to add highly precise and useful information that are usually extracted by reading research articles. Therefore, there is an urgent need to find ways to improve linking literature to the underlying data, thereby minimising the effort in browsing content and identifying key biological concepts.   As part of the development of Europe PMC, we have developed a new platform, SciLite, which integrates text-mined annotations from different sources and overlays those outputs on research articles. The aim is to aid researchers and curators using Europe PMC in finding key concepts more easily and provide links to related resources or tools, bridging the gap between literature and biological data. PMID:28948232

  1. Assessing the Representation of Occupation Information in Free-Text Clinical Documents Across Multiple Sources

    PubMed Central

    Lindemann, Elizabeth A.; Chen, Elizabeth S.; Rajamani, Sripriya; Manohar, Nivedha; Wang, Yan; Melton, Genevieve B.

    2017-01-01

    There has been increasing recognition of the key role of social determinants like occupation on health. Given the relatively poor understanding of occupation information in electronic health records (EHRs), we sought to characterize occupation information within free-text clinical document sources. From six distinct clinical sources, 868 total occupation-related sentences were identified for the study corpus. Building off approaches from previous studies, refined annotation guidelines were created using the National Institute for Occupational Safety and Health Occupational Data for Health data model with elements added to increase granularity. Our corpus generated 2,005 total annotations representing 39 of 41 entity types from the enhanced data model. Highest frequency entities were: Occupation Description (17.7%); Employment Status – Not Specified (12.5%); Employer Name (11.0%); Subject (9.8%); Industry Description (6.2%). Our findings support the value for standardizing entry of EHR occupation information to improve data quality for improved patient care and secondary uses of this information. PMID:29295142

  2. Information Retrieval and Text Mining Technologies for Chemistry.

    PubMed

    Krallinger, Martin; Rabal, Obdulia; Lourenço, Anália; Oyarzabal, Julen; Valencia, Alfonso

    2017-06-28

    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.

  3. ChemicalTagger: A tool for semantic text-mining in chemistry.

    PubMed

    Hawizy, Lezan; Jessop, David M; Adams, Nico; Murray-Rust, Peter

    2011-05-16

    The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches. We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names). It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision.

  4. Extracting BI-RADS Features from Portuguese Clinical Texts.

    PubMed

    Nassif, Houssam; Cunha, Filipe; Moreira, Inês C; Cruz-Correia, Ricardo; Sousa, Eliana; Page, David; Burnside, Elizabeth; Dutra, Inês

    2012-01-01

    In this work we build the first BI-RADS parser for Portuguese free texts, modeled after existing approaches to extract BI-RADS features from English medical records. Our concept finder uses a semantic grammar based on the BIRADS lexicon and on iterative transferred expert knowledge. We compare the performance of our algorithm to manual annotation by a specialist in mammography. Our results show that our parser's performance is comparable to the manual method.

  5. Emotionality of Colors: An Implicit Link between Red and Dominance.

    PubMed

    Mentzel, Stijn V; Schücker, Linda; Hagemann, Norbert; Strauss, Bernd

    2017-01-01

    The color red has been shown to alter emotions, physiology, psychology, and behavior. Research has suggested that these alterations could possibly be due to a link between red and perceived dominance. In this study we examined if the color red is implicitly associated to the concept of dominance. In addition, we similarly hypothesized that blue is implicitly linked to rest. A modified Stroop word evaluation task was used in which 30 participants (23.07 ± 4.42 years) were asked to classify words shown in either red, blue, or gray (control condition), as being either dominant- or rest-related. The responses were recorded and analyzed for latency time and accuracy. The results revealed a significant word type × color interaction effect for both latency times, F (2,56) = 5.09, p = 0.009, [Formula: see text] = 0.15, and accuracy, F (1.614,45.193) = 8.57, p = 0.001, [Formula: see text] = 0.23. On average participants showed significantly shorter latency times and made less errors when categorizing dominance words shown in red, compared to blue and gray. The measured effects show strong evidence for an implicit red-dominance association and a partial red-rest disassociation. It is discussed that this association can possibly affect emotionality, with the presentation of red eliciting a dominant emotional and behavioral response.

  6. USI: a fast and accurate approach for conceptual document annotation.

    PubMed

    Fiorini, Nicolas; Ranwez, Sylvie; Montmain, Jacky; Ranwez, Vincent

    2015-03-14

    Semantic approaches such as concept-based information retrieval rely on a corpus in which resources are indexed by concepts belonging to a domain ontology. In order to keep such applications up-to-date, new entities need to be frequently annotated to enrich the corpus. However, this task is time-consuming and requires a high-level of expertise in both the domain and the related ontology. Different strategies have thus been proposed to ease this indexing process, each one taking advantage from the features of the document. In this paper we present USI (User-oriented Semantic Indexer), a fast and intuitive method for indexing tasks. We introduce a solution to suggest a conceptual annotation for new entities based on related already indexed documents. Our results, compared to those obtained by previous authors using the MeSH thesaurus and a dataset of biomedical papers, show that the method surpasses text-specific methods in terms of both quality and speed. Evaluations are done via usual metrics and semantic similarity. By only relying on neighbor documents, the User-oriented Semantic Indexer does not need a representative learning set. Yet, it provides better results than the other approaches by giving a consistent annotation scored with a global criterion - instead of one score per concept.

  7. 'King George Island' Brushed

    NASA Technical Reports Server (NTRS)

    2006-01-01

    [figure removed for brevity, see original site] Annotated Version

    This mosaic was made from frames acquired by the microscopic imager on NASA's Mars Exploration Rover Spirit during Spirit's 1,031 Martian day, or sol, on the red planet (Nov. 27, 2006). It shows a rock target called 'King George Island' after the target was brushed by the rover's rock abrasion tool. The mosaic covers approximately 6 centimeters (2.4 inches) across and shows the granular nature of the rock exposure. The grains are typically about 1 millimeter (.04 inches) wide. Data from the rover's Moessbauer spectrometer provides evidence that they have an enhanced amount of the mineral hematite relative to surrounding soils.

  8. Triage by ranking to support the curation of protein interactions

    PubMed Central

    Pasche, Emilie; Gobeill, Julien; Rech de Laval, Valentine; Gleizes, Anne; Michel, Pierre-André; Bairoch, Amos

    2017-01-01

    Abstract Today, molecular biology databases are the cornerstone of knowledge sharing for life and health sciences. The curation and maintenance of these resources are labour intensive. Although text mining is gaining impetus among curators, its integration in curation workflow has not yet been widely adopted. The Swiss Institute of Bioinformatics Text Mining and CALIPHO groups joined forces to design a new curation support system named nextA5. In this report, we explore the integration of novel triage services to support the curation of two types of biological data: protein–protein interactions (PPIs) and post-translational modifications (PTMs). The recognition of PPIs and PTMs poses a special challenge, as it not only requires the identification of biological entities (proteins or residues), but also that of particular relationships (e.g. binding or position). These relationships cannot be described with onto-terminological descriptors such as the Gene Ontology for molecular functions, which makes the triage task more challenging. Prioritizing papers for these tasks thus requires the development of different approaches. In this report, we propose a new method to prioritize articles containing information specific to PPIs and PTMs. The new resources (RESTful APIs, semantically annotated MEDLINE library) enrich the neXtA5 platform. We tuned the article prioritization model on a set of 100 proteins previously annotated by the CALIPHO group. The effectiveness of the triage service was tested with a dataset of 200 annotated proteins. We defined two sets of descriptors to support automatic triage: the first set to enrich for papers with PPI data, and the second for PTMs. All occurrences of these descriptors were marked-up in MEDLINE and indexed, thus constituting a semantically annotated version of MEDLINE. These annotations were then used to estimate the relevance of a particular article with respect to the chosen annotation type. This relevance score was combined with a local vector-space search engine to generate a ranked list of PMIDs. We also evaluated a query refinement strategy, which adds specific keywords (such as ‘binds’ or ‘interacts’) to the original query. Compared to PubMed, the search effectiveness of the nextA5 triage service is improved by 190% for the prioritization of papers with PPIs information and by 260% for papers with PTMs information. Combining advanced retrieval and query refinement strategies with automatically enriched MEDLINE contents is effective to improve triage in complex curation tasks such as the curation of protein PPIs and PTMs. Database URL: http://candy.hesge.ch/nextA5 PMID:29220432

  9. Development Issues on Linked Data Weblog Enrichment

    NASA Astrophysics Data System (ADS)

    Ruiz-Rube, Iván; Cornejo, Carlos M.; Dodero, Juan Manuel; García, Vicente M.

    In this paper, we describe the issues found during the development of LinkedBlog, a Linked Data extension for WordPress blogs. This extension enables to enrich text-based and video information contained in blog entries with RDF triples that are suitable to be stored, managed and exploited by other web-based applications. The issues have to do with the generality, usability, tracking, depth, security, trustiness and performance of the linked data enrichment process. The presented annotation approach aims at maintaining web-based contents independent from the underlying ontological model, by providing a loosely coupled RDFa-based approach in the linked data application. Finally, we detail how the performance of annotations can be improved through a semantic reasoner.

  10. EUCLID: automatic classification of proteins in functional classes by their database annotations.

    PubMed

    Tamames, J; Ouzounis, C; Casari, G; Sander, C; Valencia, A

    1998-01-01

    A tool is described for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts. Euclid is freely available for academics at http://www.gredos.cnb.uam.es/EUCLID, with the corresponding dictionaries for the generation of three, eight and 14 functional classes. E-mail: valencia@cnb.uam.es The results of the EUCLID classification of different genomes are available at http://www.sander.ebi.ac. uk/genequiz/. A detailed description of the different applications mentioned in the text is available at http://www.gredos.cnb.uam. es/EUCLID/Full_Paper

  11. Red blood cells, still vital after all these years: Commentary on Canadian Blood Services' International Symposium 2017.

    PubMed

    Qadri, Syed M; Donkor, David A; Yan, Matthew; Ning, Shuoyan; Branch, Donald R; Seghatchian, Jerard; Sheffield, William P

    2018-04-01

    Canadian Blood Services (CBS), Canada's national blood transfusion service, has for many years sponsored an annual conference, for the education and awareness of interested participants, showcasing the latest evidence-based understanding of both basic science and clinical issues in transfusion medicine and science. The 15th iteration of this symposium took place September 9, 2017 and focused on some of the vital aspects of red blood cells (RBC), in line with the" 3Rs" concept, namely the provision of the Right red blood cell (RBC) product to the Right patient at the Right time. Presentations touched upon: the evolution of blood banking in North America; the monocyte monolayer assay as a predictor of post-transfusion hemolysis; hemoglobin-based oxygen carriers; RBC alloimmunization; serological approaches to complex RBC antibody problems; randomized clinical trials related to the age of stored RBC; RBC genotyping; pathophysiology, prevention and treatment of hemolytic disease of the fetus and newborn (HDFN); and testing and timing in perinatal serology. This commentary provides summaries of all speakers' presentations annotated with relevant references. Special thanks are due to all contributors for their praiseworthy approaches in sharing their experiences and knowledge on this interesting scientific/clinical and management theme. Copyright © 2018 Elsevier Ltd. All rights reserved.

  12. Morphological Characters and Transcriptome Profiles Associated with Black Skin and Red Skin in Crimson Snapper (Lutjanus erythropterus)

    PubMed Central

    Zhang, Yan-Ping; Wang, Zhong-Duo; Guo, Yu-Song; Liu, Li; Yu, Juan; Zhang, Shun; Liu, Shao-Jun; Liu, Chu-Wu

    2015-01-01

    In this study, morphology observation and illumina sequencing were performed on two different coloration skins of crimson snapper (Lutjanus erythropterus), the black zone and the red zone. Three types of chromatophores, melanophores, iridophores and xanthophores, were organized in the skins. The main differences between the two colorations were in the amount and distribution of the three chromatophores. After comparing the two transcriptomes, 9200 unigenes with significantly different expressions (ratio change ≥ 2 and q-value ≤ 0.05) were found, of which 5972 were up-regulated in black skin and 3228 were up-regulated in red skin. Through the function annotation, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of the differentially transcribed genes, we excavated a number of uncharacterized candidate pigment genes as well as found the conserved genes affecting pigmentation in crimson snapper. The patterns of expression of 14 pigment genes were confirmed by the Quantitative real-time PCR analysis between the two color skins. Overall, this study shows a global survey of the morphological characters and transcriptome analysis of the different coloration skins in crimson snapper, and provides valuable cellular and genetic information to uncover the mechanism of the formation of pigment patterns in snappers. PMID:26569232

  13. Morphological Characters and Transcriptome Profiles Associated with Black Skin and Red Skin in Crimson Snapper (Lutjanus erythropterus).

    PubMed

    Zhang, Yan-Ping; Wang, Zhong-Duo; Guo, Yu-Song; Liu, Li; Yu, Juan; Zhang, Shun; Liu, Shao-Jun; Liu, Chu-Wu

    2015-11-12

    In this study, morphology observation and illumina sequencing were performed on two different coloration skins of crimson snapper (Lutjanus erythropterus), the black zone and the red zone. Three types of chromatophores, melanophores, iridophores and xanthophores, were organized in the skins. The main differences between the two colorations were in the amount and distribution of the three chromatophores. After comparing the two transcriptomes, 9200 unigenes with significantly different expressions (ratio change ≥ 2 and q-value ≤ 0.05) were found, of which 5972 were up-regulated in black skin and 3228 were up-regulated in red skin. Through the function annotation, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of the differentially transcribed genes, we excavated a number of uncharacterized candidate pigment genes as well as found the conserved genes affecting pigmentation in crimson snapper. The patterns of expression of 14 pigment genes were confirmed by the Quantitative real-time PCR analysis between the two color skins. Overall, this study shows a global survey of the morphological characters and transcriptome analysis of the different coloration skins in crimson snapper, and provides valuable cellular and genetic information to uncover the mechanism of the formation of pigment patterns in snappers.

  14. Promoting Different Reading Comprehension Levels through Online Annotations

    ERIC Educational Resources Information Center

    Tseng, Sheng-Shiang; Yeh, Hui-Chin; Yang, Shih-hsien

    2015-01-01

    Previous studies have evaluated reading comprehension as the general understanding of reading texts. However, this broad and generic assessment of reading comprehension overlooks the specific aspects and processes that students need to develop. This study adopted Kintsch's Construction-Integration model to tap into reading comprehension at…

  15. In silico exploration of Red Sea Bacillus genomes for natural product biosynthetic gene clusters.

    PubMed

    Othoum, Ghofran; Bougouffa, Salim; Razali, Rozaimi; Bokhari, Ameerah; Alamoudi, Soha; Antunes, André; Gao, Xin; Hoehndorf, Robert; Arold, Stefan T; Gojobori, Takashi; Hirt, Heribert; Mijakovic, Ivan; Bajic, Vladimir B; Lafi, Feras F; Essack, Magbubah

    2018-05-22

    The increasing spectrum of multidrug-resistant bacteria is a major global public health concern, necessitating discovery of novel antimicrobial agents. Here, members of the genus Bacillus are investigated as a potentially attractive source of novel antibiotics due to their broad spectrum of antimicrobial activities. We specifically focus on a computational analysis of the distinctive biosynthetic potential of Bacillus paralicheniformis strains isolated from the Red Sea, an ecosystem exposed to adverse, highly saline and hot conditions. We report the complete circular and annotated genomes of two Red Sea strains, B. paralicheniformis Bac48 isolated from mangrove mud and B. paralicheniformis Bac84 isolated from microbial mat collected from Rabigh Harbor Lagoon in Saudi Arabia. Comparing the genomes of B. paralicheniformis Bac48 and B. paralicheniformis Bac84 with nine publicly available complete genomes of B. licheniformis and three genomes of B. paralicheniformis, revealed that all of the B. paralicheniformis strains in this study are more enriched in nonribosomal peptides (NRPs). We further report the first computationally identified trans-acyltransferase (trans-AT) nonribosomal peptide synthetase/polyketide synthase (PKS/ NRPS) cluster in strains of this species. B. paralicheniformis species have more genes associated with biosynthesis of antimicrobial bioactive compounds than other previously characterized species of B. licheniformis, which suggests that these species are better potential sources for novel antibiotics. Moreover, the genome of the Red Sea strain B. paralicheniformis Bac48 is more enriched in modular PKS genes compared to B. licheniformis strains and other B. paralicheniformis strains. This may be linked to adaptations that strains surviving in the Red Sea underwent to survive in the relatively hot and saline ecosystems.

  16. De Novo Assembly and Comparative Transcriptome Analyses of Red and Green Morphs of Sweet Basil Grown in Full Sunlight

    PubMed Central

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Guidi, Lucia; Gori, Antonella; Marzano, Cristina; Landi, Marco; Sebastiani, Federico

    2016-01-01

    Sweet basil (Ocimum basilicum), one of the most popular cultivated herbs worldwide, displays a number of varieties differing in several characteristics, such as the color of the leaves. The development of a reference transcriptome for sweet basil, and the analysis of differentially expressed genes in acyanic and cyanic cultivars exposed to natural sunlight irradiance, has interest from horticultural and biological point of views. There is still great uncertainty about the significance of anthocyanins in photoprotection, and how green and red morphs may perform when exposed to photo-inhibitory light, a condition plants face on daily and seasonal basis. We sequenced the leaf transcriptome of the green-leaved Tigullio (TIG) and the purple-leaved Red Rubin (RR) exposed to full sunlight over a four-week experimental period. We assembled and annotated 111,007 transcripts. A total of 5,468 and 5,969 potential SSRs were identified in TIG and RR, respectively, out of which 66 were polymorphic in silico. Comparative analysis of the two transcriptomes showed 2,372 differentially expressed genes (DEGs) clustered in 222 enriched Gene ontology terms. Green and red basil mostly differed for transcripts abundance of genes involved in secondary metabolism. While the biosynthesis of waxes was up-regulated in red basil, the biosynthesis of flavonols and carotenoids was up-regulated in green basil. Data from our study provides a comprehensive transcriptome survey, gene sequence resources and microsatellites that can be used for further investigations in sweet basil. The analysis of DEGs and their functional classification also offers new insights on the functional role of anthocyanins in photoprotection. PMID:27483170

  17. Extracting BI-RADS Features from Portuguese Clinical Texts

    PubMed Central

    Nassif, Houssam; Cunha, Filipe; Moreira, Inês C.; Cruz-Correia, Ricardo; Sousa, Eliana; Page, David; Burnside, Elizabeth; Dutra, Inês

    2013-01-01

    In this work we build the first BI-RADS parser for Portuguese free texts, modeled after existing approaches to extract BI-RADS features from English medical records. Our concept finder uses a semantic grammar based on the BIRADS lexicon and on iterative transferred expert knowledge. We compare the performance of our algorithm to manual annotation by a specialist in mammography. Our results show that our parser’s performance is comparable to the manual method. PMID:23797461

  18. Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

    PubMed Central

    Rodríguez-Penagos, Carlos; Salgado, Heladia; Martínez-Flores, Irma; Collado-Vides, Julio

    2007-01-01

    Background Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12. Results Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners. Conclusion Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages. PMID:17683642

  19. ORFer--retrieval of protein sequences and open reading frames from GenBank and storage into relational databases or text files.

    PubMed

    Büssow, Konrad; Hoffmann, Steve; Sievert, Volker

    2002-12-19

    Functional genomics involves the parallel experimentation with large sets of proteins. This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins. A Java program was developed for retrieval of protein and nucleic acid sequences and annotations from NCBI GenBank, using the XML sequence format. Annotations retrieved by ORFer include sequence name, organism and also the completeness of the sequence. The program has a graphical user interface, although it can be used in a non-interactive mode. For protein sequences, the program also extracts the open reading frame sequence, if available, and checks its correct translation. ORFer accepts user input in the form of single or lists of GenBank GI identifiers or accession numbers. It can be used to extract complete sets of open reading frames and protein sequences from any kind of GenBank sequence entry, including complete genomes or chromosomes. Sequences are either stored with their features in a relational database or can be exported as text files in Fasta or tabulator delimited format. The ORFer program is freely available at http://www.proteinstrukturfabrik.de/orfer. The ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank. Furthermore, storage of sequences and features in a relational database is supported. Such a database can supplement a laboratory information system (LIMS) with appropriate sequence information.

  20. MAPU: Max-Planck Unified database of organellar, cellular, tissue and body fluid proteomes

    PubMed Central

    Zhang, Yanling; Zhang, Yong; Adachi, Jun; Olsen, Jesper V.; Shi, Rong; de Souza, Gustavo; Pasini, Erica; Foster, Leonard J.; Macek, Boris; Zougman, Alexandre; Kumar, Chanchal; Wiśniewski, Jacek R.; Jun, Wang; Mann, Matthias

    2007-01-01

    Mass spectrometry (MS)-based proteomics has become a powerful technology to map the protein composition of organelles, cell types and tissues. In our department, a large-scale effort to map these proteomes is complemented by the Max-Planck Unified (MAPU) proteome database. MAPU contains several body fluid proteomes; including plasma, urine, and cerebrospinal fluid. Cell lines have been mapped to a depth of several thousand proteins and the red blood cell proteome has also been analyzed in depth. The liver proteome is represented with 3200 proteins. By employing high resolution MS and stringent validation criteria, false positive identification rates in MAPU are lower than 1:1000. Thus MAPU datasets can serve as reference proteomes in biomarker discovery. MAPU contains the peptides identifying each protein, measured masses, scores and intensities and is freely available at using a clickable interface of cell or body parts. Proteome data can be queried across proteomes by protein name, accession number, sequence similarity, peptide sequence and annotation information. More than 4500 mouse and 2500 human proteins have already been identified in at least one proteome. Basic annotation information and links to other public databases are provided in MAPU and we plan to add further analysis tools. PMID:17090601

  1. Effects of Multimedia Vocabulary Annotations on Vocabulary Learning and Text Comprehension in ESP Classrooms

    ERIC Educational Resources Information Center

    Lin, Huifen

    2012-01-01

    For the past few decades, instructional materials enriched with multimedia elements have enjoyed increasing popularity. Multimedia-based instruction incorporating stimulating visuals, authentic audios, and interactive animated graphs of different kinds all provide additional and valuable opportunities for students to learn beyond what conventional…

  2. The TREC Interactive Track: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Over, Paul

    2001-01-01

    Discussion of the study of interactive information retrieval (IR) at the Text Retrieval Conferences (TREC) focuses on summaries of the Interactive Track at each conference. Describes evolution of the track, which has changed from comparing human-machine systems with fully automatic systems to comparing interactive systems that focus on the search…

  3. The Long Road to the NRSV.

    ERIC Educational Resources Information Center

    Bailey, J. Martin

    1990-01-01

    Examines the revision process, begun in 1973, that created the New Revised Standard Version of the Bible, published in 1990. Stresses the importance of accurate reflection of ancient texts in contemporary yet enduring language. Notes the ecumenical nature of the committee responsible for this revision. Provides an annotated list of Bible…

  4. Choosing a President, 1968: The American Political Process.

    ERIC Educational Resources Information Center

    Tufts Univ., Medford, MA. Lincoln Filene Center for Citizenship and Public Affairs.

    The student text portion of this set contains an account of the 1968 presidential election, from the state primaries to the election of Richard M. Nixon. The Teacher's Guide outlines objectives, teaching strategies, discussion questions, vocabulary, and an annotated listing of resources which includes educational and commercial films as well as…

  5. Social Operational Information, Competence, and Participation in Online Collective Action

    ERIC Educational Resources Information Center

    Antin, Judd David

    2010-01-01

    Recent advances in interactive web technologies, combined with widespread broadband and mobile device adoption, have made online collective action commonplace. Millions of individuals work together to aggregate, annotate, and share digital text, audio, images, and video. Given the prevalence and importance of online collective action systems,…

  6. English through Latin and Greek: A Textbook List for Vocabulary Courses (Part 2).

    ERIC Educational Resources Information Center

    Sebesta, Judith Lynn

    1983-01-01

    Nine general books on developing English vocabulary from Latin and Greek and three related books on medical terminology are listed and annotated. The texts are oriented to various groups from fifth grade through professional education levels and are for classroom and self-directed learning. (MSE)

  7. Learn by Yourself: The Self-Learning Tools for Qualitative Analysis Software Packages

    ERIC Educational Resources Information Center

    Freitas, Fábio; Ribeiro, Jaime; Brandão, Catarina; Reis, Luís Paulo; de Souza, Francislê Neri; Costa, António Pedro

    2017-01-01

    Computer Assisted Qualitative Data Analysis Software (CAQDAS) are tools that help researchers to develop qualitative research projects. These software packages help the users with tasks such as transcription analysis, coding and text interpretation, writing and annotation, content search and analysis, recursive abstraction, grounded theory…

  8. Linking chloroplast relocation to different responses of photosynthesis to blue and red radiation in low and high light-acclimated leaves of Arabidopsis thaliana (L.).

    PubMed

    Pfündel, Erhard E; Latouche, Gwendal; Meister, Armin; Cerovic, Zoran G

    2018-01-27

    Low light (LL) and high light (HL)-acclimated plants of A. thaliana were exposed to blue (BB) or red (RR) light or to a mixture of blue and red light (BR) of incrementally increasing intensities. The light response of photosystem II was measured by pulse amplitude-modulated chlorophyll fluorescence and that of photosystem I by near infrared difference spectroscopy. The LL but not HL leaves exhibited blue light-specific responses which were assigned to relocation of chloroplasts from the dark to the light-avoidance arrangement. Blue light (BB and BR) decreased the minimum fluorescence ([Formula: see text]) more than RR light. This extra reduction of the [Formula: see text] was stronger than theoretically predicted for [Formula: see text] quenching by energy dissipation but actual measurement and theory agreed in RR treatments. The extra [Formula: see text] reduction was assigned to decreased light absorption of chloroplasts in the avoidance position. A maximum reduction of 30% was calculated. Increasing intensities of blue light affected the fluorescence parameters NPQ and q P to a lesser degree than red light. After correcting for the optical effects of chloroplast relocation, the NPQ responded similarly to blue and red light. The same correction method diminished the color-specific variations in q P but did not abolish it; thus strongly indicating the presence of another blue light effect which also moderates excitation pressure in PSII but cannot be ascribed to absorption variations. Only after RR exposure, a post-illumination overshoot of [Formula: see text] and fast oxidation of PSI electron acceptors occurred, thus, suggesting an electron flow from stromal reductants to the plastoquinone pool.

  9. MapReduce in the Cloud: A Use Case Study for Efficient Co-Occurrence Processing of MEDLINE Annotations with MeSH.

    PubMed

    Kreuzthaler, Markus; Miñarro-Giménez, Jose Antonio; Schulz, Stefan

    2016-01-01

    Big data resources are difficult to process without a scaled hardware environment that is specifically adapted to the problem. The emergence of flexible cloud-based virtualization techniques promises solutions to this problem. This paper demonstrates how a billion of lines can be processed in a reasonable amount of time in a cloud-based environment. Our use case addresses the accumulation of concept co-occurrence data in MEDLINE annotation as a series of MapReduce jobs, which can be scaled and executed in the cloud. Besides showing an efficient way solving this problem, we generated an additional resource for the scientific community to be used for advanced text mining approaches.

  10. GOGrapher: A Python library for GO graph representation and analysis.

    PubMed

    Muller, Brian; Richards, Adam J; Jin, Bo; Lu, Xinghua

    2009-07-07

    The Gene Ontology is the most commonly used controlled vocabulary for annotating proteins. The concepts in the ontology are organized as a directed acyclic graph, in which a node corresponds to a biological concept and a directed edge denotes the parent-child semantic relationship between a pair of terms. A large number of protein annotations further create links between proteins and their functional annotations, reflecting the contemporary knowledge about proteins and their functional relationships. This leads to a complex graph consisting of interleaved biological concepts and their associated proteins. What is needed is a simple, open source library that provides tools to not only create and view the Gene Ontology graph, but to analyze and manipulate it as well. Here we describe the development and use of GOGrapher, a Python library that can be used for the creation, analysis, manipulation, and visualization of Gene Ontology related graphs. An object-oriented approach was adopted to organize the hierarchy of the graphs types and associated classes. An Application Programming Interface is provided through which different types of graphs can be pragmatically created, manipulated, and visualized. GOGrapher has been successfully utilized in multiple research projects, e.g., a graph-based multi-label text classifier for protein annotation. The GOGrapher project provides a reusable programming library designed for the manipulation and analysis of Gene Ontology graphs. The library is freely available for the scientific community to use and improve.

  11. GoGene: gene annotation in the fast lane.

    PubMed

    Plake, Conrad; Royer, Loic; Winnenburg, Rainer; Hakenberg, Jörg; Schroeder, Michael

    2009-07-01

    High-throughput screens such as microarrays and RNAi screens produce huge amounts of data. They typically result in hundreds of genes, which are often further explored and clustered via enriched GeneOntology terms. The strength of such analyses is that they build on high-quality manual annotations provided with the GeneOntology. However, the weakness is that annotations are restricted to process, function and location and that they do not cover all known genes in model organisms. GoGene addresses this weakness by complementing high-quality manual annotation with high-throughput text mining extracting co-occurrences of genes and ontology terms from literature. GoGene contains over 4,000,000 associations between genes and gene-related terms for 10 model organisms extracted from more than 18,000,000 PubMed entries. It does not cover only process, function and location of genes, but also biomedical categories such as diseases, compounds, techniques and mutations. By bringing it all together, GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. GoGene accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. Since all associations of genes to terms are supported by evidence in the literature, the results are transparent and can be verified by the user. GoGene is available at http://gopubmed.org/gogene.

  12. Information extraction from Italian medical reports: An ontology-driven approach.

    PubMed

    Viani, Natalia; Larizza, Cristiana; Tibollo, Valentina; Napolitano, Carlo; Priori, Silvia G; Bellazzi, Riccardo; Sacchi, Lucia

    2018-03-01

    In this work, we propose an ontology-driven approach to identify events and their attributes from episodes of care included in medical reports written in Italian. For this language, shared resources for clinical information extraction are not easily accessible. The corpus considered in this work includes 5432 non-annotated medical reports belonging to patients with rare arrhythmias. To guide the information extraction process, we built a domain-specific ontology that includes the events and the attributes to be extracted, with related regular expressions. The ontology and the annotation system were constructed on a development set, while the performance was evaluated on an independent test set. As a gold standard, we considered a manually curated hospital database named TRIAD, which stores most of the information written in reports. The proposed approach performs well on the considered Italian medical corpus, with a percentage of correct annotations above 90% for most considered clinical events. We also assessed the possibility to adapt the system to the analysis of another language (i.e., English), with promising results. Our annotation system relies on a domain ontology to extract and link information in clinical text. We developed an ontology that can be easily enriched and translated, and the system performs well on the considered task. In the future, it could be successfully used to automatically populate the TRIAD database. Copyright © 2017 Elsevier B.V. All rights reserved.

  13. WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Putman, Tim E.; Lelong, Sebastien; Burgstaller-Muehlbacher, Sebastian

    With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don’t exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (wikigenomes.org), a web application that facilitates the consumption and curation of genomicmore » data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction.« less

  14. WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata

    DOE PAGES

    Putman, Tim E.; Lelong, Sebastien; Burgstaller-Muehlbacher, Sebastian; ...

    2017-03-06

    With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don’t exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (wikigenomes.org), a web application that facilitates the consumption and curation of genomicmore » data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction.« less

  15. ChemicalTagger: A tool for semantic text-mining in chemistry

    PubMed Central

    2011-01-01

    Background The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches. Results We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names). Conclusions It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision. PMID:21575201

  16. Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach.

    PubMed

    Rinaldi, Fabio; Schneider, Gerold; Kaljurand, Kaarel; Hess, Michael; Andronis, Christos; Konstandi, Ourania; Persidis, Andreas

    2007-02-01

    The amount of new discoveries (as published in the scientific literature) in the biomedical area is growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and thus the extraction of the core information becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. This paper presents and evaluates an approach aimed at automating the process of extracting functional relations (e.g. interactions between genes and proteins) from scientific literature in the biomedical domain. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus. We have implemented a state-of-the-art text mining system for biomedical literature, based on a deep-linguistic, full-parsing approach. The results are validated on two different corpora: the manually annotated genomics information access (GENIA) corpus and the automatically annotated arabidopsis thaliana circadian rhythms (ATCR) corpus. We show how a deep-linguistic approach (contrary to common belief) can be used in a real world text mining application, offering high-precision relation extraction, while at the same time retaining a sufficient recall.

  17. Using text mining to link journal articles to neuroanatomical databases

    PubMed Central

    French, Leon; Pavlidis, Paul

    2013-01-01

    The electronic linking of neuroscience information, including data embedded in the primary literature, would permit powerful queries and analyses driven by structured databases. This task would be facilitated by automated procedures which can identify biological concepts in journals. Here we apply an approach for automatically mapping formal identifiers of neuroanatomical regions to text found in journal abstracts, and apply it to a large body of abstracts from the Journal of Comparative Neurology (JCN). The analyses yield over one hundred thousand brain region mentions which we map to 8,225 brain region concepts in multiple organisms. Based on the analysis of a manually annotated corpus, we estimate mentions are mapped at 95% precision and 63% recall. Our results provide insights into the patterns of publication on brain regions and species of study in the Journal, but also point to important challenges in the standardization of neuroanatomical nomenclatures. We find that many terms in the formal terminologies never appear in a JCN abstract, while conversely, many terms authors use are not reflected in the terminologies. To improve the terminologies we deposited 136 unrecognized brain regions into the Neuroscience Lexicon (NeuroLex). The training data, terminologies, normalizations, evaluations and annotated journal abstracts are freely available at http://www.chibi.ubc.ca/WhiteText/. PMID:22120205

  18. Video-to-film color-image recorder.

    NASA Technical Reports Server (NTRS)

    Montuori, J. S.; Carnes, W. R.; Shim, I. H.

    1973-01-01

    A precision video-to-film recorder for use in image data processing systems, being developed for NASA, will convert three video input signals (red, blue, green) into a single full-color light beam for image recording on color film. Argon ion and krypton lasers are used to produce three spectral lines which are independently modulated by the appropriate video signals, combined into a single full-color light beam, and swept over the recording film in a raster format for image recording. A rotating multi-faceted spinner mounted on a translating carriage generates the raster, and an annotation head is used to record up to 512 alphanumeric characters in a designated area outside the image area.

  19. Assessing the role of a medication-indication resource in the treatment relation extraction from clinical text

    PubMed Central

    Bejan, Cosmin Adrian; Wei, Wei-Qi; Denny, Joshua C

    2015-01-01

    Objective To evaluate the contribution of the MEDication Indication (MEDI) resource and SemRep for identifying treatment relations in clinical text. Materials and methods We first processed clinical documents with SemRep to extract the Unified Medical Language System (UMLS) concepts and the treatment relations between them. Then, we incorporated MEDI into a simple algorithm that identifies treatment relations between two concepts if they match a medication-indication pair in this resource. For a better coverage, we expanded MEDI using ontology relationships from RxNorm and UMLS Metathesaurus. We also developed two ensemble methods, which combined the predictions of SemRep and the MEDI algorithm. We evaluated our selected methods on two datasets, a Vanderbilt corpus of 6864 discharge summaries and the 2010 Informatics for Integrating Biology and the Bedside (i2b2)/Veteran's Affairs (VA) challenge dataset. Results The Vanderbilt dataset included 958 manually annotated treatment relations. A double annotation was performed on 25% of relations with high agreement (Cohen's κ = 0.86). The evaluation consisted of comparing the manual annotated relations with the relations identified by SemRep, the MEDI algorithm, and the two ensemble methods. On the first dataset, the best F1-measure results achieved by the MEDI algorithm and the union of the two resources (78.7 and 80, respectively) were significantly higher than the SemRep results (72.3). On the second dataset, the MEDI algorithm achieved better precision and significantly lower recall values than the best system in the i2b2 challenge. The two systems obtained comparable F1-measure values on the subset of i2b2 relations with both arguments in MEDI. Conclusions Both SemRep and MEDI can be used to extract treatment relations from clinical text. Knowledge-based extraction with MEDI outperformed use of SemRep alone, but superior performance was achieved by integrating both systems. The integration of knowledge-based resources such as MEDI into information extraction systems such as SemRep and the i2b2 relation extractors may improve treatment relation extraction from clinical text. PMID:25336593

  20. Secondary LD Mainstreaming Methods: Instructional Module. (An Instructional Module for Preservice or Inservice Training of Regular Secondary Educators).

    ERIC Educational Resources Information Center

    Reetz, Linda J.; Hoover, John H.

    Intended for use in preservice or inservice training of regular secondary educators, the module examines principles of communication, assessment, teaching methods, and classroom management through text, an annotated bibliography, and overhead masters. The first section covers communicating with handicapped students, their parents, and other…

  1. BC4GO: a full-text corpus for the BioCreative IV GO Task

    USDA-ARS?s Scientific Manuscript database

    Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database (MOD) groups. Due to its manual nature, this task is time-consuming and labor-intensive, and thus considered one of the bottlenecks in literature curation. There have been many previous attempts a...

  2. Selected Bibliography of Educational Materials: Algeria, Libya, Morocco, Tunisia. Vol. 1 No. 3 1967.

    ERIC Educational Resources Information Center

    Azzouz, Azzedine; And Others

    This annotated compilation of 108 educational publications in Arabic, French, and English includes texts and data on elementary and secondary school curricula in the Maghreb countries (Tunisia, Morroco, Algeria, and Libya). Categories in this bibliography include--(1) Philosophy and Theory of Education, (2) Administration of the Educational…

  3. A Path to Formative Assessment through Naturalistic Inputs

    ERIC Educational Resources Information Center

    Cohen, Jonathan; Leroux, Audrey

    2017-01-01

    This paper reports on the development of a system in which naturalistic inputs are collected by a web-based e-reader and, in combination with a measurement of readers' comprehension of that text, are analyzed by a neural network to determine the nature of the relationship between the annotations and comprehension. Results showed that neural…

  4. At Issue: Helicopter Parents and Millennial Students, an Annotated Bibliography

    ERIC Educational Resources Information Center

    Pricer, Wayne F.

    2008-01-01

    Technological advances have made it easy for parents and children--many of them students--to communicate instantaneously. Devices and technologies such as cell phones, laptops, texting, and e-mail all enable various forms of instant communication. "Helicopter parents" are regarded as very overprotective and overly involved in the affairs of their…

  5. Textbooks in German 1942-1973: A Descriptive Bibliography.

    ERIC Educational Resources Information Center

    Buck, Kathryn, Comp.; Haase, Arthur, Comp.

    This annotated bibliography lists German textbooks published in the United States for use in junior high, high school, and college by students whose first language is English. Intended for those interested in researching old and writing new texts, as well as for those searching for an extant volume for classroom use, the bibliography consists of…

  6. A Pointing Out and Naming Paradigm to Support Radiological Teaching and Case-Oriented Learning.

    ERIC Educational Resources Information Center

    Van Cleynenbreugel, J.; And Others

    1994-01-01

    The use of computer programs for authoring and presenting case materials in professional instruction in radiology is discussed. A workstation-based multimedia program for presenting and annotating images accompanied by both voice and text is described. Comments are also included on validity results and student response. (MSE)

  7. Using Standardized Lexicons for Report Template Validation with LexMap, a Web-based Application.

    PubMed

    Hostetter, Jason; Wang, Kenneth; Siegel, Eliot; Durack, Jeremy; Morrison, James J

    2015-06-01

    An enormous amount of data exists in unstructured diagnostic and interventional radiology reports. Free text or non-standardized terminologies limit the ability to parse, extract, and analyze these report data elements. Medical lexicons and ontologies contain standardized terms for relevant concepts including disease entities, radiographic technique, and findings. The use of standardized terms offers the potential to improve reporting consistency and facilitate computer analysis. The purpose of this project was to implement an interface to aid in the creation of standards-compliant reporting templates for use in interventional radiology. Non-standardized procedure report text was analyzed and referenced to RadLex, SNOMED-CT, and LOINC. Using JavaScript, a web application was developed which determined whether exact terms or synonyms in reports existed within these three reference resources. The NCBO BioPortal Annotator web service was used to map terms, and output from this application was used to create an interactive annotated version of the original report. The application was successfully used to analyze and modify five distinct reports for the Society of Interventional Radiology's standardized reporting project.

  8. Semi-Automated Annotation of Biobank Data Using Standard Medical Terminologies in a Graph Database.

    PubMed

    Hofer, Philipp; Neururer, Sabrina; Goebel, Georg

    2016-01-01

    Data describing biobank resources frequently contains unstructured free-text information or insufficient coding standards. (Bio-) medical ontologies like Orphanet Rare Diseases Ontology (ORDO) or the Human Disease Ontology (DOID) provide a high number of concepts, synonyms and entity relationship properties. Such standard terminologies increase quality and granularity of input data by adding comprehensive semantic background knowledge from validated entity relationships. Moreover, cross-references between terminology concepts facilitate data integration across databases using different coding standards. In order to encourage the use of standard terminologies, our aim is to identify and link relevant concepts with free-text diagnosis inputs within a biobank registry. Relevant concepts are selected automatically by lexical matching and SPARQL queries against a RDF triplestore. To ensure correctness of annotations, proposed concepts have to be confirmed by medical data administration experts before they are entered into the registry database. Relevant (bio-) medical terminologies describing diseases and phenotypes were identified and stored in a graph database which was tied to a local biobank registry. Concept recommendations during data input trigger a structured description of medical data and facilitate data linkage between heterogeneous systems.

  9. STOP using just GO: a multi-ontology hypothesis generation tool for high throughput experimentation

    PubMed Central

    2013-01-01

    Background Gene Ontology (GO) enrichment analysis remains one of the most common methods for hypothesis generation from high throughput datasets. However, we believe that researchers strive to test other hypotheses that fall outside of GO. Here, we developed and evaluated a tool for hypothesis generation from gene or protein lists using ontological concepts present in manually curated text that describes those genes and proteins. Results As a consequence we have developed the method Statistical Tracking of Ontological Phrases (STOP) that expands the realm of testable hypotheses in gene set enrichment analyses by integrating automated annotations of genes to terms from over 200 biomedical ontologies. While not as precise as manually curated terms, we find that the additional enriched concepts have value when coupled with traditional enrichment analyses using curated terms. Conclusion Multiple ontologies have been developed for gene and protein annotation, by using a dataset of both manually curated GO terms and automatically recognized concepts from curated text we can expand the realm of hypotheses that can be discovered. The web application STOP is available at http://mooneygroup.org/stop/. PMID:23409969

  10. LMSD: LIPID MAPS structure database

    PubMed Central

    Sud, Manish; Fahy, Eoin; Cotter, Dawn; Brown, Alex; Dennis, Edward A.; Glass, Christopher K.; Merrill, Alfred H.; Murphy, Robert C.; Raetz, Christian R. H.; Russell, David W.; Subramaniam, Shankar

    2007-01-01

    The LIPID MAPS Structure Database (LMSD) is a relational database encompassing structures and annotations of biologically relevant lipids. Structures of lipids in the database come from four sources: (i) LIPID MAPS Consortium's core laboratories and partners; (ii) lipids identified by LIPID MAPS experiments; (iii) computationally generated structures for appropriate lipid classes; (iv) biologically relevant lipids manually curated from LIPID BANK, LIPIDAT and other public sources. All the lipid structures in LMSD are drawn in a consistent fashion. In addition to a classification-based retrieval of lipids, users can search LMSD using either text-based or structure-based search options. The text-based search implementation supports data retrieval by any combination of these data fields: LIPID MAPS ID, systematic or common name, mass, formula, category, main class, and subclass data fields. The structure-based search, in conjunction with optional data fields, provides the capability to perform a substructure search or exact match for the structure drawn by the user. Search results, in addition to structure and annotations, also include relevant links to external databases. The LMSD is publicly available at PMID:17098933

  11. Review of footnotes and annotations to the 1949–2013 tables of standard atomic weights and tables of isotopic compositions of the elements (IUPAC Technical Report)

    DOE PAGES

    Coplen, Tyler B.; Holden, Norman E.

    2016-01-01

    Abstract The Commission on Isotopic Abundances and Atomic Weights uses annotations given in footnotes that are an integral part of the Tables of Standard Atomic Weights to alert users to the possibilities of quite extraordinary occurrences, as well as sources with abnormal atomic-weight values outside an otherwise acceptable range. The basic need for footnotes to the Standard Atomic Weights Table and equivalent annotations to the Table of Isotopic Compositions of the Elements arises from the necessity to provide users with information that is relevant to one or more elements, but that cannot be provided using numerical data in columns. Anymore » desire to increase additional information conveyed by annotations to these Tables is tempered by the need to preserve a compact format and a style that can alert users, who would not be inclined to consult either the last full element-by-element review or the full text of a current Standard Atomic Weights of the Elements report. Since 1989, the footnotes of the Tables of Standard Atomic Weights and the annotations in column 5 of the Table of Isotopic Compositions of the Elements have been harmonized by use of three lowercase footnotes, “g”, “m”, and “r”, that signify geologically exceptionally specimens (“g”), modified isotopic compositions in material subjected to undisclosed or inadvertent isotopic fractionation (“m”), and the range in isotopic composition of normal terrestrial material prevents more precise atomic-weight value being given (“r”). As some elements are assigned intervals for their standard atomic-weight values (applies to 12 elements since 2009), footnotes “g” and “r” are no longer needed for these elements.« less

  12. Active learning: a step towards automating medical concept extraction.

    PubMed

    Kholghi, Mahnoosh; Sitbon, Laurianne; Zuccon, Guido; Nguyen, Anthony

    2016-03-01

    This paper presents an automatic, active learning-based system for the extraction of medical concepts from clinical free-text reports. Specifically, (1) the contribution of active learning in reducing the annotation effort and (2) the robustness of incremental active learning framework across different selection criteria and data sets are determined. The comparative performance of an active learning framework and a fully supervised approach were investigated to study how active learning reduces the annotation effort while achieving the same effectiveness as a supervised approach. Conditional random fields as the supervised method, and least confidence and information density as 2 selection criteria for active learning framework were used. The effect of incremental learning vs standard learning on the robustness of the models within the active learning framework with different selection criteria was also investigated. The following 2 clinical data sets were used for evaluation: the Informatics for Integrating Biology and the Bedside/Veteran Affairs (i2b2/VA) 2010 natural language processing challenge and the Shared Annotated Resources/Conference and Labs of the Evaluation Forum (ShARe/CLEF) 2013 eHealth Evaluation Lab. The annotation effort saved by active learning to achieve the same effectiveness as supervised learning is up to 77%, 57%, and 46% of the total number of sequences, tokens, and concepts, respectively. Compared with the random sampling baseline, the saving is at least doubled. Incremental active learning is a promising approach for building effective and robust medical concept extraction models while significantly reducing the burden of manual annotation. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  13. A study of the effectiveness of machine learning methods for classification of clinical interview fragments into a large number of categories.

    PubMed

    Hasan, Mehedi; Kotov, Alexander; Carcone, April; Dong, Ming; Naar, Sylvie; Hartlieb, Kathryn Brogan

    2016-08-01

    This study examines the effectiveness of state-of-the-art supervised machine learning methods in conjunction with different feature types for the task of automatic annotation of fragments of clinical text based on codebooks with a large number of categories. We used a collection of motivational interview transcripts consisting of 11,353 utterances, which were manually annotated by two human coders as the gold standard, and experimented with state-of-art classifiers, including Naïve Bayes, J48 Decision Tree, Support Vector Machine (SVM), Random Forest (RF), AdaBoost, DiscLDA, Conditional Random Fields (CRF) and Convolutional Neural Network (CNN) in conjunction with lexical, contextual (label of the previous utterance) and semantic (distribution of words in the utterance across the Linguistic Inquiry and Word Count dictionaries) features. We found out that, when the number of classes is large, the performance of CNN and CRF is inferior to SVM. When only lexical features were used, interview transcripts were automatically annotated by SVM with the highest classification accuracy among all classifiers of 70.8%, 61% and 53.7% based on the codebooks consisting of 17, 20 and 41 codes, respectively. Using contextual and semantic features, as well as their combination, in addition to lexical ones, improved the accuracy of SVM for annotation of utterances in motivational interview transcripts with a codebook consisting of 17 classes to 71.5%, 74.2%, and 75.1%, respectively. Our results demonstrate the potential of using machine learning methods in conjunction with lexical, semantic and contextual features for automatic annotation of clinical interview transcripts with near-human accuracy. Copyright © 2016 Elsevier Inc. All rights reserved.

  14. Review of footnotes and annotations to the 1949–2013 tables of standard atomic weights and tables of isotopic compositions of the elements (IUPAC Technical Report)

    USGS Publications Warehouse

    Coplen, Tyler B.; Holden, Norman E.

    2016-01-01

    The Commission on Isotopic Abundances and Atomic Weights uses annotations given in footnotes that are an integral part of the Tables of Standard Atomic Weights to alert users to the possibilities of quite extraordinary occurrences, as well as sources with abnormal atomic-weight values outside an otherwise acceptable range. The basic need for footnotes to the Standard Atomic Weights Table and equivalent annotations to the Table of Isotopic Compositions of the Elements arises from the necessity to provide users with information that is relevant to one or more elements, but that cannot be provided using numerical data in columns. Any desire to increase additional information conveyed by annotations to these Tables is tempered by the need to preserve a compact format and a style that can alert users, who would not be inclined to consult either the last full element-by-element review or the full text of a current Standard Atomic Weights of the Elements report. Since 1989, the footnotes of the Tables of Standard Atomic Weights and the annotations in column 5 of the Table of Isotopic Compositions of the Elements have been harmonized by use of three lowercase footnotes, “g”, “m”, and “r”, that signify geologically exceptionally specimens (“g”), modified isotopic compositions in material subjected to undisclosed or inadvertent isotopic fractionation (“m”), and the range in isotopic composition of normal terrestrial material prevents more precise atomic-weight value being given (“r”). As some elements are assigned intervals for their standard atomic-weight values (applies to 12 elements since 2009), footnotes “g” and “r” are no longer needed for these elements.

  15. Cellulose biosynthesis by the beta-proteobacterium, Chromobacterium violaceum.

    PubMed

    Recouvreux, Derce O S; Carminatti, Claudimir A; Pitlovanciv, Ana K; Rambo, Carlos R; Porto, Luismar M; Antônio, Regina V

    2008-11-01

    The Chromobacterium violaceum ATCC 12472 genome was sequenced by The Brazilian National Genome Project Consortium. Previous annotation reported the presence of cellulose biosynthesis genes in that genome. Analysis of these genes showed that, as observed in other bacteria, they are organized in two operons. In the present work, experimental evidences of the presence of cellulose in the extracellular matrix of the biofilm produced by C. violaceum in static cultures are shown. Biofilm samples were enzymatically digested by cellulase, releasing glucose units, suggesting the presence of cellulose as an extracellular matrix component. Fluorescence microscopy observations showed that C. violaceum produces a cellulase-sensitive extracellular matrix composed of fibers able to bind calcofluor. C. violaceum grows on medium containing Congo red, forming brown-red colonies. Together, these results suggest that cellulase-susceptible matrix material is cellulose. Scanning electronic microscopy analysis showed that the extracellular matrix exhibited a network of microfibrils, typical of bacterial cellulose. Although cellulose production is widely distributed between several bacterial species, including at least the groups of Gram-negative proteobacteria alpha and gamma, we give for the first time experimental evidence for cellulose production in beta-proteobacteria.

  16. Semi-automatic semantic annotation of PubMed Queries: a study on quality, efficiency, satisfaction

    PubMed Central

    Névéol, Aurélie; Islamaj-Doğan, Rezarta; Lu, Zhiyong

    2010-01-01

    Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical information queries. Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations. PMID:21094696

  17. Unsupervised discovery of information structure in biomedical documents.

    PubMed

    Kiela, Douwe; Guo, Yufan; Stenius, Ulla; Korhonen, Anna

    2015-04-01

    Information structure (IS) analysis is a text mining technique, which classifies text in biomedical articles into categories that capture different types of information, such as objectives, methods, results and conclusions of research. It is a highly useful technique that can support a range of Biomedical Text Mining tasks and can help readers of biomedical literature find information of interest faster, accelerating the highly time-consuming process of literature review. Several approaches to IS analysis have been presented in the past, with promising results in real-world biomedical tasks. However, all existing approaches, even weakly supervised ones, require several hundreds of hand-annotated training sentences specific to the domain in question. Because biomedicine is subject to considerable domain variation, such annotations are expensive to obtain. This makes the application of IS analysis across biomedical domains difficult. In this article, we investigate an unsupervised approach to IS analysis and evaluate the performance of several unsupervised methods on a large corpus of biomedical abstracts collected from PubMed. Our best unsupervised algorithm (multilevel-weighted graph clustering algorithm) performs very well on the task, obtaining over 0.70 F scores for most IS categories when applied to well-known IS schemes. This level of performance is close to that of lightly supervised IS methods and has proven sufficient to aid a range of practical tasks. Thus, using an unsupervised approach, IS could be applied to support a wide range of tasks across sub-domains of biomedicine. We also demonstrate that unsupervised learning brings novel insights into IS of biomedical literature and discovers information categories that are not present in any of the existing IS schemes. The annotated corpus and software are available at http://www.cl.cam.ac.uk/∼dk427/bio14info.html. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  18. My Objections to Outcome [Note the Singular] Assessment

    ERIC Educational Resources Information Center

    Digby, Joan

    2014-01-01

    Rubric means red ochre--red earth--as in Bryce Canyon and Sedona. Red headers were used in medieval manuscripts as section or chapter markers, and you can bet that the Whore of Babylon got herself some fancy rubrics over the years. Through most of its history, the word has been attached to religious texts and liturgy; rubrics were used as…

  19. Towards computational improvement of DNA database indexing and short DNA query searching.

    PubMed

    Stojanov, Done; Koceski, Sašo; Mileva, Aleksandra; Koceska, Nataša; Bande, Cveta Martinovska

    2014-09-03

    In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.

  20. Metagenomic Analysis of the Rumen Microbiome of Steers with Wheat-Induced Frothy Bloat.

    PubMed

    Pitta, D W; Pinchak, W E; Indugu, N; Vecchiarelli, B; Sinha, R; Fulford, J D

    2016-01-01

    Frothy bloat is a serious metabolic disorder that affects stocker cattle grazing hard red winter wheat forage in the Southern Great Plains causing reduced performance, morbidity, and mortality. We hypothesize that a microbial dysbiosis develops in the rumen microbiome of stocker cattle when grazing on high quality winter wheat pasture that predisposes them to frothy bloat risk. In this study, rumen contents were harvested from six cannulated steers grazing hard red winter wheat (three with bloat score "2" and three with bloat score "0"), extracted for genomic DNA and subjected to 16S rDNA and shotgun sequencing on 454/Roche platform. Approximately 1.5 million reads were sequenced, assembled and assigned for phylogenetic and functional annotations. Bacteria predominated up to 84% of the sequences while archaea contributed to nearly 5% of the sequences. The abundance of archaea was higher in bloated animals (P < 0.05) and dominated by Methanobrevibacter. Predominant bacterial phyla were Firmicutes (65%), Actinobacteria (13%), Bacteroidetes (10%), and Proteobacteria (6%) across all samples. Genera from Firmicutes such as Clostridium, Eubacterium, and Butyrivibrio increased (P < 0.05) while Prevotella from Bacteroidetes decreased in bloated samples. Co-occurrence analysis revealed syntrophic associations between bacteria and archaea in non-bloated samples, however; such interactions faded in bloated samples. Functional annotations of assembled reads to Subsystems database revealed the abundance of several metabolic pathways, with carbohydrate and protein metabolism well represented. Assignment of contigs to CaZy database revealed a greater diversity of Glycosyl Hydrolases dominated by oligosaccharide breaking enzymes (>70%) in non-bloated samples. However, the abundance and diversity of CaZymes were greatly reduced in bloated samples indicating the disruption of carbohydrate metabolism. We conclude that mild to moderate frothy bloat results from tradeoffs both within and between microbial domains due to greater competition for substrates that are of limited availability as a result of biofilm formation.

  1. Metagenomic Analysis of the Rumen Microbiome of Steers with Wheat-Induced Frothy Bloat

    PubMed Central

    Pitta, D. W.; Pinchak, W. E.; Indugu, N.; Vecchiarelli, B.; Sinha, R.; Fulford, J. D.

    2016-01-01

    Frothy bloat is a serious metabolic disorder that affects stocker cattle grazing hard red winter wheat forage in the Southern Great Plains causing reduced performance, morbidity, and mortality. We hypothesize that a microbial dysbiosis develops in the rumen microbiome of stocker cattle when grazing on high quality winter wheat pasture that predisposes them to frothy bloat risk. In this study, rumen contents were harvested from six cannulated steers grazing hard red winter wheat (three with bloat score “2” and three with bloat score “0”), extracted for genomic DNA and subjected to 16S rDNA and shotgun sequencing on 454/Roche platform. Approximately 1.5 million reads were sequenced, assembled and assigned for phylogenetic and functional annotations. Bacteria predominated up to 84% of the sequences while archaea contributed to nearly 5% of the sequences. The abundance of archaea was higher in bloated animals (P < 0.05) and dominated by Methanobrevibacter. Predominant bacterial phyla were Firmicutes (65%), Actinobacteria (13%), Bacteroidetes (10%), and Proteobacteria (6%) across all samples. Genera from Firmicutes such as Clostridium, Eubacterium, and Butyrivibrio increased (P < 0.05) while Prevotella from Bacteroidetes decreased in bloated samples. Co-occurrence analysis revealed syntrophic associations between bacteria and archaea in non-bloated samples, however; such interactions faded in bloated samples. Functional annotations of assembled reads to Subsystems database revealed the abundance of several metabolic pathways, with carbohydrate and protein metabolism well represented. Assignment of contigs to CaZy database revealed a greater diversity of Glycosyl Hydrolases dominated by oligosaccharide breaking enzymes (>70%) in non-bloated samples. However, the abundance and diversity of CaZymes were greatly reduced in bloated samples indicating the disruption of carbohydrate metabolism. We conclude that mild to moderate frothy bloat results from tradeoffs both within and between microbial domains due to greater competition for substrates that are of limited availability as a result of biofilm formation. PMID:27242715

  2. Arkansas Department of Education Home School Report, 2005-2006

    ERIC Educational Resources Information Center

    Arkansas Department of Education, 2006

    2006-01-01

    This report presents data on home schooling in the state of Arkansas that covers: students withdrawn from home school; home school student count by county, district, and grade level; and home school enrollments by grade and gender. The report contains the texts of the Arkansas Code Annotated Section 6-15-501 through Section 6-15-508 Home School…

  3. Arkansas Department of Education Home School Report, 2006-2007

    ERIC Educational Resources Information Center

    Arkansas Department of Education, 2007

    2007-01-01

    This report presents data on home schooling in the state of Arkansas that covers: students withdrawn from home school; home school student count by county, district, and grade level; and home school enrollments by grade and gender. The report contains the texts of the Arkansas Code Annotated Section 6-15-501 through Section 6-15-508 Home School…

  4. Pour Adolescent et Adulte, Francais Langue Etrangere, Niveau 1 (French as a Foreign Language, Level 1, for Adolescents and Adults)

    ERIC Educational Resources Information Center

    Gibert, Pierre

    1975-01-01

    This annotated bibliography lists dictionaries and reading materials including stories and legends, biographies, works relating to cinema, theatre and French civilization, magazines, and educational activities and games for introductory instruction of French as a foreign language to adults and adolescents. (Text is in French.) (CLK)

  5. Instance-Based Question Answering

    DTIC Science & Technology

    2006-12-01

    answer clustering, composition, and scoring. Moreover, with the effort dedicated to improving monolingual system performance, system parameters are...text collections: document type, manual or automatic annotations (if any), and stylistic and notational differences in technical terms. Monolingual ...forum in which cross language retrieval systems and question answering systems are tested for various Eu- ropean languages. The CLEF QA monolingual task

  6. Children's Reading in America, 1776. A Selection of Titles.

    ERIC Educational Resources Information Center

    Haviland, Virginia, Comp.

    This annotated list of reading materials available to American children in 1776 includes both items published on this side of the Atlantic and works issued abroad and sold here as imports. The titles are divided into five categories: (1) primers and other instructional texts, (2) works intended for moral and religious instruction, (3) stories, (4)…

  7. MAPU: Max-Planck Unified database of organellar, cellular, tissue and body fluid proteomes.

    PubMed

    Zhang, Yanling; Zhang, Yong; Adachi, Jun; Olsen, Jesper V; Shi, Rong; de Souza, Gustavo; Pasini, Erica; Foster, Leonard J; Macek, Boris; Zougman, Alexandre; Kumar, Chanchal; Wisniewski, Jacek R; Jun, Wang; Mann, Matthias

    2007-01-01

    Mass spectrometry (MS)-based proteomics has become a powerful technology to map the protein composition of organelles, cell types and tissues. In our department, a large-scale effort to map these proteomes is complemented by the Max-Planck Unified (MAPU) proteome database. MAPU contains several body fluid proteomes; including plasma, urine, and cerebrospinal fluid. Cell lines have been mapped to a depth of several thousand proteins and the red blood cell proteome has also been analyzed in depth. The liver proteome is represented with 3200 proteins. By employing high resolution MS and stringent validation criteria, false positive identification rates in MAPU are lower than 1:1000. Thus MAPU datasets can serve as reference proteomes in biomarker discovery. MAPU contains the peptides identifying each protein, measured masses, scores and intensities and is freely available at http://www.mapuproteome.com using a clickable interface of cell or body parts. Proteome data can be queried across proteomes by protein name, accession number, sequence similarity, peptide sequence and annotation information. More than 4500 mouse and 2500 human proteins have already been identified in at least one proteome. Basic annotation information and links to other public databases are provided in MAPU and we plan to add further analysis tools.

  8. GOGrapher: A Python library for GO graph representation and analysis

    PubMed Central

    Muller, Brian; Richards, Adam J; Jin, Bo; Lu, Xinghua

    2009-01-01

    Background The Gene Ontology is the most commonly used controlled vocabulary for annotating proteins. The concepts in the ontology are organized as a directed acyclic graph, in which a node corresponds to a biological concept and a directed edge denotes the parent-child semantic relationship between a pair of terms. A large number of protein annotations further create links between proteins and their functional annotations, reflecting the contemporary knowledge about proteins and their functional relationships. This leads to a complex graph consisting of interleaved biological concepts and their associated proteins. What is needed is a simple, open source library that provides tools to not only create and view the Gene Ontology graph, but to analyze and manipulate it as well. Here we describe the development and use of GOGrapher, a Python library that can be used for the creation, analysis, manipulation, and visualization of Gene Ontology related graphs. Findings An object-oriented approach was adopted to organize the hierarchy of the graphs types and associated classes. An Application Programming Interface is provided through which different types of graphs can be pragmatically created, manipulated, and visualized. GOGrapher has been successfully utilized in multiple research projects, e.g., a graph-based multi-label text classifier for protein annotation. Conclusion The GOGrapher project provides a reusable programming library designed for the manipulation and analysis of Gene Ontology graphs. The library is freely available for the scientific community to use and improve. PMID:19583843

  9. SSDOnt: An Ontology for Representing Single-Subject Design Studies.

    PubMed

    Berges, Idoia; Bermúdez, Jesus; Illarramendi, Arantza

    2018-02-01

    Single-Subject Design is used in several areas such as education and biomedicine. However, no suited formal vocabulary exists for annotating the detailed configuration and the results of this type of research studies with the appropriate granularity for looking for information about them. Therefore, the search for those study designs relies heavily on a syntactical search on the abstract, keywords or full text of the publications about the study, which entails some limitations. To present SSDOnt, a specific purpose ontology for describing and annotating single-subject design studies, so that complex questions can be asked about them afterwards. The ontology was developed following the NeOn methodology. Once the requirements of the ontology were defined, a formal model was described in a Description Logic and later implemented in the ontology language OWL 2 DL. We show how the ontology provides a reference model with a suitable terminology for the annotation and searching of single-subject design studies and their main components, such as the phases, the intervention types, the outcomes and the results. Some mappings with terms of related ontologies have been established. We show as proof-of-concept that classes in the ontology can be easily extended to annotate more precise information about specific interventions and outcomes such as those related to autism. Moreover, we provide examples of some types of queries that can be posed to the ontology. SSDOnt has achieved the purpose of covering the descriptions of the domain of single-subject research studies. Schattauer GmbH.

  10. Automated grading system for evaluation of ocular redness associated with dry eye.

    PubMed

    Rodriguez, John D; Johnston, Patrick R; Ousler, George W; Smith, Lisa M; Abelson, Mark B

    2013-01-01

    We have observed that dry eye redness is characterized by a prominence of fine horizontal conjunctival vessels in the exposed ocular surface of the interpalpebral fissure, and have incorporated this feature into the grading of redness in clinical studies of dry eye. To develop an automated method of grading dry eye-associated ocular redness in order to expand on the clinical grading system currently used. Ninety nine images from 26 dry eye subjects were evaluated by five graders using a 0-4 (in 0.5 increments) dry eye redness (Ora Calibra™ Dry Eye Redness Scale [OCDER]) scale. For the automated method, the Opencv computer vision library was used to develop software for calculating redness and horizontal conjunctival vessels (noted as "horizontality"). From original photograph, the region of interest (ROI) was selected manually using the open source ImageJ software. Total average redness intensity (Com-Red) was calculated as a single channel 8-bit image as R - 0.83G - 0.17B, where R, G and B were the respective intensities of the red, green and blue channels. The location of vessels was detected by normalizing the blue channel and selecting pixels with an intensity of less than 97% of the mean. The horizontal component (Com-Hor) was calculated by the first order Sobel derivative in the vertical direction and the score was calculated as the average blue channel image intensity of this vertical derivative. Pearson correlation coefficients, accuracy and concordance correlation coefficients (CCC) were calculated after regression and standardized regression of the dataset. The agreement (both Pearson's and CCC) among investigators using the OCDER scale was 0.67, while the agreement of investigator to computer was 0.76. A multiple regression using both redness and horizontality improved the agreement CCC from 0.66 and 0.69 to 0.76, demonstrating the contribution of vessel geometry to the overall grade. Computer analysis of a given image has 100% repeatability and zero variability from session to session. This objective means of grading ocular redness in a unified fashion has potential significance as a new clinical endpoint. In comparisons between computer and investigator, computer grading proved to be more reliable than another investigator using the OCDER scale. The best fitting model based on the present sample, and usable for future studies, was [Formula: see text] is the predicted investigator grade, and [Formula: see text] and [Formula: see text] are logarithmic transformations of the computer calculated parameters COM-Hor and COM-Red. Considering the superior repeatability, computer automated grading might be preferable to investigator grading in multicentered dry eye studies in which the subtle differences in redness incurred by treatment have been historically difficult to define.

  11. Community annotation experiment for ground truth generation for the i2b2 medication challenge

    PubMed Central

    Solti, Imre; Xia, Fei; Cadag, Eithon

    2010-01-01

    Objective Within the context of the Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records, the authors (also referred to as ‘the i2b2 medication challenge team’ or ‘the i2b2 team’ for short) organized a community annotation experiment. Design For this experiment, the authors released annotation guidelines and a small set of annotated discharge summaries. They asked the participants of the Third i2b2 Workshop to annotate 10 discharge summaries per person; each discharge summary was annotated by two annotators from two different teams, and a third annotator from a third team resolved disagreements. Measurements In order to evaluate the reliability of the annotations thus produced, the authors measured community inter-annotator agreement and compared it with the inter-annotator agreement of expert annotators when both the community and the expert annotators generated ground truth based on pooled system outputs. For this purpose, the pool consisted of the three most densely populated automatic annotations of each record. The authors also compared the community inter-annotator agreement with expert inter-annotator agreement when the experts annotated raw records without using the pool. Finally, they measured the quality of the community ground truth by comparing it with the expert ground truth. Results and conclusions The authors found that the community annotators achieved comparable inter-annotator agreement to expert annotators, regardless of whether the experts annotated from the pool. Furthermore, the ground truth generated by the community obtained F-measures above 0.90 against the ground truth of the experts, indicating the value of the community as a source of high-quality ground truth even on intricate and domain-specific annotation tasks. PMID:20819855

  12. Support patient search on pathology reports with interactive online learning based data extraction.

    PubMed

    Zheng, Shuai; Lu, James J; Appin, Christina; Brat, Daniel; Wang, Fusheng

    2015-01-01

    Structural reporting enables semantic understanding and prompt retrieval of clinical findings about patients. While synoptic pathology reporting provides templates for data entries, information in pathology reports remains primarily in narrative free text form. Extracting data of interest from narrative pathology reports could significantly improve the representation of the information and enable complex structured queries. However, manual extraction is tedious and error-prone, and automated tools are often constructed with a fixed training dataset and not easily adaptable. Our goal is to extract data from pathology reports to support advanced patient search with a highly adaptable semi-automated data extraction system, which can adjust and self-improve by learning from a user's interaction with minimal human effort. We have developed an online machine learning based information extraction system called IDEAL-X. With its graphical user interface, the system's data extraction engine automatically annotates values for users to review upon loading each report text. The system analyzes users' corrections regarding these annotations with online machine learning, and incrementally enhances and refines the learning model as reports are processed. The system also takes advantage of customized controlled vocabularies, which can be adaptively refined during the online learning process to further assist the data extraction. As the accuracy of automatic annotation improves overtime, the effort of human annotation is gradually reduced. After all reports are processed, a built-in query engine can be applied to conveniently define queries based on extracted structured data. We have evaluated the system with a dataset of anatomic pathology reports from 50 patients. Extracted data elements include demographical data, diagnosis, genetic marker, and procedure. The system achieves F-1 scores of around 95% for the majority of tests. Extracting data from pathology reports could enable more accurate knowledge to support biomedical research and clinical diagnosis. IDEAL-X provides a bridge that takes advantage of online machine learning based data extraction and the knowledge from human's feedback. By combining iterative online learning and adaptive controlled vocabularies, IDEAL-X can deliver highly adaptive and accurate data extraction to support patient search.

  13. High precision mass measurements for wine metabolomics

    PubMed Central

    Roullier-Gall, Chloé; Witting, Michael; Gougeon, Régis D.; Schmitt-Kopplin, Philippe

    2014-01-01

    An overview of the critical steps for the non-targeted Ultra-High Performance Liquid Chromatography coupled with Quadrupole Time-of-Flight Mass Spectrometry (UPLC-Q-ToF-MS) analysis of wine chemistry is given, ranging from the study design, data preprocessing and statistical analyses, to markers identification. UPLC-Q-ToF-MS data was enhanced by the alignment of exact mass data from FTICR-MS, and marker peaks were identified using UPLC-Q-ToF-MS2. In combination with multivariate statistical tools and the annotation of peaks with metabolites from relevant databases, this analytical process provides a fine description of the chemical complexity of wines, as exemplified in the case of red (Pinot noir) and white (Chardonnay) wines from various geographic origins in Burgundy. PMID:25431760

  14. High precision mass measurements for wine metabolomics

    NASA Astrophysics Data System (ADS)

    Roullier-Gall, Chloé; Witting, Michael; Gougeon, Régis; Schmitt-Kopplin, Philippe

    2014-11-01

    An overview of the critical steps for the non-targeted Ultra-High Performance Liquid Chromatography coupled with Quadrupole Time-of-Flight Mass Spectrometry (UPLC-Q-ToF-MS) analysis of wine chemistry is given, ranging from the study design, data preprocessing and statistical analyses, to markers identification. UPLC-Q-ToF-MS data was enhanced by the alignment of exact mass data from FTICR-MS, and marker peaks were identified using UPLC-Q-ToF-MS². In combination with multivariate statistical tools and the annotation of peaks with metabolites from relevant databases, this analytical process provides a fine description of the chemical complexity of wines, as exemplified in the case of red (Pinot noir) and white (Chardonnay) wines from various geographic origins in Burgundy.

  15. Impact Site: Infrared Image

    NASA Image and Video Library

    2017-09-15

    This montage of images, made from data obtained by Cassini's visual and infrared mapping spectrometer, shows the location on Saturn where the NASA spacecraft entered Saturn's atmosphere on Sept. 15, 2017. This view shows Saturn in the thermal infrared, at a wavelength of 5 microns. Here, the instrument is sensing heat coming from Saturn's interior, in red. Clouds in the atmosphere are silhouetted against that inner glow. This location -- the site of Cassini's atmospheric entry -- was at this time on the night side of the planet, but would rotate into daylight by the time Cassini made its final dive into Saturn's upper atmosphere, ending its remarkable 13-year exploration of Saturn. Both an annotated version and an animation are available at https://photojournal.jpl.nasa.gov/catalog/PIA21896

  16. Phytochemical Profiles and Antimicrobial Activities of Allium cepa Red cv. and A. sativum Subjected to Different Drying Methods: A Comparative MS-Based Metabolomics.

    PubMed

    Farag, Mohamed A; Ali, Sara E; Hodaya, Rashad H; El-Seedi, Hesham R; Sultani, Haider N; Laub, Annegret; Eissa, Tarek F; Abou-Zaid, Fouad O F; Wessjohann, Ludger A

    2017-05-08

    Plants of the Allium genus produce sulphur compounds that give them a characteristic (alliaceous) flavour and mediate for their medicinal use. In this study, the chemical composition and antimicrobial properties of Allium cepa red cv. and A. sativum in the context of three different drying processes were assessed using metabolomics. Bulbs were dried using either microwave, air drying, or freeze drying and further subjected to chemical analysis of their composition of volatile and non-volatile metabolites. Volatiles were collected using solid phase micro-extraction (SPME) coupled to gas chromatography-mass spectrometry (GC/MS) with 42 identified volatiles including 30 sulphur compounds, four nitriles, three aromatics, and three esters. Profiling of the polar non-volatile metabolites via ultra-performance liquid chromatography coupled to high resolution MS (UPLC/MS) annotated 51 metabolites including dipeptides, flavonoids, phenolic acids, and fatty acids. Major peaks in GC/MS or UPLC/MS contributing to the discrimination between A. sativum and A. cepa red cv. were assigned to sulphur compounds and flavonoids. Whereas sulphur conjugates amounted to the major forms in A. sativum , flavonoids predominated in the chemical composition of A. cepa red cv. With regard to drying impact on Allium metabolites, notable and clear separations among specimens were revealed using principal component analysis (PCA). The PCA scores plot of the UPLC/MS dataset showed closer metabolite composition of microwave dried specimens to freeze dried ones, and distant from air dried bulbs, observed in both A. cepa and A. sativum . Compared to GC/MS, the UPLC/MS derived PCA model was more consistent and better in assessing the impact of drying on Allium metabolism. A phthalate derivative was found exclusively in a commercial garlic preparation via GC/MS, of yet unknown origin. The freeze dried samples of both Allium species exhibited stronger antimicrobial activities compared to dried specimens with A. sativum being in general more active than A. cepa red cv.

  17. Functional characterization of novel ABCB6 mutations and their clinical implications in familial pseudohyperkalemia

    PubMed Central

    Andolfo, Immacolata; Russo, Roberta; Manna, Francesco; De Rosa, Gianluca; Gambale, Antonella; Zouwail, Soha; Detta, Nicola; Pardo, Catia Lo; Alper, Seth L.; Brugnara, Carlo; Sharma, Alok K.; De Franceschi, Lucia; Iolascon, Achille

    2016-01-01

    Isolated familial pseudohyperkalemia is a dominant red cell trait characterized by cold-induced ‘passive leak’ of red cell potassium ions into plasma. The causative gene of this condition is ABCB6, which encodes an erythrocyte membrane ABC transporter protein bearing the Langereis blood group antigen system. In this study analyzing three new families, we report the first functional characterization of ABCB6 mutants, including the homozygous mutation V454A, heterozygous mutation R276W, and compound heterozygous mutations R276W and R723Q (in trans). All these mutations are annotated in public databases, suggesting that familial pseudohyperkalemia could be common in the general population. Indeed, we identified variant R276W in one of 327 random blood donors (0.3%). Four weeks’ storage of heterozygous R276W blood cells resulted in massive loss of potassium compared to that from healthy control red blood cells. Moreover, measurement of cation flux demonstrated greater loss of potassium or rubidium ions from HEK-293 cells expressing ABCB6 mutants than from cells expressing wild-type ABCB6. The R276W/R723Q mutations elicited greater cellular potassium ion efflux than did the other mutants tested. In conclusion, ABCB6 missense mutations in red blood cells from subjects with familial pseudohyperkalemia show elevated potassium ion efflux. The prevalence of such individuals in the blood donor population is moderate. The fact that storage of blood from these subjects leads to significantly increased levels of potassium in the plasma could have serious clinical implications for neonates and infants receiving large-volume transfusions of whole blood. Genetic tests for familial pseudohyperkalemia could be added to blood donor pre-screening. Further study of ABCB6 function and trafficking could be informative for the study of other pathologies of red blood cell hydration. PMID:27151991

  18. Navigating the Functional Landscape of Transcription Factors via Non-Negative Tensor Factorization Analysis of MEDLINE Abstracts

    PubMed Central

    Roy, Sujoy; Yun, Daqing; Madahian, Behrouz; Berry, Michael W.; Deng, Lih-Yuan; Goldowitz, Daniel; Homayouni, Ramin

    2017-01-01

    In this study, we developed and evaluated a novel text-mining approach, using non-negative tensor factorization (NTF), to simultaneously extract and functionally annotate transcriptional modules consisting of sets of genes, transcription factors (TFs), and terms from MEDLINE abstracts. A sparse 3-mode term × gene × TF tensor was constructed that contained weighted frequencies of 106,895 terms in 26,781 abstracts shared among 7,695 genes and 994 TFs. The tensor was decomposed into sub-tensors using non-negative tensor factorization (NTF) across 16 different approximation ranks. Dominant entries of each of 2,861 sub-tensors were extracted to form term–gene–TF annotated transcriptional modules (ATMs). More than 94% of the ATMs were found to be enriched in at least one KEGG pathway or GO category, suggesting that the ATMs are functionally relevant. One advantage of this method is that it can discover potentially new gene–TF associations from the literature. Using a set of microarray and ChIP-Seq datasets as gold standard, we show that the precision of our method for predicting gene–TF associations is significantly higher than chance. In addition, we demonstrate that the terms in each ATM can be used to suggest new GO classifications to genes and TFs. Taken together, our results indicate that NTF is useful for simultaneous extraction and functional annotation of transcriptional regulatory networks from unstructured text, as well as for literature based discovery. A web tool called Transcriptional Regulatory Modules Extracted from Literature (TREMEL), available at http://binf1.memphis.edu/tremel, was built to enable browsing and searching of ATMs. PMID:28894735

  19. OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

    PubMed

    Naderi, Nona; Kappler, Thomas; Baker, Christopher J O; Witte, René

    2011-10-01

    Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation. We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%. The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end user and developer documentation, is freely available under an open-source license at http://www.semanticsoftware.info/organism-tagger. witte@semanticsoftware.info.

  20. Text mining and expert curation to develop a database on psychiatric diseases and their genes

    PubMed Central

    Gutiérrez-Sacristán, Alba; Bravo, Àlex; Portero-Tresserra, Marta; Valverde, Olga; Armario, Antonio; Blanco-Gandía, M.C.; Farré, Adriana; Fernández-Ibarrondo, Lierni; Fonseca, Francina; Giraldo, Jesús; Leis, Angela; Mané, Anna; Mayer, M.A.; Montagud-Romero, Sandra; Nadal, Roser; Ortiz, Jordi; Pavon, Francisco Javier; Perez, Ezequiel Jesús; Rodríguez-Arias, Marta; Serrano, Antonia; Torrens, Marta; Warnault, Vincent; Sanz, Ferran

    2017-01-01

    Abstract Psychiatric disorders constitute one of the main causes of disability worldwide. During the past years, considerable research has been conducted on the genetic architecture of such diseases, although little understanding of their etiology has been achieved. The difficulty to access up-to-date, relevant genotype-phenotype information has hampered the application of this wealth of knowledge to translational research and clinical practice in order to improve diagnosis and treatment of psychiatric patients. PsyGeNET (http://www.psygenet.org/) has been developed with the aim of supporting research on the genetic architecture of psychiatric diseases, by providing integrated and structured accessibility to their genotype–phenotype association data, together with analysis and visualization tools. In this article, we describe the protocol developed for the sustainable update of this knowledge resource. It includes the recruitment of a team of domain experts in order to perform the curation of the data extracted by text mining. Annotation guidelines and a web-based annotation tool were developed to support the curators’ tasks. A curation workflow was designed including a pilot phase and two rounds of curation and analysis phases. Negative evidence from the literature on gene–disease associations (GDAs) was taken into account in the curation process. We report the results of the application of this workflow to the curation of GDAs for PsyGeNET, including the analysis of the inter-annotator agreement and suggest this model as a suitable approach for the sustainable development and update of knowledge resources. Database URL: http://www.psygenet.org PsyGeNET corpus: http://www.psygenet.org/ds/PsyGeNET/results/psygenetCorpus.tar PMID:29220439

  1. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

    PubMed

    Sogancioglu, Gizem; Öztürk, Hakime; Özgür, Arzucan

    2017-07-15

    The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods. The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric. A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/ . gizemsogancioglu@gmail.com or arzucan.ozgur@boun.edu.tr. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  2. Maintaining the Dialogue: Joint Involvement in a Stand-Alone CD-Rom Chinese Course.

    ERIC Educational Resources Information Center

    Orton, Jane

    2003-01-01

    Presents a stand-alone CD-Rom version of a Web-based course in Modern Standard Chinese and addresses the problem of losing opportunities for negotiation in joint involvement episodes with teachers and peers in the CD-ROM course. Demonstrates trials of a mode of self-assessment in tasks using annotated sample texts generated by other students. (VWL)

  3. When Static Media Promote Active Learning: Annotated Illustrations Versus Narrated Animations in Multimedia Instruction

    ERIC Educational Resources Information Center

    Mayer, Richard E.; Hegarty, Mary; Mayer, Sarah; Campbell, Julie

    2005-01-01

    In 4 experiments, students received a lesson consisting of computer-based animation and narration or a lesson consisting of paper-based static diagrams and text. The lessons used the same words and graphics in the paper-based and computer-based versions to explain the process of lightning formation (Experiment 1), how a toilet tank works…

  4. Job Enrichment: Evaluation with Implications for Air Force Job Redesign. Interim Report, 1 January 1975-30 April 1977.

    ERIC Educational Resources Information Center

    Watson, Thomas W.; Zumbro, Patrick A.

    The main text of this report consists of a review and evaluation of job enrichment as an approach to job redesign, with implications for Air Force research and application. In addition, two appendixes are included: the first, a supplemental historical discussion; the second, a ninety-six-item annotated bibliography. Specific objectives are to…

  5. Resources for Teaching about Anti-Racism and Mutliethnic Education: Recent Outstanding Materials from Britain Selected Especially for American Teachers.

    ERIC Educational Resources Information Center

    Klein, Gillian; King, Edith W.

    This annotated list of resources for teachers is the product of several educators' efforts to promulgate the recent work being done in Britain in multicultural/multiethnic education, world studies, development studies and intercultural perspectives. An introduction cites appropriate texts for discussing race relations in the classroom. Section I,…

  6. Children's Books to Enrich the Social Studies: For the Elementary Grades. Revised Edition.

    ERIC Educational Resources Information Center

    Huus, Helen

    Six hundred and thirty social studies books for elementary children, published between 1935 and 1964, are listed in this revised annotated bibliography intended for teachers and librarians. The aim of the book is to provide, as a supplement to the classroom texts, a list of fact and fiction illustrating different ways of life in other countries in…

  7. Literature classification for semi-automated updating of biological knowledgebases

    PubMed Central

    2013-01-01

    Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403

  8. Zone analysis in biology articles as a basis for information extraction.

    PubMed

    Mizuta, Yoko; Korhonen, Anna; Mullen, Tony; Collier, Nigel

    2006-06-01

    In the field of biomedicine, an overwhelming amount of experimental data has become available as a result of the high throughput of research in this domain. The amount of results reported has now grown beyond the limits of what can be managed by manual means. This makes it increasingly difficult for the researchers in this area to keep up with the latest developments. Information extraction (IE) in the biological domain aims to provide an effective automatic means to dynamically manage the information contained in archived journal articles and abstract collections and thus help researchers in their work. However, while considerable advances have been made in certain areas of IE, pinpointing and organizing factual information (such as experimental results) remains a challenge. In this paper we propose tackling this task by incorporating into IE information about rhetorical zones, i.e. classification of spans of text in terms of argumentation and intellectual attribution. As the first step towards this goal, we introduce a scheme for annotating biological texts for rhetorical zones and provide a qualitative and quantitative analysis of the data annotated according to this scheme. We also discuss our preliminary research on automatic zone analysis, and its incorporation into our IE framework.

  9. A semantic model for multimodal data mining in healthcare information systems.

    PubMed

    Iakovidis, Dimitris; Smailis, Christos

    2012-01-01

    Electronic health records (EHRs) are representative examples of multimodal/multisource data collections; including measurements, images and free texts. The diversity of such information sources and the increasing amounts of medical data produced by healthcare institutes annually, pose significant challenges in data mining. In this paper we present a novel semantic model that describes knowledge extracted from the lowest-level of a data mining process, where information is represented by multiple features i.e. measurements or numerical descriptors extracted from measurements, images, texts or other medical data, forming multidimensional feature spaces. Knowledge collected by manual annotation or extracted by unsupervised data mining from one or more feature spaces is modeled through generalized qualitative spatial semantics. This model enables a unified representation of knowledge across multimodal data repositories. It contributes to bridging the semantic gap, by enabling direct links between low-level features and higher-level concepts e.g. describing body parts, anatomies and pathological findings. The proposed model has been developed in web ontology language based on description logics (OWL-DL) and can be applied to a variety of data mining tasks in medical informatics. It utility is demonstrated for automatic annotation of medical data.

  10. Literature mining of protein-residue associations with graph rules learned through distant supervision.

    PubMed

    Ravikumar, Ke; Liu, Haibin; Cohn, Judith D; Wall, Michael E; Verspoor, Karin

    2012-10-05

    We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model. The performance of the method was assessed by extracting protein-residue relations from a new automatically generated test set of sentences containing high confidence examples found using distant supervision. It achieved a F-measure of 0.84 on automatically created silver corpus and 0.79 on a manually annotated gold data set for this task, outperforming previous methods. The primary contributions of this work are to (1) demonstrate the effectiveness of distant supervision for automatic creation of training data for protein-residue relation extraction, substantially reducing the effort and time involved in manual annotation of a data set and (2) show that the graph-based relation extraction approach we used generalizes well to the problem of protein-residue association extraction. This work paves the way towards effective extraction of protein functional residues from the literature.

  11. Discovering body site and severity modifiers in clinical texts

    PubMed Central

    Dligach, Dmitriy; Bethard, Steven; Becker, Lee; Miller, Timothy; Savova, Guergana K

    2014-01-01

    Objective To research computational methods for discovering body site and severity modifiers in clinical texts. Methods We cast the task of discovering body site and severity modifiers as a relation extraction problem in the context of a supervised machine learning framework. We utilize rich linguistic features to represent the pairs of relation arguments and delegate the decision about the nature of the relationship between them to a support vector machine model. We evaluate our models using two corpora that annotate body site and severity modifiers. We also compare the model performance to a number of rule-based baselines. We conduct cross-domain portability experiments. In addition, we carry out feature ablation experiments to determine the contribution of various feature groups. Finally, we perform error analysis and report the sources of errors. Results The performance of our method for discovering body site modifiers achieves F1 of 0.740–0.908 and our method for discovering severity modifiers achieves F1 of 0.905–0.929. Discussion Results indicate that both methods perform well on both in-domain and out-domain data, approaching the performance of human annotators. The most salient features are token and named entity features, although syntactic dependency features also contribute to the overall performance. The dominant sources of errors are infrequent patterns in the data and inability of the system to discern deeper semantic structures. Conclusions We investigated computational methods for discovering body site and severity modifiers in clinical texts. Our best system is released open source as part of the clinical Text Analysis and Knowledge Extraction System (cTAKES). PMID:24091648

  12. Discovering body site and severity modifiers in clinical texts.

    PubMed

    Dligach, Dmitriy; Bethard, Steven; Becker, Lee; Miller, Timothy; Savova, Guergana K

    2014-01-01

    To research computational methods for discovering body site and severity modifiers in clinical texts. We cast the task of discovering body site and severity modifiers as a relation extraction problem in the context of a supervised machine learning framework. We utilize rich linguistic features to represent the pairs of relation arguments and delegate the decision about the nature of the relationship between them to a support vector machine model. We evaluate our models using two corpora that annotate body site and severity modifiers. We also compare the model performance to a number of rule-based baselines. We conduct cross-domain portability experiments. In addition, we carry out feature ablation experiments to determine the contribution of various feature groups. Finally, we perform error analysis and report the sources of errors. The performance of our method for discovering body site modifiers achieves F1 of 0.740-0.908 and our method for discovering severity modifiers achieves F1 of 0.905-0.929. Results indicate that both methods perform well on both in-domain and out-domain data, approaching the performance of human annotators. The most salient features are token and named entity features, although syntactic dependency features also contribute to the overall performance. The dominant sources of errors are infrequent patterns in the data and inability of the system to discern deeper semantic structures. We investigated computational methods for discovering body site and severity modifiers in clinical texts. Our best system is released open source as part of the clinical Text Analysis and Knowledge Extraction System (cTAKES).

  13. Annotated translation of "Nota in verband met de voorgenomen putboring nabij Amsterdam [Note concerning the intended well drilling near Amsterdam]" by J. Drabbe and W. Badon Ghijben (1889)

    NASA Astrophysics Data System (ADS)

    Post, Vincent E. A.

    2018-06-01

    The famous report by engineers Drabbe and Badon Ghijben (1889), on an intended well drilling near Amsterdam (the Netherlands), was one of the key documents that contributed to the Ghijben-Herzberg formula, which links water-table elevation to the depth of the freshwater-saltwater interface in coastal aquifers. The report has been often cited but no English translation has appeared in the literature to date. The aim of this annotated translation of the report is to provide the international scientific community with easier access than was hitherto the case, plus electronic access to the original in Dutch. A brief introduction to the report is provided, followed by a translation that follows the original text as closely as possible.

  14. Incorporating Semantics into Data Driven Workflows for Content Based Analysis

    NASA Astrophysics Data System (ADS)

    Argüello, M.; Fernandez-Prieto, M. J.

    Finding meaningful associations between text elements and knowledge structures within clinical narratives in a highly verbal domain, such as psychiatry, is a challenging goal. The research presented here uses a small corpus of case histories and brings into play pre-existing knowledge, and therefore, complements other approaches that use large corpus (millions of words) and no pre-existing knowledge. The paper describes a variety of experiments for content-based analysis: Linguistic Analysis using NLP-oriented approaches, Sentiment Analysis, and Semantically Meaningful Analysis. Although it is not standard practice, the paper advocates providing automatic support to annotate the functionality as well as the data for each experiment by performing semantic annotation that uses OWL and OWL-S. Lessons learnt can be transmitted to legacy clinical databases facing the conversion of clinical narratives according to prominent Electronic Health Records standards.

  15. Modeling loosely annotated images using both given and imagined annotations

    NASA Astrophysics Data System (ADS)

    Tang, Hong; Boujemaa, Nozha; Chen, Yunhao; Deng, Lei

    2011-12-01

    In this paper, we present an approach to learn latent semantic analysis models from loosely annotated images for automatic image annotation and indexing. The given annotation in training images is loose due to: 1. ambiguous correspondences between visual features and annotated keywords; 2. incomplete lists of annotated keywords. The second reason motivates us to enrich the incomplete annotation in a simple way before learning a topic model. In particular, some ``imagined'' keywords are poured into the incomplete annotation through measuring similarity between keywords in terms of their co-occurrence. Then, both given and imagined annotations are employed to learn probabilistic topic models for automatically annotating new images. We conduct experiments on two image databases (i.e., Corel and ESP) coupled with their loose annotations, and compare the proposed method with state-of-the-art discrete annotation methods. The proposed method improves word-driven probability latent semantic analysis (PLSA-words) up to a comparable performance with the best discrete annotation method, while a merit of PLSA-words is still kept, i.e., a wider semantic range.

  16. Scholarly Information Extraction Is Going to Make a Quantum Leap with PubMed Central (PMC).

    PubMed

    Matthies, Franz; Hahn, Udo

    2017-01-01

    With the increasing availability of complete full texts (journal articles), rather than their surrogates (titles, abstracts), as resources for text analytics, entirely new opportunities arise for information extraction and text mining from scholarly publications. Yet, we gathered evidence that a range of problems are encountered for full-text processing when biomedical text analytics simply reuse existing NLP pipelines which were developed on the basis of abstracts (rather than full texts). We conducted experiments with four different relation extraction engines all of which were top performers in previous BioNLP Event Extraction Challenges. We found that abstract-trained engines loose up to 6.6% F-score points when run on full-text data. Hence, the reuse of existing abstract-based NLP software in a full-text scenario is considered harmful because of heavy performance losses. Given the current lack of annotated full-text resources to train on, our study quantifies the price paid for this short cut.

  17. Text Mining in Cancer Gene and Pathway Prioritization

    PubMed Central

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes. PMID:25392685

  18. Text mining in cancer gene and pathway prioritization.

    PubMed

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.

  19. TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

    PubMed Central

    Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

    2012-01-01

    In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

  20. EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome

    PubMed Central

    Thibaud-Nissen, Françoise; Campbell, Matthew; Hamilton, John P; Zhu, Wei; Buell, C Robin

    2007-01-01

    Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation) is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1) the submission of gene annotation to an annotation project, 2) the review of the submitted models by project annotators, and 3) the incorporation of the submitted models in the ongoing annotation effort. Results We have developed the Eukaryotic Community Annotation Package (EuCAP), an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA) has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs) in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website , as well as in the Community Annotation track of the Genome Browser. Conclusion We have applied EuCAP to rice. As of July 2007, the structural and/or functional annotation of 1,094 genes representing 57 families have been deposited and integrated into the current gene set. All of the EuCAP components are open-source, thereby allowing the implementation of EuCAP for the annotation of other genomes. EuCAP is available at . PMID:17961238

  1. Discovery of A-type procyanidin dimers in yellow raspberries by untargeted metabolomics and correlation based data analysis.

    PubMed

    Carvalho, Elisabete; Franceschi, Pietro; Feller, Antje; Herrera, Lorena; Palmieri, Luisa; Arapitsas, Panagiotis; Riccadonna, Samantha; Martens, Stefan

    2016-01-01

    Raspberries are becoming increasingly popular due to their reported health beneficial properties. Despite the presence of only trace amounts of anthocyanins, yellow varieties seems to show similar or better effects in comparison to conventional raspberries. The aim of this work is to characterize the metabolic differences between red and yellow berries, focussing on the compounds showing a higher concentration in yellow varieties. The metabolomic profile of 13 red and 12 yellow raspberries (of different varieties, locations and collection dates) was determined by UPLC-TOF-MS. A novel approach based on Pearson correlation on the extracted ion chromatograms was implemented to extract the pseudospectra of the most relevant biomarkers from high energy LC-MS runs. The raw data will be made publicly available on MetaboLights (MTBLS333). Among the metabolites showing higher concentration in yellow raspberries it was possible to identify a series of compounds showing a pseudospectrum similar to that of A-type procyanidin polymers. The annotation of this group of compounds was confirmed by specific MS/MS experiments and performing standard injections. In berries lacking anthocyanins the polyphenol metabolism might be shifted to the formation of a novel class of A-type procyanidin polymers.

  2. Spitzer Observes Neutron Star Collision

    NASA Image and Video Library

    2017-10-16

    NASA's Spitzer Space Telescope has provisionally detected the faint afterglow of the explosive merger of two neutron stars in the galaxy NGC 4993. The event, labeled GW170817, was initially detected in gravitational waves and gamma rays. Subsequent observations by dozens of telescopes have monitored its afterglow across the entire spectrum of light. The event is located about 130 million light-years from Earth. Spitzer's observation on September 29, 2017, came late in the game, just over 6 weeks after the event was first seen. But if this weak detection is verified, it will play an important role in helping astronomers understand how many of the heaviest elements in the periodic table are created in explosive neutron star mergers. The left panel is a color composite of the 3.6 and 4.5 micron channels of the Spitzer IRAC instrument, rendered in cyan and red. The center panel is a median-filtered color composite showing a faint red dot at the known location of the event. The right panel shows the residual 4.5 micron data after subtracting out the light of the galaxy using an archival image that predates the event. An annotated version is at https://photojournal.jpl.nasa.gov/catalog/PIA21910

  3. Investigation of Improper Disclosure of For Official Use Only Information from the Comprehensive Review Working Group Draft Report

    DTIC Science & Technology

    2011-04-08

    and sealed in brown envelopes and the envelopes bound together with a large rubber band. The top envelope had an affixed label printed in 25 font ...red text with the words "EYES ONLY" and the name of the recipient in 18 font red text, all underlined. In addition to the draft Report and...Marine Corps, Retired Commander two copies. One from Mr. Paul & :Mr. Juan Garcia Captain Copy made by Executive Asst to Mr. Robert Work USMC Colonel

  4. Negative Example Selection for Protein Function Prediction: The NoGO Database

    PubMed Central

    Youngs, Noah; Penfold-Brown, Duncan; Bonneau, Richard; Shasha, Dennis

    2014-01-01

    Negative examples – genes that are known not to carry out a given protein function – are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html). PMID:24922051

  5. Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed.

    PubMed

    Eisinger, Daniel; Tsatsaronis, George; Bundschus, Markus; Wieneke, Ulrich; Schroeder, Michael

    2013-04-15

    Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms.Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources.

  6. Towards ontology-based decision support systems for complex ultrasound diagnosis in obstetrics and gynecology.

    PubMed

    Maurice, P; Dhombres, F; Blondiaux, E; Friszer, S; Guilbaud, L; Lelong, N; Khoshnood, B; Charlet, J; Perrot, N; Jauniaux, E; Jurkovic, D; Jouannic, J-M

    2017-05-01

    We have developed a new knowledge base intelligent system for obstetrics and gynecology ultrasound imaging, based on an ontology and a reference image collection. This study evaluates the new system to support accurate annotations of ultrasound images. We have used the early ultrasound diagnosis of ectopic pregnancies as a model clinical issue. The ectopic pregnancy ontology was derived from medical texts (4260 ultrasound reports of ectopic pregnancy from a specialist center in the UK and 2795 Pubmed abstracts indexed with the MeSH term "Pregnancy, Ectopic") and the reference image collection was built on a selection from 106 publications. We conducted a retrospective analysis of the signs in 35 scans of ectopic pregnancy by six observers using the new system. The resulting ectopic pregnancy ontology consisted of 1395 terms, and 80 images were collected for the reference collection. The observers used the knowledge base intelligent system to provide a total of 1486 sign annotations. The precision, recall and F-measure for the annotations were 0.83, 0.62 and 0.71, respectively. The global proportion of agreement was 40.35% 95% CI [38.64-42.05]. The ontology-based intelligent system provides accurate annotations of ultrasound images and suggests that it may benefit non-expert operators. The precision rate is appropriate for accurate input of a computer-based clinical decision support and could be used to support medical imaging diagnosis of complex conditions in obstetrics and gynecology. Copyright © 2017. Published by Elsevier Masson SAS.

  7. Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

    PubMed Central

    2013-01-01

    Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources. PMID:23734562

  8. Evaluating Computational Gene Ontology Annotations.

    PubMed

    Škunca, Nives; Roberts, Richard J; Steffen, Martin

    2017-01-01

    Two avenues to understanding gene function are complementary and often overlapping: experimental work and computational prediction. While experimental annotation generally produces high-quality annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the quality of annotations may be variable, and therefore evaluating the quality of computational annotations is a critical concern.In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection of new experimental annotations and leveraging the existing experimental annotations.

  9. BEACON: automated tool for Bacterial GEnome Annotation ComparisON.

    PubMed

    Kalkatawi, Manal; Alam, Intikhab; Bajic, Vladimir B

    2015-08-18

    Genome annotation is one way of summarizing the existing knowledge about genomic characteristics of an organism. There has been an increased interest during the last several decades in computer-based structural and functional genome annotation. Many methods for this purpose have been developed for eukaryotes and prokaryotes. Our study focuses on comparison of functional annotations of prokaryotic genomes. To the best of our knowledge there is no fully automated system for detailed comparison of functional genome annotations generated by different annotation methods (AMs). The presence of many AMs and development of new ones introduce needs to: a/ compare different annotations for a single genome, and b/ generate annotation by combining individual ones. To address these issues we developed an Automated Tool for Bacterial GEnome Annotation ComparisON (BEACON) that benefits both AM developers and annotation analysers. BEACON provides detailed comparison of gene function annotations of prokaryotic genomes obtained by different AMs and generates extended annotations through combination of individual ones. For the illustration of BEACON's utility, we provide a comparison analysis of multiple different annotations generated for four genomes and show on these examples that the extended annotation can increase the number of genes annotated by putative functions up to 27%, while the number of genes without any function assignment is reduced. We developed BEACON, a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/ .

  10. Identifying Issue Frames in Text

    PubMed Central

    Sagi, Eyal; Diermeier, Daniel; Kaufmann, Stefan

    2013-01-01

    Framing, the effect of context on cognitive processes, is a prominent topic of research in psychology and public opinion research. Research on framing has traditionally relied on controlled experiments and manually annotated document collections. In this paper we present a method that allows for quantifying the relative strengths of competing linguistic frames based on corpus analysis. This method requires little human intervention and can therefore be efficiently applied to large bodies of text. We demonstrate its effectiveness by tracking changes in the framing of terror over time and comparing the framing of abortion by Democrats and Republicans in the U.S. PMID:23874909

  11. Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research.

    PubMed

    Löpprich, Martin; Krauss, Felix; Ganzinger, Matthias; Senghas, Karsten; Riezler, Stefan; Knaup, Petra

    2016-08-05

    In the Multiple Myeloma clinical registry at Heidelberg University Hospital, most data are extracted from discharge letters. Our aim was to analyze if it is possible to make the manual documentation process more efficient by using methods of natural language processing for multiclass classification of free-text diagnostic reports to automatically document the diagnosis and state of disease of myeloma patients. The first objective was to create a corpus consisting of free-text diagnosis paragraphs of patients with multiple myeloma from German diagnostic reports, and its manual annotation of relevant data elements by documentation specialists. The second objective was to construct and evaluate a framework using different NLP methods to enable automatic multiclass classification of relevant data elements from free-text diagnostic reports. The main diagnoses paragraph was extracted from the clinical report of one third randomly selected patients of the multiple myeloma research database from Heidelberg University Hospital (in total 737 selected patients). An EDC system was setup and two data entry specialists performed independently a manual documentation of at least nine specific data elements for multiple myeloma characterization. Both data entries were compared and assessed by a third specialist and an annotated text corpus was created. A framework was constructed, consisting of a self-developed package to split multiple diagnosis sequences into several subsequences, four different preprocessing steps to normalize the input data and two classifiers: a maximum entropy classifier (MEC) and a support vector machine (SVM). In total 15 different pipelines were examined and assessed by a ten-fold cross-validation, reiterated 100 times. For quality indication the average error rate and the average F1-score were conducted. For significance testing the approximate randomization test was used. The created annotated corpus consists of 737 different diagnoses paragraphs with a total number of 865 coded diagnosis. The dataset is publicly available in the supplementary online files for training and testing of further NLP methods. Both classifiers showed low average error rates (MEC: 1.05; SVM: 0.84) and high F1-scores (MEC: 0.89; SVM: 0.92). However the results varied widely depending on the classified data element. Preprocessing methods increased this effect and had significant impact on the classification, both positive and negative. The automatic diagnosis splitter increased the average error rate significantly, even if the F1-score decreased only slightly. The low average error rates and high average F1-scores of each pipeline demonstrate the suitability of the investigated NPL methods. However, it was also shown that there is no best practice for an automatic classification of data elements from free-text diagnostic reports.

  12. Investigating the Effect of Using Multiple Sensory Modes of Glossing Vocabulary Items in a Reading Text with Multimedia Annotations

    ERIC Educational Resources Information Center

    Rezaee, Abbas Ali; Shoar, Neda Sharbaf

    2011-01-01

    In recent years, improvements in technology have enhanced the possibilities of teaching and learning various subjects. This is specially the case in foreign language instruction. The use of technology and multimedia brings new opportunities for learning different areas of language. In this regard, the present study attempts to find out if the use…

  13. The Dangers of Test Preparation: What Students Learn (And Don't Learn) about Reading Comprehension from Test-Centric Literacy Instruction

    ERIC Educational Resources Information Center

    Davis, Dennis S.; Vehabovic, Nermin

    2018-01-01

    The authors offer guidance on recognizing and resisting test-centric instruction in reading comprehension. They posit that five practices indicate a test-centric view of comprehension: when the tested content is privileged, when the test becomes the text, when annotation requirements replace strategic thinking, when test items frame how students…

  14. GarlicESTdb: an online database and mining tool for garlic EST sequences.

    PubMed

    Kim, Dae-Won; Jung, Tae-Sung; Nam, Seong-Hyeuk; Kwon, Hyuk-Ryul; Kim, Aeri; Chae, Sung-Hwa; Choi, Sang-Haeng; Kim, Dong-Wook; Kim, Ryong Nam; Park, Hong-Seog

    2009-05-18

    Allium sativum., commonly known as garlic, is a species in the onion genus (Allium), which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST) of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum) EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition) software technology (JSP/EJB/JavaServlet) for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP) and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation information for others to view. The GarlicESTdb web application is freely available at http://garlicdb.kribb.re.kr. GarlicESTdb is the first incorporated online information database of EST sequences isolated from garlic that can be freely accessed and downloaded. It has many useful features for interactive mining of EST contigs and datasets from each library, including curation of annotated information, expression profiling, information retrieval, and summary of statistics of functional annotation. Consequently, the development of GarlicESTdb will provide a crucial contribution to biologists for data-mining and more efficient experimental studies.

  15. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

    PubMed Central

    2010-01-01

    Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. PMID:21092105

  16. GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains

    PubMed Central

    Lu, Zhiyong

    2015-01-01

    The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator. PMID:26380306

  17. Culto: AN Ontology-Based Annotation Tool for Data Curation in Cultural Heritage

    NASA Astrophysics Data System (ADS)

    Garozzo, R.; Murabito, F.; Santagati, C.; Pino, C.; Spampinato, C.

    2017-08-01

    This paper proposes CulTO, a software tool relying on a computational ontology for Cultural Heritage domain modelling, with a specific focus on religious historical buildings, for supporting cultural heritage experts in their investigations. It is specifically thought to support annotation, automatic indexing, classification and curation of photographic data and text documents of historical buildings. CULTO also serves as a useful tool for Historical Building Information Modeling (H-BIM) by enabling semantic 3D data modeling and further enrichment with non-geometrical information of historical buildings through the inclusion of new concepts about historical documents, images, decay or deformation evidence as well as decorative elements into BIM platforms. CulTO is the result of a joint research effort between the Laboratory of Surveying and Architectural Photogrammetry "Luigi Andreozzi" and the PeRCeiVe Lab (Pattern Recognition and Computer Vision Lab) of the University of Catania,

  18. [Technological innovation and humanitarianism in the transport of war wounded: Nicasio Landa's report on a new elastic suspension system for stretchers (Pamplona, May 29, 1875)].

    PubMed

    Arrizabalaga, Jon; García-Reyes, Juan Carlos

    2016-01-01

    In May 1875, in the midst of a bloody civil conflict in Spain known as the Third Carlist War, Nicasio Landa, a medical officer with Military Health, wrote a report requesting authorization for the Spanish Red Cross, of which he was Inspector General, to adopt a new elastic suspension system for stretchers that he had designed, developed and tested. Intended above all for use in farm wagons - still the most widely-used method of transporting the wounded at the time - it was an inexpensive, sturdy mechanism that improved patient comfort and could also be installed in ambulance carriages, railway carriages and hospital ships. An annotated version of the report is included, preceded by a presentation of its contents.

  19. Data mining in newt-omics, the repository for omics data from the newt.

    PubMed

    Looso, Mario; Braun, Thomas

    2015-01-01

    Salamanders are an excellent model organism to study regenerative processes due to their unique ability to regenerate lost appendages or organs. Straightforward bioinformatics tools to analyze and take advantage of the growing number of "omics" studies performed in salamanders were lacking so far. To overcome this limitation, we have generated a comprehensive data repository for the red-spotted newt Notophthalmus viridescens, named newt-omics, merging omics style datasets on the transcriptome and proteome level including expression values and annotations. The resource is freely available via a user-friendly Web-based graphical user interface ( http://newt-omics.mpi-bn.mpg.de) that allows access and queries to the database without prior bioinformatical expertise. The repository is updated regularly, incorporating new published datasets from omics technologies.

  20. pocketZebra: a web-server for automated selection and classification of subfamily-specific binding sites by bioinformatic analysis of diverse protein families

    PubMed Central

    Suplatov, Dmitry; Kirilin, Eugeny; Arbatsky, Mikhail; Takhaveev, Vakil; Švedas, Vytas

    2014-01-01

    The new web-server pocketZebra implements the power of bioinformatics and geometry-based structural approaches to identify and rank subfamily-specific binding sites in proteins by functional significance, and select particular positions in the structure that determine selective accommodation of ligands. A new scoring function has been developed to annotate binding sites by the presence of the subfamily-specific positions in diverse protein families. pocketZebra web-server has multiple input modes to meet the needs of users with different experience in bioinformatics. The server provides on-site visualization of the results as well as off-line version of the output in annotated text format and as PyMol sessions ready for structural analysis. pocketZebra can be used to study structure–function relationship and regulation in large protein superfamilies, classify functionally important binding sites and annotate proteins with unknown function. The server can be used to engineer ligand-binding sites and allosteric regulation of enzymes, or implemented in a drug discovery process to search for potential molecular targets and novel selective inhibitors/effectors. The server, documentation and examples are freely available at http://biokinet.belozersky.msu.ru/pocketzebra and there are no login requirements. PMID:24852248

  1. A Deep and Autoregressive Approach for Topic Modeling of Multimodal Data.

    PubMed

    Zheng, Yin; Zhang, Yu-Jin; Larochelle, Hugo

    2016-06-01

    Topic modeling based on latent Dirichlet allocation (LDA) has been a framework of choice to deal with multimodal data, such as in image annotation tasks. Another popular approach to model the multimodal data is through deep neural networks, such as the deep Boltzmann machine (DBM). Recently, a new type of topic model called the Document Neural Autoregressive Distribution Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance for text document modeling. In this work, we show how to successfully apply and extend this model to multimodal data, such as simultaneous image classification and annotation. First, we propose SupDocNADE, a supervised extension of DocNADE, that increases the discriminative power of the learned hidden topic features and show how to employ it to learn a joint representation from image visual words, annotation words and class label information. We test our model on the LabelMe and UIUC-Sports data sets and show that it compares favorably to other topic models. Second, we propose a deep extension of our model and provide an efficient way of training the deep model. Experimental results show that our deep model outperforms its shallow version and reaches state-of-the-art performance on the Multimedia Information Retrieval (MIR) Flickr data set.

  2. Widespread occurrence of organelle genome-encoded 5S rRNAs including permuted molecules

    PubMed Central

    Valach, Matus; Burger, Gertraud; Gray, Michael W.; Lang, B. Franz

    2014-01-01

    5S Ribosomal RNA (5S rRNA) is a universal component of ribosomes, and the corresponding gene is easily identified in archaeal, bacterial and nuclear genome sequences. However, organelle gene homologs (rrn5) appear to be absent from most mitochondrial and several chloroplast genomes. Here, we re-examine the distribution of organelle rrn5 by building mitochondrion- and plastid-specific covariance models (CMs) with which we screened organelle genome sequences. We not only recover all organelle rrn5 genes annotated in GenBank records, but also identify more than 50 previously unrecognized homologs in mitochondrial genomes of various stramenopiles, red algae, cryptomonads, malawimonads and apusozoans, and surprisingly, in the apicoplast (highly derived plastid) genomes of the coccidian pathogens Toxoplasma gondii and Eimeria tenella. Comparative modeling of RNA secondary structure reveals that mitochondrial 5S rRNAs from brown algae adopt a permuted triskelion shape that has not been seen elsewhere. Expression of the newly predicted rrn5 genes is confirmed experimentally in 10 instances, based on our own and published RNA-Seq data. This study establishes that particularly mitochondrial 5S rRNA has a much broader taxonomic distribution and a much larger structural variability than previously thought. The newly developed CMs will be made available via the Rfam database and the MFannot organelle genome annotator. PMID:25429974

  3. Widespread occurrence of organelle genome-encoded 5S rRNAs including permuted molecules.

    PubMed

    Valach, Matus; Burger, Gertraud; Gray, Michael W; Lang, B Franz

    2014-12-16

    5S Ribosomal RNA (5S rRNA) is a universal component of ribosomes, and the corresponding gene is easily identified in archaeal, bacterial and nuclear genome sequences. However, organelle gene homologs (rrn5) appear to be absent from most mitochondrial and several chloroplast genomes. Here, we re-examine the distribution of organelle rrn5 by building mitochondrion- and plastid-specific covariance models (CMs) with which we screened organelle genome sequences. We not only recover all organelle rrn5 genes annotated in GenBank records, but also identify more than 50 previously unrecognized homologs in mitochondrial genomes of various stramenopiles, red algae, cryptomonads, malawimonads and apusozoans, and surprisingly, in the apicoplast (highly derived plastid) genomes of the coccidian pathogens Toxoplasma gondii and Eimeria tenella. Comparative modeling of RNA secondary structure reveals that mitochondrial 5S rRNAs from brown algae adopt a permuted triskelion shape that has not been seen elsewhere. Expression of the newly predicted rrn5 genes is confirmed experimentally in 10 instances, based on our own and published RNA-Seq data. This study establishes that particularly mitochondrial 5S rRNA has a much broader taxonomic distribution and a much larger structural variability than previously thought. The newly developed CMs will be made available via the Rfam database and the MFannot organelle genome annotator. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  4. Quality of Computationally Inferred Gene Ontology Annotations

    PubMed Central

    Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-01-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439

  5. Qcorp: an annotated classification corpus of Chinese health questions.

    PubMed

    Guo, Haihong; Na, Xu; Li, Jiao

    2018-03-22

    Health question-answering (QA) systems have become a typical application scenario of Artificial Intelligent (AI). An annotated question corpus is prerequisite for training machines to understand health information needs of users. Thus, we aimed to develop an annotated classification corpus of Chinese health questions (Qcorp) and make it openly accessible. We developed a two-layered classification schema and corresponding annotation rules on basis of our previous work. Using the schema, we annotated 5000 questions that were randomly selected from 5 Chinese health websites within 6 broad sections. 8 annotators participated in the annotation task, and the inter-annotator agreement was evaluated to ensure the corpus quality. Furthermore, the distribution and relationship of the annotated tags were measured by descriptive statistics and social network map. The questions were annotated using 7101 tags that covers 29 topic categories in the two-layered schema. In our released corpus, the distribution of questions on the top-layered categories was treatment of 64.22%, diagnosis of 37.14%, epidemiology of 14.96%, healthy lifestyle of 10.38%, and health provider choice of 4.54% respectively. Both the annotated health questions and annotation schema were openly accessible on the Qcorp website. Users can download the annotated Chinese questions in CSV, XML, and HTML format. We developed a Chinese health question corpus including 5000 manually annotated questions. It is openly accessible and would contribute to the intelligent health QA system development.

  6. NoGOA: predicting noisy GO annotations using evidences and sparse representation.

    PubMed

    Yu, Guoxian; Lu, Chang; Wang, Jun

    2017-07-21

    Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .

  7. Annotation Graphs: A Graph-Based Visualization for Meta-Analysis of Data Based on User-Authored Annotations.

    PubMed

    Zhao, Jian; Glueck, Michael; Breslav, Simon; Chevalier, Fanny; Khan, Azam

    2017-01-01

    User-authored annotations of data can support analysts in the activity of hypothesis generation and sensemaking, where it is not only critical to document key observations, but also to communicate insights between analysts. We present annotation graphs, a dynamic graph visualization that enables meta-analysis of data based on user-authored annotations. The annotation graph topology encodes annotation semantics, which describe the content of and relations between data selections, comments, and tags. We present a mixed-initiative approach to graph layout that integrates an analyst's manual manipulations with an automatic method based on similarity inferred from the annotation semantics. Various visual graph layout styles reveal different perspectives on the annotation semantics. Annotation graphs are implemented within C8, a system that supports authoring annotations during exploratory analysis of a dataset. We apply principles of Exploratory Sequential Data Analysis (ESDA) in designing C8, and further link these to an existing task typology in the visualization literature. We develop and evaluate the system through an iterative user-centered design process with three experts, situated in the domain of analyzing HCI experiment data. The results suggest that annotation graphs are effective as a method of visually extending user-authored annotations to data meta-analysis for discovery and organization of ideas.

  8. Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition.

    PubMed

    Funk, Christopher S; Cohen, K Bretonnel; Hunter, Lawrence E; Verspoor, Karin M

    2016-09-09

    Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms. We present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations. In this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community.

  9. Annotation of UAV surveillance video

    NASA Astrophysics Data System (ADS)

    Howlett, Todd; Robertson, Mark A.; Manthey, Dan; Krol, John

    2004-08-01

    Significant progress toward the development of a video annotation capability is presented in this paper. Research and development of an object tracking algorithm applicable for UAV video is described. Object tracking is necessary for attaching the annotations to the objects of interest. A methodology and format is defined for encoding video annotations using the SMPTE Key-Length-Value encoding standard. This provides the following benefits: a non-destructive annotation, compliance with existing standards, video playback in systems that are not annotation enabled and support for a real-time implementation. A model real-time video annotation system is also presented, at a high level, using the MPEG-2 Transport Stream as the transmission medium. This work was accomplished to meet the Department of Defense"s (DoD"s) need for a video annotation capability. Current practices for creating annotated products are to capture a still image frame, annotate it using an Electric Light Table application, and then pass the annotated image on as a product. That is not adequate for reporting or downstream cueing. It is too slow and there is a severe loss of information. This paper describes a capability for annotating directly on the video.

  10. Mars Digital Image Mosaic Globe

    NASA Technical Reports Server (NTRS)

    2000-01-01

    The photomosaic that forms the base for this globe was created by merging two global digital image models (DIM's) of Mars-a medium-resolution monochrome mosaic processed to emphasize topographic features and a lower resolution color mosaic emphasizing color and albedo variations.

    The medium-resolution (1/256 or roughly 231 m/pixel) monochromatic image model was constructed from about 6,000 images having resolutions of 150-350 m/pixel and oblique illumination (Sun 20 o -45 o above the horizon). Radiometric processing was intended to suppress or remove the effects of albedo variations through the use of a high-pass divide filter, followed by photometric normalization so that the contrast of a given topographic slope would be approximately the same in all images.

    The global color mosaic was assembled at 1/64 or roughly 864 m/pixel from about 1,000 red- and green-filter images having 500-1,000 m/pixel resolution. These images were first mosaiced in groups, each taken on a single orbit of the Viking spacecraft. The orbit mosaics were then processed to remove spatially and temporally varying atmospheric haze in the overlap regions. After haze removal, the per-orbit mosaics were photometrically normalized to equalize the contrast of albedo features and mosaiced together with cosmetic seam removal. The medium-resolution DIM was used for geometric control of this color mosaic. A green-filter image was synthesized by weighted averaging of the red- and violet-filter mosaics. Finally, the product seen here was obtained by multiplying each color image by the medium-resolution monochrome image. The color balance selected for images in this map series was designed to be close to natural color for brighter, redder regions, such as Arabia Terra and the Tharsis region, but the data have been stretched so that the relatively dark regions appear darker and less red than they actually are.

    The images are presented in a projection that portrays the entire surface of Mars in a manner suitable for the production of a globe; the number, size, and placement of text annotations were chosen for a 12-inch globe. Prominent features are labeled with names approved by the International Astronomical Union. A specialized program was used to create the 'flower petal' appearance of the images; the area of each petal from 0 to 75 degrees latitude is in the Transverse Mercator projection, and the area from 75 to 90 degrees latitude is in the Lambert Azimuthal Equal-Area projection. The northern hemisphere of Mars is shown on the left, and the southern hemisphere on the right.

  11. Persuasiveness of online flu-vaccination promotional banners.

    PubMed

    Chien, Yu-Hung

    2013-04-01

    Young people appear to have relatively little motivation to participate in flu-vaccination programs. This study assessed the effectiveness of online banners in efforts to persuade young people to get vaccinated. Specifically, a 2 x 3 between-subjects factorial design was used to examine the effects of message framing (gain vs loss) and color configuration (white text on a red background, black text on a yellow background, and white text on a blue background) on 180 college students' perceptions of the persuasiveness of flu-vaccination promotional banners. Each participant completed a four-item questionnaire, and the results of an analysis of variance showed that persuasiveness scores were higher among participants exposed to a loss-framed than to a gain-framed message, but only when the loss-framed message was presented in white text on a red background. The theoretical and practical implications of manipulating these two factors in the development of effective health-promotion materials are discussed.

  12. Matrix Management in DoD: An Annotated Bibliography

    DTIC Science & Technology

    1984-04-01

    ADDRESS 10 PROGRAM ELEMENT. PROJECT, TASK AREA & WORK UNIT NUMBERS ACSC/EDCC, MAXWELL AFB AL 36112 1 1. CONTROLLING OFFICE NAME AND ADDRESS 12 ...completes their message that matrix orga- nization is the likely format of the multiprogram Program Office. 12 The text’s discussion of matrix is...manager, and functional specialist are of vital importance to the effective operation of the matrix .... Matrix management will not achieve its

  13. 106-17 Telemetry Standards Appendix A.3 ADARIO Data Block Field Definitions

    DTIC Science & Technology

    2017-07-01

    A.3-14 4.7. Annotation Text ...significant bits (MSBs) of the word. The next sample is formatted into the next available MSBs and so on until the word is full . As an example, data...to 5) WC - Word count, an 11-bit binary value. WC is the number of full channel data words that should be in the nth channel packet. WC may range

  14. Resistance evaluation of Chinese wild Vitis genotypes against Botrytis cinerea and different responses of resistant and susceptible hosts to the infection.

    PubMed

    Wan, Ran; Hou, Xiaoqing; Wang, Xianhang; Qu, Jingwu; Singer, Stacy D; Wang, Yuejin; Wang, Xiping

    2015-01-01

    The necrotrophic fungus Botrytis cinerea is a major threat to grapevine cultivation worldwide. A screen of 41 Vitis genotypes for leaf resistance to B. cinerea suggested species independent variation and revealed 18 resistant Chinese wild Vitis genotypes, while most investigated V. vinifera, or its hybrids, were susceptible. A particularly resistant Chinese wild Vitis, "Pingli-5" (V. sp. [Qinling grape]) and a very susceptible V. vinifera cultivar, "Red Globe" were selected for further study. Microscopic analysis demonstrated that B. cinerea growth was limited during early infection on "Pingli-5" before 24 h post-inoculation (hpi) but not on Red Globe. It was found that reactive oxygen species (ROS) and antioxidative system were associated with fungal growth. O[Formula: see text] accumulated similarly in B. cinerea 4 hpi on both Vitis genotypes. Lower levels of O[Formula: see text] (not H2O2) were detected 4 hpi and ROS (H2O2 and O[Formula: see text]) accumulation from 8 hpi onwards was also lower in "Pingli-5" leaves than in "Red Globe" leaves. B. cinerea triggered sustained ROS production in "Red Globe" but not in "Pingli-5" with subsequent infection progresses. Red Globe displayed little change in antioxidative activities in response to B. cinerea infection, instead, antioxidative activities were highly and timely elevated in resistant "Pingli-5" which correlated with its minimal ROS increases and its high resistance. These findings not only enhance our understanding of the resistance of Chinese wild Vitis species to B. cinerea, but also lay the foundation for breeding B. cinerea resistant grapes in the future.

  15. Changes in the Ginsenoside Content During Fermentation Using an Appliance for the Preparation of Red Ginseng.

    PubMed

    Lee, So Jin; Ha, Na; Kim, Yunjeong; Kim, Min-Gul

    2016-01-01

    The total amount of ginsenoside in fermented red ginseng (FRG) is increased by microbial fermentation. The aim of this study was to evaluate whether fermentation time and temperature affect the ginsenoside content during fermentation using an appliance for the preparation of red ginseng. The FRG and fermented red ginseng extracts (FRG-e) were prepared using an appliance for the preparation of red ginseng. The temperature was recorded and time points for sampling were scheduled at pre-fermentation (0[Formula: see text]h) and 18, 36, 48, 60 and 72[Formula: see text]h after the addition of the microbial strains. Samples of FRG and FRG-e were collected to identify changes in the ginsenoside contents at each time point during the fermentation process. The ginsenoside content was analyzed using high performance liquid chromatography (HPLC). The levels of ginsenoside Rh1, Rg3, and compound Y, which are known to have effective pharmacological properties, increased more than three-fold in the final products of FRG relative to samples prior to fermentation. Although the ginsenoside constituents of FRG-e decreased or increased and then decreased during fermentation, the total amount of ginsenoside in FRG-e was even higher than those in FRG; the total amounts of ginsenoside in FRG-e and FRG were 8282.8 and 738.0[Formula: see text]mg, respectively. This study examined the changes in composition of ginsenosides and suggests a method to manufacture high-content total ginsenosides according to the fermentation temperature and process time. Reducing the extraction time is expected to improve the decrease of ginsenosides in FRG-e as a function of the fermentation time.

  16. Gene Ontology annotations at SGD: new data sources and annotation methods

    PubMed Central

    Hong, Eurie L.; Balakrishnan, Rama; Dong, Qing; Christie, Karen R.; Park, Julie; Binkley, Gail; Costanzo, Maria C.; Dwight, Selina S.; Engel, Stacia R.; Fisk, Dianna G.; Hirschman, Jodi E.; Hitz, Benjamin C.; Krieger, Cynthia J.; Livstone, Michael S.; Miyasato, Stuart R.; Nash, Robert S.; Oughtred, Rose; Skrzypek, Marek S.; Weng, Shuai; Wong, Edith D.; Zhu, Kathy K.; Dolinski, Kara; Botstein, David; Cherry, J. Michael

    2008-01-01

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current. PMID:17982175

  17. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    DOE PAGES

    Brettin, Thomas; Davis, James J.; Disz, Terry; ...

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore » a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less

  18. An Annotated and Federated Digital Library of Marine Animal Sounds

    DTIC Science & Technology

    2005-01-01

    of the annotations and the relevant segment delimitation points and linkages to other relevant metadata fields; e) search engines that support the...annotators to add information to the same recording, and search engines that permit either all-annotator or specific-annotator searches. To our knowledge

  19. Summary of the BioLINK SIG 2013 meeting at ISMB/ECCB 2013.

    PubMed

    Verspoor, Karin; Shatkay, Hagit; Hirschman, Lynette; Blaschke, Christian; Valencia, Alfonso

    2015-01-15

    The ISMB Special Interest Group on Linking Literature, Information and Knowledge for Biology (BioLINK) organized a one-day workshop at ISMB/ECCB 2013 in Berlin, Germany. The theme of the workshop was 'Roles for text mining in biomedical knowledge discovery and translational medicine'. This summary reviews the outcomes of the workshop. Meeting themes included concept annotation methods and applications, extraction of biological relationships and the use of text-mined data for biological data analysis. All articles are available at http://biolinksig.org/proceedings-online/. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  20. Gene Expression Profiling of Development and Anthocyanin Accumulation in Kiwifruit (Actinidia chinensis) Based on Transcriptome Sequencing

    PubMed Central

    Zeng, Shaohua; Xiao, Gong; Wang, Gan; Wang, Ying; Peng, Ming; Huang, Hongwen

    2015-01-01

    Red-fleshed kiwifruit (Actinidia chinensis Planch. ‘Hongyang’) is a promising commercial cultivar due to its nutritious value and unique flesh color, derived from vitamin C and anthocyanins. In this study, we obtained transcriptome data of ‘Hongyang’ from seven developmental stages using Illumina sequencing. We mapped 39–54 million reads to the recently sequenced kiwifruit genome and other databases to define gene structure, to analyze alternative splicing, and to quantify gene transcript abundance at different developmental stages. The transcript profiles throughout red kiwifruit development were constructed and analyzed, with a focus on the biosynthesis and metabolism of compounds such as phytohormones, sugars, starch and L-ascorbic acid, which are indispensable for the development and formation of quality fruit. Candidate genes for these pathways were identified through MapMan and phylogenetic analysis. The transcript levels of genes involved in sucrose and starch metabolism were consistent with the change in soluble sugar and starch content throughout kiwifruit development. The metabolism of L-ascorbic acid was very active, primarily through the L-galactose pathway. The genes responsible for the accumulation of anthocyanin in red kiwifruit were identified, and their expression levels were investigated during kiwifruit development. This survey of gene expression during kiwifruit development paves the way for further investigation of the development of this uniquely colored and nutritious fruit and reveals which factors are needed for high quality fruit formation. This transcriptome data and its analysis will be useful for improving kiwifruit genome annotation, for basic fruit molecular biology research, and for kiwifruit breeding and improvement. PMID:26301713

  1. In Silico Comparative Transcriptome Analysis of Two Color Morphs of the Common Coral Trout (Plectropomus Leopardus)

    PubMed Central

    Wang, Le; Yu, Cuiping; Guo, Liang; Lin, Haoran; Meng, Zining

    2015-01-01

    The common coral trout is one species of major importance in commercial fisheries and aquaculture. Recently, two different color morphs of Plectropomus leopardus were discovered and the biological importance of the color difference is unknown. Since coral trout species are poorly characterized at the molecular level, we undertook the transcriptomic characterization of the two color morphs, one black and one red coral trout, using Illumina next generation sequencing technologies. The study produced 55162966 and 54588952 paired-end reads, for black and red trout, respectively. De novo transcriptome assembly generated 95367 and 99424 unique sequences in black and red trout, respectively, with 88813 sequences shared between them. Approximately 50% of both trancriptomes were functionally annotated by BLAST searches against protein databases. The two trancriptomes were enriched into 25 functional categories and showed similar profiles of Gene Ontology category compositions. 34110 unigenes were grouped into 259 KEGG pathways. Moreover, we identified 14649 simple sequence repeats (SSRs) and designed primers for potential application. We also discovered 130524 putative single nucleotide polymorphisms (SNPs) in the two transcriptomes, supplying potential genomic resources for the coral trout species. In addition, we identified 936 fast-evolving genes and 165 candidate genes under positive selection between the two color morphs. Finally, 38 candidate genes underlying the mechanism of color and pigmentation were also isolated. This study presents the first transcriptome resources for the common coral trout and provides basic information for the development of genomic tools for the identification, conservation, and understanding of the speciation and local adaptation of coral reef fish species. PMID:26713756

  2. Query-oriented evidence extraction to support evidence-based medicine practice.

    PubMed

    Sarker, Abeed; Mollá, Diego; Paris, Cecile

    2016-02-01

    Evidence-based medicine practice requires medical practitioners to rely on the best available evidence, in addition to their expertise, when making clinical decisions. The medical domain boasts a large amount of published medical research data, indexed in various medical databases such as MEDLINE. As the size of this data grows, practitioners increasingly face the problem of information overload, and past research has established the time-associated obstacles faced by evidence-based medicine practitioners. In this paper, we focus on the problem of automatic text summarisation to help practitioners quickly find query-focused information from relevant documents. We utilise an annotated corpus that is specialised for the task of evidence-based summarisation of text. In contrast to past summarisation approaches, which mostly rely on surface level features to identify salient pieces of texts that form the summaries, our approach focuses on the use of corpus-based statistics, and domain-specific lexical knowledge for the identification of summary contents. We also apply a target-sentence-specific summarisation technique that reduces the problem of underfitting that persists in generic summarisation models. In automatic evaluations run over a large number of annotated summaries, our extractive summarisation technique statistically outperforms various baseline and benchmark summarisation models with a percentile rank of 96.8%. A manual evaluation shows that our extractive summarisation approach is capable of selecting content with high recall and precision, and may thus be used to generate bottom-line answers to practitioners' queries. Our research shows that the incorporation of specialised data and domain-specific knowledge can significantly improve text summarisation performance in the medical domain. Due to the vast amounts of medical text available, and the high growth of this form of data, we suspect that such summarisation techniques will address the time-related obstacles associated with evidence-based medicine. Copyright © 2015 Elsevier Inc. All rights reserved.

  3. Prokaryotic Contig Annotation Pipeline Server: Web Application for a Prokaryotic Genome Annotation Pipeline Based on the Shiny App Package.

    PubMed

    Park, Byeonghyeok; Baek, Min-Jeong; Min, Byoungnam; Choi, In-Geol

    2017-09-01

    Genome annotation is a primary step in genomic research. To establish a light and portable prokaryotic genome annotation pipeline for use in individual laboratories, we developed a Shiny app package designated as "P-CAPS" (Prokaryotic Contig Annotation Pipeline Server). The package is composed of R and Python scripts that integrate publicly available annotation programs into a server application. P-CAPS is not only a browser-based interactive application but also a distributable Shiny app package that can be installed on any personal computer. The final annotation is provided in various standard formats and is summarized in an R markdown document. Annotation can be visualized and examined with a public genome browser. A benchmark test showed that the annotation quality and completeness of P-CAPS were reliable and compatible with those of currently available public pipelines.

  4. Computer systems for annotation of single molecule fragments

    DOEpatents

    Schwartz, David Charles; Severin, Jessica

    2016-07-19

    There are provided computer systems for visualizing and annotating single molecule images. Annotation systems in accordance with this disclosure allow a user to mark and annotate single molecules of interest and their restriction enzyme cut sites thereby determining the restriction fragments of single nucleic acid molecules. The markings and annotations may be automatically generated by the system in certain embodiments and they may be overlaid translucently onto the single molecule images. An image caching system may be implemented in the computer annotation systems to reduce image processing time. The annotation systems include one or more connectors connecting to one or more databases capable of storing single molecule data as well as other biomedical data. Such diverse array of data can be retrieved and used to validate the markings and annotations. The annotation systems may be implemented and deployed over a computer network. They may be ergonomically optimized to facilitate user interactions.

  5. A guide to best practices for Gene Ontology (GO) manual annotation

    PubMed Central

    Balakrishnan, Rama; Harris, Midori A.; Huntley, Rachael; Van Auken, Kimberly; Cherry, J. Michael

    2013-01-01

    The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all. Database URL: http://www.geneontology.org PMID:23842463

  6. GARNET--gene set analysis with exploration of annotation relations.

    PubMed

    Rho, Kyoohyoung; Kim, Bumjin; Jang, Youngjun; Lee, Sanghyun; Bae, Taejeong; Seo, Jihae; Seo, Chaehwa; Lee, Jihyun; Kang, Hyunjung; Yu, Ungsik; Kim, Sunghoon; Lee, Sanghyuk; Kim, Wan Kyu

    2011-02-15

    Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information. GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules--gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations. GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).

  7. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Buttler, D J

    The Java Metadata Facility is introduced by Java Specification Request (JSR) 175 [1], and incorporated into the Java language specification [2] in version 1.5 of the language. The specification allows annotations on Java program elements: classes, interfaces, methods, and fields. Annotations give programmers a uniform way to add metadata to program elements that can be used by code checkers, code generators, or other compile-time or runtime components. Annotations are defined by annotation types. These are defined the same way as interfaces, but with the symbol {at} preceding the interface keyword. There are additional restrictions on defining annotation types: (1) Theymore » cannot be generic; (2) They cannot extend other annotation types or interfaces; (3) Methods cannot have any parameters; (4) Methods cannot have type parameters; (5) Methods cannot throw exceptions; and (6) The return type of methods of an annotation type must be a primitive, a String, a Class, an annotation type, or an array, where the type of the array is restricted to one of the four allowed types. See [2] for additional restrictions and syntax. The methods of an annotation type define the elements that may be used to parameterize the annotation in code. Annotation types may have default values for any of its elements. For example, an annotation that specifies a defect report could initialize an element defining the defect outcome submitted. Annotations may also have zero elements. This could be used to indicate serializability for a class (as opposed to the current Serializability interface).« less

  8. Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes.

    PubMed

    Feuermann, Marc; Gaudet, Pascale; Mi, Huaiyu; Lewis, Suzanna E; Thomas, Paul D

    2016-01-01

    We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This 'GO Phylogenetic Annotation' approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations.Database URL: http://amigo.geneontology.org/amigo. © The Author(s) 2016. Published by Oxford University Press.

  9. AnnotateGenomicRegions: a web application.

    PubMed

    Zammataro, Luca; DeMolfetta, Rita; Bucci, Gabriele; Ceol, Arnaud; Muller, Heiko

    2014-01-01

    Modern genomic technologies produce large amounts of data that can be mapped to specific regions in the genome. Among the first steps in interpreting the results is annotation of genomic regions with known features such as genes, promoters, CpG islands etc. Several tools have been published to perform this task. However, using these tools often requires a significant amount of bioinformatics skills and/or downloading and installing dedicated software. Here we present AnnotateGenomicRegions, a web application that accepts genomic regions as input and outputs a selection of overlapping and/or neighboring genome annotations. Supported organisms include human (hg18, hg19), mouse (mm8, mm9, mm10), zebrafish (danRer7), and Saccharomyces cerevisiae (sacCer2, sacCer3). AnnotateGenomicRegions is accessible online on a public server or can be installed locally. Some frequently used annotations and genomes are embedded in the application while custom annotations may be added by the user. The increasing spread of genomic technologies generates the need for a simple-to-use annotation tool for genomic regions that can be used by biologists and bioinformaticians alike. AnnotateGenomicRegions meets this demand. AnnotateGenomicRegions is an open-source web application that can be installed on any personal computer or institute server. AnnotateGenomicRegions is available at: http://cru.genomics.iit.it/AnnotateGenomicRegions.

  10. AnnotateGenomicRegions: a web application

    PubMed Central

    2014-01-01

    Background Modern genomic technologies produce large amounts of data that can be mapped to specific regions in the genome. Among the first steps in interpreting the results is annotation of genomic regions with known features such as genes, promoters, CpG islands etc. Several tools have been published to perform this task. However, using these tools often requires a significant amount of bioinformatics skills and/or downloading and installing dedicated software. Results Here we present AnnotateGenomicRegions, a web application that accepts genomic regions as input and outputs a selection of overlapping and/or neighboring genome annotations. Supported organisms include human (hg18, hg19), mouse (mm8, mm9, mm10), zebrafish (danRer7), and Saccharomyces cerevisiae (sacCer2, sacCer3). AnnotateGenomicRegions is accessible online on a public server or can be installed locally. Some frequently used annotations and genomes are embedded in the application while custom annotations may be added by the user. Conclusions The increasing spread of genomic technologies generates the need for a simple-to-use annotation tool for genomic regions that can be used by biologists and bioinformaticians alike. AnnotateGenomicRegions meets this demand. AnnotateGenomicRegions is an open-source web application that can be installed on any personal computer or institute server. AnnotateGenomicRegions is available at: http://cru.genomics.iit.it/AnnotateGenomicRegions. PMID:24564446

  11. Evaluation of web-based annotation of ophthalmic images for multicentric clinical trials.

    PubMed

    Chalam, K V; Jain, P; Shah, V A; Shah, Gaurav Y

    2006-06-01

    An Internet browser-based annotation system can be used to identify and describe features in digitalized retinal images, in multicentric clinical trials, in real time. In this web-based annotation system, the user employs a mouse to draw and create annotations on a transparent layer, that encapsulates the observations and interpretations of a specific image. Multiple annotation layers may be overlaid on a single image. These layers may correspond to annotations by different users on the same image or annotations of a temporal sequence of images of a disease process, over a period of time. In addition, geometrical properties of annotated figures may be computed and measured. The annotations are stored in a central repository database on a server, which can be retrieved by multiple users in real time. This system facilitates objective evaluation of digital images and comparison of double-blind readings of digital photographs, with an identifiable audit trail. Annotation of ophthalmic images allowed clinically feasible and useful interpretation to track properties of an area of fundus pathology. This provided an objective method to monitor properties of pathologies over time, an essential component of multicentric clinical trials. The annotation system also allowed users to view stereoscopic images that are stereo pairs. This web-based annotation system is useful and valuable in monitoring patient care, in multicentric clinical trials, telemedicine, teaching and routine clinical settings.

  12. Evaluating Hierarchical Structure in Music Annotations

    PubMed Central

    McFee, Brian; Nieto, Oriol; Farbood, Morwaread M.; Bello, Juan Pablo

    2017-01-01

    Music exhibits structure at multiple scales, ranging from motifs to large-scale functional components. When inferring the structure of a piece, different listeners may attend to different temporal scales, which can result in disagreements when they describe the same piece. In the field of music informatics research (MIR), it is common to use corpora annotated with structural boundaries at different levels. By quantifying disagreements between multiple annotators, previous research has yielded several insights relevant to the study of music cognition. First, annotators tend to agree when structural boundaries are ambiguous. Second, this ambiguity seems to depend on musical features, time scale, and genre. Furthermore, it is possible to tune current annotation evaluation metrics to better align with these perceptual differences. However, previous work has not directly analyzed the effects of hierarchical structure because the existing methods for comparing structural annotations are designed for “flat” descriptions, and do not readily generalize to hierarchical annotations. In this paper, we extend and generalize previous work on the evaluation of hierarchical descriptions of musical structure. We derive an evaluation metric which can compare hierarchical annotations holistically across multiple levels. sing this metric, we investigate inter-annotator agreement on the multilevel annotations of two different music corpora, investigate the influence of acoustic properties on hierarchical annotations, and evaluate existing hierarchical segmentation algorithms against the distribution of inter-annotator agreement. PMID:28824514

  13. IMG ER: a system for microbial genome annotation expert review and curation.

    PubMed

    Markowitz, Victor M; Mavromatis, Konstantinos; Ivanova, Natalia N; Chen, I-Min A; Chu, Ken; Kyrpides, Nikos C

    2009-09-01

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  14. HAMAP in 2013, new developments in the protein family classification and annotation system

    PubMed Central

    Pedruzzi, Ivo; Rivoire, Catherine; Auchincloss, Andrea H.; Coudert, Elisabeth; Keller, Guillaume; de Castro, Edouard; Baratin, Delphine; Cuche, Béatrice A.; Bougueleret, Lydie; Poux, Sylvain; Redaschi, Nicole; Xenarios, Ioannis; Bridge, Alan

    2013-01-01

    HAMAP (High-quality Automated and Manual Annotation of Proteins—available at http://hamap.expasy.org/) is a system for the classification and annotation of protein sequences. It consists of a collection of manually curated family profiles for protein classification, and associated annotation rules that specify annotations that apply to family members. HAMAP was originally developed to support the manual curation of UniProtKB/Swiss-Prot records describing microbial proteins. Here we describe new developments in HAMAP, including the extension of HAMAP to eukaryotic proteins, the use of HAMAP in the automated annotation of UniProtKB/TrEMBL, providing high-quality annotation for millions of protein sequences, and the future integration of HAMAP into a unified system for UniProtKB annotation, UniRule. HAMAP is continuously updated by expert curators with new family profiles and annotation rules as new protein families are characterized. The collection of HAMAP family classification profiles and annotation rules can be browsed and viewed on the HAMAP website, which also provides an interface to scan user sequences against HAMAP profiles. PMID:23193261

  15. The Functional Genomics Network in the evolution of biological text mining over the past decade.

    PubMed

    Blaschke, Christian; Valencia, Alfonso

    2013-03-25

    Different programs of The European Science Foundation (ESF) have contributed significantly to connect researchers in Europe and beyond through several initiatives. This support was particularly relevant for the development of the areas related with extracting information from papers (text-mining) because it supported the field in its early phases long before it was recognized by the community. We review the historical development of text mining research and how it was introduced in bioinformatics. Specific applications in (functional) genomics are described like it's integration in genome annotation pipelines and the support to the analysis of high-throughput genomics experimental data, and we highlight the activities of evaluation of methods and benchmarking for which the ESF programme support was instrumental. Copyright © 2013 Elsevier B.V. All rights reserved.

  16. Automatic Semantic Orientation of Adjectives for Indonesian Language Using PMI-IR and Clustering

    NASA Astrophysics Data System (ADS)

    Riyanti, Dewi; Arif Bijaksana, M.; Adiwijaya

    2018-03-01

    We present our work in the area of sentiment analysis for Indonesian language. We focus on bulding automatic semantic orientation using available resources in Indonesian. In this research we used Indonesian corpus that contains 9 million words from kompas.txt and tempo.txt that manually tagged and annotated with of part-of-speech tagset. And then we construct a dataset by taking all the adjectives from the corpus, removing the adjective with no orientation. The set contained 923 adjective words. This systems will include several steps such as text pre-processing and clustering. The text pre-processing aims to increase the accuracy. And finally clustering method will classify each word to related sentiment which is positive or negative. With improvements to the text preprocessing, can be achieved 72% of accuracy.

  17. Agile Text Mining for the 2014 i2b2/UTHealth Cardiac Risk Factors Challenge

    PubMed Central

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2016-01-01

    This paper describes the use of an agile text mining platform (Linguamatics’ Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 Challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. PMID:26209007

  18. Correlated accumulation of anthocyanins and rosmarinic acid in mechanically stressed red cell suspensions of basil (Ocimum basilicum).

    PubMed

    Strazzer, Pamela; Guzzo, Flavia; Levi, Marisa

    2011-02-15

    A red basil cell line (T2b) rich in rosmarinic acid (RA) was selected for the stable production of anthocyanins (ACs) in the dark. Cell suspension cultures were subjected to mechanical stress through increased agitation (switch from 90 to 150 rpm) to determine the relationship between AC and RA accumulation. Cell extracts were analyzed by HPLC and LC-MS, and the resulting data were processed with multivariate statistical analysis. MS and MS/MS spectra facilitated the putative annotation of several complex cyanidin-based ACs, which were esterified with coumaric acid and, in some cases, also with malonic acid. It was also possible to identify various RA-related molecules, some caffeic and coumaric acid derivatives and some flavanones. Mechanical stress increased the total AC and RA contents, but reduced biomass accumulation. Many metabolites were induced by mechanical stress, including RA and some of its derivatives, most ACs, hydroxycinnamic acids and flavonoids, whereas the abundance of some RA dimers was reduced. Although AC and RA share a common early biosynthetic pathway (from phenylalanine to 4-coumaroyl-CoA) and could have similar or overlapping functions providing antioxidant activity against stress-generated reactive oxygen species, there appeared to be no competition between their individual pathways. Copyright © 2010 Elsevier GmbH. All rights reserved.

  19. Larval RNA Interference in the Red Flour Beetle, Tribolium castaneum

    PubMed Central

    Tomoyasu, Yoshinori

    2014-01-01

    The red flour beetle, Tribolium castaneum, offers a repertoire of experimental tools for genetic and developmental studies, including a fully annotated genome sequence, transposon-based transgenesis, and effective RNA interference (RNAi). Among these advantages, RNAi-based gene knockdown techniques are at the core of Tribolium research. T. castaneum show a robust systemic RNAi response, making it possible to perform RNAi at any life stage by simply injecting double-stranded RNA (dsRNA) into the beetle’s body cavity. In this report, we provide an overview of our larval RNAi technique in T. castaneum. The protocol includes (i) isolation of the proper stage of T. castaneum larvae for injection, (ii) preparation for the injection setting, and (iii) dsRNA injection. Larval RNAi is a simple, but powerful technique that provides us with quick access to loss-of-function phenotypes, including multiple gene knockdown phenotypes as well as a series of hypomorphic phenotypes. Since virtually all T. castaneum tissues are susceptible to extracellular dsRNA, the larval RNAi technique allows researchers to study a wide variety of tissues in diverse contexts, including the genetic basis of organismal responses to the outside environment. In addition, the simplicity of this technique stimulates more student involvement in research, making T. castaneum an ideal genetic system for use in a classroom setting. PMID:25350485

  20. America National Parks Viewed in 3D by NASA MISR Anaglyph 2

    NASA Image and Video Library

    2016-08-25

    Just in time for the U.S. National Park Service's Centennial celebration on Aug. 25, NASA's Multiangle Imaging SpectroRadiometer (MISR) instrument aboard NASA's Terra satellite is releasing four new anaglyphs that showcase 33 of our nation's national parks, monuments, historical sites and recreation areas in glorious 3D. Shown in the annotated image are Grand Teton National Park, John D. Rockefeller Memorial Parkway, Yellowstone National Park, and parts of Craters of the Moon National Monument. MISR views Earth with nine cameras pointed at different angles, giving it the unique capability to produce anaglyphs, stereoscopic images that allow the viewer to experience the landscape in three dimensions. The anaglyphs were made by combining data from MISR's vertical-viewing and 46-degree forward-pointing camera. You will need red-blue glasses in order to experience the 3D effect; ensure you place the red lens over your left eye. The images have been rotated so that north is to the left in order to enable 3D viewing because the Terra satellite flies from north to south. All of the images are 235 miles (378 kilometers) from west to east. These data were acquired June 25, 2016, Orbit 87876. http://photojournal.jpl.nasa.gov/catalog/PIA20890

  1. Apratoxin H and Apratoxin A Sulfoxide from the Red Sea Cyanobacterium Moorea producens

    PubMed Central

    Thornburg, Christopher C.; Cowley, Elise S.; Sikorska, Justyna; Shaala, Lamiaa A.; Ishmael, Jane E.; Youssef, Diaa T.A.; McPhail, Kerry L.

    2014-01-01

    Cultivation of the marine cyanobacterium Moorea producens, collected from the Nabq Mangroves in the Gulf of Aqaba (Red Sea), led to the isolation of new apratoxin analogues, apratoxin H (1) and apratoxin A sulfoxide (2), together with the known apratoxins A-C, lyngbyabellin B and hectochlorin. The absolute configuration of these new potent cytotoxins was determined by chemical degradation, MS, NMR, and CD spectroscopy. Apratoxin H (1) contains pipecolic acid in place of the proline residue present in apratoxin A, expanding the known suite of naturally occurring analogues that display amino acid substitutions within the final module of the apratoxin biosynthetic pathway. The oxidation site of apratoxin A sulfoxide (2) was deduced from MS fragmentation patterns and IR data, and 2 could not be generated experimentally by oxidation of apratoxin A. The cytotoxicity of 1 and 2 to human NCI-H460 lung cancer cells (IC50 = 3.4 and 89.9 nM, respectively) provides further insight into the structure–activity relationships in the apratoxin series. Phylogenetic analysis of the apratoxin-producing cyanobacterial strains belonging to the genus Moorea, coupled with the recently annotated apratoxin biosynthetic pathway, supports the notion that apratoxin production and structural diversity may be specific to their geographical niche. PMID:24016099

  2. America National Parks Viewed in 3D by NASA MISR Anaglyph 4

    NASA Image and Video Library

    2016-08-25

    Just in time for the U.S. National Park Service's Centennial celebration on Aug. 25, NASA's Multiangle Imaging SpectroRadiometer (MISR) instrument aboard NASA's Terra satellite is releasing four new anaglyphs that showcase 33 of our nation's national parks, monuments, historical sites and recreation areas in glorious 3D. Shown in the annotated image are Sequoia National Park, Kings Canyon National Park, Manzanar National Historic Site, Devils Postpile National Monument, Yosemite National Park, and parts of Death Valley National Park. MISR views Earth with nine cameras pointed at different angles, giving it the unique capability to produce anaglyphs, stereoscopic images that allow the viewer to experience the landscape in three dimensions. The anaglyphs were made by combining data from MISR's vertical-viewing and 46-degree forward-pointing camera. You will need red-blue glasses in order to experience the 3D effect; ensure you place the red lens over your left eye. The images have been rotated so that north is to the left in order to enable 3D viewing because the Terra satellite flies from north to south. All of the images are 235 miles (378 kilometers) from west to east. These data were acquired July 7, 2016, Orbit 88051. http://photojournal.jpl.nasa.gov/catalog/PIA20892

  3. Validating a strategy for psychosocial phenotyping using a large corpus of clinical text.

    PubMed

    Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H

    2013-12-01

    To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.

  4. Validating a strategy for psychosocial phenotyping using a large corpus of clinical text

    PubMed Central

    Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H

    2013-01-01

    Objective To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. Materials and methods From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. Results A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6–0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Conclusions Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype. PMID:24169276

  5. Defense Department Pursuit of Insurers for Superfund Cost Recovery

    DTIC Science & Technology

    1992-04-01

    Aug. 14, 1991). 95. See generally Greenlaw, supra note 90, at 235-52; Thomas A. Gordon & Roger Westendorf , Liability Coverage for Toxic Tort...121 103. See American Home Products, 565 F. Supp. at 1500- 03; see also supra notes 100 and accompanying text. 104. Gordon & Westendorf , supra note 95...Annotated 409 (1969 Supplement). 111. Gordon & Westendorf , supra note 95, at 575; see also Greenlaw, supra note 90, at 244; Salisbury, supra note 86, at

  6. A method for the development of disease-specific reference standards vocabularies from textual biomedical literature resources

    PubMed Central

    Wang, Liqin; Bray, Bruce E.; Shi, Jianlin; Fiol, Guilherme Del; Haug, Peter J.

    2017-01-01

    Objective Disease-specific vocabularies are fundamental to many knowledge-based intelligent systems and applications like text annotation, cohort selection, disease diagnostic modeling, and therapy recommendation. Reference standards are critical in the development and validation of automated methods for disease-specific vocabularies. The goal of the present study is to design and test a generalizable method for the development of vocabulary reference standards from expert-curated, disease-specific biomedical literature resources. Methods We formed disease-specific corpora from literature resources like textbooks, evidence-based synthesized online sources, clinical practice guidelines, and journal articles. Medical experts annotated and adjudicated disease-specific terms in four classes (i.e., causes or risk factors, signs or symptoms, diagnostic tests or results, and treatment). Annotations were mapped to UMLS concepts. We assessed source variation, the contribution of each source to build disease-specific vocabularies, the saturation of the vocabularies with respect to the number of used sources, and the generalizability of the method with different diseases. Results The study resulted in 2588 string-unique annotations for heart failure in four classes, and 193 and 425 respectively for pulmonary embolism and rheumatoid arthritis in treatment class. Approximately 80% of the annotations were mapped to UMLS concepts. The agreement among heart failure sources ranged between 0.28 and 0.46. The contribution of these sources to the final vocabulary ranged between 18% and 49%. With the sources explored, the heart failure vocabulary reached near saturation in all four classes with the inclusion of minimal six sources (or between four to seven sources if only counting terms occurred in two or more sources). It took fewer sources to reach near saturation for the other two diseases in terms of the treatment class. Conclusions We developed a method for the development of disease-specific reference vocabularies. Expert-curated biomedical literature resources are substantial for acquiring disease-specific medical knowledge. It is feasible to reach near saturation in a disease-specific vocabulary using a relatively small number of literature sources. PMID:26971304

  7. Apollo: a sequence annotation editor

    PubMed Central

    Lewis, SE; Searle, SMJ; Harris, N; Gibson, M; Iyer, V; Richter, J; Wiel, C; Bayraktaroglu, L; Birney, E; Crosby, MA; Kaminker, JS; Matthews, BB; Prochnik, SE; Smith, CD; Tupy, JL; Rubin, GM; Misra, S; Mungall, CJ; Clamp, ME

    2002-01-01

    The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome and it is increasingly being used as a starting point for the development of customized annotation editing tools for other genome projects. PMID:12537571

  8. Tharsis Limb Cloud

    NASA Technical Reports Server (NTRS)

    2005-01-01

    [figure removed for brevity, see original site] Annotated image of Tharsis Limb Cloud

    7 September 2005 This composite of red and blue Mars Global Surveyor (MGS) Mars Orbiter Camera (MOC) daily global images acquired on 6 July 2005 shows an isolated water ice cloud extending more than 30 kilometers (more than 18 miles) above the martian surface. Clouds such as this are common in late spring over the terrain located southwest of the Arsia Mons volcano. Arsia Mons is the dark, oval feature near the limb, just to the left of the 'T' in the 'Tharsis Montes' label. The dark, nearly circular feature above the 'S' in 'Tharsis' is the volcano, Pavonis Mons, and the other dark circular feature, above and to the right of 's' in 'Montes,' is Ascraeus Mons. Illumination is from the left/lower left.

    Season: Northern Autumn/Southern Spring

  9. Modeling multiple time series annotations as noisy distortions of the ground truth: An Expectation-Maximization approach.

    PubMed

    Gupta, Rahul; Audhkhasi, Kartik; Jacokes, Zach; Rozga, Agata; Narayanan, Shrikanth

    2018-01-01

    Studies of time-continuous human behavioral phenomena often rely on ratings from multiple annotators. Since the ground truth of the target construct is often latent, the standard practice is to use ad-hoc metrics (such as averaging annotator ratings). Despite being easy to compute, such metrics may not provide accurate representations of the underlying construct. In this paper, we present a novel method for modeling multiple time series annotations over a continuous variable that computes the ground truth by modeling annotator specific distortions. We condition the ground truth on a set of features extracted from the data and further assume that the annotators provide their ratings as modification of the ground truth, with each annotator having specific distortion tendencies. We train the model using an Expectation-Maximization based algorithm and evaluate it on a study involving natural interaction between a child and a psychologist, to predict confidence ratings of the children's smiles. We compare and analyze the model against two baselines where: (i) the ground truth in considered to be framewise mean of ratings from various annotators and, (ii) each annotator is assumed to bear a distinct time delay in annotation and their annotations are aligned before computing the framewise mean.

  10. Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes1[OPEN

    PubMed Central

    Law, MeiYee; Childs, Kevin L.; Campbell, Michael S.; Stein, Joshua C.; Olson, Andrew J.; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M.; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  11. Cross-organism learning method to discover new gene functionalities.

    PubMed

    Domeniconi, Giacomo; Masseroli, Marco; Moro, Gianluca; Pinoli, Pietro

    2016-04-01

    Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  12. Transcriptomic Analysis of Paeonia delavayi Wild Population Flowers to Identify Differentially Expressed Genes Involved in Purple-Red and Yellow Petal Pigmentation

    PubMed Central

    Wang, Yan; Li, Kui; Zheng, Baoqiang; Miao, Kun

    2015-01-01

    Tree peony (Paeonia suffruticosa Andrews) is a very famous traditional ornamental plant in China. P. delavayi is a species endemic to Southwest China that has aroused great interest from researchers as a precious genetic resource for flower color breeding. However, the current understanding of the molecular mechanisms of flower pigmentation in this plant is limited, hindering the genetic engineering of novel flower color in tree peonies. In this study, we conducted a large-scale transcriptome analysis based on Illumina HiSeq sequencing of cDNA libraries generated from yellow and purple-red P. delavayi petals. A total of 90,202 unigenes were obtained by de novo assembly, with an average length of 721 nt. Using Blastx, 44,811 unigenes (49.68%) were found to have significant similarity to accessions in the NR, NT, and Swiss-Prot databases. We also examined COG, GO and KEGG annotations to better understand the functions of these unigenes. Further analysis of the two digital transcriptomes revealed that 6,855 unigenes were differentially expressed between yellow and purple-red flower petals, with 3,430 up-regulated and 3,425 down-regulated. According to the RNA-Seq data and qRT-PCR analysis, we proposed that four up-regulated key structural genes, including F3H, DFR, ANS and 3GT, might play an important role in purple-red petal pigmentation, while high co-expression of THC2'GT, CHI and FNS II ensures the accumulation of pigments contributing to the yellow color. We also found 50 differentially expressed transcription factors that might be involved in flavonoid biosynthesis. This study is the first to report genetic information for P. delavayi. The large number of gene sequences produced by transcriptome sequencing and the candidate genes identified using pathway mapping and expression profiles will provide a valuable resource for future association studies aimed at better understanding the molecular mechanisms underlying flower pigmentation in tree peonies. PMID:26267644

  13. eRAM: encyclopedia of rare disease annotations for precision medicine.

    PubMed

    Jia, Jinmeng; An, Zhongxin; Ming, Yue; Guo, Yongli; Li, Wei; Liang, Yunxiang; Guo, Dongming; Li, Xin; Tai, Jun; Chen, Geng; Jin, Yaqiong; Liu, Zhimei; Ni, Xin; Shi, Tieliu

    2018-01-04

    Rare diseases affect over a hundred million people worldwide, most of these patients are not accurately diagnosed and effectively treated. The limited knowledge of rare diseases forms the biggest obstacle for improving their treatment. Detailed clinical phenotyping is considered as a keystone of deciphering genes and realizing the precision medicine for rare diseases. Here, we preset a standardized system for various types of rare diseases, called encyclopedia of Rare disease Annotations for Precision Medicine (eRAM). eRAM was built by text-mining nearly 10 million scientific publications and electronic medical records, and integrating various data in existing recognized databases (such as Unified Medical Language System (UMLS), Human Phenotype Ontology, Orphanet, OMIM, GWAS). eRAM systematically incorporates currently available data on clinical manifestations and molecular mechanisms of rare diseases and uncovers many novel associations among diseases. eRAM provides enriched annotations for 15 942 rare diseases, yielding 6147 human disease related phenotype terms, 31 661 mammalians phenotype terms, 10,202 symptoms from UMLS, 18 815 genes and 92 580 genotypes. eRAM can not only provide information about rare disease mechanism but also facilitate clinicians to make accurate diagnostic and therapeutic decisions towards rare diseases. eRAM can be freely accessed at http://www.unimd.org/eram/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  14. An information gathering system for medical image inspection

    NASA Astrophysics Data System (ADS)

    Lee, Young-Jin; Bajcsy, Peter

    2005-04-01

    We present an information gathering system for medical image inspection that consists of software tools for capturing computer-centric and human-centric information. Computer-centric information includes (1) static annotations, such as (a) image drawings enclosing any selected area, a set of areas with similar colors, a set of salient points, and (b) textual descriptions associated with either image drawings or links between pairs of image drawings, and (2) dynamic (or temporal) information, such as mouse movements, zoom level changes, image panning and frame selections from an image stack. Human-centric information is represented by video and audio signals that are acquired by computer-mounted cameras and microphones. The short-term goal of the presented system is to facilitate learning of medical novices from medical experts, while the long-term goal is to data mine all information about image inspection for assisting in making diagnoses. In this work, we built basic software functionality for gathering computer-centric and human-centric information of the aforementioned variables. Next, we developed the information playback capabilities of all gathered information for educational purposes. Finally, we prototyped text-based and image template-based search engines to retrieve information from recorded annotations, for example, (a) find all annotations containing the word "blood vessels", or (b) search for similar areas to a selected image area. The information gathering system for medical image inspection reported here has been tested with images from the Histology Atlas database.

  15. pocketZebra: a web-server for automated selection and classification of subfamily-specific binding sites by bioinformatic analysis of diverse protein families.

    PubMed

    Suplatov, Dmitry; Kirilin, Eugeny; Arbatsky, Mikhail; Takhaveev, Vakil; Svedas, Vytas

    2014-07-01

    The new web-server pocketZebra implements the power of bioinformatics and geometry-based structural approaches to identify and rank subfamily-specific binding sites in proteins by functional significance, and select particular positions in the structure that determine selective accommodation of ligands. A new scoring function has been developed to annotate binding sites by the presence of the subfamily-specific positions in diverse protein families. pocketZebra web-server has multiple input modes to meet the needs of users with different experience in bioinformatics. The server provides on-site visualization of the results as well as off-line version of the output in annotated text format and as PyMol sessions ready for structural analysis. pocketZebra can be used to study structure-function relationship and regulation in large protein superfamilies, classify functionally important binding sites and annotate proteins with unknown function. The server can be used to engineer ligand-binding sites and allosteric regulation of enzymes, or implemented in a drug discovery process to search for potential molecular targets and novel selective inhibitors/effectors. The server, documentation and examples are freely available at http://biokinet.belozersky.msu.ru/pocketzebra and there are no login requirements. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  16. OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software.

    PubMed

    Kriventseva, Evgenia V; Tegenfeldt, Fredrik; Petty, Tom J; Waterhouse, Robert M; Simão, Felipe A; Pozdnyakov, Igor A; Ioannidis, Panagiotis; Zdobnov, Evgeny M

    2015-01-01

    Orthology, refining the concept of homology, is the cornerstone of evolutionary comparative studies. With the ever-increasing availability of genomic data, inference of orthology has become instrumental for generating hypotheses about gene functions crucial to many studies. This update of the OrthoDB hierarchical catalog of orthologs (http://www.orthodb.org) covers 3027 complete genomes, including the most comprehensive set of 87 arthropods, 61 vertebrates, 227 fungi and 2627 bacteria (sampling the most complete and representative genomes from over 11,000 available). In addition to the most extensive integration of functional annotations from UniProt, InterPro, GO, OMIM, model organism phenotypes and COG functional categories, OrthoDB uniquely provides evolutionary annotations including rates of ortholog sequence divergence, copy-number profiles, sibling groups and gene architectures. We re-designed the entirety of the OrthoDB website from the underlying technology to the user interface, enabling the user to specify species of interest and to select the relevant orthology level by the NCBI taxonomy. The text searches allow use of complex logic with various identifiers of genes, proteins, domains, ontologies or annotation keywords and phrases. Gene copy-number profiles can also be queried. This release comes with the freely available underlying ortholog clustering pipeline (http://www.orthodb.org/software). © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  17. Extracting microRNA-gene relations from biomedical literature using distant supervision

    PubMed Central

    Clarke, Luka A.; Couto, Francisco M.

    2017-01-01

    Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological processes, while the availability of annotated corpora is still limited. We studied the extraction of microRNA-gene relations from text. MicroRNA regulation is an important biological process due to its close association with human diseases. The proposed method, IBRel, is based on distantly supervised multi-instance learning. We evaluated IBRel on three datasets, and the results were compared with a co-occurrence approach as well as a supervised machine learning algorithm. While supervised learning outperformed on two of those datasets, IBRel obtained an F-score 28.3 percentage points higher on the dataset for which there was no training set developed specifically. To demonstrate the applicability of IBRel, we used it to extract 27 miRNA-gene relations from recently published papers about cystic fibrosis. Our results demonstrate that our method can be successfully used to extract relations from literature about a biological process without an annotated corpus. The source code and data used in this study are available at https://github.com/AndreLamurias/IBRel. PMID:28263989

  18. Extracting microRNA-gene relations from biomedical literature using distant supervision.

    PubMed

    Lamurias, Andre; Clarke, Luka A; Couto, Francisco M

    2017-01-01

    Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological processes, while the availability of annotated corpora is still limited. We studied the extraction of microRNA-gene relations from text. MicroRNA regulation is an important biological process due to its close association with human diseases. The proposed method, IBRel, is based on distantly supervised multi-instance learning. We evaluated IBRel on three datasets, and the results were compared with a co-occurrence approach as well as a supervised machine learning algorithm. While supervised learning outperformed on two of those datasets, IBRel obtained an F-score 28.3 percentage points higher on the dataset for which there was no training set developed specifically. To demonstrate the applicability of IBRel, we used it to extract 27 miRNA-gene relations from recently published papers about cystic fibrosis. Our results demonstrate that our method can be successfully used to extract relations from literature about a biological process without an annotated corpus. The source code and data used in this study are available at https://github.com/AndreLamurias/IBRel.

  19. Apollo2Go: a web service adapter for the Apollo genome viewer to enable distributed genome annotation.

    PubMed

    Klee, Kathrin; Ernst, Rebecca; Spannagl, Manuel; Mayer, Klaus F X

    2007-08-30

    Apollo, a genome annotation viewer and editor, has become a widely used genome annotation and visualization tool for distributed genome annotation projects. When using Apollo for annotation, database updates are carried out by uploading intermediate annotation files into the respective database. This non-direct database upload is laborious and evokes problems of data synchronicity. To overcome these limitations we extended the Apollo data adapter with a generic, configurable web service client that is able to retrieve annotation data in a GAME-XML-formatted string and pass it on to Apollo's internal input routine. This Apollo web service adapter, Apollo2Go, simplifies the data exchange in distributed projects and aims to render the annotation process more comfortable. The Apollo2Go software is freely available from ftp://ftpmips.gsf.de/plants/apollo_webservice.

  20. Apollo2Go: a web service adapter for the Apollo genome viewer to enable distributed genome annotation

    PubMed Central

    Klee, Kathrin; Ernst, Rebecca; Spannagl, Manuel; Mayer, Klaus FX

    2007-01-01

    Background Apollo, a genome annotation viewer and editor, has become a widely used genome annotation and visualization tool for distributed genome annotation projects. When using Apollo for annotation, database updates are carried out by uploading intermediate annotation files into the respective database. This non-direct database upload is laborious and evokes problems of data synchronicity. Results To overcome these limitations we extended the Apollo data adapter with a generic, configurable web service client that is able to retrieve annotation data in a GAME-XML-formatted string and pass it on to Apollo's internal input routine. Conclusion This Apollo web service adapter, Apollo2Go, simplifies the data exchange in distributed projects and aims to render the annotation process more comfortable. The Apollo2Go software is freely available from . PMID:17760972

  1. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Leung, Elo; Huang, Amy; Cadag, Eithon

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  2. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGES

    Leung, Elo; Huang, Amy; Cadag, Eithon; ...

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  3. Real-time image annotation by manifold-based biased Fisher discriminant analysis

    NASA Astrophysics Data System (ADS)

    Ji, Rongrong; Yao, Hongxun; Wang, Jicheng; Sun, Xiaoshuai; Liu, Xianming

    2008-01-01

    Automatic Linguistic Annotation is a promising solution to bridge the semantic gap in content-based image retrieval. However, two crucial issues are not well addressed in state-of-art annotation algorithms: 1. The Small Sample Size (3S) problem in keyword classifier/model learning; 2. Most of annotation algorithms can not extend to real-time online usage due to their low computational efficiencies. This paper presents a novel Manifold-based Biased Fisher Discriminant Analysis (MBFDA) algorithm to address these two issues by transductive semantic learning and keyword filtering. To address the 3S problem, Co-Training based Manifold learning is adopted for keyword model construction. To achieve real-time annotation, a Bias Fisher Discriminant Analysis (BFDA) based semantic feature reduction algorithm is presented for keyword confidence discrimination and semantic feature reduction. Different from all existing annotation methods, MBFDA views image annotation from a novel Eigen semantic feature (which corresponds to keywords) selection aspect. As demonstrated in experiments, our manifold-based biased Fisher discriminant analysis annotation algorithm outperforms classical and state-of-art annotation methods (1.K-NN Expansion; 2.One-to-All SVM; 3.PWC-SVM) in both computational time and annotation accuracy with a large margin.

  4. AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments

    PubMed Central

    Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J.

    2011-01-01

    The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute. Database URL: http://www.cbil.upenn.edu/annotCompute/ PMID:22190598

  5. MitoFish and MitoAnnotator: A Mitochondrial Genome Database of Fish with an Accurate and Automatic Annotation Pipeline

    PubMed Central

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P.; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-01-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface. PMID:23955518

  6. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

    PubMed

    Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

    2013-01-01

    The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/

  7. Semantic annotation of consumer health questions.

    PubMed

    Kilicoglu, Halil; Ben Abacha, Asma; Mrabet, Yassine; Shooshan, Sonya E; Rodriguez, Laritza; Masterton, Kate; Demner-Fushman, Dina

    2018-02-06

    Consumers increasingly use online resources for their health information needs. While current search engines can address these needs to some extent, they generally do not take into account that most health information needs are complex and can only fully be expressed in natural language. Consumer health question answering (QA) systems aim to fill this gap. A major challenge in developing consumer health QA systems is extracting relevant semantic content from the natural language questions (question understanding). To develop effective question understanding tools, question corpora semantically annotated for relevant question elements are needed. In this paper, we present a two-part consumer health question corpus annotated with several semantic categories: named entities, question triggers/types, question frames, and question topic. The first part (CHQA-email) consists of relatively long email requests received by the U.S. National Library of Medicine (NLM) customer service, while the second part (CHQA-web) consists of shorter questions posed to MedlinePlus search engine as queries. Each question has been annotated by two annotators. The annotation methodology is largely the same between the two parts of the corpus; however, we also explain and justify the differences between them. Additionally, we provide information about corpus characteristics, inter-annotator agreement, and our attempts to measure annotation confidence in the absence of adjudication of annotations. The resulting corpus consists of 2614 questions (CHQA-email: 1740, CHQA-web: 874). Problems are the most frequent named entities, while treatment and general information questions are the most common question types. Inter-annotator agreement was generally modest: question types and topics yielded highest agreement, while the agreement for more complex frame annotations was lower. Agreement in CHQA-web was consistently higher than that in CHQA-email. Pairwise inter-annotator agreement proved most useful in estimating annotation confidence. To our knowledge, our corpus is the first focusing on annotation of uncurated consumer health questions. It is currently used to develop machine learning-based methods for question understanding. We make the corpus publicly available to stimulate further research on consumer health QA.

  8. Annotation of rule-based models with formal semantics to enable creation, analysis, reuse and visualization.

    PubMed

    Misirli, Goksel; Cavaliere, Matteo; Waites, William; Pocock, Matthew; Madsen, Curtis; Gilfellon, Owen; Honorato-Zimmer, Ricardo; Zuliani, Paolo; Danos, Vincent; Wipat, Anil

    2016-03-15

    Biological systems are complex and challenging to model and therefore model reuse is highly desirable. To promote model reuse, models should include both information about the specifics of simulations and the underlying biology in the form of metadata. The availability of computationally tractable metadata is especially important for the effective automated interpretation and processing of models. Metadata are typically represented as machine-readable annotations which enhance programmatic access to information about models. Rule-based languages have emerged as a modelling framework to represent the complexity of biological systems. Annotation approaches have been widely used for reaction-based formalisms such as SBML. However, rule-based languages still lack a rich annotation framework to add semantic information, such as machine-readable descriptions, to the components of a model. We present an annotation framework and guidelines for annotating rule-based models, encoded in the commonly used Kappa and BioNetGen languages. We adapt widely adopted annotation approaches to rule-based models. We initially propose a syntax to store machine-readable annotations and describe a mapping between rule-based modelling entities, such as agents and rules, and their annotations. We then describe an ontology to both annotate these models and capture the information contained therein, and demonstrate annotating these models using examples. Finally, we present a proof of concept tool for extracting annotations from a model that can be queried and analyzed in a uniform way. The uniform representation of the annotations can be used to facilitate the creation, analysis, reuse and visualization of rule-based models. Although examples are given, using specific implementations the proposed techniques can be applied to rule-based models in general. The annotation ontology for rule-based models can be found at http://purl.org/rbm/rbmo The krdf tool and associated executable examples are available at http://purl.org/rbm/rbmo/krdf anil.wipat@newcastle.ac.uk or vdanos@inf.ed.ac.uk. © The Author 2015. Published by Oxford University Press.

  9. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.

    PubMed

    Law, MeiYee; Childs, Kevin L; Campbell, Michael S; Stein, Joshua C; Olson, Andrew J; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M; Lawrence, Carolyn J; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. © 2015 American Society of Plant Biologists. All Rights Reserved.

  10. Text de-identification for privacy protection: a study of its impact on clinical text information content.

    PubMed

    Meystre, Stéphane M; Ferrández, Óscar; Friedlin, F Jeffrey; South, Brett R; Shen, Shuying; Samore, Matthew H

    2014-08-01

    As more and more electronic clinical information is becoming easier to access for secondary uses such as clinical research, approaches that enable faster and more collaborative research while protecting patient privacy and confidentiality are becoming more important. Clinical text de-identification offers such advantages but is typically a tedious manual process. Automated Natural Language Processing (NLP) methods can alleviate this process, but their impact on subsequent uses of the automatically de-identified clinical narratives has only barely been investigated. In the context of a larger project to develop and investigate automated text de-identification for Veterans Health Administration (VHA) clinical notes, we studied the impact of automated text de-identification on clinical information in a stepwise manner. Our approach started with a high-level assessment of clinical notes informativeness and formatting, and ended with a detailed study of the overlap of select clinical information types and Protected Health Information (PHI). To investigate the informativeness (i.e., document type information, select clinical data types, and interpretation or conclusion) of VHA clinical notes, we used five different existing text de-identification systems. The informativeness was only minimally altered by these systems while formatting was only modified by one system. To examine the impact of de-identification on clinical information extraction, we compared counts of SNOMED-CT concepts found by an open source information extraction application in the original (i.e., not de-identified) version of a corpus of VHA clinical notes, and in the same corpus after de-identification. Only about 1.2-3% less SNOMED-CT concepts were found in de-identified versions of our corpus, and many of these concepts were PHI that was erroneously identified as clinical information. To study this impact in more details and assess how generalizable our findings were, we examined the overlap between select clinical information annotated in the 2010 i2b2 NLP challenge corpus and automatic PHI annotations from our best-of-breed VHA clinical text de-identification system (nicknamed 'BoB'). Overall, only 0.81% of the clinical information exactly overlapped with PHI, and 1.78% partly overlapped. We conclude that automated text de-identification's impact on clinical information is small, but not negligible, and that improved clinical acronyms and eponyms disambiguation could significantly reduce this impact. Copyright © 2014 Elsevier Inc. All rights reserved.

  11. The distributed annotation system.

    PubMed

    Dowell, R D; Jokerst, R M; Day, A; Eddy, S R; Stein, L

    2001-01-01

    Currently, most genome annotation is curated by centralized groups with limited resources. Efforts to share annotations transparently among multiple groups have not yet been satisfactory. Here we introduce a concept called the Distributed Annotation System (DAS). DAS allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. The communication between client and servers in DAS is defined by the DAS XML specification. Annotations are displayed in layers, one per server. Any client or server adhering to the DAS XML specification can participate in the system; we describe a simple prototype client and server example. The DAS specification is being used experimentally by Ensembl, WormBase, and the Berkeley Drosophila Genome Project. Continued success will depend on the readiness of the research community to adopt DAS and provide annotations. All components are freely available from the project website http://www.biodas.org/.

  12. Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX

    PubMed Central

    Martínez Barrio, Álvaro; Lagercrantz, Erik; Sperber, Göran O; Blomberg, Jonas; Bongcam-Rudloff, Erik

    2009-01-01

    Background The Distributed Annotation System (DAS) is a widely used network protocol for sharing biological information. The distributed aspects of the protocol enable the use of various reference and annotation servers for connecting biological sequence data to pertinent annotations in order to depict an integrated view of the data for the final user. Results An annotation server has been devised to provide information about the endogenous retroviruses detected and annotated by a specialized in silico tool called RetroTector. We describe the procedure to implement the DAS 1.5 protocol commands necessary for constructing the DAS annotation server. We use our server to exemplify those steps. Data distribution is kept separated from visualization which is carried out by eBioX, an easy to use open source program incorporating multiple bioinformatics utilities. Some well characterized endogenous retroviruses are shown in two different DAS clients. A rapid analysis of areas free from retroviral insertions could be facilitated by our annotations. Conclusion The DAS protocol has shown to be advantageous in the distribution of endogenous retrovirus data. The distributed nature of the protocol is also found to aid in combining annotation and visualization along a genome in order to enhance the understanding of ERV contribution to its evolution. Reference and annotation servers are conjointly used by eBioX to provide visualization of ERV annotations as well as other data sources. Our DAS data source can be found in the central public DAS service repository, , or at . PMID:19534743

  13. A System for the Semantic Multimodal Analysis of News Audio-Visual Content

    NASA Astrophysics Data System (ADS)

    Mezaris, Vasileios; Gidaros, Spyros; Papadopoulos, GeorgiosTh; Kasper, Walter; Steffen, Jörg; Ordelman, Roeland; Huijbregts, Marijn; de Jong, Franciska; Kompatsiaris, Ioannis; Strintzis, MichaelG

    2010-12-01

    News-related content is nowadays among the most popular types of content for users in everyday applications. Although the generation and distribution of news content has become commonplace, due to the availability of inexpensive media capturing devices and the development of media sharing services targeting both professional and user-generated news content, the automatic analysis and annotation that is required for supporting intelligent search and delivery of this content remains an open issue. In this paper, a complete architecture for knowledge-assisted multimodal analysis of news-related multimedia content is presented, along with its constituent components. The proposed analysis architecture employs state-of-the-art methods for the analysis of each individual modality (visual, audio, text) separately and proposes a novel fusion technique based on the particular characteristics of news-related content for the combination of the individual modality analysis results. Experimental results on news broadcast video illustrate the usefulness of the proposed techniques in the automatic generation of semantic annotations.

  14. An Evaluation of a Natural Language Processing Tool for Identifying and Encoding Allergy Information in Emergency Department Clinical Notes

    PubMed Central

    Goss, Foster R.; Plasek, Joseph M.; Lau, Jason J.; Seger, Diane L.; Chang, Frank Y.; Zhou, Li

    2014-01-01

    Emergency department (ED) visits due to allergic reactions are common. Allergy information is often recorded in free-text provider notes; however, this domain has not yet been widely studied by the natural language processing (NLP) community. We developed an allergy module built on the MTERMS NLP system to identify and encode food, drug, and environmental allergies and allergic reactions. The module included updates to our lexicon using standard terminologies, and novel disambiguation algorithms. We developed an annotation schema and annotated 400 ED notes that served as a gold standard for comparison to MTERMS output. MTERMS achieved an F-measure of 87.6% for the detection of allergen names and no known allergies, 90% for identifying true reactions in each allergy statement where true allergens were also identified, and 69% for linking reactions to their allergen. These preliminary results demonstrate the feasibility using NLP to extract and encode allergy information from clinical notes. PMID:25954363

  15. Citation Sentiment Analysis in Clinical Trial Papers

    PubMed Central

    Xu, Jun; Zhang, Yaoyun; Wu, Yonghui; Wang, Jingqi; Dong, Xiao; Xu, Hua

    2015-01-01

    In scientific writing, positive credits and negative criticisms can often be seen in the text mentioning the cited papers, providing useful information about whether a study can be reproduced or not. In this study, we focus on citation sentiment analysis, which aims to determine the sentiment polarity that the citation context carries towards the cited paper. A citation sentiment corpus was annotated first on clinical trial papers. The effectiveness of n-gram and sentiment lexicon features, and problem-specified structure features for citation sentiment analysis were then examined using the annotated corpus. The combined features from the word n-grams, the sentiment lexicons and the structure information achieved the highest Micro F-score of 0.860 and Macro-F score of 0.719, indicating that it is feasible to use machine learning methods for citation sentiment analysis in biomedical publications. A comprehensive comparison between citation sentiment analysis of clinical trial papers and other general domains were conducted, which additionally highlights the unique challenges within this domain. PMID:26958274

  16. Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae

    PubMed Central

    Meng, Shaowu; Brown, Douglas E; Ebbole, Daniel J; Torto-Alalibo, Trudy; Oh, Yeon Yee; Deng, Jixin; Mitchell, Thomas K; Dean, Ralph A

    2009-01-01

    Background Magnaporthe oryzae, the causal agent of blast disease of rice, is the most destructive disease of rice worldwide. The genome of this fungal pathogen has been sequenced and an automated annotation has recently been updated to Version 6 . However, a comprehensive manual curation remains to be performed. Gene Ontology (GO) annotation is a valuable means of assigning functional information using standardized vocabulary. We report an overview of the GO annotation for Version 5 of M. oryzae genome assembly. Methods A similarity-based (i.e., computational) GO annotation with manual review was conducted, which was then integrated with a literature-based GO annotation with computational assistance. For similarity-based GO annotation a stringent reciprocal best hits method was used to identify similarity between predicted proteins of M. oryzae and GO proteins from multiple organisms with published associations to GO terms. Significant alignment pairs were manually reviewed. Functional assignments were further cross-validated with manually reviewed data, conserved domains, or data determined by wet lab experiments. Additionally, biological appropriateness of the functional assignments was manually checked. Results In total, 6,286 proteins received GO term assignment via the homology-based annotation, including 2,870 hypothetical proteins. Literature-based experimental evidence, such as microarray, MPSS, T-DNA insertion mutation, or gene knockout mutation, resulted in 2,810 proteins being annotated with GO terms. Of these, 1,673 proteins were annotated with new terms developed for Plant-Associated Microbe Gene Ontology (PAMGO). In addition, 67 experiment-determined secreted proteins were annotated with PAMGO terms. Integration of the two data sets resulted in 7,412 proteins (57%) being annotated with 1,957 distinct and specific GO terms. Unannotated proteins were assigned to the 3 root terms. The Version 5 GO annotation is publically queryable via the GO site . Additionally, the genome of M. oryzae is constantly being refined and updated as new information is incorporated. For the latest GO annotation of Version 6 genome, please visit our website . The preliminary GO annotation of Version 6 genome is placed at a local MySql database that is publically queryable via a user-friendly interface Adhoc Query System. Conclusion Our analysis provides comprehensive and robust GO annotations of the M. oryzae genome assemblies that will be solid foundations for further functional interrogation of M. oryzae. PMID:19278556

  17. Red flag screening for low back pain: nothing to see here, move along: a narrative review.

    PubMed

    Cook, Chad E; George, Steven Z; Reiman, Michael P

    2018-04-01

    Screening for red flags in individuals with low back pain (LBP) has been a historical hallmark of musculoskeletal management. Red flag screening is endorsed by most LBP clinical practice guidelines, despite a lack of support for their diagnostic capacity. We share four major reasons why red flag screening is not consistent with best practice in LBP management: (1) clinicians do not actually screen for red flags, they manage the findings; (2) red flag symptomology negates the utility of clinical findings; (3) the tests lack the negative likelihood ratio to serve as a screen; and (4) clinical practice guidelines do not include specific processes that aid decision-making. Based on these findings, we propose that clinicians consider: (1) the importance of watchful waiting; (2) the value-based care does not support clinical examination driven by red flag symptoms; and (3) the recognition that red flag symptoms may have a stronger relationship with prognosis than diagnosis. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  18. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation

    PubMed Central

    Beijbom, Oscar; Edmunds, Peter J.; Roelfsema, Chris; Smith, Jennifer; Kline, David I.; Neal, Benjamin P.; Dunlap, Matthew J.; Moriarty, Vincent; Fan, Tung-Yung; Tan, Chih-Jui; Chan, Stephen; Treibitz, Tali; Gamst, Anthony; Mitchell, B. Greg; Kriegman, David

    2015-01-01

    Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys. PMID:26154157

  19. MEGAnnotator: a user-friendly pipeline for microbial genomes assembly and annotation.

    PubMed

    Lugli, Gabriele Andrea; Milani, Christian; Mancabelli, Leonardo; van Sinderen, Douwe; Ventura, Marco

    2016-04-01

    Genome annotation is one of the key actions that must be undertaken in order to decipher the genetic blueprint of organisms. Thus, a correct and reliable annotation is essential in rendering genomic data valuable. Here, we describe a bioinformatics pipeline based on freely available software programs coordinated by a multithreaded script named MEGAnnotator (Multithreaded Enhanced prokaryotic Genome Annotator). This pipeline allows the generation of multiple annotated formats fulfilling the NCBI guidelines for assembled microbial genome submission, based on DNA shotgun sequencing reads, and minimizes manual intervention, while also reducing waiting times between software program executions and improving final quality of both assembly and annotation outputs. MEGAnnotator provides an efficient way to pre-arrange the assembly and annotation work required to process NGS genome sequence data. The script improves the final quality of microbial genome annotation by reducing ambiguous annotations. Moreover, the MEGAnnotator platform allows the user to perform a partial annotation of pre-assembled genomes and includes an option to accomplish metagenomic data set assemblies. MEGAnnotator platform will be useful for microbiologists interested in genome analyses of bacteria as well as those investigating the complexity of microbial communities that do not possess the necessary skills to prepare their own bioinformatics pipeline. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  20. Gene function prediction based on the Gene Ontology hierarchical structure.

    PubMed

    Cheng, Liangxi; Lin, Hongfei; Hu, Yuncui; Wang, Jian; Yang, Zhihao

    2014-01-01

    The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship.

  1. Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision

    PubMed Central

    Wallace, Byron C.; Kuiper, Joël; Sharma, Aakash; Zhu, Mingxi (Brian); Marshall, Iain J.

    2016-01-01

    Systematic reviews underpin Evidence Based Medicine (EBM) by addressing precise clinical questions via comprehensive synthesis of all relevant published evidence. Authors of systematic reviews typically define a Population/Problem, Intervention, Comparator, and Outcome (a PICO criteria) of interest, and then retrieve, appraise and synthesize results from all reports of clinical trials that meet these criteria. Identifying PICO elements in the full-texts of trial reports is thus a critical yet time-consuming step in the systematic review process. We seek to expedite evidence synthesis by developing machine learning models to automatically extract sentences from articles relevant to PICO elements. Collecting a large corpus of training data for this task would be prohibitively expensive. Therefore, we derive distant supervision (DS) with which to train models using previously conducted reviews. DS entails heuristically deriving ‘soft’ labels from an available structured resource. However, we have access only to unstructured, free-text summaries of PICO elements for corresponding articles; we must derive from these the desired sentence-level annotations. To this end, we propose a novel method – supervised distant supervision (SDS) – that uses a small amount of direct supervision to better exploit a large corpus of distantly labeled instances by learning to pseudo-annotate articles using the available DS. We show that this approach tends to outperform existing methods with respect to automated PICO extraction. PMID:27746703

  2. compMS2Miner: An Automatable Metabolite Identification, Visualization, and Data-Sharing R Package for High-Resolution LC-MS Data Sets.

    PubMed

    Edmands, William M B; Petrick, Lauren; Barupal, Dinesh K; Scalbert, Augustin; Wilson, Mark J; Wickliffe, Jeffrey K; Rappaport, Stephen M

    2017-04-04

    A long-standing challenge of untargeted metabolomic profiling by ultrahigh-performance liquid chromatography-high-resolution mass spectrometry (UHPLC-HRMS) is efficient transition from unknown mass spectral features to confident metabolite annotations. The compMS 2 Miner (Comprehensive MS 2 Miner) package was developed in the R language to facilitate rapid, comprehensive feature annotation using a peak-picker-output and MS 2 data files as inputs. The number of MS 2 spectra that can be collected during a metabolomic profiling experiment far outweigh the amount of time required for pain-staking manual interpretation; therefore, a degree of software workflow autonomy is required for broad-scale metabolite annotation. CompMS 2 Miner integrates many useful tools in a single workflow for metabolite annotation and also provides a means to overview the MS 2 data with a Web application GUI compMS 2 Explorer (Comprehensive MS 2 Explorer) that also facilitates data-sharing and transparency. The automatable compMS 2 Miner workflow consists of the following steps: (i) matching unknown MS 1 features to precursor MS 2 scans, (ii) filtration of spectral noise (dynamic noise filter), (iii) generation of composite mass spectra by multiple similar spectrum signal summation and redundant/contaminant spectra removal, (iv) interpretation of possible fragment ion substructure using an internal database, (v) annotation of unknowns with chemical and spectral databases with prediction of mammalian biotransformation metabolites, wrapper functions for in silico fragmentation software, nearest neighbor chemical similarity scoring, random forest based retention time prediction, text-mining based false positive removal/true positive ranking, chemical taxonomic prediction and differential evolution based global annotation score optimization, and (vi) network graph visualizations, data curation, and sharing are made possible via the compMS 2 Explorer application. Metabolite identities and comments can also be recorded using an interactive table within compMS 2 Explorer. The utility of the package is illustrated with a data set of blood serum samples from 7 diet induced obese (DIO) and 7 nonobese (NO) C57BL/6J mice, which were also treated with an antibiotic (streptomycin) to knockdown the gut microbiota. The results of fully autonomous and objective usage of compMS 2 Miner are presented here. All automatically annotated spectra output by the workflow are provided in the Supporting Information and can alternatively be explored as publically available compMS 2 Explorer applications for both positive and negative modes ( https://wmbedmands.shinyapps.io/compMS2_mouseSera_POS and https://wmbedmands.shinyapps.io/compMS2_mouseSera_NEG ). The workflow provided rapid annotation of a diversity of endogenous and gut microbially derived metabolites affected by both diet and antibiotic treatment, which conformed to previously published reports. Composite spectra (n = 173) were autonomously matched to entries of the Massbank of North America (MoNA) spectral repository. These experimental and virtual (lipidBlast) spectra corresponded to 29 common endogenous compound classes (e.g., 51 lysophosphatidylcholines spectra) and were then used to calculate the ranking capability of 7 individual scoring metrics. It was found that an average of the 7 individual scoring metrics provided the most effective weighted average ranking ability of 3 for the MoNA matched spectra in spite of potential risk of false positive annotations emerging from automation. Minor structural differences such as relative carbon-carbon double bond positions were found in several cases to affect the correct rank of the MoNA annotated metabolite. The latest release and an example workflow is available in the package vignette ( https://github.com/WMBEdmands/compMS2Miner ) and a version of the published application is available on the shinyapps.io site ( https://wmbedmands.shinyapps.io/compMS2Example ).

  3. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  4. Community annotation and bioinformatics workforce development in concert--Little Skate Genome Annotation Workshops and Jamborees.

    PubMed

    Wang, Qinghua; Arighi, Cecilia N; King, Benjamin L; Polson, Shawn W; Vincent, James; Chen, Chuming; Huang, Hongzhan; Kingham, Brewster F; Page, Shallee T; Rendino, Marc Farnum; Thomas, William Kelley; Udwary, Daniel W; Wu, Cathy H

    2012-01-01

    Recent advances in high-throughput DNA sequencing technologies have equipped biologists with a powerful new set of tools for advancing research goals. The resulting flood of sequence data has made it critically important to train the next generation of scientists to handle the inherent bioinformatic challenges. The North East Bioinformatics Collaborative (NEBC) is undertaking the genome sequencing and annotation of the little skate (Leucoraja erinacea) to promote advancement of bioinformatics infrastructure in our region, with an emphasis on practical education to create a critical mass of informatically savvy life scientists. In support of the Little Skate Genome Project, the NEBC members have developed several annotation workshops and jamborees to provide training in genome sequencing, annotation and analysis. Acting as a nexus for both curation activities and dissemination of project data, a project web portal, SkateBase (http://skatebase.org) has been developed. As a case study to illustrate effective coupling of community annotation with workforce development, we report the results of the Mitochondrial Genome Annotation Jamborees organized to annotate the first completely assembled element of the Little Skate Genome Project, as a culminating experience for participants from our three prior annotation workshops. We are applying the physical/virtual infrastructure and lessons learned from these activities to enhance and streamline the genome annotation workflow, as we look toward our continuing efforts for larger-scale functional and structural community annotation of the L. erinacea genome.

  5. Community annotation and bioinformatics workforce development in concert—Little Skate Genome Annotation Workshops and Jamborees

    PubMed Central

    Wang, Qinghua; Arighi, Cecilia N.; King, Benjamin L.; Polson, Shawn W.; Vincent, James; Chen, Chuming; Huang, Hongzhan; Kingham, Brewster F.; Page, Shallee T.; Farnum Rendino, Marc; Thomas, William Kelley; Udwary, Daniel W.; Wu, Cathy H.

    2012-01-01

    Recent advances in high-throughput DNA sequencing technologies have equipped biologists with a powerful new set of tools for advancing research goals. The resulting flood of sequence data has made it critically important to train the next generation of scientists to handle the inherent bioinformatic challenges. The North East Bioinformatics Collaborative (NEBC) is undertaking the genome sequencing and annotation of the little skate (Leucoraja erinacea) to promote advancement of bioinformatics infrastructure in our region, with an emphasis on practical education to create a critical mass of informatically savvy life scientists. In support of the Little Skate Genome Project, the NEBC members have developed several annotation workshops and jamborees to provide training in genome sequencing, annotation and analysis. Acting as a nexus for both curation activities and dissemination of project data, a project web portal, SkateBase (http://skatebase.org) has been developed. As a case study to illustrate effective coupling of community annotation with workforce development, we report the results of the Mitochondrial Genome Annotation Jamborees organized to annotate the first completely assembled element of the Little Skate Genome Project, as a culminating experience for participants from our three prior annotation workshops. We are applying the physical/virtual infrastructure and lessons learned from these activities to enhance and streamline the genome annotation workflow, as we look toward our continuing efforts for larger-scale functional and structural community annotation of the L. erinacea genome. PMID:22434832

  6. Automated analysis and reannotation of subcellular locations in confocal images from the Human Protein Atlas.

    PubMed

    Li, Jieyue; Newberg, Justin Y; Uhlén, Mathias; Lundberg, Emma; Murphy, Robert F

    2012-01-01

    The Human Protein Atlas contains immunofluorescence images showing subcellular locations for thousands of proteins. These are currently annotated by visual inspection. In this paper, we describe automated approaches to analyze the images and their use to improve annotation. We began by training classifiers to recognize the annotated patterns. By ranking proteins according to the confidence of the classifier, we generated a list of proteins that were strong candidates for reexamination. In parallel, we applied hierarchical clustering to group proteins and identified proteins whose annotations were inconsistent with the remainder of the proteins in their cluster. These proteins were reexamined by the original annotators, and a significant fraction had their annotations changed. The results demonstrate that automated approaches can provide an important complement to visual annotation.

  7. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

    PubMed

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2015-12-01

    This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. Copyright © 2015 Elsevier Inc. All rights reserved.

  8. ODMedit: uniform semantic annotation for data integration in medicine based on a public metadata repository.

    PubMed

    Dugas, Martin; Meidt, Alexandra; Neuhaus, Philipp; Storck, Michael; Varghese, Julian

    2016-06-01

    The volume and complexity of patient data - especially in personalised medicine - is steadily increasing, both regarding clinical data and genomic profiles: Typically more than 1,000 items (e.g., laboratory values, vital signs, diagnostic tests etc.) are collected per patient in clinical trials. In oncology hundreds of mutations can potentially be detected for each patient by genomic profiling. Therefore data integration from multiple sources constitutes a key challenge for medical research and healthcare. Semantic annotation of data elements can facilitate to identify matching data elements in different sources and thereby supports data integration. Millions of different annotations are required due to the semantic richness of patient data. These annotations should be uniform, i.e., two matching data elements shall contain the same annotations. However, large terminologies like SNOMED CT or UMLS don't provide uniform coding. It is proposed to develop semantic annotations of medical data elements based on a large-scale public metadata repository. To achieve uniform codes, semantic annotations shall be re-used if a matching data element is available in the metadata repository. A web-based tool called ODMedit ( https://odmeditor.uni-muenster.de/ ) was developed to create data models with uniform semantic annotations. It contains ~800,000 terms with semantic annotations which were derived from ~5,800 models from the portal of medical data models (MDM). The tool was successfully applied to manually annotate 22 forms with 292 data items from CDISC and to update 1,495 data models of the MDM portal. Uniform manual semantic annotation of data models is feasible in principle, but requires a large-scale collaborative effort due to the semantic richness of patient data. A web-based tool for these annotations is available, which is linked to a public metadata repository.

  9. RNA-seq analysis of Rubus idaeus cv. Nova: transcriptome sequencing and de novo assembly for subsequent functional genomics approaches.

    PubMed

    Hyun, Tae Kyung; Lee, Sarah; Kumar, Dhinesh; Rim, Yeonggil; Kumar, Ritesh; Lee, Sang Yeol; Lee, Choong Hwan; Kim, Jae-Yean

    2014-10-01

    Using Illumina sequencing technology, we have generated the large-scale transcriptome sequencing data containing abundant information on genes involved in the metabolic pathways in R. idaeus cv. Nova fruits. Rubus idaeus (Red raspberry) is one of the important economical crops that possess numerous nutrients, micronutrients and phytochemicals with essential health benefits to human. The molecular mechanism underlying the ripening process and phytochemical biosynthesis in red raspberry is attributed to the changes in gene expression, but very limited transcriptomic and genomic information in public databases is available. To address this issue, we generated more than 51 million sequencing reads from R. idaeus cv. Nova fruit using Illumina RNA-Seq technology. After de novo assembly, we obtained 42,604 unigenes with an average length of 812 bp. At the protein level, Nova fruit transcriptome showed 77 and 68 % sequence similarities with Rubus coreanus and Fragaria versa, respectively, indicating the evolutionary relationship between them. In addition, 69 % of assembled unigenes were annotated using public databases including NCBI non-redundant, Cluster of Orthologous Groups and Gene ontology database, suggesting that our transcriptome dataset provides a valuable resource for investigating metabolic processes in red raspberry. To analyze the relationship between several novel transcripts and the amounts of metabolites such as γ-aminobutyric acid and anthocyanins, real-time PCR and target metabolite analysis were performed on two different ripening stages of Nova. This is the first attempt using Illumina sequencing platform for RNA sequencing and de novo assembly of Nova fruit without reference genome. Our data provide the most comprehensive transcriptome resource available for Rubus fruits, and will be useful for understanding the ripening process and for breeding R. idaeus cultivars with improved fruit quality.

  10. Developing expressed sequence tag libraries and the discovery of simple sequence repeat markers for two species of raspberry (Rubus L.).

    PubMed

    Bushakra, Jill M; Lewers, Kim S; Staton, Margaret E; Zhebentyayeva, Tetyana; Saski, Christopher A

    2015-10-26

    Due to a relatively high level of codominant inheritance and transferability within and among taxonomic groups, simple sequence repeat (SSR) markers are important elements in comparative mapping and delineation of genomic regions associated with traits of economic importance. Expressed sequence tags (ESTs) are a source of SSRs that can be used to develop markers to facilitate plant breeding and for more basic research across genera and higher plant orders. Leaf and meristem tissue from 'Heritage' red raspberry (Rubus idaeus) and 'Bristol' black raspberry (R. occidentalis) were utilized for RNA extraction. After conversion to cDNA and library construction, ESTs were sequenced, quality verified, assembled and scanned for SSRs.  Primers flanking the SSRs were designed and a subset tested for amplification, polymorphism and transferability across species. ESTs containing SSRs were functionally annotated using the GenBank non-redundant (nr) database and further classified using the gene ontology database. To accelerate development of EST-SSRs in the genus Rubus (Rosaceae), 1149 and 2358 cDNA sequences were generated from red raspberry and black raspberry, respectively. The cDNA sequences were screened using rigorous filtering criteria which resulted in the identification of 121 and 257 SSR loci for red and black raspberry, respectively. Primers were designed from the surrounding sequences resulting in 131 and 288 primer pairs, respectively, as some sequences contained more than one SSR locus. Sequence analysis revealed that the SSR-containing genes span a diversity of functions and share more sequence identity with strawberry genes than with other Rosaceous species. This resource of Rubus-specific, gene-derived markers will facilitate the construction of linkage maps composed of transferable markers for studying and manipulating important traits in this economically important genus.

  11. Tinder Fire in Arizona Viewed by NASA's MISR

    NASA Image and Video Library

    2018-05-02

    On April 27, 2018, the Tinder Fire ignited in eastern Arizona near the Blue Ridge Reservoir, about 50 miles (80 kilometers) southeast of Flagstaff and 20 miles (32 kilometers) northeast of Payson. During the first 24 hours it remained relatively small at 500 acres (202 hectares), but on April 29, during red flag wind conditions, it exploded to 8,600 acres (3,480 hectares). Residents of rural communities in the area were forced to evacuate and an unknown number of structures were burned. As of April 30, the Tinder Fire had burned a total of 11,400 acres (4,613 hectares). On April 30 at 11:15 a.m. local time, the Multi-angle Imaging SpectroRadiometer (MISR) captured imagery of the Tinder Fire as it passed overhead on NASA's Terra satellite. The MISR instrument has nine cameras that view Earth at different angles. This image shows the view from MISR's nadir (downward-pointing) camera. The angular information from MISR's images is used to calculate the height of the smoke plume, results of which are superimposed on the right-hand image. This shows that the plume top near the active fire was at approximately 13,000 feet altitude (4,000 meters). In general, higher-altitude plumes transport smoke greater distances from the source, impacting communities downwind. A stereo anaglyph providing a three-dimensional view of the plume is also shown. Red-blue glasses with the red lens placed over your left eye are required to observe the 3D effect. These data were acquired during Terra orbit 97691. An annotated figure and anaglyph are available at https://photojournal.jpl.nasa.gov/catalog/PIA00698

  12. Comparative transcriptome analysis of three color variants of the sea cucumber Apostichopus japonicus.

    PubMed

    Jo, Jihoon; Park, Jongsun; Lee, Hyun-Gwan; Kern, Elizabeth M A; Cheon, Seongmin; Jin, Soyeong; Park, Joong-Ki; Cho, Sung-Jin; Park, Chungoo

    2016-08-01

    The sea cucumber Apostichopus japonicus Selenka 1867 represents an important resource in biomedical research, traditional medicine, and the seafood industry. Much of the commercial value of A. japonicus is determined by dorsal/ventral color variation (red, green, and black), yet the taxonomic relationships between these color variants are not clearly understood. We performed the first comparative analysis of de novo assembled transcriptome data from three color variants of A. japonicus. Using the Illumina platform, we sequenced nearly 177,596,774 clean reads representing a total of 18.2Gbp of sea cucumber transcriptome. A comparison of over 0.3 million transcript scaffolds against the Uniprot/Swiss-Prot database yielded 8513, 8602, and 8588 positive matches for green, red, and black body color transcriptomes, respectively. Using the Panther gene classification system, we assessed an extensive and diverse set of expressed genes in three color variants and found that (1) among the three color variants of A. japonicus, genes associated with RNA binding protein, oxidoreductase, nucleic acid binding, transferase, and KRAB box transcription factor were most commonly expressed; and (2) the main protein functional classes are differently regulated in all three color variants (extracellular matrix protein and phosphatase for green color, transporter and potassium channel for red color, and G-protein modulator and enzyme modulator for black color). This work will assist in the discovery and annotation of novel genes that play significant morphological and physiological roles in color variants of A. japonicus, and these sequence data will provide a useful set of resources for the rapidly growing sea cucumber aquaculture industry. Copyright © 2016 Elsevier B.V. All rights reserved.

  13. Genome-wide analysis of long non-coding RNAs at the mature stage of sea buckthorn (Hippophae rhamnoides Linn) fruit.

    PubMed

    Zhang, Guoyun; Duan, Aiguo; Zhang, Jianguo; He, Caiyun

    2017-01-05

    Long non-coding RNAs (lncRNAs), which are >200nt longer transcripts, potentially play important roles in almost all biological processes in plants and mammals. However, the functions and profiles of lncRNAs in fruit is less understood. Therefore, it is urgent and necessary to identify and analyze the functions of lncRNAs in sea buckthorns. Using RNA-sequencing, we synthetically identified lncRNAs in mature fruit from the red and yellow sea buckthorn. We obtained 567,778,938 clean reads from six samples and identified 3428 lncRNAs in mature fruit, including 2498 intergenic lncRNAs, 593 anti-sense lncRNAs, and 337 intronic lncRNAs. We also identified 3819 and 2295 circular RNAs in red and yellow sea buckthorn Fruit. In the aspects of gene architecture and expression, our results showed significant differences among the three lncRNA subtypes. We also investigated the effect of lncRNAs on its cis and trans target genes. Based on target genes analysis, we obtained 61 different expression lncRNAs (DE-lncRNAs) between these two sea buckthorns, including 23 special expression lncRNAs in red fruit and 22 special expression lncRNAs in yellow fruit. Importantly, we found a few DE-lncRNAs play cis and trans roles for genes in the Carotenoid biosynthesis, ascorbate and aldarate metabolism and fatty acid metabolism pathways. Our study provides a resource for lncRNA studies in mature fruit. It probably encourages researchers to deeply study fruit-coloring. It expands our knowledge about lncRNA biology and the annotation of the sea buckthorn genome. Copyright © 2016 Elsevier B.V. All rights reserved.

  14. Current and future trends in marine image annotation software

    NASA Astrophysics Data System (ADS)

    Gomes-Pereira, Jose Nuno; Auger, Vincent; Beisiegel, Kolja; Benjamin, Robert; Bergmann, Melanie; Bowden, David; Buhl-Mortensen, Pal; De Leo, Fabio C.; Dionísio, Gisela; Durden, Jennifer M.; Edwards, Luke; Friedman, Ariell; Greinert, Jens; Jacobsen-Stout, Nancy; Lerner, Steve; Leslie, Murray; Nattkemper, Tim W.; Sameoto, Jessica A.; Schoening, Timm; Schouten, Ronald; Seager, James; Singh, Hanumant; Soubigou, Olivier; Tojeira, Inês; van den Beld, Inge; Dias, Frederico; Tempera, Fernando; Santos, Ricardo S.

    2016-12-01

    Given the need to describe, analyze and index large quantities of marine imagery data for exploration and monitoring activities, a range of specialized image annotation tools have been developed worldwide. Image annotation - the process of transposing objects or events represented in a video or still image to the semantic level, may involve human interactions and computer-assisted solutions. Marine image annotation software (MIAS) have enabled over 500 publications to date. We review the functioning, application trends and developments, by comparing general and advanced features of 23 different tools utilized in underwater image analysis. MIAS requiring human input are basically a graphical user interface, with a video player or image browser that recognizes a specific time code or image code, allowing to log events in a time-stamped (and/or geo-referenced) manner. MIAS differ from similar software by the capability of integrating data associated to video collection, the most simple being the position coordinates of the video recording platform. MIAS have three main characteristics: annotating events in real time, posteriorly to annotation and interact with a database. These range from simple annotation interfaces, to full onboard data management systems, with a variety of toolboxes. Advanced packages allow to input and display data from multiple sensors or multiple annotators via intranet or internet. Posterior human-mediated annotation often include tools for data display and image analysis, e.g. length, area, image segmentation, point count; and in a few cases the possibility of browsing and editing previous dive logs or to analyze the annotations. The interaction with a database allows the automatic integration of annotations from different surveys, repeated annotation and collaborative annotation of shared datasets, browsing and querying of data. Progress in the field of automated annotation is mostly in post processing, for stable platforms or still images. Integration into available MIAS is currently limited to semi-automated processes of pixel recognition through computer-vision modules that compile expert-based knowledge. Important topics aiding the choice of a specific software are outlined, the ideal software is discussed and future trends are presented.

  15. New challenges for text mining: mapping between text and manually curated pathways

    PubMed Central

    Oda, Kanae; Kim, Jin-Dong; Ohta, Tomoko; Okanohara, Daisuke; Matsuzaki, Takuya; Tateisi, Yuka; Tsujii, Jun'ichi

    2008-01-01

    Background Associating literature with pathways poses new challenges to the Text Mining (TM) community. There are three main challenges to this task: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge. Results To address these challenges, we constructed new resources to link the text with a model pathway; they are: the GENIA pathway corpus with event annotation and NF-kB pathway. Through their detailed analysis, we address the untapped resource, ‘bio-inference,’ as well as the differences between text and pathway representation. Here, we show the precise comparisons of their representations and the nine classes of ‘bio-inference’ schemes observed in the pathway corpus. Conclusions We believe that the creation of such rich resources and their detailed analysis is the significant first step for accelerating the research of the automatic construction of pathway from text. PMID:18426550

  16. Approaches to Fungal Genome Annotation

    PubMed Central

    Haas, Brian J.; Zeng, Qiandong; Pearson, Matthew D.; Cuomo, Christina A.; Wortman, Jennifer R.

    2011-01-01

    Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment. PMID:22059117

  17. Solving the Problem: Genome Annotation Standards before the Data Deluge.

    PubMed

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D; Tatusova, Tatiana

    2011-10-15

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.

  18. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation.

    PubMed

    Bolleman, Jerven T; Mungall, Christopher J; Strozzi, Francesco; Baran, Joachim; Dumontier, Michel; Bonnal, Raoul J P; Buels, Robert; Hoehndorf, Robert; Fujisawa, Takatomo; Katayama, Toshiaki; Cock, Peter J A

    2016-06-13

    Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned "omics" areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe - and potentially merge - sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.

  19. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

    DOE PAGES

    Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco; ...

    2016-06-13

    Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. In this paper, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data formatmore » to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Finally, data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less

  20. Using comparative genome analysis to identify problems in annotated microbial genomes.

    PubMed

    Poptsova, Maria S; Gogarten, J Peter

    2010-07-01

    Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.

Top